Alessandro E Villa Włodzisław Duch Péter Érdi Francesco Masulli Günther Palm Artificial Neural Networks and Machine Learning - ICANN 2012 - 22nd International Conference

Lecture Notes in Computer Science 7553
Commenced Publication in 1973

Founding and Former Series Editors:
Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board
David Hutchison
Lancaster University, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Alfred Kobsa
University of California, Irvine, CA, USA
Friedemann Mattern
ETH Zurich, Switzerland
John C. Mitchell
Stanford University, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
Oscar Nierstrasz
University of Bern, Switzerland
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
TU Dortmund University, Germany
Madhu Sudan
Microsoft Research, Cambridge, MA, USA
Demetri Terzopoulos
University of California, Los Angeles, CA, USA
Doug Tygar
University of California, Berkeley, CA, USA
Gerhard Weikum
Max Planck Institute for Informatics, Saarbruecken, Germany
Alessandro E.P. Villa Włodzisław Duch
Péter Érdi Francesco Masulli
Günther Palm (Eds.)
Artificial Neural Networks

and Machine Learning –
ICANN 2012
22nd International Conference

on Artificial Neural Networks
Lausanne, Switzerland, September 11-14, 2012
Proceedings, Part II
13
Volume Editors
Alessandro E.P. Villa

University of Lausanne, Neuro Heuristic Research Group
1015 Lausanne, Switzerland
E-mail: alessandro.villa@unil.ch
Włodzisław Duch
Nicolaus Copernicus University, Department of Informatics
87-100, Toruń, Poland
E-mail: wduch@is.umk.pl
Péter Érdi
Kalamazoo College, Center for Complex Systems Studies
Kalamazoo, MI 49006, USA
E-mail: peter.erdi@kzoo.edu
Francesco Masulli
Università di Genova, Dipartimento di Informatica e Scienze dell’Informazione
16146 Genoa, Italy
E-mail: masulli@disi.unige.it
Günther Palm
Universität Ulm, Institut für Neuroinformatik
89069 Ulm, Germany
E-mail: guenther.palm@uni-ulm.de
ISSN 0302-9743 e-ISSN 1611-3349

ISBN 978-3-642-33265-4 e-ISBN 978-3-642-33266-1
DOI 10.1007/978-3-642-33266-1
Springer Heidelberg Dordrecht London New York
Library of Congress Control Number: 2012946038
CR Subject Classification (1998): I.2, F.1, I.4, I.5, J.3, H.3
LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues
© Springer-Verlag Berlin Heidelberg 2012

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
in its current version, and permission for use must always be obtained from Springer. Violations are liable
to prosecution under the German Copyright Law.
The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply,
even in the absence of a specific statement, that such names are exempt from the relevant protective laws
and regulations and therefore free for general use.
Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India
Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com)
Preface
The International Conference on Artificial Neural Networks (ICANN) is the an-

nual flagship conference of the European Neural Network Society (ENNS). It is
the premier European event covering all topics concerned with neural networks
and related areas. The aim of ICANN is to bring together researchers from two
worlds: information sciences and neurosciences. The scope is wide, ranging from
machine learning algorithms to models of real nervous systems. The aim is to
facilitate discussions and interactions toward developing more intelligent artifi-
cial systems and increasing our understanding of neural and cognitive processes
in the brain.
The ICANN series of conferences was initiated in 1991 and soon became the
major European gathering for experts in these fields. The 22nd International
Conference on Artificial Neural Networks (ICANN 2012, http://icann2012.org)
was held on 11–14 September 2012 in Lausanne, Switzerland, with pre-conference
workshops and satellite meetings on robotics and consciousness studies held on 11
September 2012. The host organization is the University of Lausanne (UNIL) and
its Faculty of Business and Economics (HEC); the venue is the Internef Building
on the UNIL Dorigny Campus on the shore of Lake Geneva. We acknowledge
the support of the Fondation du 450ème, the Société Académique Vaudoise, the
Rectorate of UNIL, the Faculty of Business and Economics, and its Department
of Information Systems. The ICANN 2012 organization is non-profit and all
financial transactions are checked by the accounting office of UNIL.
The 2012 conference is characterized by two major facts: the consolidation
of two parallel tracks with a new scheme of reduced fees, and the first ICANN
conference without the late John G. Taylor.
A variety of topics constituted the focus of paper submissions and it was
difficult to categorize the papers either in the brain-inspired computing track
or in the machine learning research track. However, after the successful initia-
tive of the organizers of ICANN 2011 in Espoo, Finland, to limit the parallel
sessions to two, it appeared that a broader audience would follow the oral pre-
sentations if the same formula were adopted in 2012. From 247 papers submitted
to the conference, the Program Committee and Editorial Board – after a thor-
ough peer-review process – selected 162 papers for publication, subdivided in
82 oral presentations in 16 sessions and 80 poster presentations. The quality of
the papers received was high and it was not possible to include many papers
of good quality in the conference program. Papers selected for oral or poster
presentations were equally good and the attribution to a specific type of presen-
tation was decided, in the vast majority of the cases, according to the preference
expressed by the authors. The dual-track, initially intended as brain-inspired
computing track or machine learning research track, simply became track A and
track B, because many papers presented an interdisciplinary approach, which is
VI Preface
in the spirit of ICANN and the goals promoted by ENNS. All posters remained
on display during the three days of the conference with a mandatory presenter
standing near odd numbers on Thursday 13th and near even numbers on Friday
14th. This year the organizers decided to slash the registration fee and focus
on the core of ICANN activities at the expense of excluding the lunches. This
scheme has proven to be successful and attracted many foreign participants,
coming from 35 different countries and all continents, in particular at graduate
and postgraduate levels.
This was the first ICANN after the death of Prof. John Gerald Taylor (JGT),
the first president and co-founder of the European Neural Network Society
(ENNS). John was born in Hayes, Kent, on August 18, 1931. He obtained a PhD
in Theoretical Physics from Christ’s College, Cambridge (1950–1956), where he
was strongly influenced by the teaching of Paul Dirac. John G. Taylor started
research in neural networks in 1969 and has contributed to many, if not all, of its
subfields. In 1971 he was appointed to the established Chair in Applied Math-
ematics at King’s College London where he founded and directed the Centre
for Neural Networks. His research interests were wide, ranging from high energy
physics, superstrings, quantum field theory and quantum gravity, neural compu-
tation, neural bases of behavior, and mathematical modelling in neurobiology.
After observing the metal “bending” skills of Uri Geller in 1974, Prof. J.G. Tay-
lor became interested in parapsychology and sought to establish whether there
is an electromagnetic basis for the phenomenon. After careful investigation char-
acterized by an initial enthusiasm and late skepticism he came to the conclusion,
expressed in his book Science and the Supernatural (1980), that the paranor-
mal cannot be reconciled with modern physics. After Francis Crick’s hypothesis
(1984) on the internal attentional searchlight role played by the thalamic reticu-
lar nucleus, Prof. Taylor became involved in developing a higher cognitive level
model of consciousness, using the most recent results on attention to describe it
as an engineering control system. This led him to the CODAM (attention copy)
model of consciousness. In 2007, Prof. Taylor developed the first program of its
kind in the hedge funds industry using artificial intelligence techniques to create
portfolios of hedge funds. He also trained as an actor and performed in plays
and films, wrote several science fiction plays, as well as directing stage produc-
tions in Oxford and Cambridge. Throughout his career Prof. Taylor encouraged
young scientists to follow their curiosity in their search for a better understand-
ing of nature and he served on numerous PhD dissertation juries around the
world. This brief biographical sketch of John G. Taylor is not intended to be
exhaustive but it is an attempt to present an exceptional person, though humble
and ordinary, yet out of the ordinary, who was part of our community from the
very beginning. At the ICANN conferences Prof. Taylor spent much time in the
poster sessions interacting with the participants and his presence at the oral
sessions was often marked by his questions and comments. The attendants at
past ICANN conferences remember that at banquet dinner Prof. Taylor usually
gave a short speech that was a condensed summary of his elegance and humor.
I had the privilege of his friendship during the past twenty years and I am sure
Preface VII
that many of us will remember stories about Prof. John Gerald Taylor. Dear
John, thank you for your legacy, it is now up to us to pursue your effort, make
it grow and flourish.
July 2012 Alessandro E.P. Villa
John Gerald Taylor (18.VIII.1931-10.III.2012)

Organization
Committees
General Chair Alessandro E.P. Villa
Special Sessions Chair Marco Tomassini
Tutorials Chair Lorenz Goette
Competitions Chair Giacomo Indiveri
Program Chairs
Wlodek Duch Guenther Palm
Péter Érdi Alessandro E.P. Villa
Francesco Masulli
Program Committee and Editorial Board
Cesare Alippi Giacomo Indiveri

Bruno Apolloni Nikola Kasabov
Yoshiyuki Asai Mario Koeppen
Lubica Benuskova Stefanos Kollias
Roman Borisyuk Petia Koprinkova-Hristova
Antônio Braga Irena Koprinska
Hans Albert Braun Vera Kurkova
Jérémie Cabessa Giancarlo La Camera
Angelo Cangelosi Diego Liberati
Angel Caputi Alessandra Lintas
Ke Chen André Longtin
Gerard Dreyfus Teresa Ludermir
Jean-Pierre Eckmann Thomas Martinetz
Marina Fiori Francesco Masulli
Jordi Garcia-Ojalvo Maurizio Mattia
Philippe Gaussier Claudio Mirasso
Michele Giugliano Francesco C. Morabito
Tatiana V. Guy Manuel Moreno Arostegui
Barbara Hammer Ernst Niebur
Ulrich Hoffrage Jose Nunez-Yanez
Timo Honkela Klaus Obermeyer
Brian I. Hyland Takashi Omori
Lazaros Iliadis Hélène Paugam-Moisy
X Organization
Jaako Peltonen Walter Senn

Danil Prokhorov Isabella Silkis
Barry Richmond Alessandro Sperduti
Jean Rouat Marco Tomassini
John Rinzel Tatyana Turova
Stefan Rotter Roseli Wedemann
Stefano Rovetta Stefan Wermter
Jorge Santos
Additional Reviewers
Fabio Babiloni Alfredo Petrosino
Simone Bassis Ramin Pichevar
Fülöp Bazso Marina Resta
Francesco Camastra Alessandro Rozza
Alessandro Di Nuovo Justus Schwabedal
Simona Doboli Vladyslav Shaposhnyk
Alessio Ferone Giorgio Valentini
Maurizio Filippone Eleni Vasilaki
Stefan Heinrich Jan K. Woike
Hassan Mahmoud Sean Wood
ENNS Travel Grant Committee
Wlodzislaw Duch Guenther Palm

Péter Érdi Alessandro E.P. Villa
Secretariat and Publicity
Daniela Serracca Fraccalvieri

Edy Ceppi
Elisabeth Fournier
Registration Committee
Paulo Monteiro
Table of Contents – Part II
Multilayer Perceptrons and Kernel Networks (A6)

Complex-Valued Multilayer Perceptron Search Utilizing Eigen Vector
Descent and Reducibility Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Shinya Suzumura and Ryohei Nakano
Theoretical Analysis of Function of Derivative Term in On-Line

Gradient Descent Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Kazuyuki Hara, Kentaro Katahira, Kazuo Okanoya, and
Masato Okada
Some Comparisons of Networks with Radial and Kernel Units . . . . . . . . . 17

Věra Kůrková
Multilayer Perceptron for Label Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

Geraldina Ribeiro, Wouter Duivesteijn, Carlos Soares, and
Arno Knobbe
Electricity Load Forecasting: A Weekday-Based Approach . . . . . . . . . . . . . 33

Irena Koprinska, Mashud Rana, and Vassilios G. Agelidis
Training and Learning (C4)

Adaptive Exploration Using Stochastic Neurons . . . . . . . . . . . . . . . . . . . . . . 42
Michel Tokic and Günther Palm
Comparison of Long-Term Adaptivity for Neural Networks . . . . . . . . . . . . 50

Frank-Florian Steege and Horst-Michael Groß
Simplifying ConvNets for Fast Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

Franck Mamalet and Christophe Garcia
A Modified Artificial Fish Swarm Algorithm for the Optimization

of Extreme Learning Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
João Fausto Lorenzato de Oliveira and Teresa B. Ludermir
Robust Training of Feedforward Neural Networks Using Combined

Online/Batch Quasi-Newton Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Hiroshi Ninomiya
Estimating a Causal Order among Groups of Variables in Linear

Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Doris Entner and Patrik O. Hoyer
XII Table of Contents – Part II
Training Restricted Boltzmann Machines with Multi-tempering:

Harnessing Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Philemon Brakel, Sander Dieleman, and Benjamin Schrauwen
A Computational Geometry Approach for Pareto-Optimal Selection

of Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Luiz C.B. Torres, Cristiano L. Castro, and Antônio P. Braga
Learning Parameters of Linear Models in Compressed Parameter

Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Yohannes Kassahun, Hendrik Wöhrle, Alexander Fabisch, and
Marc Tabie
Control of a Free-Falling Cat by Policy-Based Reinforcement

Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Daichi Nakano, Shin-ichi Maeda, and Shin Ishii
Gated Boltzmann Machine in Texture Modeling . . . . . . . . . . . . . . . . . . . . . 124

Tele Hao, Tapani Raiko, Alexander Ilin, and Juha Karhunen
Neural PCA and Maximum Likelihood Hebbian Learning

on the GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Pavel Krömer, Emilio Corchado, Václav Snášel, Jan Platoš, and
Laura Garcı́a-Hernández
Inference and Recognition (C5)

Construction of Emerging Markets Exchange Traded Funds Using
Multiobjective Particle Swarm Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . 140
Marta Dı́ez-Fernández, Sergio Alvarez Teleña, and Denise Gorse
The Influence of Supervised Clustering for RBFNN Centers Definition:

A Comparative Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
André R. Gonçalves, Rosana Veroneze, Salomão Madeiro,
Carlos R.B. Azevedo, and Fernando J. Von Zuben
Nested Sequential Minimal Optimization for Support Vector Machines . . 156

Alessandro Ghio, Davide Anguita, Luca Oneto, Sandro Ridella, and
Carlotta Schatten
Random Subspace Method and Genetic Algorithm Applied to a

LS-SVM Ensemble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
Carlos Padilha, Adrião Dória Neto, and Jorge Melo
Text Recognition in Videos Using a Recurrent Connectionist

Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
Khaoula Elagouni, Christophe Garcia, Franck Mamalet, and
Pascale Sébillot
Table of Contents – Part II XIII
An Investigation of Ensemble Systems Applied to Encrypted and

Cancellable Biometric Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
Isaac de L. Oliveira Filho, Benjamn R.C. Bedregal, and
Anne M.P. Canuto
New Dynamic Classifiers Selection Approach for Handwritten

Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
Nabiha Azizi, Nadir Farah, and Abdel Ennaji
Vector Perceptron Learning Algorithm Using Linear Programming . . . . . 197

Vladimir Kryzhanovskiy, Irina Zhelavskaya, and Anatoliy Fonarev
A Robust Objective Function of Joint Approximate Diagonalization . . . . 205

Yoshitatsu Matsuda and Kazunori Yamaguchi
TrueSkill-Based Pairwise Coupling for Multi-class Classification . . . . . . . . 213

Jong-Seok Lee
Analogical Inferences in the Family Trees Task: A Review . . . . . . . . . . . . . 221

Sergio Varona-Moya and Pedro L. Cobos
An Efficient Way of Combining SVMs for Handwritten Digit

Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
Renata F.P. Neves, Cleber Zanchettin, and Alberto N.G. Lopes Filho
Comparative Evaluation of Regression Methods for 3D-2D Image

Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
Ana Isabel Rodrigues Gouveia, Coert Metz, Luı́s Freire, and
Stefan Klein
A MDRNN-SVM Hybrid Model for Cursive Offline Handwriting

Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
Byron Leite Dantas Bezerra, Cleber Zanchettin, and
Vinı́cius Braga de Andrade
Extraction of Prototype-Based Threshold Rules Using Neural Training

Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
Marcin Blachnik, Miroslaw Kordos, and Wlodzislaw Duch
Instance Selection with Neural Networks for Regression Problems . . . . . . 263

Miroslaw Kordos and Marcin Blachnik
A New Distance for Probability Measures Based on the Estimation

of Level Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
Alberto Muñoz, Gabriel Martos, Javier Arriero, and Javier Gonzalez
Low Complexity Proto-Value Function Learning from Sensory

Observations with Incremental Slow Feature Analysis . . . . . . . . . . . . . . . . . 279
Matthew Luciw and Juergen Schmidhuber
XIV Table of Contents – Part II
Improving Neural Networks Classification through Chaining . . . . . . . . . . . 288

Khobaib Zaamout and John Z. Zhang
Feature Ranking Methods Used for Selection of Prototypes . . . . . . . . . . . . 296

Marcin Blachnik, Wlodzislaw Duch, and Tomasz Maszczyk
A “Learning from Models” Cognitive Fault Diagnosis System . . . . . . . . . . 305

Cesare Alippi, Manuel Roveri, and Francesco Trovò
Support Vector Machines (A5)

Improving ANNs Performance on Unbalanced Data with an AUC-Based
Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
Cristiano L. Castro and Antônio P. Braga
Learning Using Privileged Information in Prototype Based Models . . . . . 322

Shereen Fouad, Peter Tino, Somak Raychaudhury, and
Petra Schneider
A Sparse Support Vector Machine Classifier with Nonparametric

Discriminants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
Naimul Mefraz Khan, Riadh Ksantini, Imran Shafiq Ahmad, and
Ling Guan
Training Mahalanobis Kernels by Linear Programming . . . . . . . . . . . . . . . . 339

Shigeo Abe
Self-Organizing Maps and Clustering (A8)

Correntropy-Based Document Clustering via Nonnegative Matrix
Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
Tolga Ensari, Jan Chorowski, and Jacek M. Zurada
SOMM – Self-Organized Manifold Mapping . . . . . . . . . . . . . . . . . . . . . . . . . 355

Edson Caoru Kitani, Emilio Del-Moral-Hernandez, and
Leandro A. Silva
Self-Organizing Map and Tree Topology for Graph Summarization . . . . . 363

Nhat-Quang Doan, Hanane Azzag, and Mustapha Lebbah
Variable-Sized Kohonen Feature Map Probabilistic Associative

Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
Hiroki Sato and Yuko Osana
Learning Deep Belief Networks from Non-stationary Streams . . . . . . . . . . 379

Roberto Calandra, Tapani Raiko, Marc Peter Deisenroth, and
Federico Montesino Pouzols
Table of Contents – Part II XV
Separation and Unification of Individuality and Collectivity and Its

Application to Explicit Class Structure in Self-Organizing Maps . . . . . . . 387
Ryotaro Kamimura
Clustering, Mining and Exploratory Analysis (C6)

Autoencoding Ground Motion Data for Visualisation . . . . . . . . . . . . . . . . . 395
Nikolaos Gianniotis, Carsten Riggelsen, Nicolas Kühn, and
Frank Scherbaum
Examining an Evaluation Mechanism of Metaphor Generation

with Experiments and Computational Model Simulation . . . . . . . . . . . . . . 403
Asuka Terai, Keiga Abe, and Masanori Nakagawa
Pairwise Clustering with t-PLSI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411

He Zhang, Tele Hao, Zhirong Yang, and Erkki Oja
Selecting β-Divergence for Nonnegative Matrix Factorization by Score

Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
Zhiyun Lu, Zhirong Yang, and Erkki Oja
Neural Networks for Proof-Pattern Recognition . . . . . . . . . . . . . . . . . . . . . . 427

Ekaterina Komendantskaya and Kacper Lichota
Using Weighted Clustering and Symbolic Data to Evaluate Institutes’s

Scientific Production . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
Bruno Almeida Pimentel, Jarley P. Nóbrega, and
Renata M.C.R. de Souza
Comparison of Input Data Compression Methods in Neural Network

Solution of Inverse Problem in Laser Raman Spectroscopy of Natural
Waters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
Sergey Dolenko, Tatiana Dolenko, Sergey Burikov, Victor Fadeev,
Alexey Sabirov, and Igor Persiantsev
New Approach for Clustering Relational Data Based on Relationship

and Attribute Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
João Carlos Xavier-Júnior, Anne M.P. Canuto,
Luiz M.G. Gonçalves, and Luiz A.H.G. de Oliveira
Comparative Study on Information Theoretic Clustering and Classical

Clustering Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
Daniel Araújo, Adrião Dória Neto, and Allan Martins
Text Mining for Wellbeing: Selecting Stories Using Semantic and

Pragmatic Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467
Timo Honkela, Zaur Izzatdust, and Krista Lagus
XVI Table of Contents – Part II
Hybrid Bilinear and Trilinear Models for Exploratory Analysis of

Three-Way Poisson Counts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475
Juha Raitio, Tapani Raiko, and Timo Honkela
Estimating Quantities: Comparing Simple Heuristics and Machine

Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483
Jan K. Woike, Ulrich Hoffrage, and Ralph Hertwig
Bioinformatics (A2)
Rademacher Complexity and Structural Risk Minimization:
An Application to Human Gene Expression Datasets . . . . . . . . . . . . . . . . . 491
Luca Oneto, Davide Anguita, Alessandro Ghio, and Sandro Ridella
Using a Support Vector Machine and Sampling to Classify Compounds

as Potential Transdermal Enhancers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499
Alpa Shah, Gary P. Moss, Yi Sun, Rod Adams, Neil Davey, and
Simon Wilkinson
The Application of Gaussian Processes in the Predictions of

Permeability across Mammalian Membranes . . . . . . . . . . . . . . . . . . . . . . . . . 507
Yi Sun, Marc B. Brown, Maria Prapopoulou, Rod Adams,
Neil Davey, and Gary P. Moss
Protein Structural Blocks Representation and Search through

Unsupervised NN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515
Virginio Cantoni, Alessio Ferone, Ozlem Ozbudak, and
Alfredo Petrosino
Time Series and Forecasting (C7)

Evolutionary Support Vector Machines for Time Series Forecasting . . . . . 523
Paulo Cortez and Juan Peralta Donate
Learning Relevant Time Points for Time-Series Data in the Life

Sciences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531
Frank-Michael Schleif, Bassam Mokbel, Andrej Gisbrecht,
Leslie Theunissen, Volker Dürr, and Barbara Hammer
A Multivariate Approach to Estimate Complexity of FMRI Time

Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 540
Henry Schütze, Thomas Martinetz, Silke Anders, and
Amir Madany Mamlouk
Neural Architectures for Global Solar Irradiation and Air Temperature

Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 548
Pierrick Bruneau, Laurence Boudet, and Cécilia Damon
Table of Contents – Part II XVII
Sparse Linear Wind Farm Energy Forecast . . . . . . . . . . . . . . . . . . . . . . . . . . 557

Carlos M. Alaı́z, Alberto Torres, and José R. Dorronsoro
Diffusion Maps and Local Models for Wind Power Prediction . . . . . . . . . . 565
Ángela Fernández Pascual, Carlos M. Alaı́z,
Ana Ma González Marcos, Julia Dı́az Garcı́a, and
José R. Dorronsoro
A Hybrid Model for S&P500 Index Forecasting . . . . . . . . . . . . . . . . . . . . . . 573

Ricardo de A. Araújo, Adriano L.I. Oliveira, and Silvio R.L. Meira
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583

Table of Contents – Part I
Theoretical Neural Computation (A3)

Temporal Patterns in Artificial Reaction Networks . . . . . . . . . . . . . . . . . . . 1
Claire Gerrard, John McCall, George M. Coghill, and
Christopher Macleod
Properties of the Hopfield Model with Weighted Patterns . . . . . . . . . . . . . 9

Iakov Karandashev, Boris Kryzhanovsky, and Leonid Litinskii
Dynamics and Oscillations of GHNNs with Time-Varying Delay . . . . . . . 17

Farouk Chérif
A Dynamic Field Architecture for the Generation of Hierarchically

Organized Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Boris Durán, Yulia Sandamirskaya, and Gregor Schöner
Information and Optimization (C1)

Stochastic Techniques in Influence Diagrams for Learning Bayesian
Network Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Michal Matuszak and Jacek Miȩkisz
The Mix-Matrix Method in the Problem of Binary Quadratic

Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Iakov Karandashev and Boris Kryzhanovsky
A Rule Chaining Architecture Using a Correlation Matrix Memory . . . . . 49

James Austin, Stephen Hobson, Nathan Burles, and Simon O’Keefe
A Generative Multiset Kernel for Structured Data . . . . . . . . . . . . . . . . . . . 57

Davide Bacciu, Alessio Micheli, and Alessandro Sperduti
Spectral Signal Unmixing with Interior-Point Nonnegative Matrix

Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Rafal Zdunek
Hybrid Optimized Polynomial Neural Networks with Polynomial

Neurons and Fuzzy Polynomial Neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Dan Wang, Donghong Ji, and Wei Huang
Tikhonov-Type Regularization for Restricted Boltzmann Machines . . . . . 81

KyungHyun Cho, Alexander Ilin, and Tapani Raiko
XX Table of Contents – Part I
From Neurons to Neuromorphism (A1)

Modeling of Spiking Analog Neural Circuits with Hebbian Learning,
Using Amorphous Semiconductor Thin Film Transistors with Silicon
Oxide Nitride Semiconductor Split Gates . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Richard Wood, Ian Bruce, and Peter Mascher
Real-Time Simulations of Synchronization in a Conductance-Based

Neuronal Network with a Digital FPGA Hardware-Core . . . . . . . . . . . . . . 97
Marcel Beuler, Aubin Tchaptchet, Werner Bonath,
Svetlana Postnova, and Hans Albert Braun
Impact of Frequency on the Energetic Efficiency of Action Potentials . . . 105

Anand Singh, Pierre J. Magistretti, Bruno Weber, and
Renaud Jolivet
A Large-Scale Spiking Neural Network Accelerator for FPGA

Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Kit Cheung, Simon R. Schultz, and Wayne Luk
Silicon Neurons That Compute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

Swadesh Choudhary, Steven Sloan, Sam Fok, Alexander Neckar,
Eric Trautmann, Peiran Gao, Terry Stewart, Chris Eliasmith, and
Kwabena Boahen
A Communication Infrastructure for Emulating Large-Scale Neural

Networks Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Andres Gaona Barrera and Manuel Moreno Arostegui
Spiking Dynamics (B2)

Pair-Associate Learning with Modulated Spike-Time Dependent
Plasticity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
Nooraini Yusoff, André Grüning, and Scott Notley
Associative Memory in Neuronal Networks of Spiking Neurons:

Architecture and Storage Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
Everton J. Agnes, Rubem Erichsen Jr., and Leonardo G. Brunnet
Bifurcating Neurons with Filtered Base Signals . . . . . . . . . . . . . . . . . . . . . . 153

Shota Kirikawa, Takashi Ogawa, and Toshimichi Saito
Basic Analysis of Digital Spike Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

Narutoshi Horimoto, Takashi Ogawa, and Toshimichi Saito
Table of Contents – Part I XXI
From Single Neurons to Networks (C2)

Cyfield-RISP: Generating Dynamic Instruction Set Processors for
Reconfigurable Hardware Using OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
Jörn Hoffmann, Frank Güttler, Karim El-Laithy, and Martin Bogdan
A Biophysical Network Model Displaying the Role of Basal Ganglia

Pathways in Action Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
Cem Yucelgen, Berat Denizdurduran, Selin Metin,
Rahmi Elibol, and Neslihan Serap Sengor
How Degrading Networks Can Increase Cognitive Functions . . . . . . . . . . . 185

Adam Tomkins, Mark Humphries, Christian Beste,
Eleni Vasilaki, and Kevin Gurney
Emergence of Connectivity Patterns from Long-Term and Short-Term

Plasticities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
Eleni Vasilaki and Michele Giugliano
Artificial Neural Networks and Data Compression Statistics for the

Discrimination of Cultured Neuronal Activity . . . . . . . . . . . . . . . . . . . . . . . 201
Andres Perez-Uribe and Héctor F. Satizábal
Liquid Computing in a Simplified Model of Cortical Layer IV: Learning

to Balance a Ball . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
Dimitri Probst, Wolfgang Maass, Henry Markram, and
Marc-Oliver Gewaltig
Timing Self-generated Actions for Sensory Streaming . . . . . . . . . . . . . . . . . 217

Angel A. Caputi
The Capacity and the Versatility of the Pulse Coupled Neural Network
in the Image Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
Yuta Ishida, Masato Yonekawa, and Hiroaki Kurokawa
A Novel Bifurcation-Based Synthesis of Asynchronous Cellular

Automaton Based Neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
Takashi Matsubara and Hiroyuki Torikai
Biomimetic Binaural Sound Source Localisation with Ego-Noise

Cancellation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
Jorge Dávila-Chacón, Stefan Heinrich, Jindong Liu, and
Stefan Wermter
A Biologically Realizable Bayesian Computation in a Cortical Neural

Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
Daiki Futagi and Katsunori Kitano
XXII Table of Contents – Part I
Complex Firing Patterns (B5)

Evaluating the Effect of Spiking Network Parameters on
Polychronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
Panagiotis Ioannou, Matthew Casey, and André Grüning
Classification of Distorted Patterns by Feed-Forward Spiking Neural

Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
Ioana Sporea and André Grüning
Spike Transmission on Diverging/Converging Neural Network and Its

Implementation on a Multilevel Modeling Platform . . . . . . . . . . . . . . . . . . . 272
Yoshiyuki Asai and Alessandro E.P. Villa
Differential Entropy of Multivariate Neural Spike Trains . . . . . . . . . . . . . . 280

Nanyi Cui, Jiaying Tang, and Simon R. Schultz
Movement and Motion (B7)

Learning Representations for Animated Motion Sequence and Implied
Motion Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
Georg Layher, Martin A. Giese, and Heiko Neumann
Exploratory Behaviour Depends on Multisensory Integration during

Spatial Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
Denis Sheynikhovich, Félix Grèzes, Jean-Rémi King, and
Angelo Arleo
Control of Biped Robot Joints’ Angles Using Coordinated Matsuoka

Oscillators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
Asiya M. Al-Busaidi, Riadh Zaier, and Amer S. Al-Yahmadi
Self-calibrating Marker Tracking in 3D with Event-Based Vision

Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
Georg R. Müller and Jörg Conradt
Integration of Static and Self-motion-Based Depth Cues for Efficient

Reaching and Locomotor Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
Beata J. Grzyb, Vicente Castelló, Marco Antonelli, and
Angel P. del Pobil
A Proposed Neural Control for the Trajectory Tracking of a

Nonholonomic Mobile Robot with Disturbances . . . . . . . . . . . . . . . . . . . . . . 330
Nardênio A. Martins, Maycol de Alencar, Warody C. Lombardi,
Douglas W. Bertol, Edson R. De Pieri, and Humberto F. Filho
Table of Contents – Part I XXIII
From Sensation to Perception (B8)

Simulating Light Adaptation in the Retina with Rod-Cone Coupling . . . 339
Kendi Muchungi and Matthew Casey
Evolving Neural Networks for Orientation Behavior of Sand Scorpions

towards Prey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
Hyungu Yim and DaeEun Kim
Evolving Dendritic Morphology and Parameters in Biologically

Realistic Model Neurons for Pattern Recognition . . . . . . . . . . . . . . . . . . . . . 355
Giseli de Sousa, Reinoud Maex, Rod Adams, Neil Davey, and
Volker Steuber
Neural Network Providing Integrative Perception of Features and

Subsecond Temporal Parameters of Sensory Stimuli . . . . . . . . . . . . . . . . . . 363
Isabella Silks
An Effect of Short and Long Reciprocal Projections on Evolution of

Hierarchical Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
Vladyslav Shaposhnyk and Alessandro E.P. Villa
Some Things Psychopathologies Can Tell Us about Consciousness . . . . . . 379

Roseli S. Wedemann and Luı́s Alfredo V. de Carvalho
Object and Face Recognition (B1)

Elastic Graph Matching on Gabor Feature Representation at Low
Image Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
Yasuomi D. Sato and Yasutaka Kuriya
Contour Detection by CORF Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395

George Azzopardi and Nicolai Petkov
Hybrid Ensembles Using Hopfield Neural Networks and Haar-Like

Features for Face Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
Nils Meins, Stefan Wermter, and Cornelius Weber
Face Recognition with Disparity Corrected Gabor Phase Differences . . . . 411

Manuel Günther, Dennis Haufe, and Rolf P. Würtz
Visual Categorization Based on Learning Contextual Probabilistic

Latent Component Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
Masayasu Atsumi
Biological Brain and Binary Code: Quality of Coding for Face

Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427
João da Silva Gomes and Roman Borisyuk
XXIV Table of Contents – Part I
Reinforcement Learning (B4)

Making a Reinforcement Learning Agent Believe . . . . . . . . . . . . . . . . . . . . . 435
Klaus Häming and Gabriele Peters
Biologically Plausible Multi-dimensional Reinforcement Learning in

Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
Jaldert O. Rombouts, Arjen van Ooyen, Pieter R. Roelfsema, and
Sander M. Bohte
Adaptive Neural Oscillator with Synaptic Plasticity Enabling Fast

Resonance Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
Timo Nachstedt, Florentin Wörgötter, and Poramate Manoonpong
Learning from Delayed Reward und Punishment in a Spiking Neural

Network Model of Basal Ganglia with Opposing D1/D2 Plasticity . . . . . . 459
Jenia Jitsev, Nobi Abraham, Abigail Morrison, and
Marc Tittgemeyer
Understanding the Role of Serotonin in Basal Ganglia through a

Unified Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467
Balasubramani Pragathi Priyadharsini, Balaraman Ravindran, and
V. Srinivasa Chakravarthy
Learning How to Select an Action: A Computational Model . . . . . . . . . . . 474

Berat Denizdurduran and Neslihan Serap Sengor
Bayesian and Echo State Networks (A4)

A Dynamic Binding Mechanism for Retrieving and Unifying Complex
Predicate-Logic Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482
Gadi Pinkas, Priscila Lima, and Shimon Cohen
Estimation of Causal Orders in a Linear Non-Gaussian Acyclic Model:

A Method Robust against Latent Confounders . . . . . . . . . . . . . . . . . . . . . . . 491
Tatsuya Tashiro, Shohei Shimizu, Aapo Hyvärinen, and
Takashi Washio
Reservoir Sizes and Feedback Weights Interact Non-linearly in Echo

State Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499
Danil Koryakin and Martin V. Butz
Learning to Imitate YMCA with an ESN . . . . . . . . . . . . . . . . . . . . . . . . . . . 507

Rikke Amilde Løvlid
A New Neural Data Analysis Approach Using Ensemble Neural

Network Rule Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515
Atsushi Hara and Yoichi Hayashi
Table of Contents – Part I XXV
Bayesian Inference with Efficient Neural Population Codes . . . . . . . . . . . . 523

Xue-Xin Wei and Alan A. Stocker
Recurrent Neural Networks and Reservoir

Computing (C3)
Learning Sequence Neighbourhood Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 531
Justin Bayer, Christian Osendorfer, and Patrick van der Smagt
Learning Features and Predictive Transformation Encoding Based on a

Horizontal Product Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 539
Junpei Zhong, Cornelius Weber, and Stefan Wermter
Regulation toward Self-organized Criticality in a Recurrent Spiking

Neural Reservoir . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547
Simon Brodeur and Jean Rouat
Adaptive Learning of Linguistic Hierarchy in a Multiple Timescale

Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555
Stefan Heinrich, Cornelius Weber, and Stefan Wermter
The Spherical Hidden Markov Self Organizing Map for Learning Time
Series Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563
Gen Niina and Hiroshi Dozono
Echo State Networks for Multi-dimensional Data Clustering . . . . . . . . . . . 571

Petia Koprinkova-Hristova and Nikolay Tontchev
The Counter-Change Model of Motion Perception: An Account Based

on Dynamic Field Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 579
Michael Berger, Christian Faubel, Joseph Norman,
Howard Hock, and Gregor Schöner
Self-organized Reservoirs and Their Hierarchies . . . . . . . . . . . . . . . . . . . . . . 587

Mantas Lukoševičius
On-Line Processing of Grammatical Structure Using Reservoir

Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596
Xavier Hinaut and Peter F. Dominey
Constructing Robust Liquid State Machines to Process Highly Variable

Data Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604
Stefan Schliebs, Maurizio Fiasché, and Nikola Kasabov
Coding Architectures (B3)

Infinite Sparse Threshold Unit Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 612
Michiel Hermans and Benjamin Schrauwen
XXVI Table of Contents – Part I
Learning Two-Layer Contractive Encodings . . . . . . . . . . . . . . . . . . . . . . . . . 620

Hannes Schulz and Sven Behnke
Effects of Architecture Choices on Sparse Coding in Speech

Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 629
Fionntán O’Donnell, Fabian Triefenbach, Jean-Pierre Martens, and
Benjamin Schrauwen
Generating Motion Trajectories by Sparse Activation of Learned

Motion Primitives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 637
Christian Vollmer, Julian P. Eggert, and Horst-Michael Groß
Interacting with the Brain (B6)

Kinetic Modelling of Synaptic Functions in the Alpha Rhythm Neural
Mass Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645
Basabdatta Sen Bhattacharya, Damien Coyle,
Liam P. Maguire, and Jill Stewart
Integrating Neural Networks and Chaotic Measurements for Modelling

Epileptic Brain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653
Maurizio Fiasché, Stefan Schliebs, and Lino Nobili
Dynamic Stopping Improves the Speed and Accuracy of a P300

Speller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 661
Hannes Verschore, Pieter-Jan Kindermans, David Verstraeten, and
Benjamin Schrauwen
Adaptive SVM-Based Classification Increases Performance of a

MEG-Based Brain-Computer Interface (BCI) . . . . . . . . . . . . . . . . . . . . . . . . 669
Martin Spüler, Wolfgang Rosenstiel, and Martin Bogdan
Recognizing Human Activities Using a Layered Markov Architecture . . . 677

Michael Glodek, Georg Layher, Friedhelm Schwenker, and
Günther Palm
Swarm Intelligence and Decision-Making (A7)

PSO for Reservoir Computing Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 685
Anderson Tenório Sergio and Teresa B. Ludermir
One-Class Classification through Optimized Feature Boundaries

Detection and Prototype Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 693
George G. Cabral and Adriano L.I. Oliveira
Bi-objective Genetic Algorithm for Feature Selection in Ensemble

Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 701
Laura E.A. Santana and Anne M.P. Canuto
Table of Contents – Part I XXVII
Dual Support Vector Domain Description for Imbalanced

Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 710
Felipe Ramı́rez and Héctor Allende
Learning Method Inspired on Swarm Intelligence for Fuzzy Cognitive

Maps: Travel Behaviour Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 718
Maikel León, Lusine Mkrtchyan, Benoı̂t Depaire, Da Ruan,
Rafael Bello, and Koen Vanhoof
A Computational Model of Motor Areas Based on Bayesian Networks

and Most Probable Explanations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 726
Yuuji Ichisugi
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735

Complex-Valued Multilayer Perceptron Search
Utilizing Eigen Vector Descent
and Reducibility Mapping
Shinya Suzumura and Ryohei Nakano
Chubu University 1200 Matsumoto-cho, Kasugai, 487-8501 Japan

tp11018-4021@sti.chubu.ac.jp,
nakano@cs.chubu.ac.jp
Abstract. A complex-valued multilayer perceptron (MLP) can approx-

imate a periodic or unbounded function, which cannot be easily realized
by a real-valued MLP. Its search space is full of crevasse-like forms hav-
ing huge condition numbers; thus, it is very hard for existing methods to
perform efficient search in such a space. The space also includes the struc-
ture of reducibility mapping. The paper proposes a new search method
for a complex-valued MLP, which employs both eigen vector descent and
reducibility mapping, aiming to stably find excellent solutions in such a
space. Our experiments showed the proposed method worked well.
Keywords: complex-valued multilayer perceptron, Wirtinger calculus,

search method, eigen vector, reducibility mapping.
1 Introduction
A complex-valued MLP (multilayer perceptron) has the attractive potential a
real-valued MLP doesn’t have. For example, a complex-valued MLP can be nat-
urally used in the fields where complex values are indispensable, or a complex-
valued MLP can naturally fit a periodic or unbounded function.
Our preliminary experiments showed the search space of a complex-valued
MLP parameters is full of crevasse-like forms having huge condition numbers,
much the same as in a real-valued MLP [8]. In such an extraordinary space, it
will be hard for usual gradient-based search such as BP to find excellent solutions
because the search will easily get stuck. Recently a higher-order search method
has been proposed to get better performance for a complex-valued MLP [1].
This paper proposes a totally new search method for a complex-valued MLP,
which utilizes eigen vector descent and reducibility mapping [3,9], aiming to sta-
bly find excellent solutions in such an extraordinary search space full of crevasse-
like forms. Our experiments showed that the proposed method worked well for
two data sets generated by an unbounded function and Bessel functions.
2 Complex-Valued Multilayer Perceptron

Figure 1 shows a model of a complex-valued MLP. Here fiμ and zjμ are output
values of output unit i and hidden unit j for data point μ respectively.
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 1–8, 2012.

c Springer-Verlag Berlin Heidelberg 2012
2 S. Suzumura and R. Nakano
Fig. 1. Complex-valued multilayer perceptron

J
(2)

K
(1)
fiμ = wij zjμ , zjμ = g(hμj ), hμj = wjk xμk (1)
j=0 k=0
The following activation function g(h) is employed: g(h) = 1/(1 + i + e−z ). This
function has unbounded and periodic features. Since an activation function plays
an important role in a complex-valued MLP, many have been proposed [4,5] so
far. Our function is quite similar to that proposed by [7].
3 Search Space of Complex-Valued MLP

Wirtinger calculus [2] is used to calculate the gradient or the Hessian matrix for
a complex-valued MLP. Let E and w denote a sum-of-squares error and weights
respectively. Weights w (= wx + i wy ) are used to make the complex variables
c and the real variables r.

w wx
c= , r= (2)
w wy
Then the complex Hessian H c and the real Hessian H r are defined respectively
as below. Note that H c is Hermitian and H r is symmetric. The former is more
convenient to calculate than the latter.
H T
∂ ∂E ∂ ∂E
Hc = , Hr = (3)
∂c ∂c ∂r ∂r
What kind of landscapes does the search space of a complex-valued MLP have?
Here search space means the error surface the weights of a complex-valued MLP
form. To the best of our knowledge, little has been known about the landscapes.
Since the search space is usually high-dimensional, eigen values of the Hessian
will give us the exact nature of the landscapes.
Complex-Valued MLP Search Utilizing EVD and RM 3
Our preliminary experiments revealed the search space of a complex-valued

MLP is full of crevasse-like forms having huge condition numbers ranging from
106 to 1015 or more. A crevasse-like form means a form where there exists a flat
line surrounded by very steep walls. In such crevasse-like forms the usual steepest
descent cannot move along the bottom but just go back and forth toward the
steepest wall until termination.
4 Search Methods for Complex-Valued MLP

A new search method for a complex-valued MLP is explained. The proposed
method combines steepest descent with a new descent called eigen vector de-
scent under a new search framework which makes use of reducibility mapping [3].
Steepest Descent
Now the sum-of-squares error E is formally defined below, where y μ is a teacher
signal for data point μ. The error is a real-valued scalar.

N
I
E= δiμ δiμ , δiμ = fiμ − yiμ (4)
μ i
Using Wirtinger calculus, the gradient is defined as follows.
∂E
N
I
(2) ∂E
N
= δiμ wij g (hμj )xμk , = δiμ z μj (5)
(1) (2)
∂wjk μ i ∂wij μ
Steepest descent uses the gradient to get the search direction by multiplying the
learning rate. Since a constant learning rate does not work well in crevasse-like
forms, a line search [6] is employed to get an adaptive learning rate.
Eigen Vector Descent

This section explains a new descent called eigen vector descent. The error func-
tion is approximated using the second-order Taylor expansion.
H
∂E 1
E(w + Δw) = E(c + Δc) ≈ E(c) + Δc + ΔcH H c Δc (6)
∂c 2
From the definitions we can see the complex variables c and the real variables
r are linearly connected: c = J r, where I denotes the identity matrix. Using J
the real Hessian H r is calculated from the complex one H c as shown below [2].

I iI
H r = J H H c J, J= (7)
I −iI
Let λm and v m be the m-th eigen value and eigen vector of H r respectively.
The main idea of eigen vector descent is to consider each eigen vector as
a candidate of the search direction. Let ηm be the suitable step length in the
v m . Putting together the result of each direction, we get the real update
direction
Δr = m ηm v m . The complex update Δc is obtained from Δr. That is, Δc =
2M
J m ηm v m . Substituting this into eq.(6), we get the following. The basis {vm }
is assumed to be orthonormal.
H
1
2M 2M
∂E
E(c + Δc) ≈ E(c) + J v m ηm + λm ηm2
(8)
∂c m
2 m
By minimizing the above with respect to ηm , we get the suitable step length
ηm . When λm < 0, the above ηm gives the maximal point; then, ηm is selected
so as to reduce E. Moreover, we check if ηm surely reduces E, and if that does
not hold, we set ηm = 0. Thus, the weight update rule of eigen vector descent is
given as below.

2M
w new
←w old
+ Δwm , Δw m = ηm (I iI)v m (9)
m
Crevasse Search (CS)

Assuming the search space of a complex-valued MLP is full of crevasse-like forms,
steepest descent is accompanied with eigen vector descent described above. A
new search routine called Crevasse Search repeats a pair of steepest descent and
eigen vector descent as many times as specified. Let Nmax be the maximum
number of the repetitions to specify.
Reducibility Mapping (RM)

Sussmann [10] pointed out the uniqueness of weights and the reducibility of
real-valued MLPs. Much the same uniqueness and reducibility hold for complex-
valued MLPs [9]. Let MLP(J) and u be a complex-valued MLP having J hidden
units and its optimal weights respectively. Applying the α-type reducibility map-
ping, we get MLP(J + 1) having the following weights w from MLP(J) having
the optimal weights u. Note that hidden unit J + 1 is newly created, and free
(1)
weights {wJ+1,k } can take arbitrary values. This reducibility mapping will give
us a good starting point for the search of MLP(J + 1). Incidentally, this re-
ducibility mapping does not create a singular region of MLP(J + 1).
(1) (1) (2) (2) (2)
{w | wjk = ujk , wij = uij , wi,J+1 = 0, i = 1, ..., I, j = 0, ..., J, k = 0, ..., K }
New Search Method: CS+RM

Making use of Crevasse Search and reducibility mapping, the procedure of our
new search method called CS+RM is given as below. Let Jmax and Lmax be
the maximum number of hidden units to consider and the maximum number of
search trials respectively. Moreover, θ is an error improvement threshold.
1. Initialize weights randomly, and J ← 1.

2. Call Crevasse Search, and let E(J) and w(J) be the error and weights after
learning.
3. while J ≤ Jmax do
3.1 Apply reducibility mapping to get w(J + 1) from w(J), where free
(1)
weights {wJ+1,k } are left undetermined.
3.2 for = 1, 2, · · · , Lmax do
(1)
a. Initialize free weights {wJ+1,k } randomly.
b. Call Crevasse Search, and let E(J + 1) and w(J + 1) be the error and
weights after learning.
c. if E(J) − E(J + 1) > θE(J) then break end if
end for
3.3 J ← J + 1.
5 Experiments and Consideration

Experiment Using Unbounded Function
We compared the performance of the proposed method with steepest descent
using the following unbounded function: f (x) = 2x + i (1/10x). Real variable x
changes within the range [−1, 1] and the value range of the imaginary part of
function f is unbounded. Training data set {(xμ , f (xμ )), μ=1,2,· · ·,200} is gen-
erated for points with the equal interval 0.01; that is, x = −1, −0.99, · · · , 0.99, 1.
The point x = 0 is excluded. Table 1 shows the experimental conditions.
Table 1. Experimental conditions for learning the unbounded function
items steepest descent the proposed method

the number of hidden units J 5 -
max number of hidden units Jmax - 5
max number of sweeps 10000 -
max number of CS iterations Nmax - 2000
max number of RM search iterations Lmax - 50
error improvement threshold θ 10−6 10−6
value range of initial weights [−10, 10] [−10, 10]
Figures 2 and 3 show the learning processes of steepest descent and the pro-
posed method respectively. The error of the best solution found by steepest
descent is around 100 , while the proposed method found solutions whose errors
are around 10−12 , much better than steepest descent. In Fig. 3 we see reducibility
mapping (red circles) successively triggered error reductions to guide the search
into a new promising search field.
The generalization of complex-valued MLP learned by the proposed method
was evaluated. Points with the equal interval 0.001 were used, ten times smaller
than training data, in the range x ∈ [−2, 2], twice wider than training data.
Thus, interpolation and extrapolation capabilities were checked. Figure 4 shows
excellent fitting; in Fig. 5 showing the first quadrant in double log scale, we see
some mismatches only in very small real parts around 10−3 .
Fig. 2. Transition of training error in the Fig. 3. Transition of training error in the
learning process of the unbounded func- learning process of the unbounded func-
tion by steepest descent tion by the proposed method
Fig. 4. Output of complex-valued MLP Fig. 5. Output of complex-valued MLP

for unknown data of the unbounded for unknown data of the unbounded func-
function tion (log plot for the first quadrant)
Experiment Using Bessel Functions

Next, our method was applied to fit Bessel functions of the 1st and 2nd kinds.
x α 2
k
−x /4 Jα (x) cos(απ) − J−α (x)
Jα (x) = , Yα (x) = (10)
2 k!Γ (α + k + 1) sin(απ)
k=0
Table 2. Experimental conditions for learning Bessel functions
items values
max number of hidden units Jmax 10
max number of CS iterations Nmax 2000
max number of RM search iterations Lmax 50
error improvement threshold θ 10−6
value range of initial weights [−10, 10]
0.5
predicted Bessel function Jα(x)

0
-0.5
106 convergence curve of error

-1
105 point where reducibility mapping was performed
-1.5
104
-2
103
2 -2.5
10
Error
101 -3
0
5 10 15 20 25 30 35 40
10
x
10-1
1
10-2
10-3 0.5

10-4 0
0 500 1000 1500 2000
Iteration number -0.5
-1
Fig. 6. Transition of training error in the -1.5

learning process of Bessel functions by -2
the proposed method -2.5
-3
1 5 10 15 20 25 30 35 40
x
0.5
1
0
Bessel function Jα(x)
0.5
-0.5
0
-1
-0.5
-1.5
-1
-2
-1.5
-2.5
-2
-3
5 10 15 20 25 30 35 40 -2.5
x
-3
5 10 15 20 25 30 35 40
Fig. 7. True values of Bessel function
x
Jα , α = 1, 2, ..., 5
Fig. 8. Output of complex-valued MLP
for unknown data of Bessel function
Jα , α = 1, 2, ..., 5. The number of hidden
units J was 3, 5, 7 from top to bottom.
We used a complex-valued MLP which inputs real variable x and real integer α
and outputs Jα (x) and Yα (x). Variable x changes from 1 to 20 with the equal
interval 0.1 and α is set to be 1,2, and 3; thus, sample size is 191 × 3 = 573.
Generalization was evaluated using points from 1 to 40, twice larger than training
data, together with α = 1,2,3,4,5, where α=4,5 are extrapolation. Table 2 shows
the experimental conditions.
Figure 6 shows the learning process of the proposed method. We see again
reducibility mapping (red circles) triggered error reduction nicely guiding the
search. Figure 7 shows true values of Bessel function Jα (x), while Fig. 8 shows
output of complex-valued MLP learned by the proposed method. From Fig. 8
a small J(=3) gives rather poor fitting and poor extrapolation, while a large
J(=7) gives unstable fitting. Excellent fitting and extrapolation was obtained
for J=5. Much the same tendency was observed for Yα (x).
6 Conclusion
The paper proposed a new search method called CS+RM for a complex-valued
MLP, which makes use of eigen vector descent and reducibility mapping. Our
experiments using an unbounded function and Bessel functions showed the pro-
posed method worked well with nice generalization.
Acknowledgments. This work was supported by Grants-in-Aid for Scientific

Research (C) 22500212 and Chubu University Grant 24IS27A.
References
1. Amin, M. F., Amin, M.I., Al-Nuaimi, A.Y.H., Murase, K.: Wirtinger Calcu-
lus Based Gradient Descent and Levenberg-Marquardt Learning Algorithms in
Complex-Valued Neural Networks. In: Lu, B.-L., Zhang, L., Kwok, J. (eds.)
ICONIP 2011, Part I. LNCS, vol. 7062, pp. 550–559. Springer, Heidelberg (2011)
2. Delgado, K.K.: The complex gradient operator and the CR-calculus. ECE275A-
Lecture Supplement, Fall (2006)
3. Fukumizu, K., Amari, S.: Local minima and plateaus in hierarchical structure of
multilayer perceptrons. Neural Networks 13(3), 317–327 (2000)
4. Kim, T., Adali, T.: Approximation by fully complex multilayer perceptrons. Neural
Computation 15(7), 1641–1666 (2003)
5. Kuroe, Y., Yoshida, M., Mori, T.: On Activation Functions for Complex-Valued
Neural Networks–Existence of Energy Functions. In: Kaynak, O., Alpaydın, E.,
Oja, E., Xu, L. (eds.) ICANN 2003 and ICONIP 2003. LNCS, vol. 2714, pp. 985–
992. Springer, Heidelberg (2003)
6. Luenberger, D.G.: Linear and nonlinear programming. Addison-Wesley (1984)
7. Leung, H., Haykin, S.: The complex backpropagation algorithm. IEEE Trans. Sig-
nal Processing 39(9), 2101–2104 (1991)
8. Nakano, R., Satoh, S., Ohwaki, T.: Learning method utilizing singular region of
multilayer perceptron. In: Proc. 3rd Int. Conf. on Neural Computation Theory and
Applications, pp. 106–111 (2011)
9. Nitta, T.: Reducibility of the complex-valued neural network. Neural Information
Processing - Letters and Reviews 2(3), 53–56 (2004)
10. Sussmann, H.J.: Uniqueness of the weights for minimal feedforward nets with a
given input-output map. Neural Networks 5(4), 589–593 (1992)
Theoretical Analysis of Function of Derivative
Term in On-Line Gradient Descent Learning
Kazuyuki Hara1 , Kentaro Katahira2,3 , Kazuo Okanoya3,4,

and Masato Okada4,3,2
1
College of Industrial Technology, Nihon University,
1-2-1, Izumi-cho, Narashino, Chiba 275-8575, Japan
hara.kazuyuki@nihon-u.ac.jp
2
Center for Evolutionary Cognitive Sciences, The University of Tokyo,
3-8-1, Komaba, Meguro-ku, Tokyo, Japan
3
Brain Science Institute, RIKEN, 2-1 Hirosawa, Wako, Saitama, 351-0198, Japan
4
Graduate School of Frontier Science, The University of Tokyo,
5-1-5, Kashiwanoha, Kashiwa, Chiba 277-8561, Japan
okada@k.u-tokyo.ac.jp
Abstract. In on-line gradient descent learning, the local property of the

derivative term of the output can slow convergence. Improving the deriva-
tive term, such as by using the natural gradient, has been proposed for
speeding up the convergence. Beside this sophisticated method, ”simple
method” that replace the derivative term with a constant has proposed
and showed that this greatly increases convergence speed. Although this
phenomenon has been analyzed empirically, however, theoretical analysis
is required to show its generality. In this paper, we theoretically analyze
the effect of using the simple method. Our results show that, with the
simple method, the generalization error decreases faster than with the
true gradient descent method when the learning step is smaller than op-
timum value ηopt . When it is larger than ηopt , it decreases slower with
the simple method, and the residual error is larger than with the true
gradient descent method. Moreover, when there is output noise, ηopt
is no longer optimum; thus, the simple method is not robust in noisy
circumstances.
1 Introduction
Learning in neural networks can be formulated as optimization of an objective
function that quantifies the system’s performance. An important property of
feed-forward networks is their ability to learn a rule from examples. Statistical
mechanics has been successfully used to study this property, mainly for the
simple perceptron [1,2,3]. A compact description of the learning dynamics can
be obtained by using statistical mechanics, which uses a large input dimension
N and provides an accurate model of mean behavior for a realistic N [2,3,4].
Several studies have investigated ways to accelerate the learning process [5,6,7].
For example, the slow convergence due to the plateau occurs in the learning
process using a gradient descent algorithm. In gradient descent learning, the

10 K. Hara et al.
parameters are updated in the direction of the steepest descent of the objec-
tive function, and the derivative of the output is taken into account. Fahlman
[8] proposed, on the basis of empirical studies, a “simple method” in which the
derivative term is replaced with a constant, thereby speeding up the conver-
gence. However, the results should be supported by theoretical analysis to show
its generality.
In this paper, we theoretically analyze the effect of using the simple method by
using statistical mechanics methods and derive coupled differential equations of
the order parameters depicting its learning behavior. We validate the analytical
solutions by comparing them with those of computer simulation. Then we com-
pare the behavior of the true gradient descent method and the simple method
from the theoretical point of view. Our results show that the simple method
leads to faster convergence up until a optimum — learning rate; beyond this
rate it leads to slower convergence. We also show that, in the presence of output
noise, the optimum learning rate changes, which means the simple method is
not robust in noisy circumstances. Consequently, the derivative term affects the
learning speed and the robustness to noise is clarified.
2 Formulation
In this section, we formulate teacher and student networks and a gradient descent
algorithm in which the derivative term is replaced with a constant. We use a
teacher-student formulation, so we assume the existence of a teacher network
that produces the desired outputs. Teacher output t is the target of student
output s.
Consider a teacher and a student that are perceptrons with connection weights
B = (B1 , ..., BN ) and J m = (J1 − m, ..., JN m
), respectively, where m denotes the
number of learning iterations. We assume that the teacher and student percep-
trons receive N -dimensional input ξ m = (ξ1m , . . . , ξN m
), that the teacher out-
(m) (m)
puts t = g(ym ), and that the student outputs s = g(xm ). Here, g(·) is
the output
function, y m is the inner potential of the teacher calculated using
ym = N i=1 Bi ξim , and xm is the inner potential of the student calculated using
N
xm = i=1 Jim ξim .
We assume that the elements ξim of the independently drawn input ξ m are
uncorrelated random variables with zero mean and unit variance; that is, the
ith element of the input is drawn from a probability distribution P(ξi ). The
thermodynamic limit of N → ∞ is also assumed. The
statistics of the inputs
√
at the thermodynamic limit are ξim = 0, (ξim )2 = 1, and ξ m = N ,
where · · · denotes average and · denotes the norm of a vector. Each element
Bi , i = 1 ∼ N , is drawn from a probability distribution with zero mean and 1/N
variance. With the assumption of the thermodynamic
limit, the statistics of the
teacher weight vector are Bi = 0, (Bi )2 = N1 , and B = 1. The distribution
of inner potential ym follows a Gaussian distribution with zero mean and unit
variance in the thermodynamic limit. For the sake of analysis, we assume that
each element of Ji0 , which is the initial value of the student vector J m , is drawn
Function of Derivative in Gradient Descent Learning 11
from a probability distribution with zero

mean
and
1/N variance. The statistics
of the k-th student weight vector are Ji0 = 0, (Ji0 )2 = N1 , and J 0 = 1 at
the thermodynamic limit. The output function of the student g(·) is the same
as that of the teacher. The distribution of inner potential xm follows a Gaussian
distribution with zero mean and (Qm )2 variance in the thermodynamic limit.
Here, (Qm )2 = J m · J m .
Next, we introduce the gradient descent algorithm. For the possible inputs
{ξ}, we want to train the student network to produce desired outputs t = s. The
generalization error is defined as the squared error averaged over the possible
inputs.

1 1
εg = (t − s − n)2 = (g(ym ) − g(xm ) − n)2 (1)
2 2
Angle brackets · denote the average over possible inputs. We assume the pres-
ence of noise n in the student output, where n is drawn from a probability
distribution with zero mean and unit variance. At each learning step m, a new
uncorrelated input ξ m is presented, and the current student weight vector J m
is updated using
η
J m+1 = J m + (g(ym ) − g(xm )) g (xm )ξ m , (2)
N
where η is the learning step size and g (x) is the derivative of the output function
g(x).
3 Theory
In this section, we first show why the local property of the derivative of the
output slows convergence and then derive equations that depict the learning
dynamics.
We √use a sigmoid function as the output functionof perceptrons: g(x) =
erf(x/ 2). The derivative of the function is g (x) = 2/π exp(−x2 /2). Since
g (x) is a Gaussian function, it decreases quickly along x. As explained in the
previous section, the distribution of inner potential P (x) follows a Gaussian
distribution with mean zero and unit variance in the thermodynamic limit of
N → ∞. Consequently, g (x) for non-zero x are very small, so the update of the
student weight from (2) is very √small, which reduces the convergence speed.
We expand g (x) = exp(−x2 / 2) ∼ 1 − x2 /2 + x4 /8 · · · and use the first term.
When the first term is constant, the update for non-zero x becomes larger. A
better approach might be to use a constant value, ”a”, instead of ”1” (the first
term). We thus modify the learning equation to include a constant term:

ηa ym xm ηa
J m+1 =J m + erf √ − erf √ − n ξm = J m + δξ. (3)
N 2 2 N
We replace η with ηa for simplicity.
12 K. Hara et al.
The generalization error is given by (4). Noise n is added to the student

output. The generalization error when using the sigmoid function is calculated
using
1

g = (g(ym ) − g(xm ) − n)2
2

1 −1 1 1 −1 Q2 2 −1 R σ2
= sin + sin − sin + , (4)
π 2 π 1 + Q2 π 2(1 + Q2 ) 2
where σ 2 is the variance of additive noise n. By substituting Q2 and R at every

time step, we can obtain the generalization error.
The differential equations of order parameters Q2 = J · J and R = J · B are
the same as those used by Biehl and Schwarze [2]. (Their derivation is given in
the Appendix.)
dQ2
= 2η δx + η 2 δ 2 , (5)
dα
dR
= η δy , (6)
dα

where δ = erf √y2 − erf √x2 − n. α is time defined as α = m/N , and we
assume the limit of N → ∞. Note that (5) and (6) are macroscopic equations
2 (2) and (3) are microscopic equations. By calculating three averages, δx,
while
δ , and δy, we get two closed differential equations.

dR η 2R
=√ 1− (7)
dα π 2(1 + Q2 )

dQ2 2η 2Q2
=√ R−
dα π 2(1 + Q2 )

2 2 −1 1 −1 Q2 −1 R
+ (η ) sin + sin − 2 sin +σ 2
π 2 1 + Q2 2(1 + Q2 )
(8)
4 Results
In this section, we first present the results for noise-free cases and compare the
analytical solutions with those of computer simulation. We then present and
discuss the results for noise cases. In the figures presented here, the horizontal
axis is continuous time α = m/N , where m is the learning iteration. The vertical
axis for the analytical solutions is the generalization error, g ; for the simulation
solutions, it is the square mean error for N inputs.
Figure 1 shows the results for the noise-free cases. The learning √ step size η
is 0.1, 0.5, 2.7, 3.0 or 5.0, and we set B = 1, J = 1, and x = N . In the
0
simulations, N = 1000. The curves in the figure show the analytical solutions,
and the symbols show the simulation results: ”+” is for η = 0.1, ”×” is for 0.5,
”∗” is for 2.7, ”2” is for 3.0, and ”
” is for 5.0. The close agreement between
the analytical and simulation results validates the analytical results.
0.45
0.4
Generalization error
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0 5 10 15 20
Fig. 1. Analytical solutions and simulation results for noise-free cases
Next, we compare the true gradient descent method with the simple method
using analytical solutions. As reported by Biehl and Schwarze, the optimum
learning step size is ηopt ≈ 2.7 [2]. With this in mind, we compare the generaliza-
tions for learning step size η = 0.1, 0.5, 3.0, and 5.0. Figure 2 shows the results.
In the figures, ”T” is the true gradient descent method, and ”P” is the simple
method. For η = 0.1 and 0.5, the generalization error with the simple method
decreases faster than with the true gradient descent method. For η = 3.0, the
generalization error with the true gradient descent method decreases faster than
with the simple method. With both methods, the generalization error approaches
zero. For η = 5.0, the residual generalization error with the simple method is
larger than with the true gradient descent method.
Figure 3 shows the results for η = ηopt = 2.7 for both analytical and simula-
tion solutions. Label ”T” is results of the true gradient method, and label ”P” is
results of the simple method. The analytical solutions agree with the simulation
ones, meaning that the generalization error is reduced at the same rate with both
methods when ηopt is used. Therefore, when the learning step size is η < ηopt ,
the generalization error with the simple method decreases faster than with the
true gradient descent one, and the generalization error with the true gradient
descent method decreases faster than with the simple one when η > ηopt .
Next, we present and discuss the results for noisy cases. Figure 4 shows the
results. The learning √ step size η is 0.1, 0.5, 2.7, 3.0 or 5.0 and we set B = 1,
J = 1, and ξ = N . In the simulation, N = 1000. The curves in the figures
0
14 K. Hara et al.
0.35 0.35
0.3 0.3
0.25 0.25
0.2 0.2
0.15 0.15
0.1 0.1
0.05 0.05
0 0
0 5 10 15 20 0 5 10 15 20
0.35 0.35
η’=3.0(P) 0.3
0.3 η’=3.0(T)
0.25 0.25
0.2 0.2
0.15 0.15
0.1 0.1
0.05 0.05
0 0
0 5 10 15 20 0 5 10 15 20
Fig. 2. Comparison of generalization error between true gradient descent and simple
methods
0.4
0.35
T
0.3
0.25
0.2
0.15
0.1
0.05
0
-0.05
0 5 10 15 20
Fig. 3. Comparison of asymptotic property between simple and true gradient descent
methods for both analytical and simulation solutions
show the analytical solutions, and the symbols show the simulation solutions:
”+” is for η = 0.1, ”×” is for 0.5, ”∗” is for 2.7, ”2” is for 3.0, and ”
” is for
5.0. As shown in the figures, the presence of noise greatly increases the residual
error for η ≥ 2.7. The optimum learning step size is no longer η = 2.7; the
fastest convergence is attained with η = 0.5.
From analytical results,

√ the effect of the noise in the differential equation of
Q2 increased from σ 2 / 3 to σ 2 by replacing from the derivative term g (x) to
a constant a. (See eqs.(10) and (8)) This change in Q2 causes a larger general-
ization error. This demonstrates that the simple method is sensitive to added
output noise.
0.45 0.45
0.4 0.4
0.35 0.35
0.3 0.3
0.25 0.25
0.2 0.2
0.15 0.15
0.1 0.1
0.05 0.05
0 0
0 5 10 15 20 0 5 10 15 20
Fig. 4. Comparison of learning behavior between noise-free (left) and noise cases (right)
5 Conclusion
We have analyzed the simple method using a constant value ”a” instead of
g (x). We derived closed order parameter differential equations depicting the
dynamic behavior of the learning system and solved for the generalization error
by theoretical analysis. The analytical solutions were confirmed by the simulation
results. We found that the generalization error decreases faster with the simple
method than with the true gradient descent method when the learning step size is
fixed at η < ηopt . When η > ηopt , the generalization error decreases slower with
the simple method and the residual error is larger than with the true gradient
descent method. The addition of output noise changed the optimum learning
rate, meaning that the simple method is not robust in noisy circumstances.
Consequently, the derivative term affects the learning speed and the robustness
to noise is clarified.
A Derivation of Order Parameter Equations
The order parameter equations ((5) and (6)) are derived from learning equation
(3). To obtain the deterministic differential equation for Q, we square both sides
of (3) and then average the terms in the equation by using the distribution of
P (x, y). Since Q has a self-averaging property, we get
2η η2 2
(Q(m+1) )2 = (Q(m) )2 + δx + δ , (9)
N N
16 K. Hara et al.
where N = ξ2 . We denote time as α = m/N and assuming N → ∞, (9)

becomes a differential equation. We then set Q(m) → Q, Q(m+1) → Q + dQ, and
1/N → dα, resulting in eq.(5).
The differential equation for overlap R (6) is obtained by calculating the
product of B and eq. (3). We then average the terms in the equation. Since R
also has a self-averaging property, we get (6).
B Order Parameter Equation of Q2 with Derivative Term

g (x)
When the derivative g (x) is taken into account in (3), the differential equation
of the order parameter Q2 becomes
dQ2 4η 1 R Q2
= −
dα π 1 + Q2 2(1 + Q2 ) − R2 1 + 2Q2

2η 2 2 1 1 + 2(Q2 − R2 ) Q2
+ sin−1 + sin−1
π π 1 + 2Q2 1(1 + 2Q − R )
2 2 1 + 3Q2

R σ2
−2 sin +√ . (10)
2(1 + 2Q2 − R2 ) 1 + 3Q2 3
Here, σ 2 is the variance of additive noise n.
References
1. Krogh, A., Hertz, J., Palmer, R.G.: Introduction to the Theory of Neural Compu-
tation. Addison-Wesley, Redwood City (1991)
2. Biehl, M., Schwarze, H.: Learning by on-line gradient descent. Journal of Physics
A: Mathematical and General Physics 28, 643–656 (1995)
3. Saad, D., Solla, S.A.: On-line learning in soft-committee machines. Physical Review
E 52, 4225–4243 (1995)
4. Hara, K., Katahira, K., Okanoya, K., Okada, M.: Statistical Mechanics of On-Line
Node-perturbation Learning. Information Processing Society of Japan, Transactions
on Mathematical Modeling and Its Applications 4(1), 72–81 (2011)
5. Fukumizu, K.: A Regularity Condition of the Information Matrix of a Multilayer
Perceptron Network. Neural Networks 9(5), 871–879 (1996)
6. Rattray, M., Saad, D.: Incorporating Curvature Information into On-line learning.
In: Saad, D. (ed.) On-line Learning in Neural Networks, pp. 183–207. Cambridge
University Press, Cambridge (1998)
7. Amari, S.: Natural gradient works efficiently in learning. Neural Computation 10,
251–276 (1998)
8. Fahlman, S.E.: An Empirical Study of Learning Speed in Back-Propagation Net-
works, CMU-CS-88-162 (1988)
9. Williams, C.K.I.: Computation with Infinite Neural Networks. Neural Computa-
tion 10, 1203–1216 (1998)
Some Comparisons of Networks with Radial
and Kernel Units
Věra Kůrková
Institute of Computer Science, Academy of Sciences of the Czech Republic

Pod Vodárenskou věžı́ 2, 18207 Prague, Czech Republic
vera@cs.cas.cz
Abstract. Two types of computational models, radial-basis function

networks with units having varying widths and kernel networks where all
units have a fixed width, are investigated in the framework of scaled ker-
nels. The impact of widths of kernels on approximation of multivariable
functions, generalization modelled by regularization with kernel stabiliz-
ers, and minimization of error functionals is analyzed.
Keywords: Radial and kernel networks, universal approximation prop-

erty, fixed and varying widths, minimization of error functionals, stabi-
lizers induced by kernels.
1 Introduction
Originally, artificial neural networks were built from biologically inspired per-
ceptrons. Later, other types of computational units became popular in neuro-
computing merely due to their good mathematical properties. Among them,
radial-basis-function (RBF) units introduced by Broomhead and Lowe [1] and
kernel units introduced by Girosi and Poggio [2] became most popular. In partic-
ular, kernel units with symmetric positive semidefinite kernels have been widely
used due to their good classification properties [3]. In contrast to RBF networks,
where both centers and widths are adjustable, in networks with units defined by
symmetric kernels, all units have the same fixed width determined by the choice
of the kernel. Both computational models have their advantages. RBF networks
are known to be universal approximators [4,5]. In addition to the capability of
RBF networks to approximate arbitrarily well all reasonable real-valued func-
tions, model complexity of RBF networks is often lower than complexity of
traditional linear approximators (see, e.g., [6,7,8] for some estimates). On the
other hand, kernel models with symmetric positive semidefinite kernels benefit
from geometrical properties of Hilbert spaces generated by these kernels. These
properties allow application of maximal margin classification [3], generate suit-
able stabilizers for modeling of generalization in terms of regularization [9], and
lead to mathematical description of theoretically optimal solutions of learning
tasks [10,11,12]. Thus both types of computational models, the one with units
having fixed widths and the one with units having variable widths, have their
advantages.

18 V. Kůrková
In this paper, we investigate mathematical properties of these two types of

computational models in the framework of scaled kernels. First, we show that
besides of well known classification capabilities of networks with units defined by
positive definite symmetric kernels, many such networks are also suitable for re-
gression. We prove the universal approximation property for networks with units
induced by convolution kernels with positive Fourier transforms. Further we in-
vestigate minimization of error functionals over kernel networks with units with
fixed and varying widths. We describe multiple minima of empirical error func-
tionals in spaces of continuous functions and estimate dependence of stabilizers
induced by kernels on scalings of these kernels and on the input dimension.
The paper is organized as follows. In section 2, notations and basic concepts
on RBF and kernel models are introduced. In section 3, the role of width in
proofs of the universal approximation is discussed and this property is proven
for a wide class of kernel networks with fixed widths. In section 4, minima of
error functionals over networks with fixed and varying widths are investigated
and effect of change of the width on regularization is estimated.
2 Radial and Kernel Units

Radial-basis-function networks as well as kernel models belong to a class of
one-hidden-layer networks with one linear output unit. Such networks compute
functions from sets of the form
n

spann G := wi gi | wi ∈ R, gi ∈ G ,
i=1
where the set G is called a dictionary [13], n is the number of hidden units, and
R denotes the set of real numbers. The set of input-output functions of networks
with an arbitrary number of hidden units is denoted
n
span G := { i=1 wi gi | wi ∈ R, gi ∈ G, n ∈ N+ } ,
where N+ is the set of positive integers. Often, dictionaries are parameterized

families of functions modeling computational units, i.e., they are of the form
GK (X, Y ) := {K(., y) : X → R | y ∈ Y }
where K : X × Y → R is a function of two variables, an input vector x ∈ X ⊆

Rd and a parameter y ∈ Y ⊆ Rs . Such functions of two variables are called
kernels. This term, derived from the German term “kern”, has been used since
1904 in theory of integral operators [14, p.291]. Kernel units were introduced to
neurocomputing by Girosi and Poggio [2] as an extension of radial units, however
later the term kernel became often reserved to symmetric positive semidefinite
kernels.
Radial units (RBF) are non symmetric kernels Bψ : Rd × Rd+1 → R defined
as Bψ (x, (b, v)) = ψ(b(x − v)), where ψ : R → R is a one-variable function
(sometimes it is defined merely on R+ and typically limt→∞ ψ(t) = 0), . is
Networks with Radial and Kernel Units 19
the Euclidean norm on Rd , v is called a center, and 1b a width. Thus RBF units
generate dictionaries GBψ (X, Y ) := {ψ(b(. − v)) : X → R | b ∈ R+ , v ∈ Y }.
So for RBF units, sets of inputs X differ from sets of parameters Y as in addition
to centers, also widths are varying. Fixing a width b > 0, we get from an RBF
kernel Bψ a symmetric kernel Bψb : Rd × Rd → R.
We investigate radial and symmetric kernel units in terms of scaled kernels.
For K : Rd × Rd → R, we denote by K a : Rd × Rd → R the kernel defined as
K a (x, y) = K(ax, ay).
When it is clear from the context, we also use K a to denote the restriction of K a
to X × X, where X ⊆ Rd . Thus a symmetric kernel K induces two dictionaries:
a dictionary with fixed widths
GK (X) := {K(., y) : X → R | y ∈ X}
and a dictionary with varying widths

FK (X) := {K a (., y) : X → R | a ∈ R+ , y ∈ X} = GK a (X).
a>0
3 Universal Approximation Property

In this section, we show that although kernel units with fixed widths have less free
parameters than radial units with varying widths, for many kernels they gener-
ate classes of input-output functions large enough to be universal approximators.
Recall that a class of one-hidden-layer networks with units from a dictionary G
is said to have the universal approximation property in a normed linear space
(X , .X ) if it is dense in this space, i.e., clX span G = X , where clX denotes the
closure with respect to the topology induced by the norm .X . Function spaces
where the universal approximation has been of interest are spaces (C(X), .sup)
of continuous functions on subsets X of Rd (typically compact) with the supre-
mum norm and the space (L2 (Rd ), .L2 ) of square integrable functions on Rd
1/2
with the norm f L2 = Rd f (y)2 dy . Note that the capability to approxi-
mate arbitrarily well all real valued functions is much stronger than the capabil-
ity of classification, which merely needs approximation up to certain accuracy
of functions with finite (or even binary) domains.
For RBF networks with functions ψ satisfying 0 = R ψ(t)dt < ∞, the uni-
versal approximation property was proven by Park and Sandberg [4,5]. Their
proof exploits varying widths – it is based on a classical result on approximation
of functions by sequences of their convolutions with scaled kernels. This proof
might suggest that variability of widths is essential for the universal approxi-
mation. However for the special case of Gaussian kernels with any fixed width,
Mhaskar [15] proved the universal approximation capability in spaces of contin-
uous functions on compact subsets of Rd . His proof is based on properties of the
derivatives of the Gaussian function (they have the form of products of Hermite
polynomials with the Gaussian function) and so it cannot be extended to other
kernels.
20 V. Kůrková
Our next theorem shows that for convolution kernels with certain properties
of their Fourier transforms, variability of widths is not a necessary condition for
the universal approximation. Networks with such kernel units even with fixed
widths can approximate arbitrarily well all functions from L2 (Rd ).
Recall that a convolution kernel K is induced by translations of a one-variable
function k : Rd → Rd , i.e., K(x, y) = k(x−y), and so GK (X) := {k(.−y) |y ∈ Y }.
The convolution is an operation defined as

f ∗ g(x) = f (x − y)g(y)dy = f (y)g(x − y)dy

Rd Rd
[16, p.170]. The d-dimensional Fourier transform is an isometry on L2 (Rd ) de-

fined on L2 (Rd ) ∩ L1 (Rd ) as fˆ(s) = (2π)1d/2 Rd eix·s f (x) dx and extended to
L2 (Rd ) [16, p.183]. By λ is denoted the Lebesgue measure.
Theorem 1. Let d be a positive integer, k ∈ L1 (Rd ) ∩ L2 (Rd ) be such that
the set λ({s ∈ Rd | k̂(s) = 0}) = 0, and K : Rd × Rd → R be defined as
K(x, y) = k(x − y). Then span GK is dense in (L2 (Rd ), .L2 ).
Proof. Suppose that clL2 span GK (Rd ) = L2 (Rd ). Then by Hahn-Banach The-
orem [16, p. 60] there exists a linear functional l on L2 (Rd ) such that for all
f ∈ clL2 span GK (Rd ), l(f ) = 0 and for some f0 ∈ cl2 (Rd ) \ clL2 span GK (Rd ),
l(f0 ) = 1. By Riesz Representation Theorem [17], there exists h ∈ L2 (Rd ), such
that for all g ∈ L (R ), l(g) = Rd g(y)h(y)dy. Thus for all f ∈ clL2 span GK (Rd ),
2 d

Rd f (y)h(y)dy = 0. In particular, for all x ∈ R , Rd h(y)k(x − y)dy = h ∗
d
k(x) = 0. Thus by Plancherel Theorem [16, p.188], h ∗ kL2 = 0. As h ∗k =

1
d/2 ĥ k̂ [16, p.183], we have ĥ k̂ L2 = 0 and so
R d ( ĥ(s) k̂(s))2
ds = 0. Denot-
(2π)

ing Sk = {s ∈ Rd | k̂(s) = 0}, we have λ(Sk ) = 0 and so Rd ĥ(s)2 k̂(s)2 ds =

Rd \Sk
ĥ(s)2 k̂(s)2 ds = 0. Hence Rd ĥ(s)2 ds = Rd \Sk ĥ(s)2 ds = 0 and thus

ĥL2 ds = 0. So by Plancherel Theorem, hL2 = 0. Hence we get 1 = l(f0 ) =
2
Rd 0
f (y) h(y)dy ≤ f0 L2 hL2 = 0, which is a contradiction. 2
Corollary 1. Let d be a positive integer, k ∈ L1 (Rd ) ∩ L2 (Rd ) be such that

the set λ({s ∈ Rd | k̂(s) = 0}) = 0, and K : Rd × Rd → R be defined as
K(x, y) = k(x − y). Then
(i) for a Lebesque measurable X ⊆ Rd , span GK (X) is dense in (L2 (X), .L2 );
(ii) for a compact X ⊂ Rd and k continuous on Rd , span GK (X) is dense in
(C(X), .sup).
Proof. (i) Extending functions from L2 (X) to L2 (Rd ) by setting their values
equal to zero outside of X and restricting their approximations from span GK (Rd )
to X, we get the statement from Theorem 1.
(ii) The statement follows from (i) as for X compact, C(X) ⊂ L2 (X). 2
Note that, Theorem 1 and Corollary 1 imply the universal approximation prop-
erty of Gaussian kernel networks with any fixed width both in (L2 (Rd ), .L2 )
and in (C(X), .sup ) with X compact. Indeed, for any a > 0, the Fourier
transform of the scaled d-dimensional Gaussian function satisfies e−a 2 .2
=
√ −d −(1/a 2
). 2
( 2a) e [16, p.186]. So our results provide an alternative to Mhaskar’s
proof of the universal approximation property of Gaussian networks with fixed
widths. Moreover, our proof technique applies to a wider class of kernels than
Gaussians and holds in both L2 (Rd ) and C(X). In particular, it applies to all
convolution kernels induced by functions with positive Fourier transforms. Such
kernels are known to be positive definite and thus they play an important role
in classification and generalization [18].
4 Minimization of Error Functionals over Networks with

Scaled Kernels
In this section, we investigate minimization of error functionals over networks
with kernel units with fixed and varying widths. Recall, that a kernel K :
X × X → R is called positive semidefinite if forany positive integer m, any
m m
x1 , . . . , xm ∈ X and any a1 , . . . , am ∈ R, i=1 j=1 ai aj K(xi , xj ) ≥ 0. For
symmetric positive semidefinite kernels K, the sets span GK (X) of input-output
functions of networks with units induced by the kernel K are contained in
Hilbert spaces defined by these kernels. Such spaces are called reproducing ker-
nel Hilbert spaces (RKHS) and denoted HK (X). They are formed by functions
from span GK (X) together with limits of their Cauchy sequences in the norm
.K . The norm .K is induced by the inner product ., .K , which is defined
on GK (X) = {Kx | x ∈ X} as Kx , Ky K := K(x, y), where Kx (.) = K(x, .).
Mathematical properties of RKHSs enable to derive characterization of the-
oretically optimal solutions of learning tasks and to model generalization. An
empirical error functional Ez determined by a training sample z = {(ui , vi ) ∈
Rd × R | i = 1, . . . , m} of input-output pairs of data and a quadratic loss function
is defined as
1 m
Ez (f ) := m i=1 (f (ui ) − vi ) .
2
Girosi and Poggio [2] initiated mathematical modeling of generalization in terms

of regularization which has been used to improve stability of inverse problems
(see, e.g., [19]). Tikhonov regularization adds to the empirical error a functional
called stabilizer penalizing undesired properties of solutions. Girosi, Jones and
Poggio [20] considered as stabilizers suitably weighted Fourier transforms, later
Girosi [9] realized that such stabilizers are squares of norms on RKHSs. We
denote by
Ez,α,K := Ez + α.2K .
the regularized empirical error with the stabilizer .2K . The next theorem char-
acterizes argminima of Ez and Ez,α,K . The part (ii), often called the Representer
Theorem, was proven by several authors [21,10,11] using Fréchet derivatives. The
parts (ii) and (iii) are from [12] where they were derived using methods from
theory of inverse problems. By K[u] is denoted the matrix K[u]i,j = K(ui , uj ),
Km [u] = m 1
K[u], and K[u]+ denotes the Moore-Penrose pseudoinverse of the
matrix K[u].
22 V. Kůrková
Theorem 2. Let X ⊆ Rd , K : X × X → R be a symmetric positive semidefinite

m
kernel, m be a positive integer, z = (u, v) with u = (u1 , . . . , um ) ∈ Rd , v =
(v1 , . . . , vm ) ∈ Rm , then
(i) there exists an argminimum
m f + of Ez over HK (X) which satisfies
f = i=1 ci Kui , where c = (c1 , . . . , cm ) = K[u]+ v,
+
and for all f ∈ argmin(HK (X), Ez ), f + K ≤ f o K ;

o
(ii) for all α > 0, there exists a unique argminimum f α of Ez,α,K over HK (X)
which satisfies −1
fα = m i=1 ci Kui , where c = (c1 , . . . , cm ) = (Km [u] + α Im )
α α α α
v;
(iii) limα→0 f − f K = 0.
α +
Note that both argminima, f + and f α , are computable by networks with m

kernel units from GK (X). The argminima differ merely in coefficients of linear
combinations (output weights) of kernel units with parameters corresponding to
the data u1 , . . . , um . Thus in the case of theoretically optimal solutions, gener-
alization is achieved merely by modification of output weights.
The following theorem shows that in the space of continuous functions C(X)
on a compact X ⊂ Rd , for any training sample z the empirical error functional
Ez has a large convex set of argminima computable by kernel networks with
varying widths.
Theorem 3. Let X be a compact subset of Rd , K : X ×X → R be a convolution
kernel such that K(x, y) = k(x − y), where k is even, k ∈ L1 (Rd ) ∩ L2 (Rd ),
k̂(s) > 0 for all s ∈ Rd , m be a positive integer, and z = (u, v) with u =
m
(u1 , . . . , um ) ∈ Rd , v = (v1 , . . . , vm ) ∈ Rm . Then the set of argminima
m of Ez in
C(X) contains the convex hull conv {fa+ |a > 0}, where fa+ = i=1 cai Kuai with
ca = (ca1 , . . . , cam ) = Ka [u]+ v.
Proof. By Corollary 1(ii) for any a > 0, span GK a (X) is dense in (C(X), .sup).
It is easy to show that Ez is continuous on C(X), an argminimum of a continuous
functional over a dense subset is an argminimum over the whole space, and a
convex combination of argminima is an argminimum. The statement then follows
from Theorem 3. 2
The next theorem describes dependence of stabilizers induced by kernels on

scaling.
Theorem 4. Let K : Rd × Rd → R be a convolution kernel such that K(x, y) =
k(x − y), where k is even, k ∈ L1 (Rd ) ∩ L2 (Rd ), k̂ is non increasing and k̂(s) > 0
for all s ∈ Rd . Then for all 0 < b ≤ a,
(i) HK b (Rd ) ⊆ HK a (Rd );
(ii) the inclusion Jb,a : (HK b (Rd ), .K b ) → (HK a (Rd ), .K a ) is continuous;
d/2
(iii) for all f ∈ HK b (Rd ), f K a ≤ ab f K b .
Proof. It was noticed in [9] and rigorously proven in [22] that for a convo-
lution kernel K induced by a one-variable function k with a positive Fourier
ˆ(s)2
transform, f 2K = (2π)1d/2 Rd fk̂(s) a (s) = a−d k̂( s )
ds. As for all a > 0, k a
−1
ad f 2 a
[16, p.183], we have f 2K a = (2π)d/2 Rd fˆ(s)2 k̂( as ) ds. Hence f K 2 =
a d

ˆ 2 s
f (s) (k̂( a ))
−1
ds
−1 Kb
−1

Rd
−1 . As k̂ is non increasing, b ≤ a implies k̂( as ) ≤ k̂( sb ) .
bd Rd fˆ(s)2 (k̂( sb )) ds

Thus f K a
f b ≤ b
a d/2
. 2
K
Theorem 4 shows that Hilbert spaces induced by “flatter” modifications of a

kernel K are embedded in spaces induced by “sharper” modifications of K.
Thus we have a nested family {HK a (Rd ) | a > 0} of RKHSs with continuous
embeddings. “Sharpening” of a kernel K realized by replacing it with a scaled
kernel K a increases the penalty represented by the stabilizer .2K at most by ad .
In practical applications instead of .2K , simpler stabilizers, such as 1 or 2 -
norm of output weights, have been used [23]. If GK a (X) is linearly independent
(which holds for any strictly positive
ndefinite kernel K), each f ∈ span GK a (X)
a a
has a unique representation f = w
i=1i K xi . So we can define a functional
n
W a : span GK a (X) → R by W a (f ) = i=1 |wia |. When K : Rd × Rd → R is
bounded with cK = supx∈Rd |K(x, x)|, we have for all a > 0, and all X ⊆ Rd ,
n
supx∈X |K (x, x)| ≤ cK . Thus we have f K a ≤
a
i=1 |wi |K (xi , xi )K a ≤
a a
1/2
W a (f ) cK . So for any a > 0, a decrease of 1 -norms of output weights of func-
tions computable by networks with units from the dictionary GK a (X) implies a
decrease of .K a -norms.
2
To illustrate our results, consider
the Gaussian kernel K(x, y) := e−x−y . It
was shown in [24] that the set a>0 GK a (Rd ) of Gaussians with all widths and
centers is linearly independent. Thus for a = b, span GK a (Rd ) ∩ span GK b (Rd ) =
∅. By Theorem 1 and Corollary 1, all these sets are dense subspaces of L2 (Rd )
and C(X), resp. So we have a family of disjoint dense subsets, each formed by
input-output functions of Gaussian networks with some fixed width. However
by Theorem 4, for 0 < b < a, the whole space HK b (Rd ) and hence also its
subset span GK b (Rd ) is contained in the space HK a (Rd ). So span GK b (Rd ) ⊂
HK a (Rd ) \ span GK a (Rd ). The RKHSs induced by Gaussians with decreasing
widths are nested, but their subsets computable be kernel networks with fixed
widths are disjoint. In the space of continuous functions on any compact subset
of Rd , for any training sample z the empirical error functional Ez has many
argminima formed by functions computable by networks with units with all
widths.
Acknowledgments. This work was partially supported by GA ČR grant

P202/11/1368 and RVO 67985807.
References
1. Broomhead, D.S., Lowe, D.: Error bounds for approximation with neural networks.
Complex Systems 2, 321–355 (1988)
2. Girosi, F., Poggio, T.: Regularization algorithms for learning that are equivalent
to multilayer networks. Science 247(4945), 978–982 (1990)
24 V. Kůrková
3. Cortes, C., Vapnik, V.N.: Support vector networks. Machine Learning 20, 273–297
(1995)
4. Park, J., Sandberg, I.: Universal approximation using radial–basis–function net-
works. Neural Computation 3, 246–257 (1991)
5. Park, J., Sandberg, I.: Approximation and radial basis function networks. Neural
Computation 5, 305–316 (1993)
6. Kainen, P.C., Kůrková, V., Sanguineti, M.: Complexity of Gaussian radial basis
networks approximating smooth functions. J. of Complexity 25, 63–74 (2009)
7. Gnecco, G., Kůrková, V., Sanguineti, M.: Some comparisons of complexity in
dictionary-based and linear computational models. Neural Networks 24(1), 171–
182 (2011)
8. Gnecco, G., Kůrková, V., Sanguineti, M.: Can dictionary-based computational
models outperform the best linear ones? Neural Networks 24(8), 881–887 (2011)
9. Girosi, F.: An equivalence between sparse approximation and support vector ma-
chines. Neural Computation 10, 1455–1480 (1998) (AI memo 1606)
10. Cucker, F., Smale, S.: On the mathematical foundations of learning. Bulletin of
AMS 39, 1–49 (2002)
11. Poggio, T., Smale, S.: The mathematics of learning: dealing with data. Notices of
AMS 50, 537–544 (2003)
12. Kůrková, V.: Inverse problems in learning from data. In: Kaslik, E., Sivasundaram,
S. (eds.) Recent advances in dynamics and control of neural networks. Cambridge
Scientific Publishers (to appear)
13. Gribonval, R., Vandergheynst, P.: On the exponential convergence of matching
pursuits in quasi-incoherent dictionaries. IEEE Trans. on Information Theory 52,
255–261 (2006)
14. Pietsch, A.: Eigenvalues and s-Numbers. Cambridge University Press, Cambridge
(1987)
15. Mhaskar, H.N.: Versatile Gaussian networks. In: Proceedings of IEEE Workshop
of Nonlinear Image Processing, pp. 70–73 (1995)
16. Rudin, W.: Functional Analysis. Mc Graw-Hill (1991)
17. Friedman, A.: Modern Analysis. Dover, New York (1982)
18. Schölkopf, B., Smola, A.J.: Learning with Kernels – Support Vector Machines,
Regularization, Optimization and Beyond. MIT Press, Cambridge (2002)
19. Bertero, M.: Linear inverse and ill-posed problems. Advances in Electronics and
Electron Physics 75, 1–120 (1989)
20. Girosi, F., Jones, M., Poggio, T.: Regularization theory and neural networks ar-
chitectures. Neural Computation 7, 219–269 (1995)
21. Wahba, G.: Splines Models for Observational Data. SIAM, Philadelphia (1990)
22. Loustau, S.: Aggregation of SVM classifiers using Sobolev spaces. Journal of Ma-
chine Learning Research 9, 1559–1582 (2008)
23. Fine, T.L.: Feedforward Neural Network Methodology. Springer, Heidelberg (1999)
24. Kůrková, V., Neruda, R.: Uniqueness of functional representations by Gaussian
basis function networks. In: Proceedings of ICANN 1994, pp. 471–474. Springer,
London (1994)
Multilayer Perceptron for Label Ranking
Geraldina Ribeiro1 , Wouter Duivesteijn2 , Carlos Soares3 , and Arno Knobbe2

1
Faculdade de Economia, Universidade do Porto, Portugal
100414012@fep.up.pt
2
LIACS, Leiden University, the Netherlands
{wouterd,knobbe}@liacs.nl
3
INESC TEC, Universidade do Porto, Portugal
csoares@fep.up.pt
Abstract. Label Ranking problems are receiving increasing attention

in machine learning. The goal is to predict not just a single value from
a finite set of labels, but rather the permutation of that set that applies
to a new example (e.g., the ranking of a set of financial analysts in
terms of the quality of their recommendations). In this paper, we adapt
a multilayer perceptron algorithm for label ranking. We focus on the
adaptation of the Back-Propagation (BP) mechanism. Six approaches
are proposed to estimate the error signal that is propagated by BP. The
methods are discussed and empirically evaluated on a set of benchmark
problems.
Keywords: Label ranking, back-propagation, multilayer perceptron.
1 Introduction
In many real-world applications, assigning a single label to an example is not
enough. For instance, when trading in the stock market based on recommenda-
tions from financial analysts, predicting who is the best analyst does not suffice
because 1) he/she may not make a recommendation in the near future and 2)
we may prefer to take into account recommendations of multiple analysts, to be
on the safe side [1]. Hence, to support this approach, a model should predict a
ranking of analysts rather than suggesting a single one. Such a situation can be
modeled as a Label Ranking (LR) problem: a form of preference learning, aiming
to predict a mapping from examples to rankings of a finite set of labels [2].
Recently, quite some solutions have been proposed for the label ranking prob-
lem [2], including one based on the Multilayer Perceptron algorithm (MLP) [4].
MLP is a type of neural network architecture, which has been applied in a super-
vised learning context using the error back-propagation (BP) learning algorithm.
In this paper, we try a different approach to the simple adaptation proposed ear-
lier [4]. We adapt the BP learning mechanism to LR. More specifically, we inves-
tigate how the error signal explored by BP can use information from the LR loss
function. We introduce six approaches and evaluate their (relative) performance.
We also show some preliminary experimental results that indicate whether our
new method could compete with state-of-the-art LR methods.

26 G. Ribeiro et al.
The remainder of this paper is organized as follows. Section 2 formalizes the

LR problem and recalls the BP algorithm for neural networks. In Section 3 we
introduce our new adaptation of a multilayer perceptron to solve the LR problem,
and the approaches created to estimate the error signal. The experimental results
are presented in Section 4, and Section 5 concludes this paper.
2 Preliminaries
Throughout this paper, we assume a training set T = {xn , πn } consisting
of t examples xn and their associated label rankings πn . Such a ranking is a
permutation of a finite set of labels L = {λ1 , . . . , λk }, given k , taken from
the permutation space ΩL . Each example xn consists of m attributes xn =
{a1 , . . . , am } and is taken from the example space X. The position of λa in a
ranking πn is denoted by πn (a) and assumes a value in the set {1, . . . , k}.
2.1 Label Ranking

Given T = {xn , πn }, the goal in LR is to learn
t a function f : X → ΩL that min-
imizes a given loss function function l = 1t n=1 τ (πn , πˆn ). With this mapping,
we are able to predict a ranking πˆn of the labels in L for a new example xn . Loss
functions in LR are typically based on measures of rank correlation, that assess
the similarity between two rankings. One such measure is Kendall’s τ coefficient,
denoted τ (πˆn , πn ). The LR error is defined as eτ (n) on the nth training example
by eτ (n) = 1/2 − 1/2 · τ (πˆn , πn ). The LR error always lies between 0 and 1, where
eτ (n) = 0 means that the network returns a prediction equal to πn and eτ (n) = 1
means that the labels in πˆn are sorted in the reverse order of πn .
There are different approaches to solve LR problems. In reduction techniques,
the method is to learn a utility function for each label using the constraint
classification technique [5] or a log-linear model [6]. Ranking by pairwise com-
parisons [2,7] is a well-known method to model rankings as pairwise binary pref-
erences [5]. In probabilistic discriminative methods, the purpose is to estimate a
distribution for the probability of a ranking given an example [2, 8, 9]. Another
approach is adapting a machine learning algorithm based on similarity between
rankings. In [10], an adaptation of association rules was created where the goal
is to discover frequent pairs of attributes associated with a ranking. In [1], an
adaptation of a naive Bayes model is proposed where probabilities are replaced
by the concept of distance between rankings. In [4], different architectures of
an MLP are used to obtain a ranking prediction. In this paper, we propose an
adaptation of an MLP for LR problems based on similarity measures.
2.2 Neural Networks and Back-Propagation

An MLP is a particular form of a Neural Network (NN), a computational model
often used to solve learning problems [11]. It consists of a weighted directed
graph of an interconnected set of neurons organized in separate layers: the input
Adaptation of a Neural Network for Label Ranking 27
layer, the hidden layer(s) and the output layer. Each layer has one or more
neurons. Every neuron i is connected to the j neurons of the next layer by a set
of weighted links denoted by w1i , . . . , wji . At the input layer, {a1 , a2 , . . . , am }
represent m input signals associated with the m attributes. At the hidden and
output layers, each neuron j receives the input signals as a linear combination of
m
the output given by: vj = i=0 wji ai . The linear combinations are transformed
into output signals using an activation function ϕ(vj ). These signals are sent in
a forward direction layer by layer to the output layer which delivers an output
yj for each output neuron j. In classification, each class is associated with an
output neuron and the prediction is typically given by the one with the highest
activation level.
The goal is to define the values for the connections weights that return the
outputs which lowest error, i.e., the output is most similar to the desired value,
d(n). One method to learn the weights is BP, which propagates errors in a
backward direction from the output layer to the input layer, updating the weight
connections if an error is detected at the output layer. A weight correction on the
nth training example is defined in terms of the error signals cj (n) for each output
neuron j. Considering a sequential mode in which the weights are updated after
every training example, the predicted output yj (n) is compared with the desired
target dj (n), and the individual error ej (n) is estimated as follows: ej (n) =
dj (n) − yj (n). In a typical NN, the error signal is equal to the individual error,
because the predicted output is directly compared with the target. The correction
is given by Δwji (n) = ηδj (n)yi (n), where η is the learning rate, yi (n) is the
output signal of the previous neuron i and the local gradient δj is defined by
δj = ej (n)ϕ (vj (n)). For a hidden neuron i, the local gradient is defined in a
recursive form by δi (n) = ϕi (vi (n)) j δj (n)wji (n).
To prevent the MLP learning from getting stuck in a local optimum we use
random-restart hill climbing, by generating new random weights wji ∼ N (0, 1).
For each restart we present every example in the training set to the learning
process a user-defined number of times, called an epoch. The weights associated
with the best performance are returned.
3 Multilayer Perceptron for Label Ranking
Our adaptation of MLP for LR essentially consists of 1) the method to generate a

ranking from the output layer and 2) the error functions guiding the BP learning
process. The output layer contains k neurons (one for each label). The output
yj of a neuron j at the output layer does not represent a target value or class
but rather the score associated with a label λj . By ordering all the scores, the
predicted ranks πˆn (j) of the label λj and, thus, the predicted ranking.
The tricky point of adapting an MLP for LR is the weight corrections in
the BP process: minimizing the individual errors does not necessarily lead to
minimizing the LR loss. We propose six approaches to define the error signal
cj at the output layer. The weight connection wji (n) is updated based on the
estimated cj (n) using the delta rule Δwji (n) = ηcj (n)yi (n).
Local Approach (LA). The error signal is the individual error of each output
neuron, cj (n) = ej (n) = πn (j) − πˆn (j), as in the original MLP. The LR error,
eτ , is only used to evaluate the activation of the BP.
Global Approach (GA). The error signal is defined in terms of the LR error. In
this case, it is simply given by cj (n) = eτ (n).
Combined Approach (CA). CA is a combination between GA and LA, cj (n) =

ej (n)eτ (n). We note that a neuron which returns the correct position πn (j) =
πnˆ(j) (i.e., ej (n) = 0) is not penalized even if eτ > 0.
Weight-Based Signed Global Approach (WSGA). The error signal is defined in

terms of the LR error and the incoming weight connections of the output layer.
We assume that a high LR error means that some weights of neurons are too
high and other are too low. The output neurons are ranked according to their
q
average weights w̄j = 1q i=0 wji , resulting in a position pw (j) ∈ [1, . . . , k]. The
error of the neurons with a position above the mean is negative and it is positive
otherwise: ⎧ k
⎨ −eτ (n), if pw (j) > 2 + 0.5 ,
⎪
cj (n) = eτ (n), if pw (j) < k2 + 0.5 , (1)
⎪
⎩ k
0, if pw (j) = 2 + 0.5 .
Score-Based Signed Global Approach (SSGA). The motivation for SSGA is the
same as for WSGA. The difference is that we rank the output neuron scores yj
instead of the input weights. The positions of the weights, pw (j) is replaced in
eq. 1 with the positions of the scores, ps (j).
Individual Weight-Based Signed Global Approach (IWSGA). This assumes that

all the weight connections at the output layer are important to define the error
signal and are considered independently of the neurons they connect to. The error
signal denoted cji (n) is associated with the weight of the connection between
output neuron i and hidden neuron j. This is similar to WSGA but we rank all
weight connections individually, rather than the average weights for each output
neuron. The weight corrections are given by Δwji (n) = ηcji (n)yi (n), where:

−eτ (n), if pgw (ji) > qk2 ,
cji (n) = qk
eτ (n), if pgw (ji) ≤ 2 .
4 Experimental Results
The goal is to compare the performance of the proposed approaches on dif-
ferent datasets. The datasets used for the evaluation are from the KEBI Data
Repository [12] hosted by the Philipps University of Marburg. These datasets,
which are commonly used for LR, are presented in Table 1. Our approach starts
Table 1. Datasets for LR
Dataset Type k m Instances Dataset Type k m Instances

Authorship A 4 70 841 Iris A 3 4 150
Bodyfat B 7 7 252 Pendigits A 10 16 10992
Calhousing B 4 4 20640 Segment A 7 18 2310
Cpu-small B 5 6 8192 Stock B 5 5 950
Elevators B 9 9 16599 Vehicle A 4 18 846
Fried B 5 9 40768 Vowel A 11 10 528
Glass A 6 9 214 Wine A 3 13 178
Housing B 6 6 506 Wisconsin B 16 16 194
Table 2. Experimental results of MLP-LR and their ranks
Dataset GA LA CA WSGA SSGA IWSGA

τ rij τ rij τ rij τ rij τ rij τ rij
Authorship 0.291 6 0.889 1 0.829 2 0.307 5 0.528 4 0.548 3
Bodyfat -0.004 5 0.056 2 0.075 1 0.033 3 0.022 4 -0.006 6
Calhousing 0.054 6 0.083 3 0.106 2 0.078 4 0.130 1 0.076 5
Cpu-small 0.109 6 0.295 2 0.357 1 0.176 5 0.293 3 0.181 4
Elevators 0.110 6 0.687 1 0.684 2 0.135 5 0.419 3 0.168 4
Fried -0.002 6 0.532 2 0.660 1 0.157 4 0.446 3 0.133 5
Glass 0.317 5 0.818 1 0.757 2 0.258 6 0.475 4 0.493 3
Housing 0.077 6 0.531 2 0.574 1 0.290 3 0.241 4 0.094 5
Iris 0.178 6 0.911 1 0.800 2 0.609 4 0.693 3 0.351 5
Pendigits 0.161 5 0.694 2 0.752 1 0.122 6 0.314 3 0.257 4
Segment 0.177 6 0.799 2 0.842 1 0.341 4 0.338 5 0.346 3
Stock 0.032 6 0.732 2 0.745 1 0.303 4 0.403 3 0.197 5
Vehicle 0.106 6 0.801 1 0.800 2 0.482 4 0.504 3 0.339 5
Vowel 0.065 6 0.474 2 0.545 1 0.098 5 0.130 3 0.125 4
Wine 0.324 6 0.931 1 0.874 2 0.503 4 0.598 3 0.341 5
Wisconsin 0.007 6 0.221 2 0.235 1 0.066 3 0.060 4 0.028 5
Rj 5.8125 1.6875 1.4375 4.3125 3.3125 4.4375
by normalizing all attributes, and separating the dataset into a training and
a test set. On each dataset we tested the six approaches with h = 3 hidden
neurons, η = 0.2, using 5 epochs with 5 random restarts. The error estimation
methodology is 10-fold cross-validation. The results are presented in terms of the
similarity between the rankings πi and π̂i with the Kendall τ coefficient, which
is equivalent to the error measure described in Section 2.
In Table 2, we show the resulting τ -values for each approach, and associated
rank (lower is better) per dataset. The bottom row shows the average rank for
each approach, which allows us to compare the relative performance of the ap-
proaches using the Friedman test with post-hoc Nemenyi test [13]. The Friedman
test proves that the average ranks are significantly unequal (with α = 1%). Then
the Nemenyi test gives us a critical difference of CD = 2.225 (with α = 1%).
(a) Boxplot of the results according (b) Results per number of epochs on
to the approaches Iris dataset
Fig. 1. Results of Kendall’s τ correlation coefficient
The test implies that for each pair of approaches Ai and Aj , if Ri < Rj − CD,
then Ai is significantly better than Aj . Hence we can see from the table that
approaches LA and CA significantly outperform all other approaches except for
SSGA. However, at α = 10% the critical difference becomes CD = 1.712, so at
this significance level CA significantly outperforms SSGA too.
As we can see from Table 2, not all approaches have a very high τ -value
for all datasets. Notice, however, that these experiments are performed with
a rather arbitrary set of parameters. Varying parameters such as the number
of hidden neurons in the MLP, the number of epochs used when learning the
neural network, and the number of random restarts, could benefit performance.
To illustrate this, Figure 1b displays the variation of τ -values for the different
approaches on the Iris dataset, when varying the number of epochs. As we
can see, we can subtantially improve the results when tweaking the number of
epochs. For some approaches using more epochs is better, but for others this
monotonicity does not hold. We see similar behavior when varying the number
of stages and hidden neurons. Hence, we expect that much better results can be
gained with the new approaches when the parameter space is properly explored
for each dataset, but this is beyond the scope of this paper.
In Table 3, we compare the performance of approaches LA and CA with pub-
lished results of the state-of-the-art algorithms equal width apriori label ranking
(EW), minimum entropy apriori label ranking (ME) [10], constraint classifica-
tion (CC), instance-based label ranking (IBLR) and ranking trees (LRT) [8, 10],
in terms of Kendall’s τ coefficient. Notice that the new methods do not gener-
ally outperform the current state-of-the-art methods, but they do achieve results
that are often of the same magnitude. Since the results for the new approaches
are obtained without any form of parameter optimization, we feel confident that
exploration of the parameter space can yield a competitive algorithm.
Table 3. Comparison of MLP-LR with other methods
Dataset MLP-LR APRIORI-LR

CA rij LA rij EW rij ME rij CC rij IBLR rij LRT rij
Authorship 0.829 5 0.889 3 NA NA 0.608 6 0.920 2 0.936 1 0.882 4
Bodyfat 0.074 5 0.056 7 0.161 3 0.059 6 0.281 1 0.248 2 0.117 4
Calhousing 0.106 6 0.083 7 0.139 5 0.291 3 0.250 4 0.351 1 0.324 2
Cpu-small 0.357 5 0.295 6 0.279 7 0.439 4 0.475 2 0.506 1 0.447 3
Elevators 0.684 5 0.687 4 0.623 7 0.643 6 0.768 1 0.733 3 0.760 2
Fried 0.660 6 0.532 7 0.676 5 0.774 4 0.999 1 0.935 2 0.890 3
Glass 0.757 7 0.818 5 0.794 6 0.871 2 0.846 4 0.865 3 0.883 1
Housing 0.574 6 0.531 7 0.577 5 0.758 2 0.660 4 0.745 3 0.797 1
Iris 0.800 7 0.911 4 0.883 5 0.960 2 0.836 6 0.966 1 0.947 3
Pendigits 0.752 4 0.694 5 0.684 6 NA NA 0.903 3 0.944 1 0.935 2
Segment 0.842 4 0.799 6 0.496 7 0.829 5 0.914 3 0.959 1 0.949 2
Stock 0.745 5 0.732 7 0.836 4 0.890 3 0.737 6 0.927 1 0.895 2
Vehicle 0.800 5 0.801 4 0.675 7 0.774 6 0.855 2 0.862 1 0.827 3
Vowel 0.545 6 0.474 7 0.709 3 0.680 4 0.623 5 0.900 1 0.794 2
Wine 0.874 6 0.931 3 0.910 4 0.844 7 0.933 2 0.949 1 0.882 5
Wisconsin 0.235 5 0.221 6 0.280 4 0.031 7 0.629 1 0.506 2 0.343 3
To learn more about our results, we crafted a metalearning dataset from Ta-
bles 1 and 3. We performed a Subgroup Discovery [14, 15] run using the dataset
characteristics from Table 1 as search space, and mined for local patterns wherein
the rank of LA or CA deviates from the average over all datasets. Such a run re-
sults in a set of conditions on dataset characteristics, under which our approaches
perform unusually good or bad, giving pointers for further research.
The most convincing metasubgroup under which both LA and CA perform
well, is defined by m ≥ 13. Datasets belonging to this subgroup are indicated
by bold blue names in Table 3. When the dataset at hand has relatively many
attributes, our approaches have relatively many input signals in the MLP. Hence
there are many more connections with the hidden layer, and much more interac-
tions between the neurons in the network. Apparently, this increased complexity
of the MLP adds subtlety to its predictions, which allows the MLP-LR method
to induce more accurate representations of the underlying concepts.
5 Conclusions
Empirical results indicate that the two methods that directly incorporate the
individual errors perform significantly better than the methods that focus on the
LR error. However, the best results are obtained by combining both errors (CA).
A comparison with results published for other methods additionally indicates
that our method has the potential to compete with other methods. This holds
even though no parameter tuning was carried out, which is known to be essential
for learning accurate networks. Our method becomes more competitive when the
data contains more attributes; this increases the amount of input neurons, and
the MLP-LR predictions benefit from the more complex network. As future
work, apart from parameter tuning we will investigate other ways of combining
the local and global errors and we will investigate how to give more importance
to higher ranks.
Acknowledgments. This research is financially supported by the Nether-

lands Organisation for Scientific Research (NWO) under project num-
ber 612.065.822 (Exceptional Model Mining) and by FCT project Rank!
(PTDC/EIA/81178/2006).
References
1. Aiguzhinov, A., Soares, C., Serra, A.P.: A Similarity-Based Adaptation of Naive
Bayes for Label Ranking: Application to the Metalearning Problem of Algorithm
Recommendation. In: Discovery Science (2010)
2. Vembu, S., Gärtner, T.: Label Ranking Algorithms: A Survey. In: Fürnkranz, J.,
Hüllermeier, E. (eds.) Preference Learning. Springer (2010)
3. Hüllermeier, E., Fürnkranz, J.: On loss functions in label ranking and risk mini-
mization by pairwise learning. JCSS 76(1), 49–62 (2010)
4. Kanda, J., Carvalho, A.C.P.L.F., Hruschka, E.R., Soares, C.: Using Meta-learning
to Classify Traveling Salesman Problems. In: SBRN (2010)
5. Brinker, K., Hüllermeier, E.: Label Ranking in Case-Based Reasoning. In: Weber,
R.O., Richter, M.M. (eds.) ICCBR 2007. LNCS (LNAI), vol. 4626, pp. 77–91.
Springer, Heidelberg (2007)
6. Dekel, O., Manning, C.D., Singer, Y.: Log-linear models for label ranking. In:
Advances in Neural Information Processing Systems (2003)
7. Hülermeier, E., Fürnkranz, J., Cheng, W., Brinker, K.: Label ranking by learning
pairwise preferences. Artif. Intell., 1897–1916 (2008)
8. Cheng, W., Dembczynski, K., Hüllermeier, E.: Label Ranking Methods based on
the Plackett-Luce Model. In: ICML (2010)
9. Cheng, W., Huhn, J.C., Hüllermeier, E.: Decision tree and instance-based learning
for label ranking. In: ICML (2009)
10. de Sá, C.R., Soares, C., Jorge, A.M., Azevedo, P., Costa, J.: Mining Association
Rules for Label Ranking. In: Huang, J.Z., Cao, L., Srivastava, J. (eds.) PAKDD
2011, Part II. LNCS, vol. 6635, pp. 432–443. Springer, Heidelberg (2011)
11. Haykin, S.: Neural Networks: a comprehensive foundation, 2nd edn (1998)
12. KEBI Data Repository,
http://www.uni-marburg.de/fb12/kebi/research/repository
13. Demǎr, J.: Statistical comparisons of classifiers over multiple data sets. Journal of
Machine Learning Research 7, 1–30 (2006)
14. Klösgen, W.: Subgroup Discovery. In: Handbook of Data Mining and Knowledge
Discovery, ch. 16.3. Oxford University Press, New York (2002)
15. Pieters, B.F.I., Knobbe, A., Džeroski, S.: Subgroup Discovery in Ranked Data, with
an Application to Gene Set Enrichment. In: Proc. Preference Learning Workshop
(PL 2010) at ECML PKDD (2010)
Electricity Load Forecasting:
A Weekday-Based Approach
Irena Koprinska1, Mashud Rana1, and Vassilios G. Agelidis2

1
School of Information Technologies,
University of Sydney, Sydney, Australia
{irena,mashud}@it.usyd.edu.au
2
Australian Energy Research Institute,
University of New South Wales, Sydney, Australia
vassilios.agelidis@unsw.edu.au
Abstract. We present a new approach for building weekday-based prediction

models for electricity load forecasting. The key idea is to conduct a local
feature selection using autocorrelation analysis for each day of the week and
build a separate prediction model using linear regression and backpropagation
neural networks. We used two years of 5-minute electricity load data for the
state of New South Wales in Australia to evaluate performance. Our results
showed that the weekday-based local prediction model, when used with linear
regression, obtained a small and statistically significant increase in accuracy in
comparison with the global (one for all days) prediction model. Both models,
local and global, when used with linear regression were accurate and fast to
train and are suitable for practical applications.
Keywords: electricity load forecasting, autocorrelation analysis, linear

regression, backpropagation neural networks, weekday-based prediction model.
1 Introduction
Electricity load forecasting is the task of predicting the electricity load (demand)
based on previous electricity loads and other variables such as weather conditions. It
is important for the management of power systems, including daily decision making,
dispatching of generators, setting the minimum reserve and planning maintenance. In
this paper we focus on 5-minute-ahead prediction from previous 5-minute electricity
load data. This is an example of 1-step-ahead and very short-term prediction, and is
especially useful in competitive electricity markets, to help the market operator and
participants in their transactions. The overall goal is to ensure reliable operation of the
electricity system while minimizing the costs.
There are two main groups of approaches for electricity load prediction: the
traditional statistical approaches, which are linear and model-based, such as
exponential smoothing and autoregressive integrated moving average, and the more
recent machine learning approaches, with neural network-based approaches being
most popular. Taylor’s work [1-3] is the most prominent example of the first group.
34 I. Koprinska, M. Rana, and V.G. Agelidis
He studied a number of methods for very short-term and short-term prediction based
on exponential smoothing and autoregressive integrated moving average using British
and French data, and found that the best methods for 5-minute-ahead prediction was
double seasonal Holt-Winter’s exponential smoothing. Notable examples of the
second group are [4, 5] that used Backpropagation Neural Networks (BPNN) and [6-
8] that also used other prediction algorithms such as support vector machines and
Linear Regression (LR). For example, in [5] Shamsollahi et al. used a BPNN for 5-
minute ahead load forecasting. The data was preprocessed by applying a logarithmic
differencing of the consecutive loads; the BPNN’s architecture consisted of one
hidden layer and one output node; the stopping criterion was based on the use of
validation set. They obtained an excellent accuracy and their method was integrated
into the New England energy market system.
Most of the previous work has focused on building global models. Exceptions are
[7] where local models for each season were created and the wavelet-based
approaches [9, 10] where the load was decomposed into different frequency
components and a local model was built for each of them. In this paper we consider
another type of local prediction models - models based on the day of the week.
The key idea is to exploit the differences in the load profiles for the different days
of the week, e.g. it is well known that the load during the weekend is smaller than the
load during the working days. If a global model is built, it will treat all days in the
same way and will capture an average of the previous dependencies for all days. For
example, in [6, 7] we found that one of the most important predictors is the load from
the previous day at the same time as the prediction time. It is an important predictor,
on average, if a single prediction model is built for all days of the week. However,
this predictor is not equally important for all days of the week. For example, the load
profile on Monday is more likely to be similar to the load profile on the previous
Friday, not the previous day (Sunday). Similarly, the load profile on Saturday is more
likely to be similar to the load profile on the previous Saturday and Sunday, not the
previous day (Friday).
The key contributions of this paper are:
1) We propose a new approach for building local weekday-based model using
autocorrelation analysis. It is a generic approach and can be applied to other time
series and local components, not only to electricity load data and day of the week.
2) We compare the performance of the local model with the performance of a
global model, i.e. one single model for all days of the week. We conduct a
comprehensive evaluation using two years of Australian electricity data.
2 Load Forecasting Approach

Our forecasting is based only on previous load values. We do not consider weather
variables such as temperature and humidity as they were shown not to affect very
short-term load forecasting such as 5-minute ahead forecasting [1].
We follow the main idea of our approach [6] that uses autocorrelation feature
selection and machine learning prediction algorithms. In contrast to [6], we do not
apply additional feature selection after the autocorrelation selection and we also select
Electricity Load Forecasting: A Weekday-Based Approach 35
the autocorrelation features in a slightly different way (different number of peaks and
neighborhood size). Our approach consists of two main steps: 1) selecting features
using autocorrelation analysis (local and global selection) and 2) building prediction
models (local and global) using the LR and BPNN algorithms.
The autocorrelation function shows the correlation of a time series with itself at
different time lags. It is used to investigate the cyclic nature of a time series and is
appropriate for electricity load data as there are well defined daily and weekly cycles.
The first graph in Fig. 1 (“global”) shows the autocorrelation function of the
electricity load in 2006 for the state of New South Wales (NSW) in Australia. Values
close to 1 or -1 (i.e. peaks) indicate high positive or negative autocorrelation and
values close to 0 indicate lack of autocorrelation. The data is highly correlated; the
strongest dependence is at lag 1 (i.e. values that are 1 lag apart), the second strongest
dependence is at lag 2016 (i.e. values that are exactly 1 week apart) and so on.
global Monday
Friday Saturday
Fig. 1. Autocorrelation function for the global model and the local (weekday) models for
Monday, Friday and Saturday
To form a feature set we extract load variables from the seven highest peaks and
their neighbourhoods. The number of peaks and the size of the neighbourhoods are
selected empirically. As the higher peaks indicate stronger dependence and more
informative variables, we extract more variables from them than from the lower
peaks. More specifically, we extracted the following 37 variables:
• from peak 1 (the highest peak): the peak and the 10 lags before it; note that there
are no lags after it (11 features).
• from peaks 2 and 3: the peak and the three lags before and after it (7 features each)
• from peaks 4-7: the peak and the surrounding 1 lag before and after it (3 features
each).
Using the selected features we build prediction models that learn from the training
data. As prediction algorithms we used LR and BPNN; BPNN is the most popular
algorithm for load forecasting and LR is the algorithm we found to work best in
previous work [6, 7]. LR assumes linear decision boundary and uses the least squares
method to find the line of best fit. We used a variant of stepwise regression with
backward elimination based on the M5 method. BPNN is a classical neural network
trained with the backpropagation algorithm and capable of producing complex non-
linear decision boundaries. We used 1 hidden layer; to tune the BPNN parameters we
experimented with different number of hidden nodes, learning rate, momentum and
maximum number of epochs and report the best results that we obtained.
2.1 Global Prediction Model

We build one model that is used to predict the load for all days of the week. The
autocorrelation analysis and feature selection are conducted for all weekdays together.
The first graph in Fig. 1 shows the autocorrelation function for this model; the first
rows in Table 1 and 2 show the location of the seven highest peaks and the extracted
features, respectively.
2.2 Local (Weekday) Prediction Models
We build one model for each day of the week, e.g. one for Monday, one for Tuesday
and so on. It is used to predict the load only for this day of the week. The
autocorrelation analysis and feature selection are conducted separately for each
weekday. Fig. 1 shows the autocorrelation function for three of the days (Monday,
Friday and Saturday). Table 1 shows the location of the highest peaks and Table 2
shows the extracted features for each of the local models.
Table 1. The seven highest autocorrelation peaks for the global and local models
Peak number 1 2 3 4 5 6 7
global 1 2016 288 4032 6048 1728 2304
same day 1 week 1 day 2 weeks 3 weeks 6 days 8 days
local 1 2016 4032 6048 854 1152 1440
Monday same day 1 week 2 weeks 3 weeks 3 days 4 days 5 days
local 1 2016 4032 6048 288 1152 1440
Tuesday same day 1 week 2 weeks 3 weeks 1 day 4 days 5 days
local 1 2016 4032 6048 288 576 1440
Wednesday same day 1 week 2 weeks 3 weeks 1 day 2 days 5 days
local 1 2016 4032 6048 288 576 864
Thursday same day 1 week 2 weeks 3 weeks 1 day 2 days 3 days
local 1 2016 4032 6048 288 576 864
Friday same day 1 week 2 weeks 3 weeks 1 day 2 days 3 days
local 1 2016 4032 6048 288 1728 8064
Saturday same day 1 week 2 weeks 3 weeks 1 day 6 days 4 weeks
local 1 2016 4032 6048 288 8064 2304
Sunday same day 1 week 2 weeks 3 weeks 1 day 4 weeks 8 days
2.3 Comparison of the Features Selected in the Global and Local Models
The two strongest dependencies are the same for all prediction models – at the same
day and 1 week before. However, there are considerable differences in the remaining
5 strongest dependencies. For the global model, they are (in decreasing order) at 1
day, 2 weeks, 3 weeks, 6 days and 8 days. Hence, there is a mixture between daily
and weekly dependencies as the global model captures the dependencies for all days
of the week and represents an overall average dependence measure.
In contrast, for the local models, the weekly dependencies are stronger than the
daily, with all of them having 2 weeks and 3 weeks as the third and fourth highest
dependencies, followed by mainly daily dependencies for peaks 5-7. The daily
dependencies for the different weekdays are as expected. For example, the load for
Monday correlates with other working days – the previous Friday, Thursday and
Wednesday and not with the previous Sunday which is the third strongest predictor in
the global model. The load for Sunday correlates with other weekend days – the
Saturday before, Sunday 4 weeks ago and Saturday 1 week ago. The load for Tuesday
correlates with the other workdays - the previous Monday, Friday and Thursday.
Hence, the features extracted by the local models are meaningful and represent
better the load dependencies for the respective day of the week than the global model
which averages these dependencies for all days of the week.
Table 2. Selected features to predict the load Xt+1 for the global and local models
model Selected features to predict Xt+1:
global Xt-10 to Xt, XWt-3 to XWt+3, XDt-3 to XDt+3, XW2t-1 to XW2t+1,
XW3t-1 to XW3t+1, XD6t-3 to XD6t+3, XD8t-3 to XD8t+3
local Xt-10 to Xt, XWt-3 to XWt+3, XW2t-3 to XW2t+3, XW3t-1 to XW3t+1,
Monday XD3t-1 to XD3t+1, XD4t-1 to XD4t+1, XD5t-1 to XD5t+1
Tuesday XDt-1 to XDt+1, XD4t-1 to XD4t+1, XD5t-1 to XD5t+1
Wednesday XDt-1 to XDt+1, XD2t-1 to XD2t+1, XD5t-1 to XD5t+1
Thursday XDt-1 to XDt+1, XD2t-1 to XD2t+1, XD3t-1 to XD3t+1
Friday XDt-1 to XDt+1, XD2t-1 to XD2t+1, XD3t-1 to XD3t+1
Saturday XDt-1 to XDt+1, XD6t-1 to XD6t+1, XW4t-1 to XW4t+1
Sunday XDt-1 to XDt+1, XW4t-1 to XW4t+1, XD8t-1 to XD8t+1
where: Xt – load on forecast day at time t; XDt, XD2t, XD3t, XD4t, XD5t, XD6t, XD8t –
load 1, 2, 3, 4, 5, 6 and 8 days before the forecast day at time t, XWt, XW2t, XW3t,
XW4t - load 1, 2, 3 and 4 weeks before the forecast day at time t.
3 Data and Performance Measures

We used electricity load data for 2006 and 2007 for the state of NSW provided by the
Australian electricity market operator [11]. The 2006 data was used as training data
(99,067 instances) and the 2007 data was used as testing data (105,119 instances).
To measure the predictive accuracy, we used the Mean Absolute Error (MAE) and
the Mean Absolute Percentage Error (MAPE):
n n
L _ actual i − L _ forecast i
 L _ actual i − L _ forecast i , MAPE = n 
1 1
MAE = 100 [%]
n i =1 i =1 L _ actual i
where L_actuali and L_forecasti are the actual and forecasted load at the 5-minute lag
i and n is the total number of predicted loads. MAE is a standard metric used by the
research community and MAPE is widely used by the industry forecasters.
The performance of our models was compared with four naïve baselines where the
predicted load value was: 1) the mean of the class variable in the training data (Bmean),
2) the load from the previous lag (Bplag), 3) the load from the previous day at the same
time (Bpday) and 4) the load from the previous week at the same time (Bpweek).
4 Results and Discussion
Table 3 shows the accuracy results of the global model and the local weekday model.
In the global case, one prediction model is built and is then used to predict the load
for all examples in the test data. In the local case, seven prediction models are built,
one for each day of the week. Each model is used to predict the load only for the test
examples from the respective day (e.g. the model for Monday predicts the Mondays in
the test set) and the reported result is the average of these predictions for the test data.
Table 3. Accuracy of the global and local prediction models, and the baselines
global local baselines
LR BPNN LR BPNN Bmean Bplag Bpday Bpweek
MAE [MW] 25.07 25.06 24.73 25.47 1159.42 41.24 453.88 451.03
MAPE [%] 0.286 0.286 0.282 0.291 13.484 0.473 5.046 4.940
The best model was the local LR achieving MAPE=0.282%. It was slightly more
accurate than the global LR model (MAPE=0.286%) and this difference was
statistically significant at p<0.001 (Wilcoxon rank sum test). The local BPNN was
slightly less accurate then the global BPNN model (MAPE=0.291% and 0.286%) and
again this difference was statistically significant at p<0.001. The only pair-wise
accuracy difference that was not statistically significant was between LR global and
BPNN global. All global and local models considerably outperformed all baselines.
The computational complexity of the local model is higher than the global model
as it requires training seven models instead of one. As the training is done offline, this
is not a problem for LR which is very fast to train (approx. 10 sec) but is prohibitive
for BPNN which requires hours.
A closer examination of the selected features shows that all prediction models have
65% common and 35% different features. Specifically, there are 24 common features:
11 from the same day, 7 from 1 week ago, 3 from 2 weeks ago and 3 from 3 weeks
ago. The results show that the 35% different features, with the prediction algorithms
used, are not sufficient to make a big difference in the predictive accuracy.
In order to understand the differences between the global and local model for each
day of the week and also to investigate if there are days of the week associated with
higher and lower accuracy, we calculated the accuracy separately for each day of the
week, see Table 4. A comparison of the global and local models shows that their
accuracies are very similar for all days of the week, for both LR and BPNN. A
comparison along the days of the week shows that it was easiest to predict the load on
Wednesday, followed by Thursday, then Sunday, Saturday and Friday, then Monday
and it was most difficult to predict the load on Tuesday. The difference between
Tuesday and Wednesday requires further investigation; it might be due to public
holidays or random disturbances from large loads.
In the global model, LR and BPNN were equally accurate. In the local model, LR
was more accurate than BPNN. LR has another advantage over BPNN – it is much
faster to train (seconds versus hours). These results are consistent with our previous
work [6, 7] showing that LR is more accurate and faster than BPNN and thus more
suitable for load prediction. There is a debate in the literature [12-13] about the need
to use nonlinear prediction models such as BPNN for forecasting electricity load data.
Table 4. Accuracy of the global and local prediction models for each weekday using LR and
BPNN. In bold is the best result for each day for each prediction algorithm.
LR global
Mon Tue Wed Thu Fri Sat Sun
MAE [MW] 26.91 27.20 20.17 23.85 26.11 25.66 25.54
MAPE [%] 0.304 0.302 0.225 0.261 0.288 0.306 0.314
local
MAE [MW] 25.71 27.35 20.02 23.59 25.54 25.45 25.45
MAPE [%] 0.291 0.303 0.223 0.258 0.282 0.303 0.314
BPNN global
MAE [MW] 26.70 27.22 20.34 23.90 25.98 25.72 25.53
MAPE [%] 0.301 0.302 0.227 0.261 0.287 0.307 0.315
local
MAE [MW] 27.19 27.92 20.81 23.66 25.93 25.88 26.87
MAPE [%] 0.307 0.310 0.232 0.259 0.286 0.307 0.334
In summary, the performance of the global model and the local weekday model
was very similar. There was a small and statistically significant gain in accuracy when
building local (versus global) model and using LR as a prediction algorithm. LR was
also very fast to train and predict new data, so both local and global prediction models
with LR are suitable for practical applications.
5 Conclusions
We considered the task of predicting the electricity load every 5 minutes from
previous 5-minute loads. We presented a new approach for building weekday-based
prediction models using local autocorrelation feature selection and machine learning
algorithms (LR and BPNN). We compared the performance of the local weekday
model with a global (non-weekday dependent) model using two full years of
Australian electricity data. We found that the performance of the local and global
models was comparable. The local model when used with LR obtained a small and
statistically significant gain in accuracy over the global model and also achieved the
highest overall accuracy of MAPE=0.282%. Both models, global and local, with LR
are accurate and fast to train and are thus suitable for practical applications. Our
approach for building local models using autocorrelation analysis is not limited to
electricity load data and day of the week; it is a generic approach that can be applied
to other time series and local characteristics. Future work will include learning
weights for the individual features and investigating the effectiveness of local
prediction models for holidays and irregular days.
Acknowledgement. Mashud Rana was supported by an Endeavour postgraduate

award and a NICTA scholarship.
References
1. Taylor, J.W.: An evaluation of methods for very short-term load forecasting using minite-
by-minute British data. International Journal of Forecasting 24, 645–658 (2008)
2. Taylor, J.W.: Short-term electricity demand forecasting using double seasonal exponencial
smoothing. Journal of the Operational Research Society 54, 799–805 (2003)
3. Taylor, J.W.: Triple seasonal methods for short-term electricity demand forecasting.
European Journal of Operational Research 204, 139–152 (2010)
4. Charytoniuk, W., Chen, M.-S.: Very short-term load forecasting using artificial neutal
networks. IEEE Transactions on Power Systems 15, 263–268 (2000)
5. Shamsollahi, P., Cheung, K.W., Chen, Q., Germain, E.H.: A neural network based VSTLF
for the interim ISO New England electricity market system. In: 22nd IEEE PICA Conf.,
pp. 217–222 (2001)
6. Sood, R., Koprinska, I., Agelidis, V.: Electricity load forecasting based on autocorrelation
analysis. In: International Joint Conference on Neural Networks (IJCNN), Barcelona, pp.
1772–1779 (2010)
7. Koprinska, I., Rana, M., Agelidis, V.: Yearly and Seasonal Models for Electricity Load
Forecasting. In: International Joint Conference on Neural Networks (IJCNN 2011), USA
(2011)
8. Chen, B.J., Chang, M.-W., Lin, C.-J.: Load forecasting using support vector machines: a
study on EUNITE competition 2001. IEEE Transactions on Power Systems 19, 1821–1830
(2001)
9. Reis, A.J.R., Alvis, A.P., da Silva, P.A.: Feature Extraction via Multiresolution Analysis
for Short-Term Load Forecasting. IEEE Transactions on Power Systems 20, 189–198
(2005)
10. Chen, Y., Luh, P.B., Guan, C., Zhao, Y., Michel, L.D., Coolbeth, M.A.: Short-term load
forecas-ting: similar day-based wavelet neural network. IEEE Trans. on Power
Systems 25, 322–330 (2010)
11. Australian Energy Market Opeartor (AEMO) http://www.aemo.com.au
12. Hippert, H.S., Pedreira, C.E., Souza, R.C.: Neural Networks for short-term load
forecasting: a review and evaluation. IEEE Transactions on Power Systems 16(1), 44–55
(2001)
13. Darbellay, G.A., Slama, M.: Forecasting the short-term demand for electricity - Do neural
networks stand a better chance? International Journal of Forecasting 16, 71–83 (2000)
Adaptive Exploration Using Stochastic Neurons
Michel Tokic1,2 and Günther Palm1

1
Institute of Neural Information Processing, University of Ulm, Germany
2
Institute of Applied Research, University of Applied Sciences
Ravensburg-Weingarten, Germany
Abstract. Stochastic neurons are deployed for efficient adaptation of

exploration parameters by gradient-following algorithms. The approach
is evaluated in model-free temporal-difference learning using discrete ac-
tions. The advantage is in particular memory efficiency, because mem-
orizing exploratory data is only required for starting states. Hence, if
a learning problem consist of only one starting state, exploratory data
can be considered as being global. Results suggest that the presented
approach can be efficiently combined with standard off- and on-policy
algorithms such as Q-learning and Sarsa.
Keywords: reinforcement learning, exploration/exploitation.
1 Introduction
One of the most challenging tasks in reinforcement learning (RL) is balancing
the amount of exploration and exploitation [1]. If the behavior of an agent is
too exploratory, the outcome of randomly selected bad actions can prevent from
maximizing short-term reward. In contrast, if an agent is too exploitative, the se-
lection of only sub-optimal actions prevents from maximizing long-term reward,
because the outcome of true optimal actions is underestimated. Conclusively,
the optimal balance is somewhere in between, dependent on many parameters
such as the learning rate, discounting factor, learning progress, and of course on
the learning problem itself.
Many different approaches exist for trading off exploration and exploitation.
Based on a single exploration parameter, some basic policies select random ac-
tions either equally distributed (ε-Greedy) or value sensitively (Softmax) [1], or
by a combination of both [2], with the advantage of not requiring to memorize
any exploratory data. In contrast, approaches utilizing counters in every state
(exploration bonuses) direct the exploration process towards finding the optimal
policy in polynomial time under certain circumstances [3, 4]. Nevertheless, basic
policies can be effective having a proper exploration parameter configured, which
has been successfully shown for example in board games with huge discrete state
spaces like Othello [5] or English Draughts [6]. For such state spaces, utility func-
tions are hard to approximate, and conducting experiments for determining a
proper exploration parameter can be time consuming. A non-convergent counter
function is even harder to approximate than a convergent utility function [7]. In-
terestingly, Daw et al. revealed in biologically-motivated studies on exploratory

Adaptive Exploration Using Stochastic Neurons 43
decisions in humans that there is [...] no evidence to justify the introduction of

an extra parameter that allowed exploration to be directed towards uncertainty
(softmax with an uncertainty bonus): at optimal fit, the bonus was negligible,
making the model equivalent to the simpler softmax [8]. However, the search for
an appropriate exploration parameter for such policy remains.
In the following, a stochastic neuron model [9] is deployed for adapting the
exploration parameter of basic exploration policies instead of tuning such param-
eter by hand in advance. We evaluate the algorithm in discrete and continuous
state spaces using variants of the cliff-walking and mountain-car problems.
2 Methodology
The learning problems considered in this paper can be described as Markovian
Decision Processes (MDP) [1], which basically consist of a set of states, S, and a
set of possible actions within each state, A(s) ∈ A, ∀s ∈ S. A stochastic transi-
tion function P(s, a, s ) describes the (stochastic) behavior of the environment,
i.e. the probability of reaching successor state s after selecting action a ∈ A(s)
in state s. The selection of an action is rewarded by a numerical signal from
the environment, r ∈ R, used for evaluating the utility of the selected action.
The goal of an agent is finding an optimal policy, π ∗ : S → A, maximizing the
cumulative reward. In the following, it is allowed for S to be continuous, but
assumed that A is a finite set of actions. Action-selection decisions are taken at
regular time steps, t ∈ {1, 2, . . . , T }, until a maximum number of T actions is
exceeded or a terminal state is reached.
2.1 Learning Algorithms

In reinforcement learning, behavior can be derived from a utility function esti-
mating so farlearned knowledge as the expected and discounted future reward,
∞ k
Q(s, a) = E k=0 γ rt+k+1 |s t = s, a t = a , for taking action a in state s [1]. If
no model of the environment is present, utility functions must be sampled from
observations of the interaction between the agent and its environment, which is
also known as temporal-difference learning. Two commonly used learning algo-
rithms are Sarsa [7] and Q-learning [10], where the technical difference between
both algorithms is the kind of successor-state information used for evaluating
action at in state st . Sarsa includes the discounted value of the actual selected
action in the successor state, for which reason it is classified being an on-policy
algorithm. In contrast, Q-learning includes the discounted value of the estimated
optimal action in the successor state for which reason it is classified being an
off-policy algorithm. On-policy algorithms have the advantage of including into
Q respective costs from stochastic action-selection policies, but have in turn no
convergence guarantee, except when the policy has a greedy behavior1 [1].
For sampling Q accurately, a proper trade-off between exploration and ex-
ploitation is required. Basic exploration policies such as ε-Greedy [10] select a
1
i.e. taking in state s an action having the highest estimated utility Q(s, a).
44 M. Tokic and G. Palm
certain amount of random actions with regard to an exploration rate ε. A disad-

vantage of such policy is that exploration actions are selected equally distributed
among all possible actions, which might cause the income of high negative re-
wards from several bad actions, even if their true utility is correctly estimated.
For this, another basic policy is selecting actions according to their weighting in
a Boltzmann distribution (the Softmax policy), which also takes so far estimated
utility into account [1]. A known problem of Softmax is that it [...] has large prob-
lems of focusing on the best actions while still being able to sometimes deviate
from them [2]. For this, Wiering proposed to combine Softmax with ε-Greedy
into the Max-Boltzmann Exploration (MBE) rule [2], which selects exploration
actions according to Softmax instead of being equally distributed.
A drawback of the above mentioned basic exploration policies is that an explo-
ration parameter (ε or τ , or both for MBE) needs to be found. Such parameter
varies dependent on the learning problem, and typically an exploratory behavior
is desired at the beginning of the learning process, when everything is unknown
(in model-free RL). For this, some applications make use of an decreasing ex-
ploration rate [11], but which is known as to be inefficient for non-stationary
environment responses. Other approaches such as the VDBE-Softmax policy
dynamically adapt a state-dependent exploration rate, ε(s), based on learning
progress measured as fluctuations in Q [12].
3 Exploration Control Using Stochastic Neurons

Finding a near optimal exploration parameter by trial-and-error can be a very
time consuming task, and conclusively it is desired having algorithms adapt-
ing this parameter based on current operating conditions. The proposed idea
is adapting the parameter, ae , of an action-selection policy, π(ae , ·, ·), towards
improving the future outcome of π with regard to a performance measure ρ. For
maximizing ρ in the future, we deploy Williams’ “REINFORCE with multipa-
rameter distributions” algorithm using a stochastic neuron model [9]. The input
to such neuron is a weighted parameter vector θ, from which the neuron deter-
mines an adaptable stochastic scalar as its output, i.e. reinforcement learning in
continuous action spaces. However, our investigated domain consists of discrete
actions, thus the algorithm is applied for adapting the continuous-valued explo-
ration parameter, i.e. REINFORCE Exploration-Parameter Control (REC). For
example, if π(ae , ·, ·) is an ε-Greedy policy, the exploration parameter ae refers
to the exploration rate ε, i.e. ae ≡ ε.
Assuming one starting state, the exploration parameter ae is drawn at the
beginning of an episode (being valid over the whole episode) from a Gaussian
distribution, ae ∼ N (μ, σ), whose density function is given by
1 2 2
g(ae , μ, σ) = √ e−(ae −μ) /2σ . (1)
σ 2π
Let θ denote the vector of adaptable parameters consisting of

μ
θ= . (2)
σ
At the end of episode i, the components of θ are adapted towards the gradient
with regard to the outcome ρ of the current episode
θi+1 ≈ θi + α∇θ ρ . (3)
For improving the future performance of π(ae , ·, ·), the policies outcome is mea-
sured as the cumulative reward in the current episode
ρ = E{r1 + r2 + · · · + rT |π(ae , ·, ·)} . (4)
Next, the characteristic eligibility of each component of θ is estimated by
∂ ln g(ae , μ, σ) ae − μ
= (5)
∂μ σ2
∂ ln g(ae , μ, σ) (ae − μ)2 − σ 2
= , (6)
∂σ σ3
and a reasonable algorithm for adapting μ and σ has the following form
ae − μ
Δμ = αR (ρ − ρ̄) (7)
σ2
(ae − μ)2 − σ 2
Δσ = αR (ρ − ρ̄) . (8)
σ3
The learning rate αR has to be chosen appropriately, e.g. as a small positive
constant, αR = ασ 2 , [9]. The baseline ρ̄ is adapted by a simple reinforcement-
comparison scheme
ρ̄ = ρ̄ + α(ρ − ρ̄) . (9)
Analytically, in Eqn. 7 the mean μ is shifted towards ae in case of ρ ≥ ρ̄. On the
contrary, μ is shifted towards the opposite direction if ρ is less than ρ̄. Similarly,
in Eqn. 8 the standard deviation σ is adapted in a way that the occurrence of ae
is increased if ρ ≥ ρ̄, and decreased otherwise (see proof in [9]). In simple words,
the standard deviation controls exploration in the space of ae .
Importantly, a proper functioning of the proposed algorithm depends on some
requirements. In order to limit the search of reasonable parameters, the explo-
ration parameter, mean and standard deviation must be bounded for obtaining
reasonable performance. Furthermore, if the learning problem consists of more
than one starting state, all parameters must be associated to each occurring
starting state, i.e. μ → μ(s), σ → σ(s) and ρ̄ → ρ̄(s), since way costs might
affect ρ unevenly. However, if a learning problem consists of just one starting
state, all utilized parameters can be considered as global parameters.
4 Experiments
The presented approach is evaluated in two environments using Q-learning and
Sarsa. First, a variation of the cliff-walking problem [1] is proposed as the
non-stationary cliff-walking problem comprising a non-stationary environment.
Goal 1
Goal 2
rt+1=1000
a) b) c) rt+1=0
( G2 )
G1
1 5 10 21
Rewards:
Phase/ a) b) c)
Episode 1…200 201…1000 1001…3000
rG1 3 21 41 -1.2 -0.5 0.3
rG2 n/a n/a 500 Position
(a) (b)
Fig. 1. The non-stationary cliff-walking problem (a) and the mountain-car problem
with two goals (b)
Second, a variation of the mountain-car problem is investigated comprising a

continuous-valued state space approximated by a table, which causes partial ob-
servability of the actual coordinates. Investigated basic exploration policies are
ε-Greedy, Softmax and MBE. Since MBE requires two parameters to be set (ε
and τ ), we only adapt ε of this policy, while setting the temperature parameter
constantly to the value of τ = 1, and normalizing all Q-values in state s into the
interval [−1, 1]. For comparison, the recently proposed VDBE-Softmax policy
is also evaluated, which works similar to the MBE policy, but adapts a state-
dependent exploration rate ε(s) based on fluctuations in the utility function Q
[12]. For this policy the adaptable parameter is the sensitivity parameter used
for controlling greediness.
4.1 The Non-stationary Cliff-Walking Problem
The non-stationary cliff-walking problem is a modification of the cliff-walking

problem presented by Sutton and Barto [1], but additionally comprising non-
stationary responses of the environment. The goal for the agent is learning a
path from starting state S to the goal state G1 , which is rewarded with the
absolute costs of the shortest path minus 1 if successful (see Fig. 1(a)). The way
costs (reward) for each action are defined as rstep = −1. The environment also
comprises unsafe cliff states, which lead to a high negative reward of rcliff = −100
when entered, and also reset the agent back to the starting state S.
At the beginning of the experiment, learning takes place in phase (a) of
Fig. 1(a) having one cliff state (at left border). After 200 learning episodes,
the grid world changes to phase (b), now comprising 10 cliffs. After additional
800 episodes, the problem is tightened as shown in phase (c), where the number
of cliffs is increased to 20. Additionally, an alternative goal state appears, G2,
which is much higher rewarded with rG2 = 500 when entered.
In each episode the agent starts in state S, as well an episode terminates
when a goal state, G1 or G2, is entered or the time limit of Tmax = 200 actions
is exceeded. Since the learning problem is episodic, no discounting (γ = 1) is
used. Finally, all utility values are optimistically initialized with Qt=0 (s, a) = 0,
and learned using a constant step-size parameter of α = 0.2.
(a) Averaged reward per episode.
(b) Mean and standard deviation for MBE and Softmax.
Fig. 2. The non-stationary cliff-walking problem: Averaged results (smoothed) using

Q-learning and Sarsa. Note the dynamics of exploration for non-stationary environment
responses.
Results. Figure 2(a) shows the averaged reward per episode over 500 runs for
the non-stationary cliff-walking problem. Averages of the mean μ and standard
deviation σ are shown in Figure 2(b). It is observable that VDBE-Softmax max-
imizes the reward/episode. MBE shows best performance of the remaining three
basic policies with the advantage of not requiring to memorize any further ex-
ploratory data such as utilized by VDBE-Softmax. The Sarsa learning algorithm
shows better results for all four investigated policies. All REC policies have a
much higher reward in episode 3000 compared to when using a pure greedy pol-
icy, which converges to a reward per episode of 0, and is only the optimal policy
for the first 1000 episodes. In contrast, a pure random policy converges to a
reward per episode of −2750 respectively.
4.2 The Mountain-Car Problem with Two Goals

In the mountain-car problem, the goal is driving an underpowered car up a
mountain road [1], by initially standing in the valley between two mountains.
The problem is that gravity is stronger than the car’s engine, thus reaching the
mountain top by full throttle only is not possible. Instead, the car has to swing
up for collecting enough inertia for overcoming gravity. In the here presented
modification of the original learning problem, two goal states are utilized as
depicted in Fig. 1(b), which are rewarded differently upon arrival, because a
simple greedy policy leads to optimal performance in the original description
of the learning problem. The bounded state variables are continuously valued
consisting of the position of the car, −1.2 ≤ x ≤ 0.3, and its velocity −0.07 ≤
ẋ ≤ 0.07. The car’s dynamics are described by differential equations

xt+1 = bound xt + ẋt+1

ẋt+1 = bound ẋt + 0.001at − 0.0025 cos (3xt ) . (10)
At each discrete time step, the agent can chose between one of seven actions, at ∈
{−1.0, −0.66, −0.33, 0, 0.33, 0.66, 1.0}, each rewarded by rt+1 = −1, except for
reaching the right goal, which is rewarded by rt+1 = 1000. An episode terminates
when either one of the two goals has been arrived or when a maximum number
of actions, Tmax = 10000, is exceeded. At the beginning of each episode, the car
is positioned in the valley at position x = −0.5 with initial velocity ẋ = 0.0.
The state space is approximated by a 100 × 100 matrix, thus causing the actual
state to be only partially observable. Since the learning problem is episodic, no
discounting (γ = 1) is used. Finally, all utility values are optimistically initialized
with Qt=0 (s, a) = 0, and learned using a step-size parameter of α = 0.7.
Results. The averaged results over 200 runs are shown in Figure 3. Similar
to the non-stationary cliff-walking problem, episodic adaptation of MBE out-
performs ε-Greedy and Softmax. Furthermore, the Sarsa algorithm shows only
to be advantageous in combination with ε-Greedy and Softmax policies, in con-
trast to MBE and VDBE-Softmax behaving more efficiently in combination with
Q-learning. In the first phase of learning, a degradation of performance is rec-
ognizable for episodic MBE and VDBE-Softmax, which is due to the reason of
first learning a path to the left goal and afterwards learning the path to the right
(better) goal. For comparison, a greedy policy converges to an average reward
per episode of 114, in contrast to a pure random policy converging to −5440.
Fig. 3. The mountain-car problem with two goals: Averaged reward using Q-learning
and Sarsa (smoothed)
5 Discussion and Conclusions

In this paper, stochastic neurons have been deployed for adapting the exploration
parameter of basic exploration policies using gradient-following algorithms. A
global variant is evaluated with two popular learning algorithms, Q-learning and
Sarsa, in two different learning problems showing performance improvements
when trading off exploration and exploitation. Results from the non-stationary
cliff-walking problem show how the exploration parameter is readapted based on
learning progress when non-stationary environment responses are received. The
MBE policy turned out to be reliable for achieving good performance without
requiring to memorize any exploratory data. However, additional exploratory
data might further improve results as shown using VDBE-Softmax in the non-
stationary cliff-walking problem, but which is not always the case as it is observ-
able in the mountain-car problem with two goals.
Acknowledgements. Michel Tokic received funding by the collaborative center

for applied research ZAFH-Servicerobotik. The authors gratefully acknowledge
the research grants of the federal state Baden-Württemberg and the European
Union.
References
[1] Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press,
Cambridge (1998)
[2] Wiering, M.: Explorations in Efficient Reinforcement Learning. PhD thesis, Uni-
versity of Amsterdam, Amsterdam (1999)
[3] Thrun, S.B.: Efficient exploration in reinforcement learning. Technical Report
CMU-CS-92-102, Carnegie Mellon University, Pittsburgh, USA (1992)
[4] Auer, P.: Using confidence bounds for exploitation-exploration trade-offs. The
Journal of Machine Learning Research 3, 397–422 (2002)
[5] van Eck, N.J., van Wezel, M.: Application of reinforcement learning to the game
of Othello. Computers and Operations Research 35, 1999–2017 (2008)
[6] Faußer, S., Schwenker, F.: Learning a strategy with neural approximated temporal-
difference methods in english draughts. In: Proceedings of the 20th International
Conference on Pattern Recognition, ICPR 2010, pp. 2925–2928. IEEE Computer
Society (2010)
[7] Rummery, G.A., Niranjan, M.: On-line Q-learning using connectionist systems.
Technical Report CUED/F-INFENG/TR 166, Cambridge University (1994)
[8] Daw, N.D., O’Doherty, J.P., Dayan, P., Seymour, B., Dolan, R.J.: Cortical sub-
strates for exploratory decisions in humans. Nature 441(7095), 876–879 (2006)
[9] Williams, R.J.: Simple statistical Gradient-Following algorithms for connectionist
reinforcement learning. Machine Learning 8, 229–256 (1992)
[10] Watkins, C.: Learning from Delayed Rewards. PhD thesis, University of Cam-
bridge, England (1989)
[11] Grzes, M., Kudenko, D.: Online learning of shaping rewards in reinforcement learn-
ing. Neural Networks 23(4), 541–550 (2010)
[12] Tokic, M., Palm, G.: Value-Difference Based Exploration: Adaptive Control be-
tween Epsilon-Greedy and Softmax. In: Bach, J., Edelkamp, S. (eds.) KI 2011.
LNCS, vol. 7006, pp. 335–346. Springer, Heidelberg (2011)
Comparison of Long-Term Adaptivity
for Neural Networks
Frank-Florian Steege1,2 and Horst-Michael Groß1

1
Ilmenau Technical University, Neuroinformatics and Cognitive Robotics Lab,
P.O. Box 100565, D-98684 Ilmenau, Germany
2
Powitec Intelligent Technologies GmbH, 45219 Essen-Kettwig, Germany
frank-florian.steege@stud.tu-ilmenau.de
Abstract. Neural Networks can be used for the prognosis of important

quality measures in industrial processes to complement or reduce costly
laboratory analysis. Problems occur if the system dynamics change over
time (concept drift). We survey different approaches to handle concept
drift and to ensure good prognosis quality over long time ranges. Two
main approaches - data accumulation and ensemble learning - are ex-
plained and implemented. We compare the concepts on artificial datasets
and on industrial data from three cement production plants and analyse
strengths and weaknesses of different approaches.
Keywords: Neural Network, Concept Drift, Incremental Learning, Long

Term Learning.
1 Introduction
Artificial Neural Networks (ANN) for function approximation can be used for
modern process monitoring and process control. One of the most common ap-
proaches is to use ANN to predict future values of a measurement [1]. If this
prediction is used in combination with a controller, the system is known as Model
Predictive Control (MPC).
One of the most important problems with MPC in industrial applications is a
slightly changing environment which results in a drift of properties and dynamics
of the process. This changing is known as concept drift [2]. Hence, the prognosis
of the ANN will worsen the more time has passed since training the ANN.
To counter this worsening, the ANN has to be retrained with new data to
adapt to changes of the process. In this paper, we compare different methods
to retrain neural networks (in particular Multi-Layer Perceptrons) with new
data. We are especially interested in long term effects of different retraining
approaches. We compare all methods on artificial data and on data obtained
from several industrial cement production plants.
The remainder of this paper is organized as follows: we present a Model Pre-
dictive Control scenario and the data we used to benchmark different algorithms
in Sect. 2. Accordingly in Sect. 3, we give a brief review of related approaches
to adapt Neural Networks for function approximation to new data and compare

Comparison of Long-Term Adaptivity for Neural Networks 51
their applicability to the environment. The algorithms we used are explained in

Sect. 4 and a description of the experimental investigations is given in Sect. 5.
We conclude with a summary and outlook on possible improvements.
2 Experimental Environment and Prerequisites

Automatic learning and continuous adaptation are required in process control if
the dynamics of the controlled system changes over time. Industrial combustion
processes as used in coal-fired power plants, waste incineration plants, or cement
plants are a good example for such changing dynamics. Carbon black and slag are
combustion by-products and coat the furnace walls. Over time, the coating grows
and changes properties of the combustion process. An even more challenging
issue are the fast changes in raw material qualities.
Cement plants are very challenging but also very promising applications for
Model Predictive Control. Today cement usually is produced with rotary kiln
plants [3]. The residence time of the raw material in the rotary kiln varies from
35 to 60 minutes and depends on the steepness, length, and rotation speed of the
kiln. The quality of the produced cement is determined by laboratory analysis
which can take from five minutes (X-ray analysis) up to two hours (chemical
analysis). A cement sample is usually taken every two hours. Hence it may take
up to four hours until a quality measurement is available for the current process
situation. If a neural network can predict and estimate cement quality from
continuous measurements, like air temperature, kiln rotation speed, raw meal
feed etc. a controller can react much faster to changes of the cement quality.
Figure 1 shows the prognosis of one main quality measure, the free lime value,
obtained with a Multi-Layer Perceptron.
Fig. 1. Free lime value prognosis at a cement plant. The plot shows three days of plant
operation with laboratory measurements and neural network prognosis of the free lime
value. While a laboratory measurement is only available every four hours, the prognosis
is available the whole time.
The free lime value [3] is a major quality criterion of the cement. If a good pre-
diction of this value is possible, the whole production process can be stabilized.
52 F.-F. Steege and H.-M. Groß
This prognosis can only be used for control purposes if it is correct. To obtain a
correct prognosis over a long time range of two or more years is very demanding
as process dynamics of the cement plant change as mentioned afore. Figure 2
shows long term changes of an important process measurement from a cement
plant over a period of two years. To test the capability of different long-term
network adaptation algorithms, we used data from three cement plants. The
target for the approximation was the free lime value of the cement produced by
the plant. Network inputs were signals obtained from the process such as kiln
rotation speed, kiln temperature, and raw meal feed.
Fig. 2. Time plot of kiln inlet meal temperature over the period of two years. The
measurement is low-pass filtered with a sliding time window of one week. Time spans
without signal denote stops of the plant. Over the whole time, a change in the level
and dynamics of the signal of about 80 degrees is visible.
3 State-of-the-Art
One of the first publications that mentions the difficulties in long term adaptivity
of artificial neural networks in changing environments is [4]. The author anal-
ysed the problem of catastrophic interference in neural networks. Catastrophic
interference describes the phenomenon that learning of new facts disrupts per-
formance on previously learned old facts, however, in [4] the author did not give
solutions how the problem could be solved or avoided.
In [5] the author proposes the FLORA framework which accepts only certain
samples for training of a classificator. Old samples, that do not suit to the current
window are not used for training. A method to weight old and new samples for
a two layered network is introduced in [6] by using a forgetting function which
reduces influence of old samples in the training process.
Other approaches use not only one approximator but an ensemble. Examples
are Learn++ [7], dynamic weighted majority [8], incremental adaptive learning
[9], or iRGLVQ [10]. All ensemble based approaches are similar in their use of
more than one model, where each model is trained with different data. They can
be discerned by the basic model used and the way the ensemble members are
used to compute an approximation.
Summarizing aforementioned publications, two different classes for handling

concept drift can be distinguished:
1. Data accumulation/Instance selection [5,6,9]: learn a new model with

all/selected data acquired; discard the old model
2. Ensemble learning [7,8,9,10]: learn a new model with new data; select the
best model for every situation or merge output of old and new model
While comparing these different approaches two major problems occur: first,
every approach uses a different basic approximation model which is adapted
to concept drift. In [10] Learning Vector Quantisation (LVQ) is used, in [9,7]
Multi-Layer Perceptrons (MLP), in [6] a combination of two subnetworks, in [5]
attribute-value logic, and in [8] an Incremental Tree Inducer (ITI) and Bayes
learners. The second problem is that most results are obtained on artificial data
or real data with artificially induced concept drift.
In the following, we compare algorithms on industrial data, use the same basic
approximator (MLP), and compare results of different adaptation algorithms.
For purpose of explanation and to allow for other researchers to reproduce our
results, we also use three artificial datasets for benchmarking.
4 Algorithms for Automatic Network Adaptation
In Sec. 3 we showed that there are two paradigms how to adapt a model to
concept drift: 1. data accumulation and 2. ensemble learning. In this section we
propos algorithms to apply these paradigms to training of Multi-Layer Percep-
trons as function approximator for the purpose of Model Predictive Control.
Every approach starts with the same preconditions: there is an initial dataset
Sinit with si ∈ Sinit , si = (x1 , . . . , xm , y), where x1 , . . . , xm are the input val-
ues/measurements for m input dimensions and y is the target value of the
respective sample. This dataset is used to train a MLP Ninit with Levenberg-
Marquardt training algorithm.
4.1 Data Accumulation
The principle of data accumulation is to have one model which is adapted when
a certain amount of new data Snew is available. We applied three variations of
this concept:
– data acc.1: create a dataset Saccu = Sinit ∪ Snew and retrain Ninit with
dataset Saccu
– data acc.2: retrain Ninit only with dataset Snew ; ignore old data
train val
– data acc.3: split Snew in training data Snew and validation data Snew
train
create a new MLP Nnew and train it with Snew
val
if the approximation error of Nnew on Snew is lower than approximation
val
error of Ninit on Snew delete Ninit and use Nnew
Each of the three concepts is repeated every time a new dataset Snew is available
(the exact number of samples sufficient to create a new dataset depends on the
data used). The first concept uses all samples/information acquired over time
but may suffer catastrophic interference as described in [4]. The second concept
is a very basic version of sliding time window used in [5]. The third concept tries
to compensate drawbacks of the first two concepts: a new (retrained) model is
only accepted if its results are better than old model results.
4.2 Ensemble Learning

The principle of ensemble learning is to train a new model Nnew if sufficiently
new data Snew is available. This results in a pool P of models Ni ∈ P, i = 1 . . . n
where n is the number of currently available models. The important question is
which model Ni is activated at a certain time step t? We compare two different
ways to determine the active model Ni :
– ensemble1: use all Ni ∈ P to simulate target y for the last d time steps:
y s = (yt−d
s s
, . . . , yt−1 ) and compare prognosis error e = |ys − y| for all
Ni ∈ P ; the model with lowest error is chosen
– ensemble2: compare current input and training data of each model Ni ∈ P
to choose the best model; therefore:
• cluster the training data (for example with a k-means clusterer) of each
network i into k cluster K and calculate the characteristic input values
xik = (xik ik
1 , . . . , xm ) and the mean prognosis error e
ik
for each cluster
centre
• for every new sample snew = (x1 , . . . , xm ) calculate the Euclidean dis-
tance dik to each cluster centre xik = (xik 1, . . . , xik
m)
ec ·dc
• calculate the distance weighted error e = c∈K
i
of mean prognosis
j dj
error for each cluster centre c ∈ K of each model Ni ∈ P
• the model Ni with the minimal ei is chosen and set active
Both approaches are simpler versions of the ensemble methods used in [7,8,9,10].
The comparison of input/output-relations is not as easy in Multi Layer Percep-
trons as in LVQs or rule based systems. Nevertheless both approaches enable to
use adaptive ensembles of MLPs for Model Predictive Control.
5 Experiments
In this section, we apply the adaptation algorithms explained in Sec. 4 to data
with concept drift. We use two different types of data. The first three data sets
are obtained from rotary kiln cement production plants. The target y for the
MLP prognosis is the free lime value which indicates the quality of the cement
produced [3]. Five to ten different measurements from each kiln, such as kiln
inlet temperature, secondary air temperature, raw meal feed, etc (see [3] for a
detailed description of the measurements) are used as input values for a sample
si = (x1 , . . . , xm , y). We use two years of data of each plant which results in
2,100-4,000 samples, depending on the sample rate of the laboratory (from three
up to eight hours) and the revision times of the plants.
For purpose of explanation and to allow for other researchers to reproduce
our results, we also use three simple artificial datasets. We generate a target y(t)
from five input signals x1 (t), . . . , x5 (t) as shown in equation 1:
y(t) = α · x1 (t) · x2 (t) + x3 (t) + α · x4 (t) + x1 (t) · x5 (t) + d(t) (1)
Each input xi (t) is randomly sampled from a Gaussian distribution of N (0, 1).
We add noise data d(t) sampled from N (0, 0.3). Linear concept drift is induced
by the parameter α which changes over the simulation time. We apply three
different variations for the concept drift:
1. α changes linear from 0.1 to 1 and is set back to 0.1 at certain times; this
corresponds to slagging in industrial plants which grows over time but is set
back to a low level after a plant revision
2. α is linearly changing from 0.1 to 1
3. α does not change, which corresponds to a process without concept drift
Fig. 3 illustrates the progress of parameter α at all three artificial datasets.
Fig. 3. Progress of parameter α used to model concept drift in three artificial datasets
For the prognosis of the target, we use a Multi-Layer Perceptron featuring one
hidden layer with five neurons to approximate the target. Training algorithm is
standard Levenberg-Marquardt training as included in the Neural Network li-
brary of Matlab. All networks (except approach data acc.1) are trained/retrained
with 250 samples of data where the last 50 samples are used for validation. For
data acc.1, we use a growing training set which includes all samples available
since starting retraining. For ensemble2, we apply a k-means clusterer with
k = 10 to cluster the training set of each network. The test performed in
ensemble1 to determine the best model is carried out with the last 50 sam-
ples observed. No additional pruning algorithms are used as we are focused on
effects the different adaptation algorithms have on long term prognosis error.
Table 1 shows results on the different datasets.
Table 1. Median prognosis error eQ50% of 200 trials network training. The two best
results of each data set are marked with a greybackground. The number of plant
revision (resets of the concept drift) and the sum of errors over all datasets are also
listed.

eQ50% plant1 plant2 plant3 plant art.data1 art.data2 art.data3 art.
no adapt. 0.635 0.871 0.782 2.288 0.309 0.334 0.135 0.778
data acc.1 0.492 0.766 0.714 1.972 0.220 0.217 0.124 0.561
data acc.2 0.701 0.773 0.768 2.242 0.201 0.156 0.124 0.481
data acc.3 0.520 0.801 0.779 2.100 0.249 0.167 0.134 0.550
ensemble1 0.478 0.795 0.749 2.022 0.193 0.168 0.134 0.495
ensemble2 0.524 0.850 0.793 2.167 0.306 0.275 0.185 0.766
revisions 3 4 0 3 0 0
For evaluation of the results, we repeated every simulation 200 times. The
mean prognosis error over the whole time period was calculated. Afterwards,
we compared the median eQ50% of all 200 trials for each concept and data set.
We chose the median and not the mean because approximately 1% of the net-
works trained produces a very high error because of disadvantageous initialisa-
tion which influences the mean error of all 200 simulations disproportionately.
Prognosis without adaptation of the network produces the worst result. This
was expected as it does not counter the concept drift. ensemble2 also performs
very bad. This is a result of the imprecise representation of the input space we
choose with the k-means clustering. If a better method is aquired to map and
compare input/output relations in trained MLPs, this approach would surely
produce better results. The potential of ensembles is revealed by ensemble1,
which is the second best method of the six approaches we tested. Only if the
concept does not change (art.data3) or there is no revision of the plant included
in the data (art.data2, plant3), data accumulation approaches outperform this
ensemble approach.
Of the three different data accumulation approaches data acc.1 performs best
on real world data. This is surprising, since data acc.1 uses all data available,
which results in ambiguous data due to the changing parameters (boiler slagging
in plants and α in artificial data). Nonetheless the prediction acquired with
unambiguous data but fewer training samples is worse. We expect the results of
data acc.2/3 to get better if the sampling rate is increased and more samples
are available for the used time window.
On plant3 the differences between the approaches are smaller than on other
plants. The reason is that in plant3 other sensor measurements than in plant1
and plant2 had to be used because of the plant architecture. Hence the over-
all prognosis quality decreases and differences between the adaptation concepts
disappear.
6 Conclusion
Concept drift does influence the quality of neural network prognosis in industrial
combustion processes. Through growing boiler slagging and the use of other fuels,
the prognosis of important performance figures is getting worse if the networks
used are not adapted to changing data. We applied different approaches to adapt
networks to concept drift over long time ranges. The best approach depends on
the type of the concept drift. If dynamics and properties of the plant change
very slowly and old states do not appear again, it is advantageous to use sliding
window technic and data accumulation to constantly retrain a single network
with new data.
If changes in the dynamics and properties appear very abrupt and old states
reappear (due to revisions of a plant or a small selection of used fuels) ensemble
learning with more than one model is superior to other concepts.
Our future work concentrates on improving the use of ensemble methods.
Furthermore we want to apply the approaches to other industrial MPC problems
and compare them with the results gained on cement plants.
References
1. Agachi, P.S., Nagy, Z.K., Cristea, M.V., Imre-Lucaci, A.: Model based control:
Case Studies in Process Engineering. Wiley-VCH (2006)
2. Tsymbal, A.: The problem of concept drift: Definitions and related work. Technical
Report, Department of Computer Science, Trinity College: Dublin, Ireland (2004)
3. Alsop, P.A.: The Cement Plant Operations Handbook, 5th edn. Tradeship Publi-
cations Ltd. (2007)
4. McCloskey, M., Cohen, N.: Catastrophic interference in connectionist networks:
The sequential learning problem. Psychology of Learning and Motivation 24, 109–
164 (1989)
5. Widmer, G., Kubat, M.: Learning in the presence of concept drift and hidden
context. Machine Learning 23(1), 69–101 (1996)
6. Pérez-Sánchez, B., Fontenla-Romero, O., Guijarro-Berdiñas, B.: An incremental
learning method for neural networks in adaptive environments. In: Int. Joint Conf.
on Neural Networks (IJCNN 2010), pp. 1–8 (2010)
7. Elwell, R., Polikar, R.: Incremental Learning of Concept Drift in Nonstationary
Environments. IEEE Transactions on Neural Networks 22(10), 1517–1531 (2011)
8. Kolter, J.Z., Maloof, M.A.: Dynamic weighted majority: A new ensemble method
for tracking concept drift. In: Proc. IEEE Int. Conf. on Data Mining (ICDM 2003),
pp. 123–130 (2003)
9. He, H.: Self-Adaptive Systems for Machine Intelligence. John Wiley & Sons (2011)
10. Kirstein, S., Wersing, H., Gross, H.-M., Koerner, E.: A life-long learning vector
quantization approach for interactive learning of multiple categories. Neural Net-
works 28, 90–105 (2012)
Simplifying ConvNets for Fast Learning
Franck Mamalet1 and Christophe Garcia2

1
Orange Labs, 4 rue du Clos Courtel, 35512 Cesson-Sévigné, France
franck.mamalet@orange.com
2
LIRIS, CNRS, Insa de Lyon, 17 avenue Jean Capelle, 69621 Villeurbanne, France
christophe.garcia@liris.cnrs.fr
Abstract. In this paper, we propose different strategies for simplifying

filters, used as feature extractors, to be learnt in convolutional neural net-
works (ConvNets) in order to modify the hypothesis space, and to speed-
up learning and processing times. We study two kinds of filters that are
known to be computationally efficient in feed-forward processing: fused
convolution/sub-sampling filters, and separable filters. We compare the
complexity of the back-propagation algorithm on ConvNets based on
these different kinds of filters. We show that using these filters allows to
reach the same level of recognition performance as with classical Conv-
Nets for handwritten digit recognition, up to 3.3 times faster.
1 Introduction
Convolutional Neural Networks (ConvNets), proposed by LeCun et al. [1], have

shown great performances in various computer vision applications, such as hand-
written character recognition [1,2], facial analysis [3,4,5], videoOCR [6,7], or
vision-based navigation [8]. ConvNets consist of a pipeline of convolution and
pooling operations followed by a multi-layer perceptron. They tightly couple
local feature extraction, global model construction and classification in a single
architecture where all parameters are learnt conjointly using back-propagation.
Constructing efficient ConvNets to solve a given problem requires exploring
several network architectures by choosing the number of layers, the number of
features per layer, convolution and sub-sampling sizes, and connections between
layers, which impact directly training time.
Several approaches have been proposed to improve learning speed and gen-
eralization by reducing or modifying the hypothesis space, i.e. the network
architecture, or the activation function [9]. Pruning neural networks has been
broadly studied for MLPs (a survey is given in [10]). Concerning ConvNets,
Jarrett et al. [11] have proposed to add sophisticated non-linearity layers such as
rectified sigmoid or local contrast normalization to improve network convergence.
Mrazova et al. [12] proposed to replace convolutions by Radial Basic Functions
-RBF- layers leading to faster learning when associated with a ”winner-takes-all”
strategy.
Optimization methods for efficient implementation on processors or hardware
systems have also been proposed to accelerate learning. Some studies were carried

Simplifying ConvNets for Fast Learning 59
Fig. 1. (a) a typical ConvNet architecture with two feature extraction stages; (b) Fusion
of convolution and sub-sampling layers
out on fractionnal transformation of back-propagation [13], others on paralleliza-

tion schemes (comparisons are given in [14]). Recent works have been targeting
graphic processing units -GPU- to speed-up back-propagation [2,15].
In this paper, we will focus on ConvNet hypothesis space modifications, using
simplified convolutional filters to accelerate epoch processing time. In section 2,
we describe the reference ConvNet architecture, detail the proposed equivalent
convolutional filters, and compare their back-propagation complexity. Section 3
presents in-depth experiments on handwritten digit recognition using different
kinds of convolutional filters, and compares both generalization performances
and training time. Finally, conclusions and perspectives are drawn in section 4.
2 Simplifying Convolutional Filters

In this section, we first describe the classical ConvNet LeNet-5 [1], proposed
by LeCun et al.. Then, we propose several equivalent networks architectures
using simplified convolutional filters, and compare the complexity of the back-
propagation algorithm on these layers.
The original model of ConvNet, illustrated in Figure 1.(a), is based on convo-
lutional filters layers interspersed with non-linear activation functions, followed
by spatial feature pooling operations such as sub-sampling. Convolutional layers
Ci contain a given number of planes. Each unit in a plane receives input from
a small neighborhood (local receptive field) in the planes of the previous layer.
Each plane can be considered as a feature map that has a fixed feature detec-
tor, that corresponds to a convolution with a trainable mask of size Ki × Ki ,
applied over the planes in the previous layer. A trainable bias is added to the
results of each convolutional mask, and a hyperbolic tangent function, used as
an activation function, is applied. Multiple planes are used in each layer so that
multiple features can be detected. Once a feature has been detected, its exact lo-
cation is less important. Hence, each convolutional layer Ci is typically followed
by a pooling layer Si that computes the average (or maximum) values over a
neighborhood in each feature map, multiplies it by a trainable coefficient, adds
a trainable bias, and passes the result through an activation function.
Garcia et al. [3,5,7] have shown that, for different object recognition tasks,
state-of-the-art solutions can be achieved without non-linear activation functions
60 F. Mamalet and C. Garcia
in convolutional layers. Thus, in the rest of this paper, we will only consider Ci
layers with identity activation function. We will also consider average pooling
layers Si performing a sub-sampling by two. For a Ci layer, its input map size
Win × Hin , its output map size Wi × Hi , and the following Si sub-sampled
output map size SWi × SHi are connected to the convolution kernel size Ki by:
(Wi , Hi ) = (Win − Ki + 1, Hin − Ki + 1) and (SWi , SHi ) = (Wi /2, Hi /2).
Since these layers rely on local receptive fields, the complexity of the back-
propagation delta-rule algorithm for a given element is proportional to its output
map size and the cardinal of its connections with the following layer, that is,
proportional to (Wi × Hi ) for Ci layers and (SWi × SHi × Ki+1 2
) for Si layers.
Weight sharing in these layers implies a complexity of the weight update
algorithm that is proportional to output map and kernel sizes: i.e. (Wi ×Hi ×Ki2)
for Ci layers, and in (SWi × SHi ) for Si layers.
In the remainder of this section, we present our proposition to learn modified
ConvNets where Ci and Si layers are replaced by equivalent convolutional filters,
and compare the back-propagation complexity of these layers.
2.1 Fused Convolution and sub-sampling Filters

It has been shown by Mamalet et al. [16,17], that a convolutional layer Ci with
identity activation followed by a sub-sampling layer Si can be replaced in the
feed-forward pass (when the learning phase is achieved) by an equivalent fused
convolutional/sub-sampling layer CSi which consists of single convolutions of
(Ki + 1) × (Ki + 1) kernel size applied with horizontal and vertical input steps of
two pixels, followed by a non-linear activation function (this two pixels step
serves as sub-sampling, see Figure 1.(a)), leading to speed-up factors up to
2.5 [16,17]. Kernel weights w̃ and bias b̃ are obtained respectively by linear
combination of original weights w and bias b.
In this paper, we propose to learn directly these fused convolution/sub-sam-
pling layers, i.e. convolution maps of kernel size (Ki + 1) × (Ki + 1) with an
input step of two pixels. One can notice that the hypothesis space represented
by fused convolution/sub-sampling layers CSi is larger than the one represented
by the pair (Ci , Si ).
The output map size of a layer CSi is SWi × SHi and is connected to a CSi+1
convolution with a step of two pixels. The complexity of the back-propagation
algorithm for such a CSi layer is proportional to (SWi SHi (Ki+1 + 1)2 /4). The
update weight algorithm complexity is is proportional to (SWi SHi (Ki + 1)2 ).
2.2 Separable Convolution Filters

Another special case of convolutional filters are separable ones, i.e. convolutions
that can be expressed as the outer product of two vectors: Ci = Chi ∗ Cvi =
Cvi ∗ Chi , where Chi (resp. Cvi ) is a row (resp. column) vector of size Ki .
Figure 2.(a) shows a separable Ci feature map split into two successive 1D-
convolutions. In feed-forward computation applied over a Win ×Hin input image,
this transformation leads to a Ki2 /(2Ki ) speedup factor. If separable filters are
Fig. 2. (a) Separable convolution layers; (b) Fused separable convolution and sub-
sampling layers
broadly used in image processing, as far as we know, no study has been published
on learning separable filters within ConvNet architectures.
We thus propose to restrict the hypothesis space using only separable con-
volutions in ConvNets, and directly learn two successive 1D-filters. If in the
feed-forward application, horizontal and vertical filters are commutative, back-
propagation in ConvNets may lead to different trained weights. Thus, we will
evaluate either a horizontal convolution Chi whose output map size is Wi × Hin ,
followed by a vertical one Cvi (Figure 2.(a)), or a vertical convolution Cvi whose
output map size is Win × Hi , followed by a horizontal one Chi . No activation
function is used in Chi and Cvi layers.
We denote the first (resp. second) configuration Chi ∗ Cvi (resp. Cvi ∗ Chi ).
The delta-rule complexity of the Chi ∗ Cvi configuration is proportional to
(Wi Hin Ki + Wi Hi ), since the Chi layer is connected to the Cvi layer, which
is itself connected to the Si layer. The weight update algorithm is proportional
to (Wi (Hin + Hi )Ki ). The complexity of the Cvi ∗ Chi configuration is obtained
by replacing H and W .
The hypothesis space represented by these separable convolutional filters is a
more restricted set than the one of classical ConvNets.
2.3 Fused Separable Convolution and sub-sampling Filters
Our third proposition is to combine the two previous kinds of filters to learn
fused separable convolution and sub-sampling layers, which consist in either a
horizontal convolution CShi with a horizontal step of two, whose output map
size is SWi × Hin , followed by a vertical one CSvi with a vertical step of two
and an activation function, or a vertical convolution CSvi with a vertical step
of two, whose output map size is Win × SHi , followed by a horizontal one CShi
with a horizontal step of two and an activation function.
We denote the first configuration CShi ∗CSvi and the second CSvi ∗CShi . The
CShi ∗CSvi configuration is described in Figure 2.(b), underlining its equivalence
with a traditional (Ci , Si ) couple or a CSi layer.
Table 1. Complexity comparison of back-propagation algorithm for different filters
(Wi , Hi , Ki , Ki+1 ) (28, 28, 5, 5) (10, 10, 5, 5)

Ci Wi Hi Ki2 19600 2500
2
Si Wi Hi Ki+1 /4 4900 625
CSi (4(Ki + 1)2 + (Ki+1 + 1)2 ) W16
i Hi
8820 1125
Speedup factor 16Ki2 +4Ki+1
2
2.8 2.8
(Ci , Si )/CSi 4(Ki +1)2 +(Ki+1 +1)2
Chi ∗ Cvi Ki Wi (2Hin + Hi ) 12880 1900

Speedup factor Ki Hi
1.5 1.3
Ci /(Chi ∗ Cvi ) 2Hin +Hi
CShi ∗ CSvi (Ki + 1)(3Hin + Hi )Wi /4 5208 780

Speedup factor Hi (4Ki2 +Ki+1
2
)
4.7 4.0
(Ci , Si )/(CShi ∗ CSvi ) (Ki +1)(3Hin +Hi )
The CShi ∗ CSvi delta-rule complexity is proportional to (SWi Hin (Ki +

1)/2 + SWi SHi (Ki+1 + 1)/2) and its weight update complexity is proportional
to (SWi Hin (Ki + 1) + SWi SHi (Ki + 1)). The complexity of the CSvi ∗ CShi
configuration is obtained by replacing H and W .
The hypothesis space represented by these fused separable convolution and
sub-sampling filters is larger than the one represented by separable convolutional
ones (section 2.2), but smaller than the ones presented in section 2.1.
2.4 Comparison of the Back-Propagation Complexity for These

Filters
Table 1 gathers the complexity of the learning phase for each filter type described
in this section. It also gives estimated speedup factors compared to traditional
Ci , Si ConvNet layers, for some parameter values.
We can see in Table 1 that the back-propagation complexity of CSi layers
is up to 2.8 times lower than traditional ConvNet (Ci , Si ) layers. Separable
convolution Chi ∗ Cvi learning is only 1.3 times faster, and fused separable
convolution and sub-sampling CShi ∗ CSvi or can lead to a speedup factor of
up to 4.7. The complexity and speedup factor of the Cvi ∗ Chi and CSvi ∗ CShi
configurations can be obtained by replacing H and W .
In the next section, we present some experiments showing that using such
modified convolutional layers leads to comparable classification and recognition
performances, and enables epoch processing acceleration closed to those given
in Table 1.
3 Experiments
The main goal of these experiments is not to propose novel convolutional ar-
chitectures for the following tasks, but to compare learning capabilities with
modified filters. We thus use a reference ConvNet architecture similar to the

well-known LeNet-5 proposed by LeCun et al. [1] for handwritten digit recogni-
tion. From now on, we denote networks with the corresponding filter notation,
i.e. a CSi network stands for a ConvNet with CSi layers.
3.1 Handwritten Digit Recognition

This experiment is based on the MNIST database introduced by LeCun et al. [1]
which comprises 60,000 training and 10,000 test 28 × 28 images. State-of-the-art
methods achieve a recognition rate of 99.65% [15] using a deep MLP trained on
GPUs and elastic distortions on training images.
In this paper, we use a reference ConvNet architecture inspired by LeNet-
5 [1], and do not apply any distortion to the training images. As in [1], ConvNet
inputs are padded to 32 × 32 images and normalized so that the background
level corresponds to a value of −0.1 and the foreground corresponds to 1.175.
For each network, we launch six training on 25 epochs and save the network
after the last epoch (no overlearning is observed as in [1]). Then, generalization
is estimated on the test set, and we retain the best one.
The 32 × 32 input image is connected to six C1 5 × 5 kernel size convolution
maps, followed by six S1 sub-sampling maps. C2 layers consists in fifteen 5 × 5
kernel size convolution maps which take input from one of the possible pairs of
different feature maps of S1 . These maps are connected to fifteen S2 sub-sampling
layer maps. The N1 layer has 135 neurons: each of the fifteen S2 feature maps
is connected to two neurons, and each of the remaining 105 neurons takes input
from one of the possible pairs of different feature maps of S2 . N2 is a fully
connected 50 neurons layer. The ten N3 fully connected output neurons use a
softmax activation function. This network comprises 14,408 trainable weights.
We train the networks using modified convolutional filters:
– Fused convolution and sub-sampling network where Ci + Si layers have been
replaced by 6 × 6 kernel size CSi filters (Figure 1.(b)). This network has
only five layers and 14,762 trainable weights,
– Separable convolution networks have nine layers, replacing each Ci layer by
two Chi ∗ Cvi or Cvi ∗ Chi ones (Figure 2.(a)). They have 13,814 trainable
weights,
– Fused separable convolution and sub-sampling networks comprise seven lay-
ers, each (Ci , Si ) couple is replaced by CShi ∗ CSvi or CSvi ∗ CShi ones
(Figure 2.(b)). They have 13,829 trainable weights.
Figure 3 shows features maps obtained on the ’8’ handwritten digit with the
learnt networks CSi , CShi ∗ CSvi and Chi ∗ Cvi . Table 2 presents the results ob-
tained on MNIST training and test databases with different kind of convolutional
filters. The first line gives the reference performance of LeNet-5 architecture with
the same training database (no distortion). Our reference ConvNet architecture
(Ci , Si ) has a performance of 1.28% error rate on MNIST test database. This
small rate gap with LeNet-5 results is mainly due to architecture differences
(layer connections and output units).
Fig. 3. Feature maps obtained with simplified convolutional filters (upper left: CSi ;
bottom left: CShi ∗ CSvi ; right: Chi ∗ Cvi )
Table 2. MNIST error rate (ER) for each kind of network
Training ER (%) Test ER (%) Speedup Factor

LeNet-5 (no distortion) [1] 0.35 0.95
(Ci , Si ) 0.46 1.28 1.0
CSi 0.07 1.32 2.6
Chi ∗ Cvi 0.68 1.52 1.6
Cvi ∗ Chi 0.44 1.45 1.6
CShi ∗ CSvi 0.36 1.49 3.3
CSvi ∗ CShi 0.14 1.61 2.9
The CSi network obtains the same generalization performances as traditional

ConvNet and requires 2.6 times less processing time per epoch, which is compa-
rable to the estimation given in Table 1. Other configurations induce a loss of
performance smaller than 0.4%, and enable speedup factor of 1.6 for separable
filters, and up to 3.3 for fused separable ones. This latter is slightly lower than
the estimation given in Table 1, due to the Ni back-propagation time which
becomes predominant.
4 Summary and Future Work

In this paper, we have introduced several modifications of the hypothesis space
of Convolutional Neural Networks (ConvNet ), replacing convolution and sub-
sampling layers by equivalent fused convolution/sub-sampling filters, separable
convolution filters or fused separable convolution/sub-sampling filters. We have
proposed a complexity estimation of the back-propagation algorithm on these
different kinds of filters which allows evaluating learning speedup-factor. We have
presented experiments on the handwritten digit database MNIST using reference
ConvNets which performed comparably to similar systems in the literature. We
have trained the modified ConvNets using the simplified filters, and proven that
classification and recognition performances are almost the same with a train-
ing time divided by up to 3.3. To enhance convergence and generalization, the
proposed convolutional filters could be interspersed with other non-linear units,

such as rectification or local normalization [11], or also to form part of wider
networks enabling to speed-up architecture and space exploration. Furthermore,
we plan to combine these filters optimizations with parallel implementations on
GPU which are known to be efficient in 1D and 2D convolution processing, and
we believe it would allow processing of larger deep-learning networks.
References
1. Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to
document recognition. Proc. of the IEEE (November 1998)
2. Chellapilla, K., Puri, S., Simard, P.: High Performance Convolutional Neural Net-
works for Document Processing. In: Proc. of the Int. Workshop on Frontiers in
Handwriting Recognition, IWFHR 2006 (2006)
3. Garcia, C., Delakis, M.: Convolutional Face Finder: a neural architecture for fast
and robust face detection. IEEE Trans. on Pattern Analysis and Machine Intelli-
gence (November 2004)
4. Osadchy, M., LeCun, Y., Miller, M.L., Perona, P.: Synergistic face detection and
pose estimation with energy-based model. In: Proc. of Advances in Neural Infor-
mation Processing Systems, NIPS 2005 (2005)
5. Garcia, C., Duffner, S.: Facial image processing with convolutional neural networks.
In: Proc. Int. Workshop on Advances in Pattern Recognition (2007)
6. Delakis, M., Garcia, C.: Text detection with Convolutional Neural Networks. In:
Proc. of the Int. Conf. on Computer Vision Theory and Applications (2008)
7. Saidane, Z., Garcia, C.: Automatic scene text recognition using a convolutional
neural network. In: Proc. of Int. Workshop on Camera-Based Document Analysis
and Recognition (2007)
8. Hadsell, R., Sermanet, P., Scoffier, M., Erkan, A., Kavackuoglu, K., Muller, U.,
LeCun, Y.: Learning long-range vision for autonomous off-road driving. Journal of
Field Robotics (February 2009)
9. Raiko, T., Valpola, H., LeCun, Y.: Deep learning made easier by linear transfor-
mations in perceptrons. In: Conf. on AI and Statistics (2012)
10. Reed, R.: Pruning algorithms - a survey. IEEE Trans. on Neural Networks (1993)
11. Jarrett, K., Kavukcuoglu, K., Ranzato, M., LeCun, Y.: What is the best multi-
stage architecture for object recognition? In: Proc. Int. Conf. on Computer Vision
(2009)
12. Mrazova, I., Kukacka, M.: Hybrid convolutional neural networks. In: Proc. of IEEE
Int. Conf. on Industrial Informatics, INDIN 2008 (2008)
13. Holt, J., Baker, T.: Back propagation simulations using limited precision calcula-
tions. In: Proc. of Int. Joint Conf. on Neural Networks, IJCNN 1991 (1991)
14. Petrowski, A.: Choosing among several parallel implementations of the backprop-
agation algorithm. In: Proc. of IEEE Int. Conf. on Neural Networks (1994)
15. Ciresan, D., Meier, U., Gambardella, L.M., Schmidhuber, J.: Handwritten digit
recognition with a committee of deep neural nets on GPUs. In: Computing Research
Repository (2011)
16. Mamalet, F., Roux, S., Garcia, C.: Real-time video convolutional face finder on
embedded platforms. EURASIP Journal on Embedded Systems (2007)
17. Mamalet, F., Roux, S., Garcia, C.: Embedded facial image processing with convo-
lutional neural networks. In: Proc. of Int. Symp. on Circuits and Systems (2010)
A Modified Artificial Fish Swarm Algorithm
for the Optimization of Extreme Learning
Machines
João Fausto Lorenzato de Oliveira and Teresa B. Ludermir
Center of Informatics, Federal University of Pernambuco

Av. Jornalista Anibal Fernandes, s/n, Recife, PE, 50.740-560, Brazil
Abstract. Neural networks have been largely applied into many real
world pattern classification problems. During the training phase, every
neural network can suffer from generalization loss caused by overfitting,
thereby the process of learning is highly biased. For this work we use
Extreme Learning Machine which is an algorithm for training single hid-
den layer neural networks, and propose a novel swarm-based method for
optimizing its weights and improving generalization performance. The
algorithm presents the basic Artificial Fish Swarm Algorithm (AFSA)
and some features from Differential Evolution (Crossover and Mutation)
to improve the quality of the solutions during the search process. The re-
sults of the simulations demonstrated good generalization capacity from
the best individuals obtained in the training phase.
Keywords: Neural Networks, Optimization.
1 Introduction
Artificial Neural Networks (ANNs) have been largely applied in many applica-
tions in the recent years. ANNs need to be trained by an algorithm to gather
enough informations over the solution space of a given problem. Traditional
gradient descent algorithms such as backpropagation and Levenberg Maquardt
present low training speed and may easily get stuck on local minima [1]. An
alternative algorithm is the Extreme Learning Machine (ELM) [2], which is a
fast learning algorithm for the training of neural networks with only one hidden
layer.
The knowledge in a given ANN is represented by its weights, and thus, its
performance will vary depending on how well those weights cover the solution
space. Finding weights to improve the performance of a ANN is not an easy
task, due to many possible weight configurations. In the literature, there are
many search algorithms applied to the ANNs in order to find a set of weights
that improve the performance of the system. Algorithms such as Particle Swarm
optimization (PSO) [3], Genetic Algorithms (GA) [3] and Artificial Fish Swarm
Algorithm (AFSA) [4], may be applied to find the best solution (set of weights)
in the solution space.

A Modified Artificial Fish Swarm Algorithm 67
The ELM optimization has been studied in [5,6,7,8] where algorithms such as
PSO, Differential Evolution (DE) and GA are applied.
In this work we propose a hybrid model based on the optimization of ELM
weights by a modified AFSA algorithm called Modified Artificial Fish Swarm
Algorithm for the Optimization of Extreme Learning Machines (MAFSA-ELM).
2 Extreme Learning Machine

The ELM [2] is a simple learning algorithm for single layer feedforward neural
networks (SLFNs), whose learning speed can be faster then traditional gradient
descent methods such as back-propagation (BP) and its variation Levenberg-
Marquardt (LM)[1]. The RBF network (RBF) [1] maps the inputs space into a
higher dimensionality space based on a radial basis function so the process of
classification becomes easier. The ELM algorithm can also be used with a radial
basis activation function, although in this work we will use the sigmoid function.
In this algorithm, the input weights and hidden layer biases are randomly set,
and the output weights are calculated through matrix operations in an off-line
fashion. The absence of a tuning phase, enables a faster training phase.
Given n distinct training samples (xi , ti ), where xi = [xi1 , xi2 , ..., xin ]T ∈ n
is a vector containing the inputs and ti = [ti1 , ti2 , ..., tim ]T ∈ m is a vector con-
taining the desired outputs (i ∈ [1, ..., n]), the SLFN with s hidden nodes needs
to learn all the training samples. Since the input weights wj = [wj1 , wj2 , ..., wjn ]T
and hidden biases B = [b1 , b2 , .., bj ] (j = [1, ..., s]) are randomly generated, the
task performed in the training phase is to find the appropriate output weights for
the given input weights and biases through the calculation of the linear system
Hβ = T .
The matrix H = {hij } is the hidden-layer output matrix and hij = g(wj ∗xi +
bj ) corresponds to the result from the jth hidden neuron to xi , T = [t1 , t2 , ..., ti ]
represents the target (desired output) matrix, and βj = [βj1 , βj2 , ..., βjm ]T (j ∈
[1, ..., s]) is the output weight matrix. In order to compute the output weights, an
inverse matrix operation is needed. The output weight matrix will be calculated
as the minimum least-square (LS) solution of a linear system, through a pseudo-
inverse matrix operation [9]. The next step for the calculation of the output
weight β̂ = H + T is done by calculating the pseudo-inverse matrix of H.
The main drawback of this technique is that the initial input weights are
randomly set and kept unchanged for the rest of the execution of the algorithm.
This procedure adds speed to the process, however it may not have the best
coverage of the solution space in the training data.
3 Proposed Method
The objective of the proposed algorithm is to find the best initial input weights
for the ELM algorithm through a hybrid search algorithm. The AFSA algorithm
does not produces good results in comparison with other existing algorithms [10],
thus, a modification was proposed to the original algorithm in this work.
68 J.F.L. de Oliveira and T.B. Ludermir
One of the limitations of the AFSA algorithm comes from the visual param-
eter which allows a given fish to gather informations about other fish inside its
visual field. With this restriction, some fish could sink into a local minima, while
only a few could reach a better region. The absence of information from other
fish near a global minima, may reduce the search capacity of the algorithm.
The MAFSA-ELM has the basic AFSA behaviors as in [4], however the
Crossover-Mutation phase from the DE algorithm is used as an additional be-
havior without the influence of the visual and step parameters.
The configurable parameters of the algorithm are: number of fish in the pop-
ulation N , the position of each fish Si which in this work will be defined as a
weight matrix W and a bias vector B, an objective function Y = f (x) which
will be the accuracy rate of the ELM on the validation set, the capacity to lo-
cate neighbor fish visual, number of neighbor fish nf , a distance measure Dij
between fish, a swim measure of each fish step, the crowd factor of a given region
δ and the number of tries (T rynumber) of execution of the prey behavior.
The visual parameter will influence the behaviors of each fish determining
the environment conditions. From the definition of this parameter a fish can
determine whether the region is crowded. According to the current environment
conditions, the fish evaluates the results from the behaviors and chooses one to
execute and update the fish position.
Optimization algorithms such as PSO, have their search process based on
previous movements and experiences of the population in the problem envi-
ronment. However, the AFSA and MAFSA techniques are not only based on
previous movements, but also on the current state of the population. Thus it
needs more parameters to measure the current state of the population in order
to ascertain how the movements will be performed.
The proposed method is based on the hybridization between the AFSA, DE
and ELM algorithms, called MAFSA-ELM, where each fish in the MAFSA algo-
rithm will represent a ELM network, and the search will be conducted based on
the behaviors of each fish. We also study the performance of the hybridization be-
tween the AFSA and ELM, which we call AFSA-ELM. This algorithm does not
have the crossover-Mutation behavior in its execution, and it was implemented
to verify its performance against the proposed method.
In order to calculate whether a certain fish is within the visual field of another
fish, the distance between them is calculated using an euclidean distance. The
output weights of the ELM algorithm are unknown a priori, hence the distance
is calculated as showed on equation 1.
Q Q

n
dpq = (Wij (p) − Wij (q))2 + (bi (p) − bi (q))2 (1)
i=1 j=1 i=1
The original AFSA algorithm has some basic behaviors executed by each fish,
and their fitness are calculated through the objective function Y = f (x). The be-
havior that produces the highest accuracy rate will be the one used for updating
the fish position. The basic behaviors are: Follow, Swarm, Prey and Leap.
3.1 Follow
Let S = [S1 , S2 , ..., SN ] be a vector with the position of each fish in the swarm
and Smax is the position found by a fish with best food concentration. The
updating process is showed on equation 2 in case the region is not crowded
n
( Nf < δ) and the fish with better positioning is in a better food concentration
region than the current fish (Ymax > Yi ).
Smax − Si
F ollow(Si ) = Si + rand()step , (2)
Smax − Si
If the above conditions are not satisfied, the updating will be done as shown on
equation 3.
F ollow(Si ) = P rey(Si ) (3)
3.2 Prey
In this behavior, the fish chooses a position inside its visual field, and if after a
number of tries (trynumber) the fish does not find a better food concentration
region, it moves randomly. If the selected fish has better food concentration, the
updating is performed as shown on equation 4 :
Sj − S i
P rey(Si ) = Si + rand()step (4)
Sj − Si
If the fish does not find a neighbor with better food concentration after trynumber
tries, it moves randomly through the space as shown on equation 5.
P rey(Si ) = Leap(Si ) (5)
3.3 Swarm
This behavior is based on the number of neighbor fish, where a central position
Sc is calculated, using the positions of each fish. The updating process is shown
n
on equation 6 in case the region is not crowded ( Nf < δ) and the central position
is in a better food concentration region than the current fish (Yc > Yi ).
S c − Si
Swarm(Si ) = Si + rand()step (6)
Sc − Si
Otherwise, the updating process is done as follows:
Swarm(Si ) = P rey(Xi ) (7)
3.4 Leap
Leap behavior is based on random movements, independent from the rest of the
swarm. This is an stochastic behavior of the fish.
Leap(Si ) = Si + rand()step (8)
3.5 Crossover-Mutation
This behavior is part of the MAFSA-ELM search strategy. In order to avoid sink-
ing on local minima and to improve the performance of the algorithm, we ran-
domly select three fish in the swarm and combine them using the basic Crossover
and Mutation strategies of the DE algorithm [11]. This behavior is not restricted
by the visual parameter to ensure the selection of any fish in the swarm and to
gather the information necessary for escaping from possible local minima. The
step parameter does not influence this behavior, thus the global search capacity
of the algorithm is increased.
The mutation phase is shown as follows:
Gi,t+1 = Gr1 ,t + F (Gr2 ,t − Gr3 ,t ), (9)

where r1 ,r2 and r3 are randomly chosen indexes of three fish from the swarm,
and F is an amplitude factor for the term (Gr2 ,t − Gr3 ,t ). The values for the
amplitude factor F range in the given interval [0..2].
In the crossover phase a vector V is created with the same dimension of each
individual. The vector is initialized as follows:

Gij,t if rand() < CR and j = randIndex(i)
Vij =
Gij,t+1 if rand() > CR or j
= randIndex(i)
Where CR is a crossover constant initialized by the user, and randIndex(i)

is a randomly chosen index in order to ensure that at least one element from
Gi,t+1 will be included in Vi . The crossover process either rejects the mutation
or accepts it, for each feature on the new vector, based on the CR constant.
On equations 2, 4, 6 and 8 the rand() factor is sampled once per vector.
On algorithm 1 the proposed algorithm is presented.
Algorithm 1. Modified Artificial Fish Swarm Algorithm

count ← 0
Initialize population X, ∀xi ∈ X
while count < maxIterations do
while i < N do
Determine Output Weights of fish i and assess the quality of the solution
Execute the behaviors (follow, swarm, leap, crossover-mutation) and update the solution
with the best result
i++
end while
count + +
end while
Select the best solution in the swarm
4 Experiments
The experiments were performed on 4 dataset from the UCI Machine Learning
Repository [12]. The data from each dataset were split into training set (50%),
validation set (25%) and test set (25%) randomly generated for 30 iterations.
For each iteration all the classifiers receive the same training, validation and
test sets. All the attributes from the datasets were normalized into the interval
[0..1]. The simulations were perfomed with 10, 15 and 20 hidden neurons, and
the configuration that produced the best result was selected. The initialization
of the AFSA and MAFSA parameters in this work were based on several sim-
ulations with distinct parameters, and the best configuration was selected. The
parameters are described as follows: the number of fish N = 30, step = 0.6,
lotation factor δ = 0.8. Amplitude factor for the mutation F = 1 and crossover
rate CR = 0.5. For the experiments using the PSO-ELM method, we used the
same configuration presented in [5] with some modifications, to match the pa-
rameters of other techniques such as population number and maximum number
of iterations, (C1 = 2, C2 = 2, w = 0.9, number of particles was set to 30, iter-
ations=50). The parameters from the E-ELM algorithm is the same presented
in [6], however, the total number of individuals was increased to 30. On table 1
the following results on the test set are given: the mean accuracy rate, standard
deviation SD and the number of hidden neurons Q.
(a) Glass (b) Ionosphere
(c) Sonar (d) Vehicle
Fig. 1. Validation Error through the iterations

Table 1. Results for the Glass, Ionosphere, Sonar and Vehicle datasets
(a) Glass (b) Ionosphere
Technique Mean± SD Q Technique Mean± SD Q
ELM 63.45 ± 5.39 20 ELM 84.24 ± 3.41 20
RBF [1] 62.42 ± 4.21 20 RBF [1] 84.65 ± 2.39 20
LM [1] 52.55 ± 16.70 15 LM [1] 88.16 ± 11.36 20
AFSA-ELM 64.21 ± 5.34 20 AFSA-ELM 87.15 ± 4.08 20
MAFSA-ELM 64.96 ± 4.87 20 MAFSA-ELM 88.10 ± 4.06 20
E-ELM 63.52 ± 5.97 20 E-ELM 88.06 ± 4.22 20
PSO-ELM 64.65 ± 5.77 15 PSO-ELM 85.11 ± 3.70 20
(c) Sonar (d) Vehicle

Technique Mean± SD Q Technique Mean± SD Q
ELM 71.05 ± 4.42 20 ELM 73.36 ± 2.07 20
RBF [1] 70.35 ± 4.96 15 RBF [1] 75.70 ± 2.24 20
LM [1] 75.76 ± 10.41 10 LM [1] 71.44 ± 15.94 15
AFSA-ELM 72.69 ± 7.09 20 AFSA-ELM 74.37 ± 2.90 20
MAFSA-ELM 74.35 ± 6.43 15 MAFSA-ELM 75.15 ± 3.60 20
E-ELM 74.87 ± 6.26 20 E-ELM 75.98 ± 3.40 20
PSO-ELM 74.48 ± 4.54 20 PSO-ELM 75.03 ± 2.56 20
For all datasets we used the Wilcoxon signed-rank hypothesis test with 5%
of significance for statistical comparison of the results. The results for the Glass
dataset (figure 1a and table 1a) show that the proposed technique achieved
lower validation errors. Through the hypothesis test we concluded that the PSO-
ELM and MAFSA-ELM achieved similar results, and they were superior to the
remained methods.
In the Ionosphere dataset (figure 1b and table 1b) the MAFSA-ELM and
E-ELM methods performed similarly in the accuracy on the test set and on
validation error. In this dataset the MAFSA-ELM technique also outperformed
the traditional AFSA-ELM algorithm.
In the Sonar dataset (figure 1c and table 1c) the MAFSA-ELM, E-ELM and
PSO-ELM methods achieved similar results, however the MAFSA-ELM algo-
rithm obtained the lowest validation error. In this dataset the MAFSA-ELM also
achieved better results than the AFSA-ELM algorithm. On the Vehicle dataset
(figure 1d and table 1d), the PSO-ELM, MAFSA-ELM and E-ELM achieved
similar classifications accuracies, and the PSO-ELM and the MAFSA-ELM had
similar validation errors.
5 Conclusions and Future Work

This work proposed a modification on the original Artificial Fish Swarm Algo-
rithm and performed experiments with some recent work related to optimization
of ELMs. The AFSA-ELM algorithm, also implemented in this work, achieved

promising results, however it did not achieved low validations errors in some
datasets. The introduction of the Crossover-Mutation behavior without the in-
fluence of the visual and step parameters allowed each fish to have knowledge
over the entire swarm, improving the performance of the proposed technique in
all datasets. The visual and step parameters of the MAFSA-ELM algorithm re-
main unchanged during its execution. For future work, adaptive procedures such
as fuzzy strategies can be applied to regulate exploration and exploitation on
the MAFSA algorithm in order to achieve lower validation errors and improve
classification accuracy.
References
1. Haykin, S.: Neural networks: a comprehensive foundation. Prentice Hall PTR, Up-
per Saddle River (1994)
2. Huang, G.B., Wang, D.H., Lan, Y.: Extreme learning machines: a survey. Interna-
tional Journal of Machine Learning and Cybernetics, 1–16 (2011)
3. Engelbrecht, A.P.: Fundamentals of computational swarm intelligence, vol. 1. Wi-
ley, NY (2005)
4. Wang, C.R., Zhou, C.L., Ma, J.W.: An improved artificial fish-swarm algorithm
and its application in feed-forward neural networks. In: Proceedings of 2005 Inter-
national Conference on Machine Learning and Cybernetics, vol. 5, pp. 2890–2894.
IEEE (2005)
5. Xu, Y., Shu, Y.: Evolutionary extreme learning machine–based on particle swarm
optimization. In: Advances in Neural Networks, ISNN 2006, pp. 644–652 (2006)
6. Zhu, Q.Y., Qin, A.K., Suganthan, P.N., Huang, G.B.: Evolutionary extreme learn-
ing machine. Pattern Recognition 38(10), 1759–1763 (2005)
7. Saraswathi, S., Sundaram, S., Sundararajan, N., Zimmermann, M., Nilsen-
Hamilton, M.: Icga-pso-elm approach for accurate multiclass cancer classification.
IEEE/ACM Transactions on Computational Biology and Bioinformatics 8(2), 452–
463 (2011)
8. Qu, Y., Shang, C., Wu, W., Shen, Q.: Evolutionary fuzzy extreme learning machine
for mammographic risk analysis. International Journal of Fuzzy Systems 13(4)
(2011)
9. Rao, C.R., Mitra, S.K.: Generalized inverse of matrices and its applications. Wiley,
NY (1971)
10. Yazdani, D., Nadjaran Toosi, A., Meybodi, M.: Fuzzy Adaptive Artificial Fish
Swarm Algorithm. In: Li, J. (ed.) AI 2010. LNCS, vol. 6464, pp. 334–343. Springer,
Heidelberg (2010)
11. Storn, R., Price, K.: Differential evolution. Journal of Global Optimization 11(4),
341–359 (1997)
12. Blake, C.L., Merz, C.J.: UCI repository of machine learning databases (1998)
Robust Training of Feedforward Neural Networks
Using Combined Online/Batch Quasi-Newton Techniques
Hiroshi Ninomiya
Department of Information Science, Shonan Institute of Technology

1-1-25, Tsujido-Nishikaigan, Fujisawa, Kanagawa, 251-8511, Japan
ninomiya@info.shonan-it.ac.jp
Abstract. This paper describes a robust training algorithm based on quasi-

Newton process in which online and batch error functions are combined by a
weighting coefficient parameter. The parameter is adjusted to ensure that the
algorithm gradually changes from online to batch. Furthermore, an analogy
between this algorithm and Langevin one is considered. Langevin algorithm is a
gradient-based continuous optimization method incorporating Simulated
Annealing concept. Neural network training is presented to demonstrate the
validity of combined algorithm. The algorithm achieves more robust training
and accurate generalization results than other quasi-Newton based training
algorithms.
Keywords: feedforward neural network, quasi-Newton method, online training

algorithm, batch training algorithm, Langevin algorithm.
1 Introduction
Neural network techniques have been recognized as a useful tool for the function
approximation problems with high-nonlinearity [1]. For example, the techniques are
useful for microwave modeling and design in which neural networks can be trained
from Electro-Magnetic (EM) data over a range of geometrical parameters and trained
neural networks become models providing fast solutions of the EM behavior [2][3].
Training is the most important step in developing a neural network model.
Gradient based algorithms such as Back propagation and quasi-Newton are popularly
used for this purpose [1]. For a given set of training data, the gradient algorithm
operates in one of two modes: online (stochastic) or batch. In the online mode, the
synaptic weights of all neurons in the network are adjusted in a sequential manner,
pattern by pattern. In the batch mode, by contrast, the adjustments to all synaptic
weights are made on a set of training data, with the result that a more accurate
estimate of the gradient vector is utilized. Despite its disadvantages, the online form is
the most frequently used for the training of multilayer perceptrons, particularly for
large-scale problems and also has better global searching ability than batch mode
training without being trapped into local minimum [1]. The quasi-Newton method
which is one of the most efficient optimization technique [4] is widely utilized as the
robust training algorithm for highly nonlinear function approximation using
Robust Training of Feedforward Neural Networks 75
feedforward neural networks [1]-[3]. Most of them were batch mode. On the other
hand, the online quasi-Newton-based training algorithm referred to as oBFGS (Online
quasi-Newton based Broyden-Fletcher-Goldfarb-Shanno formula [4]) was introduced
as the algorithm for machine learning with huge training data set in [5]. This
algorithm worked with gradients which obtained from small subsamples (mini-
batches) of training data and could greatly reduce computational requirements: on
huge, redundant data sets. However, when applied to highly nonlinear function
modeling and neural network training, oBFGS still converges too slowly and
optimization error cannot be effectively reduced within finite time in spite of its
advantage [6].
Recently, Improved Online BFGS (ioBFGS) was developed for neural network
training [6]. The gradient of ioBFGS was calculated by variable training samples.
Namely, training samples for a weight update were automatically increased from a
mini-batch to all samples as quasi-Newton iteration progressed. That is, ioBFGS
gradually changed from online to batch during iteration. This algorithm was
overcoming the problem of local minima in prevailing quasi-Newton based batch
mode training, and slow convergence of existing stochastic mode training.
This paper describes a robust training algorithm based on quasi-Newton in which
the online and batch error functions are combined by a weighting coefficient. This
coefficient is adjusted to ensure that the algorithm gradually changes from online to
batch. In other words, the transition from online to batch can be parameterized in
quasi-Newton iteration. The parameterized method not only has an effect similar to
ioBFGS, but also facilitates the analysis of algorithm by using an analogy between the
parameterized method and Langevin algorithm. Langevin algorithm is a gradient-
based continuous optimization method incorporating Simulated Annealing (SA)
concept [7]. This technique called poBFGS (Parameterized Online BFGS),
substantially improves the quality of solutions during global optimization compared
with the other quasi-Newton based algorithms. The algorithm is tested on some
function approximation problems with high-nonlinearity.
2 Formulation of Training and Improved Online BFGS
2.1 Formulation of Training

The input-output characteristics of feedforward neural network which consists of an
input layer, an arbitrary number of hidden layers and an output layer, is defined as
, , (1)
where , and are the p-th output, the p-th input and the weight vectors,
respectively. Neurons of hidden layer have the sigmoid activation function. Let be
the p-th desired vector, the average error function is defined as
1⁄ , and 1⁄ 2 , (2)
76 H. Ninomiya
where denotes a training data set and is the number of sample pairs within
.
Training is the most important step in developing a neural network model.
Gradient-based algorithms such as Back propagation and quasi-Newton method are
popularly used for this purpose [1]. Among the gradient-based algorithm, the
objective function of (2) is minimized by the following iterative formula
, (3)
where k is the iteration count and is the gradient vector. Gradient vectors of online
and batch training algorithms are defined as ∂ ⁄∂ and
∂ ⁄∂ , respectively. The learning rate is either positive number for
Back propagation or positive definite matrix for (quasi-)Newton method. The quasi-
Newton method is considered in this paper, because the method in which positive
definite matrix is updated by using BFGS formula, is one of the most efficient
optimization algorithms [4] and commonly-used training method for highly nonlinear
function problems [1]-[3].
2.2 Improved Online BFGS (ioBFGS)
Most of quasi-Newton methods are batch mode. The batch BFGS (BFGS) depends on
initial values of , with good result only if initial guess is suitable. On the other
hand, oBFGS in which a training data set was divided into Seg subsamples (mini-
batches) was introduced in [5]. Seg denotes the number of mini-batches. A mini-batch
is called a “segment” and includes _ ⁄ training samples in this paper.
Then the gradient, of oBFGS is calculated by training samples in a segment and
positive definite matrix is updated by the using BFGS formula. oBFGS
improved efficiency for the convex optimization with a large data set over BFGS as
reported in [5]. However, oBFGS still converges too slowly and optimization error
cannot be effectively reduced within finite time, when applied to highly nonlinear
function modeling and neural network training. Notable recent progress in data-driven
optimization for high-nonlinearity function modeling is the improved online quasi-
Newton training method that is called Improved Online BFGS (ioBFGS) [6]. ioBFGS
was carried out using following aspects of two existing BFGSs. First, in the early
stage of training, the weight vector was updated using a mini-batch. Next, a mini-
batch size of oBFGS was gradually increased overlapping multiple segments. Finally,
a mini-batch of oBFGS included all training samples, namely the algorithm became
BFGS. The details of the increasing strategy of mini-batch size are shown in [6].
3 Parameterized Online BFGS (poBFGS)
ioBFGS could make use of not only strong global searching ability, namely, the
capability to avoid local minima of the online BFGS, but also strong local searching
ability of the batch BFGS by just systematically combining online with batch. In this
paper, a robust training algorithm which is build on the same concept of ioBFGS, that
is changing from online to batch, is described for highly nonlinear function modeling.
In this algorithm, online and batch error functions are associated by a weighting
coefficient. Then, the coefficient is adjusted to ensure that the algorithm gradually
changes from online to batch. In other words, the transition from online to batch is
parameterized in quasi-Newton iteration. This algorithm not only has an effect similar
to ioBFGS, but also facilitates the analysis of algorithm by using an analogy between
the parameterized method and Langevin algorithm (LA). LA is a gradient-based
continuous optimization method incorporating SA concept [8]. This algorithm is
referred to as Parameterized Online BFGS (poBFGS).
3.1 Parameterized Online BFGS (poBFGS)

The difference between online and batch is the number of training samples for an
epoch of weight update. That is, the gradient of is calculated by using the p-
th training sample which is changed at each epoch in online method. On the other
hand, all training samples are used to calculate the gradient of at every epoch
in batch mode. These errors are associated by a weighting coefficient parameter in
poBFGS. Let be a weighting coefficient parameter, an error function at k-th epoch,
is defined as
1 . (4)
Its gradient vector ∂ ⁄∂ is utilized as the gradient vector of (3) in
poBFGS. Note that, when 1, ∂ ⁄∂ , that is oBFGS (online),
and when 0, ⁄∂ namely, BFGS (batch). Therefore, poBFGS
starts at 1, and then is gradually decreased with progressing iteration. Finally,
approaches 0. In that process, poBFGS progressively changes from oBFGS to
BFGS in the similar concept of ioBFGS. Here the following hypothesis is considered.
There is a possibility that some gradients of online are hill-climbing directions for
. The idea of this algorithm is that the parameter can be updated by the
concept of SA [8]. That is, can be considered as an acceptance probability of
Metropolis function which is controlled by the temperature parameter . The
acceptance function of j-th Metropolis loop [8], is defined as
exp ⁄ · 1⁄ , (5)
where and are minimum and maximum s among the j-th
Metropolis loop, respectively. Here, the temperature parameter is constant among
the j-th Metropolis loop. The standard SA starts at “high” temperature given by
user, and then the temperature gradually decreases until around 0 using “cooling
schedule” with a cooling parameter . The cooling schedule is defined as
, 0 1. (6)
Furthermore, the number of iteration in the j-th Metropolis loop, is given by
⁄ 10 . (7)
78 H. Ninomiya
The algorithm of poBFGS is illustrated in Algorithm 1. In Algorithm 1, the inverse

matrix of Hessian is iteratively approximated by BFGS formula [4]. poBFGS
substantially improves the quality of solutions during global optimization compared
with the other algorithms based on quasi-Newton method.
3.2 Langevin Algorithm and poBFGS
Here the relationship between Langevin algorithm (LA) which is a gradient-based

continuous optimization method incorporating SA concept [7] and poBFGS is
considered. LA is based on Langevin equation which is a stochastic differential
equation describing Brownian motion [7]. LA for neural network training [9] has
been done for the following discretized version of Langevin equation,
⁄∂ 2 , (8)
where is a white Gaussian noise sequence, is a “temperature” parameter
which is slowly decreased as the algorithm proceeds, and denotes the learning
rate of LA. It is possible for this algorithm to escape from the local minima by
artificially adding the noise term to the standard gradient method. Furthermore there
are some theoretical analyses of global convergence in [7].
The analogy between poBFGS and LA is shown as follows: (8) can be redefined as
1 ⁄∂ , (9)
by the following conversions
⁄
, √2 ⁄ 1 and 1 . (10)
(10) shows that when changes from 1 to 0, is cooled from a high temperature.
The idea behind these conversions is that (9) is similar to poBFGS except for the
second term of weight update. Namely, if the gradient of online mode ∂ ⁄∂
can play a role of “noise” in poBFGS, the algorithm could have same global
convergence property of LA. The rigorous analysis of global convergence will be
shown in the future.
4 Simulation Results
Computer simulations are conducted in order to demonstrate the validity of poBFGS.

The structure of feedforward neural network considered here is 3-layer, that is, a
network has a hidden layer. The performance of poBFGS is compared with the
performances of BFGS [1], oBFGS [5], ioBFGS [6], and LA [9]. In this paper, of
LA is positive definite matrix and updated by BFGS formula (hereafter laBFGS)
although was positive number, namely Back-propagation in [9] because of fair
comparison. Moreover two types of noise vectors are utilized as in (9) to verify
practical effectiveness. One is a Gaussian random noise sequence [9] in which the
mean is 0 and the variance is 0.01 (laBFGS/G). The other is a uniform random
sequence in 0.01, 0.01 (laBFGS/u). Thirty (30) independent runs were performed
to all algorithms with different starting values of . Each trained neural network was
estimated by average of 103 and the average of computational time sec .
<example A> First of all, the following functions are considered [6][10][11][12]:
, 1.9 1.35 e e sin 13 0.6 sin 7 . (11)
, 1.3356 1.5 1 e sin 3 0.6
.
e sin 4 0.9 . (12)
⁄ ∑ 1 1 10sin 10sin
1 , 4, 4 , . (13)
and are 2-Dimensional benchmark problems and referred to as Complicated
Interaction (Fig. 1) and Additive (Fig. 2) functions, respectively [10][11]. In and
problems, includes 1,680 training samples within , 1, 1 . is Levy
function [12]. This function is usually used as a benchmark problem for multimodal
function optimization. That is, the function has a huge number of local minima as
shown in Fig.3 even when 2. As a result, Levy function can be regarded as a
highly nonlinear function for neural network modeling. Moreover the input dimension
( ) can be arbitrarily decided. Therefore two examples ( and ) are considered
here, and the parameters of and are ( , )=(5, 1,000) and (10, 2,000),
respectively. The numbers of hidden neurons for , , and are 27, 9, 20
and 40 respectively. The maximum iteration count is 2 10 for all
algorithms. The cooling parameters of laBFGS and poBFGS are experimentally set
to 0.7 for and , and 0.2 for and . The simulation results are illustrated
in Table 1. Several mini-batches are tested for oBFGS and ioBFGS. A mini-batch
includes _ ⁄ training samples. This table shows that poBFGS and
laBFGS/G can obtain slightly smaller errors than BFGS and similar results of ioBFGS
for . However, poBFGS and ioBFGS can reduce the error, compared with
BFGS and laBFGS/G for , and without taking extra computational
time. The results of ioBFGS and poBFGS are better than the results of the other
BFGS-based algorithms, indicating that the increasing strategy of training samples is
Algorithm 1: Prameterized online BFGS (poBFGS)

1. 1, 1, 1, 1;
2. Initialize and by uniform random numbers and the unit matrix , respectively;
3. Initialize and ;
4. While(k<kmax or ∂ ⁄∂ )
5. Calculate ∂ ⁄∂ ;
6. Execute Line search to decide the step size ;
7. Update ;
8. Update by using BFGS formula [4];
9. if then
1, 1;
Update , ,
using (6), (5), and (7), respectively,
10. 1, 1;
11. EndWhile
80 H. Ninomiya
effective for these problems. On the other hand, oBFGS cannot also obtain the enough
small errors. That is, oBFGS is easy to get stuck to local minimum for these problems
although the computational times are short. Furthermore, it is shown that Gaussian
random noise is more effective than the uniform one for laBFGS.
. 1. . 2. . 3. ( 2) . 4.
<example B> Next, (14) is used as a function approximation problem with high-
nonlinearity [6][13],
, , 1 2 sin 4,4 , (14)
where , and are input variables for neural networks. (14) can be regarded as a
highly nonlinear function because 3-variables ( , and ) are included in a sine
function as frequency elements. In particular the EM behavior of microwave circuit is
quite similar to these test functions [3][6]. For (14), three examples are considered,
and two maximum iteration counts are used for each example, namely, 2 10
and 1 10 .
Table 1. Summary of simulation results for example A
Algorithm _ _ _
BFGS 0.947 617 1.04 292 3.03 186 3.04 2157

5 13.8 158 48.9 108 2 62.3 120 5 35.5 503
oBFGS 10 12.8 160 7.05 109 10 29.7 125 10 32.7 506
( _) 80 10.0 192 2.17 120 50 13.6 146 50 12.6 577
168 5.00 233 3.58 134 100 8.67 174 200 7.36 840
5 0.927 571 0.824 254 2 0.131 239 5 1.66 2180
ioBFGS 10 0.902 598 0.866 264 10 1.85 163 10 1.43 1637
( _) 80 0.928 607 0.747 268 50 2.80 250 50 2.81 2293
168 0.975 608 0.841 268 100 2.54 225 200 1.80 1757
laBFGS/u 0.933 614 1.26 286 3.96 290 6.70 2169
laBFGS/G 0.884 615 1.03 286 5.17 291 2.82 2172
poBFGS 0.884 618 0.641 293 1.68 202 1.08 2159
First example ( ) is very simple, but relatively high-nonlinearity function. In

[13], the radial basis function network with 20 hidden neurons was used for this
problem. The structure of 3-layer neural network is 1-7-1. 1 and 0,
namely these variables are constant. includes 400 training points. The cooling
Table 2. Summary of simulation results for example B
Algorithm
s s s
BFGS 2.52/1.82 102/483 2.05/2.03 2363/9163 0.829/0.748 8324/41611
ioBFGS(10) 1.23/1.15 111/556 1.10/0.909 1708/9068 4.20/0.806 7416/45902
laBFGS/G 3.54/2.13 117/550 1.53/1.56 1901/9208 0.764/0.376 8248/41094
poBFGS 1.01/0.599 105/581 1.75/0.573 1853/10652 0.796/0.321 8355/41749
parameters of laBFGS/G and poBFGS are set to 0.7 and 0.9 for and ,
respectively. In the second example, (Fig.4), the network has 2 inputs, and .
Here, and 1,1 are variables, and is fixed to 0. The neural network is 2-
45-1. includes 3,320 training points. The cooling parameters are same as .
More complicated problem ( ) is used as the third example. The structure of
network is 3-55-1. That is, , 1,1 and, 0.5,0.5 are variables.
includes 10,080 training points. The cooling parameters are set to 0.05 and 0.5 for
and , respectively. The mini-batch ( _ ) of ioBFGS is set to 10 for
three examples. The simulation results are illustrated in Table 2. and s are
shown as ⁄ for each maximum iteration count in this table. From the
table, it is demonstrated for that poBFGS can obtain the smallest error among
the tested algorithms for each maximum iteration count, and . This
indicates that the quality of solution by poBFGS can be more certain than other
algorithms with respect to randomly chosen starting point. From the results of ,
when the maximum iteration count is , the smallest error can be obtained by
using ioBFGS(10). On the other hand, the minimum error was obtained by poBFGS
when the maximum iteration count was . This means that poBFGS has strong
ability to search a global minimum covering a wider solution area and without being
trapped into local minimum by taking much
more iteration. Namely, poBFGS is a robust Table 3. Comparison training with
algorithm less dependent on the choice of testing errors of and
initial guesses of . Furthermore, it is
confirmed that the error of by poBFGS 100 400 1,680 3,320
was also the smallest among four algorithms. 0.335 1.82 1.46 2.03
As a result, poBFGS can obtain the small BFGS
2.80 1.66 2.74 1.81
training errors with much certainty while it is 0.190 0.599 0.589 0.573
poBFGS
impossible for the other algorithms to obtain 1.62 0.558 2.51 0.545
such small training errors.
<example C> Finally, the generalization abilities of trained neural networks using
BFGS and poBFGS are studied for and . In this simulation, new training
data sets which are smaller than those of ex. B, are generated. Each for and
is set to 100 and 1,680, respectively, and the neural networks have same
structures as ex. B. Each trained neural network is estimated by testing errors. Testing
error denotes the generalization ability which is the ability of a trained neural
network to estimate with input never seen during training (i.e. ). is
calculated by (2) using a data set of 10,000 randomly-selected pairs , . The
training and testing errors of the trained neural networks are presented in Table 3.
82 H. Ninomiya
From the table, the testing errors s of and with small training data
sets are larger than s calculated by the trained networks in ex. B, although the
training errors are low compared with the results of ex. B. On the other hand, the
testing errors of trained networks in ex. B ( 400 and 3,320) are almost exactly
the same as the training errors. This means that the small data sets used here are
insufficient for the useful neural models and the adequate number of training samples
is necessary to generate an accurate neural model. Furthermore, it makes the neural
network training especially difficult with the increased number of training samples.
As a result, it is obvious that poBFGS is practical and useful for highly nonlinear
function modeling and neural network training.
5 Conclusions
In this paper we have presented a robust training technique of feedforward neural
networks. The technique combines the global optimization capability of the online
BFGS, and fast and strong local search capability of the batch BFGS. Furthermore, an
analogy between the proposed algorithm and Langevin one was considered. The
method overcomes the problem of local minima of conventional gradient based neural
network training. The method is robust, and provides high quality training and testing
solutions regardless of initial values. It helps provide accurate neural network models
for highly nonlinear function approximation problems.
In the future the validity of the proposed algorithm for the real world problems
such as microwave circuit modeling [2][3][6] will be demonstrated.
References
1. Haykin, S.: Neural Networks and Learning Machines 3rd. Pearson (2009)
2. Zhang, Q.J., Gupta, K.C., Devabhaktuni, V.K.: Artificial neural networks for RF and
microwave design-from theory to practice. IEEE Trans. Microwave Theory and Tech. 51,
1339–1350 (2003)
3. Ninomiya, H., Wan, S., Kabir, H., Zhang, X., Zhang, Q.J.: Robust training of microwave
neural network models using combined global/local optimization techniques. In: IEEE
MTT-S International Microwave Symposium (IMS) Digest, pp. 995–998 (June 2008)
4. Nocedal, J., Wright, S.J.: Numerical Optimization 2nd. Springer (2006)
5. Schraudolph, N.N., Yu, J., Gunter, S.: A stochastic quasi-Newton method for online
convex optimization. In: Proc. 11th Intl. Conf. Artificial Intelligence and Statistics (2007)
6. Ninomiya, H.: An improved online quasi-Newton method for robust training and its
application to microwave neural network models. In: Proc. IEEE&INNS/IJCNN 2010, pp.
792–799 (July 2010)
7. Gelfand, S.B., Mitter, S.K.: Recursive stochastic algorithms for global optimization in. Rd
SIAM J. Control and Optimization 29(5), 999–1018 (1991)
8. Corane, A., Marechesi, M., Martini, C., Ridella, S.: Minimizing multimodal functions of
continuous variables with the Simulated Annealing algorithm. ACM Trans. Math.
Soft. 13(3), 262–280 (1987)
9. Rögnvaldsson, T.: On Langevin updating in multilayer perceptrons. Neur. Comp. 6(5),
916–926 (1991)
10. Kwok, T.Y., Yeung, D.Y.: Objective functions for training new hidden units in
constructive neural networks. IEEE Trans. Neural Networks 8(5), 630–645 (1997)
11. Ma, L., Khorasani, K.: New training strategies for constructive neural networks with
application to regression problems. Neural Networks 17, 589–609 (2004)
12. Levy, A., Montalvo, A., Gomez, S., Galderon, A.: Topics in global optimization. Lecture
Notes in Mathematics, vol. (909). Springer, New York (1981)
13. Benoudjit, N., Archambeau, C., Lendasse, A., Leel, J., Verleysen, M.: Width optimization
of the Gaussian kernels in radial basis function networks. In: Proc. Eur. Symp. Artif.
Neural Netw., pp. 425–432 (April 2002)
Estimating a Causal Order among Groups
of Variables in Linear Models
Doris Entner and Patrik O. Hoyer
HIIT & Department of Computer Science, University of Helsinki, Finland
Abstract. The machine learning community has recently devoted much

attention to the problem of inferring causal relationships from statistical
data. Most of this work has focused on uncovering connections among
scalar random variables. We generalize existing methods to apply to col-
lections of multi-dimensional random vectors, focusing on techniques ap-
plicable to linear models. The performance of the resulting algorithms
is evaluated and compared in simulations, which show that our methods
can, in many cases, provide useful information on causal relationships
even for relatively small sample sizes.
1 Introduction
Many techniques have recently been developed for inferring causal relationships
from data over a set of random variables [1,2,3,4,5,6,7,8,9]. While most of this
work has focused on uncovering connections among scalar random variables,
in many actual cases each of the variables of interest may consist of multiple
related, but distinct, measurements. For instance, in fMRI data analysis one
is often interested in the functional connectivity among brain regions, and for
each such region-of-interest one has data measured from a set of multiple voxels.
Typically, in these cases, some aggregate of each area is computed, after which
the standard approaches are directly applicable. However, it can be shown that
not only may information be lost when computing aggregates, but the outputs
of such methods may not even be correct in the large sample limit.
A simple example illustrating one of the problems inherent with working with
aggregates is the following. Consider three sets of variables with causal connec-
tions X → Y → Z, i.e. the variables in X may influence the variables in Y, but
not directly the variables in Z, and the variables in Y may influence the variables
in Z. In this case, each variable x ∈ X is independent of each z ∈ Z conditional
on the full set of mediating variables Y. However, when replacing the variables
of each group with their respective mean value (the typical aggregate used),
denoted by x̄, ȳ, and z̄, in general we obtain x̄ ⊥
/⊥ z̄ | ȳ [1,10]. Thus, it is impor-
tant to develop methods for causal discovery that exploit the full information
available, as opposed to only aggregates of the data.
Towards this end, in this paper we extend two existing approaches [5,6] de-
signed for causal discovery among scalar random variables to the case of random
vectors (i.e. groups of variables), both exploiting any kind of non-Gaussianity

Estimating a Causal Order among Groups of Variables in Linear Models 85
present in the data. We also extend a recent method [8] for inferring the causal
relationship among two arbitrarily distributed multi-dimensional variables to an
arbitrary number of such variables. After describing the resulting algorithms, we
evaluate and compare their performance in numerical simulations.
2 Model and Problem Statement

Throughout the paper, we will use the term ‘group’ to denote a set of un-
derlying variables all belonging to the same (multi-dimensional) random vector
representing a single concept (e.g. one region in fMRI analysis). We use the term
‘variable’ to represent a single scalar random variable belonging to one of the
groups. Thus, for g = 1, . . . , G, let Xg denote group g, and let the random vector
xg = (x1 , . . . , xng )T collect the ng random variables belonging to group g. We
(g) (g)
assume that the groups Xg can be arranged in a causal order K = (k1 , . . . , kG ),

such that the causal relationships among the groups can be represented by a
directed acyclic graph. The data generating process is assumed to be a set of
linear equations, given by

xki = Bki ,kj xkj + eki , i = 1, . . . , G, (1)
j<i
with Bki ,kj arbitrary (real) matrices of dimension nki × nkj , containing the
direct effects from group Xkj to group Xki . The vectors of disturbance terms
eki are assumed to be zero mean, and mutually independent over groups, i.e.
eki ⊥⊥ ekj , i = j, but are allowed to be dependent within each group. If we
arrange the groups in a causal order K and define x = (xk1 , . . . , xkG ) and e =
(ek1 , . . . , ekG ), we can rewrite Equation (1) in matrix form as x = Bx+e with B
a lower block triangular matrix. The model reduces to standard LiNGAM (Linear
Non-Gaussian Acyclic Model, [4,5]) when ∀g : ng = 1 and all disturbances e are
non-Gaussian. It also includes the model of [6] when G = 2, n1 = n2 = 1 and the
disturbances are non-Gaussian. Finally, it contains as a special case the noisy
model of [8] when G = 2 (but with no restriction on the ng and e).
We assume that all variables in x are observed, and that the grouping of these
variables is known. Given merely observations of x generated by Model (1) (i.e.
B and e are unknown), we want to estimate the unknown causal order K among
these groups. We denote the data matrix of observations over the variables x as
X = (X1 , . . . , XG )T , where each column corresponds to one observation and each
row to one variable. The observations are grouped according to the G groups,
arranged in a random order, such that the first n1 rows correspond to group X1 ,
the following n2 rows to group X2 , and so on.
We note that our model family is equivalent to that given by [7]. The main
difference between our approach and theirs is that they do not assume to know
which variable belongs to what group, which results in algorithms exponential in
the number of involved variables, whereas our algorithm explicitly builds upon
such knowledge, allowing to construct computationally and statistically more
efficient algorithms, polynomial in the number of groups.
86 D. Entner and P.O. Hoyer
3 Method
The overall algorithm for finding a causal order among the groups follows the
approach introduced in [5]. We first search for an exogenous group (Section 3.1),
and then ‘regress out’ the effect of this group on all other groups (Section 3.2).
We iterate this process to generate a full causal order over the G groups.
3.1 Finding an Exogenous Group

We generalize three existing methods to search for an exogenous group, formally
defined as below. Note that since the connections among the groups are assumed
to be acyclic, there always exists at least one such exogenous (‘source’) group.
Definition 1. A group Xj is exogenous if for xj all matrices Bj,i of Equa-
tion (1) are zero.
GroupDirectLiNGAM. As our first approach, we generalize the idea of Di-

rectLiNGAM [5] to find an exogenous variable, to finding an exogenous group.
The following lemma, which corresponds directly to Lemma 1 in [5], states a
criterion to find an exogenous group using regressions and independence tests.
Lemma 1. Let x follow Model (1) with non-Gaussian disturbance terms e. Let
(j)
ri := xi − Cxj be the residuals when regressing xi on xj using ordinary least
(j)
squares (OLS). A group Xj is exogenous if and only if xj ⊥
⊥ ri for all i
= j.
The proof of this lemma, and the proof of Lemma 2 in Section 3.2 are left to the
online appendix at http://www.cs.helsinki.fi/u/entner/GroupCausalOrder/
To apply Lemma 1 in practice, we need to test for (in)dependence between
two vectors of random variables, and combine the results of several such in-
dependence tests. Assuming that the test returns p-values pji under the null
(j)
hypothesis of xj ⊥⊥ r i , i
= j, we can get a measure of how exogenous group
Xj is by combining these p-values using Fisher’s method [11]. This means that
we select as the exogenous group the one minimizing

μ(j) = − log(pji ). (2)
i=j
To obtain the p-values pji we can test for joint dependence of the two vectors
(j)
xj and r i using the Hilbert Schmidt Independence Criterion (HSIC, [12]),
which, however, requires many samples to detect dependencies for high dimen-
sional vectors. Alternatively, we can perform pairwise tests of each variable in
(j)
xj against each variable in r i using nonlinear correlations, and combine the
resulting nj × ni p-values appropriately. Details are left to the online appendix.
Pairwise Measure. Our second approach is based on modifying the pairwise

measure [6] designed for inferring the causal relationship between two linearly
related non-Gaussian scalar random variables x and y. If the true underlying
causal direction is from x to y, the model is defined as y = ρx + ey with x ⊥⊥ ey .
As pointed out in Section 2, this is just a special case of our more general model.
The (normalized) ratio of the log likelihoods for the two possible causal models is
given by R(x, y) = (log L(x → y) − log L(y → x)) /m, where m is the sample size
and L the likelihood of the specified direction, under some suitable assumption
on the distributions of the disturbances. If the true underlying causal direction
is x → y, then R(x, y) > 0 in the large sample limit. Symmetrically, if x ← y,
then R(x, y) < 0 in the limit.
To use the ratio R(· , ·) to find an exogenous group Xj , the naı̈ve approach
(j) (i) (j) (i)
is to calculate R(xk , xl ) for each pair with xk ∈ Xj , k = 1, . . . , nj , xl ∈
Xi , l = 1, . . . , ni , i
= j, and combine these measures. However, even if Xj is
exogenous, these pairs do not necessarily meet the model assumption because of
the dependent error terms within each group, and hence there is no guarantee
for correctness even in the large sample limit. This approach, termed the Naı̈ve
Pairwise Measure, may however have a statistical advantage for small sample
sizes (see Section 4).
To obtain a consistent method (simply termed Pairwise Measure in Sec-
(j) (i)
tion 4), we replace the second variable of the pairs (xk , xl ) with a quantity
which guarantees that the model assumption is met if Xj is exogenous: We
(i) nj (j)
first estimate the regression model xl = k̃=1 b̂lk̃ xk̃ + rl,(i) . If Xj is exoge-
nous then the regression coefficients b̂lk̃ are consistent estimators of the true
total effects (when marginalizing out any intermediate groups). Hence, defining
(i) (i) nj (j) (j) (j) (i)
zk,l := xl − k̃=1; b̂ x = b̂lk xk +rl,(i) yields a pair (xk , zk,l ) meeting the
k̃=k lk̃ k̃
(j) (i)
model assumption of [6] if Xj is exogenous. Thus, in this case, R(xk , zk,l ) > 0,
in the limit, for all k, l, and i
= j. On the contrary, if Xj is not exogenous the
measure can take either sign, and simulations show that it is unlikely to always
obtain a positive one. A way to combine the ratios is suggested in [6], which can
be modified for the group case as
nj ni
1
μ(j) = (j) (i)
min{0, R(xk , zk,l )}2 . (3)
nj i=j ni k=1 i=j l=1
That is, we penalize each negative value according to its squared magnitude and
adjust for the group sizes. We select the group minimizing this measure as the
exogenous one.
Trace Method. Our third method for finding an exogenous group is based on
the approach of [8,9], termed the Trace Method, designed to infer the causal order
among two groups of variables X and Y with nx and ny variables, respectively.
If the underlying true causality is given by X → Y, the model is defined as y =
Bx+e, where the connection matrix B is chosen independently of the covariance
matrix of the regressors Σ := cov(x, x), and the disturbances e are independent
of x. Note that this method is based purely on second-order statistics and does
not make any assumptions about the distribution of the error terms e, as opposed
to the previous two approaches where we needed non-Gaussianity. The measure
to infer the causal direction defined in [8] is given by

ΔX→Y := log tr(B̂Σ̂B̂T )/ny − log tr(Σ̂)/nx − log tr(B̂B̂T )/ny (4)
where tr(·) denotes the trace of a matrix, Σ̂ an estimate of the covariance matrix
of x, and B̂ the OLS estimate of the connection matrix from x to y. The measure
for the backward direction ΔY→X is calculated similarly by exchanging B̂ with
the OLS estimate of the connection matrix from y to x and Σ̂ with the estimated
covariance matrix of y. If the correct direction is given by X → Y, Janzing et
al. [8] (i) conclude that ΔX→Y ≈ 0, (ii) show for the special case of B being an
orthogonal matrix and the covariance matrix of e being λI, that ΔY→X < 0,
and (iii) show for the noise free case that ΔY→X ≥ 0. Hence, the underlying
direction is inferred to be the one yielding Δ closer to zero [8]. In particular, if
|ΔX→Y | / |ΔY→X | < 1, then the direction is judged to be X → Y.
We suggest using the Trace Method to find an exogenous group Xj among G
groups in the following way. For each j, we calculate the measures ΔXj →Xi and
ΔXi →Xj , for all i
= j, and infer as exogenous group the one minimizing
2
μ(j) = ΔXj →Xi / ΔXi →Xj . (5)
i=j
3.2 Estimating a Causal Order

Following the approach of [5], after finding an exogenous group we ‘regress out’
the effect of this group on all other groups. Since the resulting data set follows
again the model in Equation (1) having the same causal order as the original
groups, we can search for the next group in the causal order in this reduced data
set. This is formally stated in the following lemma, which corresponds to the
combination of Lemma 2 and Corollary 1 in [5].
Lemma 2. Let x follow Model (1), and assume that the group Xj is exogenous.
(j)
Let ri := xi − Cxj be the residuals when regressing xi on xj using OLS,
for i = 1, . . . , G, i
= j, and denote by r(j) the column vector concatenating all
these residuals. Then r (j) = B(j) r (j) + e(j) follows Model (1). Furthermore, the
(j)
residuals in r i follow the same causal order as the original groups xi , i = j.
Using Lemma 2, and the methods of Section 3.1, we can formalize the approach
to find a causal order among the groups as shown in Algorithm 1.
3.3 Handling Large Variable Sets with Few Observations

The OLS estimation used in Algorithm 1 requires an estimate of the inverse
covariance matrix which can lead to unreliable results in the case of low sample
size. One approach to solving this problem is to use regularization. For the L2 -
regularized estimate of the connection matrix we obtain Ĉi,j = Xi XTj (XTj Xj +
λI)−1 = cov(Xi , Xj ) m (m cov(Xj , Xj ) + λI)−1 , with m the sample size and λ
the regularization parameter, see for example [13]. In particular, this provides a
regularized estimate of the covariance matrix.
Algorithm 1. (Estimating a Causal Order among Groups)

Input: Data matrix X generated by Model (1), arranged in a random causal order
Initialize the causal order K := [ ].
repeat
Find an exogenous group Xj from X using one of the approaches in Section 3.1.
Append j to K.
Replace the data matrix X with the matrix R(j) concatenating all residuals
(j)
Ri , i = j, from the regressions of xi on xj using OLS:
(j)
Xi = Ci,j Xj + Ri with Ci,j = cov(Xi , Xj ) cov(Xj , Xj )−1 .
until G − 1 group indices are appended to K
Append the remaining group index to K.
Another approach is to apply the methods of Section 3.1 for finding an exoge-
nous group to N data sets, each of which consists of G groups formed by taking
subsets of the variables of the corresponding original groups. We then calculate
(j)
measures μn , j = 1, . . . , G, n = 1, . . . , N , as in Equations (2), (3) or (5), for
each such data set separately, and pick the group Xj ∗ which minimizes the sum
over these sets to be an exogenous one, i.e.

j ∗ = arg min μ(j)
n (6)
j 1≤n≤N
where μn is the measure of group j in the nth data set. We then can proceed
(j)
as in Algorithm 1 to find the whole causal order.

Note that the same approach can be used when multiple data sets are avail-
able, which are assumed to have the same causal order among the groups but
possibly different parameter values. An example for such a scenario is given by
fMRI data from several individuals. An equivalent of Equation (6) was suggested
in [14] for the single variable case with multiple data sets.
4 Simulations
Together, the methods of Section 3 provide a diverse toolbox for inferring the
model of Section 2. Here, we provide simulations to evaluate the performance of
the variants of Algorithm 1, and compare it to a few ad hoc methods. Matlab
code is available at http://www.cs.helsinki.fi/u/entner/GroupCausalOrder/
We generate models following Equation (1) by randomly creating the connec-
tion matrices Bki ,kj , i > j with, on average, s% of the entries being nonzero
and additionally ensure that at least one entry is nonzero, to ensure a complete
graph over the groups. To obtain the disturbance terms eki for each group, we
linearly mix random samples from various independent non-Gaussian variables
as to obtain dependent error terms within each group. Finally, we generate the
sample matrix X and randomly block-permute the rows (groups) to hide the
generating causal order from the inference algorithms.
6 vars / group 12 vars / group 100 vars / group

0.6 0.6 0.8
GDL,nlcorr 1 GDL,nlcorr
0.5 0.5 0.5
GDL,HSIC GDL,nlcorr,10sets
0.6
0.4 0.4 TrMeth. 0.5 TrMeth.,L2reg
0
error rate
error rate
error rate
200PwMeas.
500 1000 TrMeth.,10sets
0.3 0.3 0.4 0
ICA−L 200PwMeas.,L2reg
5001000
0.2 0.2 DL,nlcorr PwMeas.,10sets
0.2
0.1 0.1
DL,HSIC NaivePwMeas.
0 0 0
200 500 1000 200 500 1000 200 500 1000
sample size sample size sample size
(a) 100 models with 5 groups (b) 50 models with 3 groups
Fig. 1. Sample size (x-axis) against error rate (y-axis) for various model sizes and
algorithms, as indicated in the legends (abbreviations: GDL = GroupDirectLiNGAM;
nlcorr, HSIC: nonlinear correlation or HSIC as independence test; TrMeth. = Trace
Method; PwMeas. = Pairwise Measure; ICA-L = modified ICA-LiNGAM approach;
DL = DirectLiNGAM on the mean-variables; 10sets = Equation (6) on N = 10 data
sets; L2reg = L2 -regularization for covariance matrix). The dashed black line indicates
the number of mistakes made when randomly guessing an order.
We compare the variants of Algorithm 1 to two ad hoc methods. The first

one is a modified ICA-based LiNGAM approach [4] where instead of searching
for a permutation yielding a lower triangular connection matrix B (i.e. finding a
causal order among the variables), we search for a block permutation yielding a
lower block triangular matrix B (i.e. finding a causal order among the groups).
Secondly, we compare our approach to DirectLiNGAM [5], when replacing each
group by the mean of all its variables.1
We measure the performance of the methods by computing the error rates for
predicting whether Xi is prior to Xj , for all pairs (i, j), i < j.
Results for simulated data of sample size 200, 500 and 1000 generated from
100 random models having 5 groups with either 6 or 12 variables each, and
s = 10%, are shown in Figure 1 (a). As expected, most methods based on
Algorithm 1 improve their performance with increasing sample size. The only
exception is the Trace Method on the smaller models; to be fair the method was
not really designed for so few dimensions. Overall, the best performing method
is the Pairwise Measure, closely followed by GroupDirectLiNGAM for the larger
sample sizes. The ad hoc methods using DirectLiNGAM on the mean perform
about as well as guessing an order (indicated by the dashed black line), whereas
the modified ICA-LiNGAM approach performs better than guessing. However, it
does not seem to converge for growing sample size, probably due to the dependent
errors within each group, which is a violation of the ICA model assumption.
We next replace each group by a subset of its variables of size m = 1, . . . , ng ,
and apply Algorithm 1 to these subgroups. As expected, the larger m is, the less
ordering mistakes are made. Details can be found in the online appendix.
1
We do not compare our results to methods such as PC [1] or GES [3], as they
cannot distinguish between Markov-equivalent graphs. Hence, in these simulations,
they cannot provide any conclusions about the ordering among the groups since we
generate complete graphs over the groups to ensure a total causal order.
Finally, we test the strategies described in Section 3.3 for handling low sample
sizes in high dimensions on 50 models with 3 groups of 100 variables each, using
200, 500 and 1000 samples, and s = 5%. For L2 -regularization, we choose the pa-
rameter λ using 10-fold cross validation on the covariance matrix. When taking
subgroups, we use N = 10 data sets, and each subgroup containing ten variables.
The error rates are shown in Figure 1 (b) (we only show the L2 -regularized results
if they were better than without regularization). Unreliable estimates of the
covariance matrix seem to affect especially the Trace Method, and the Pairwise
Measure on the smaller sample sizes. On the smallest sample, using subsets seems
to be advantageous for most methods, however, the best performing approach
is the Naı̈ve Pairwise Measure, which, however, does not seem to converge to be
consistent, where as GroupDirectLiNGAM and the Pairwise Measure are.
In general, the simulations show that the introduced method often correctly
identifies the true causal order, and clearly outperforms the simple ad hoc ap-
proaches. It is left to future work to study the performance in cases of model
violations as well as to apply the method to real world data.
Acknowledgments. We thank Ali Bahramisharif and Aapo Hyvärinen for dis-
cussion. The authors were supported by Academy of Finland project #1255625.
References
1. Spirtes, P., Glymour, C., Scheines, R.: Causation, Prediction, and Search, 2nd edn.
MIT Press (2000)
2. Pearl, J.: Causality: Models, Reasoning, and Inference, 2nd edn. Cambridge Uni-
versity Press (2009)
3. Chickering, D.M., Meek, C.: Finding optimal bayesian networks. In: UAI (2002)
4. Shimizu, S., Hoyer, P.O., Hyvärinen, A., Kerminen, A.J.: A linear non-gaussian
acyclic model for causal discovery. JMLR 7, 2003–2030 (2006)
5. Shimizu, S., Inazumi, T., Sogawa, Y., Hyvärinen, A., Kawahara, Y., Washio, T.,
Hoyer, P.O., Bollen, K.: DirectLiNGAM: A direct method for learning a linear
non-Gaussian structural equation model. JMLR 12, 1225–1248 (2011)
6. Hyvärinen, A.: Pairwise measures of causal directions in linear non-gaussian acyclic
models. JMLR W.&C.P. 13, 1–16 (2010)
7. Kawahara, Y., Bollen, K., Shimizu, S., Washio, T.: GroupLiNGAM: Linear non-
gaussian acyclic models for sets of variables. arXiv, 1006.5041v1 (June 2010)
8. Janzing, D., Hoyer, P.O., Schölkopf, B.: Telling cause from effect based on high-
dimensional observations. In: ICML (2010)
9. Zscheischler, J., Janzing, D., Zhang, K.: Testing whether linear equations are
causal: A free probability theory approach. In: UAI (2011)
10. Scheines, R., Spirtes, P.: Causal structure search: Philosophical foundations and
problems, 2008. In: NIPS 2008 Workshop: Causality: Objectives and Assessment
(2008)
11. Fisher, R.A.: Statistical Methods for Research Workers, 11th edn. Oliver and Boyd,
London (1950)
12. Gretton, A., Fukumizu, K., Teo, C.H., Song, L., Schölkopf, B., Smola, A.J.: A
kernel statistical test of independence. In: NIPS (2008)
13. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data
Mining, Inference and Prediction, 2nd edn. Springer (2008)
14. Shimizu, S.: Joint estimation of linear non-gaussian acyclic models. Neurocomput-
ing 81, 104–107 (2012)
Training Restricted Boltzmann Machines
with Multi-tempering: Harnessing Parallelization
Philemon Brakel, Sander Dieleman, and Benjamin Schrauwen
Department of Electronics and Information Systems, Ghent University,

Sint-Pietersnieuwstraat 41, 9000 Gent, Belgium
{philemon.brakel,sander.dieleman,benjamin.schrauwen}@elis.ugent.be
Abstract. Restricted Boltzmann Machines (RBM’s) are unsupervised

probabilistic neural networks that can be stacked to form Deep
Belief Networks. Given the recent popularity of RBM’s and the increas-
ing availability of parallel computing architectures, it becomes inter-
esting to investigate learning algorithms for RBM’s that benefit from
parallel computations. In this paper, we look at two extensions of the
parallel tempering algorithm, which is a Markov Chain Monte Carlo
method to approximate the likelihood gradient. The first extension is
directed at a more effective exchange of information among the parallel
sampling chains. The second extension estimates gradients by averaging
over chains from different temperatures. We investigate the efficiency of
the proposed methods and demonstrate their usefulness on the MNIST
dataset. Especially the weighted averaging seems to benefit Maximum
Likelihood learning.
Keywords: Markov Chain Monte Carlo, Restricted Boltzmann Ma-

chines, Neural Networks, Machine Learning.
1 Introduction
Since the recent popularity of deep neural architectures for learning [2], Re-
stricted Boltzmann Machines (RBM’s; [6,5]), which are the building blocks of
Deep Belief Networks [7], have been studied extensively. An RBM is an undi-
rected graphical model with a bipartite connection structure. It consists of a layer
of visible units and a layer of hidden units and can be trained in an unsupervised
way to model the distribution of a dataset. After training, the activations of the
hidden units can be used as features for applications such as classification or
clustering. Unfortunately, the likelihood gradient of RBM’s is intractable and
needs to be approximated.
Most approximations for RBM training are based on sampling methods.
RBM’s have an independence structure that makes it efficient to apply Gibbs
sampling. However, the efficiency of Gibbs sampling depends on the rate at which
independent samples are generated. This property is known as the mixing rate.
While Gibbs samplers will eventually generate samples from the true underlying

Training Restricted Boltzmann Machines with Multi-tempering 93
distribution they approximate, they can get stuck in local modes. This is espe-
cially problematic for distributions that contain many modes that are separated
by regions where the probability density is very low.
In this paper, we investigate two methods for improving both the mixing rate
of the sampler and the quality of the gradient estimates at each sampling step.
These two methods are extensions for the so-called Replica Exchange method
and were recently proposed for statistical physics simulations [1]. The first ex-
tension allows every possible pair of replicas to swap positions to increase the
number of sampling chains that can be used in parallel. The second extension
is to use a weighted average of the replicas that are simulated in parallel. The
weights are chosen in a way that is consistent with the exchange mechanism.
2 Restricted Boltzmann Machines
An RBM defines an energy function that depends on the joint configuration of

a set of visible variables v and a set of hidden variables h. In an RBM where all
variables are binary, this energy function is given by
Nh
Nv Nh
Nv

E(v, h) = − Wij hi vj − h i ai − vj bj , (1)
i=1 j=1 i=1 j=1
where Nh and Nv are, respectively, the number of hidden and the number of
visible units. The symbols W , a and b denote trainable weight and bias parame-
ters. This function
can be used to define a Gibbs probability distribution of the
form p(v) = h e−E(v,h) /Z, where Z is the partition function which is given by
Z = h,v e−E(v,h) .
The gradient of this likelihood function is given by
∂ln p(v) ∂E(v, h) ∂E(v, h)

=− p(h|v) + p(v, h) , (2)
∂θ ∂θ ∂θ
h v,h
where θ is an element in the set of parameters {W, a, b}. The first term of this
gradient can be evaluated analytically in RBM’s but the second term needs to
be approximated. This second term is the gradient of the partition function and
will be referred to as the model expectation.
3 Training RBM’s
The most commonly used training method for RBM’s is the Contrastive Di-
vergence (CD; [6]) algorithm. During training, a Gibbs sampler is initialized at
a sample from the data and run for a couple of iterations. The last sample of
the chain is used to replace the intractable model expectation. This strategy
assumes that many of the low energy configurations, that contribute most to the
model expectation, can be found near the data. However, it is very likely that
94 P. Brakel, S. Dieleman, and B. Schrauwen
there are many other valleys of low energy. Furthermore, the algorithm does not
necessarily optimize the likelihood function at all.
In Persistent Contrastive Divergence learning [13], (PCD) a Markov chain
is updated after every parameter update during training and used to provide
samples that approximate the model expectation. The difference with normal
CD is that the chain is not reset at a data point after every update, but keeps on
running so it can find low energy regions that are far away from the data. Given
infinite training time, this algorithm optimizes the true likelihood. However, as
training progresses and the model parameters get larger, the energy landscape
becomes more rough. This will decrease the size of the steps the chain takes and
increase the chance that the chain gets stuck in local modes of the distribution.
To obtain better mixing rates for the sampling chains in PCD, the Fast PCD
algorithm was proposed [12]. This algorithm uses a copy of the model that is
trained using a higher learning rate to obtain samples. The training itself is in
this case pushing chains out of local modes. Unfortunately, the training algorithm
is now not necessarily converging to the true likelihood anymore.
Another way to improve the mixing rate is Replica Exchange Monte Carlo
[11], also referred to as Parallel Tempering (PT). Recently, PT has been ap-
plied to RBM training as well [4]. This algorithm runs various chains in parallel
that sample from replicas of the system of interest that operate under different
temperatures. Chains that operate at lower temperatures can escape from local
modes by jumping to locations of similar energy that have been proposed by
chains that operate at higher temperatures. A serial version of this idea has also
been proposed for training RBM’s [9].
One downside of PT for training RBM’s is that the number of parallel sam-
pling chains that can be used by this algorithm is limited. One can use many
chains in PT to cover more temperatures. This will cause more swaps between
neighbouring chains to be accepted because they are closer together. However,
it will also take more sequential updates before a certain replica moves back and
forth between the lowest and the highest temperatures. Another disadvantage of
PT is that only the chain with the lowest temperature is actually used to gather
statistics for the learning algorithm.
4 Multi-tempering
To increase the number of parallel chains that PT can effectively use, we propose
Multiple Replica Exchange methods for RBM training. These methods have
already been shown to work well in statistical physics [3,1]. To prevent the use
of very different names for similar algorithms, we will refer to this method as
Multi-Tempering (MT). Since MT is a modification of PT Markov Chain Monte
Carlo, it is necessary to describe the original algorithm in some more detail.
The idea behind PT is to run several Markov chains in parallel and treat
this set of chains as one big chain that generates samples from a distribution
with augmented variables. Transition steps in this combined chain can now also
include possible exchanges among the sub chains. Let X = {x1 , · · · , xM } be
the state of a Markov chain that consists of the states of M sub chains that
operate under inverse temperatures {β1 , · · · , βM }, where β1 = 1 and indicative
of the model we want to compute
expectations for. The combined energy of this
system is given by E(X) = M i=1 βi E(xi ). The difference in total energy that
results from switching two arbitrary sub chains with indices i, j, is given by
E(X̂(i, j)) − E(X) = (βi − βj )(E(xj ) − E(xi )) , (3)
where X̂(·) denotes the new state of the combined chain that results from the
exchange indicated by its arguments1 . If i and j are selected uniformly and
forced to be neighbours, the Metropolis-Hastings acceptance probability is given
by rij = exp(E(X) − E(X̂(i, j))). This is the acceptance criterion that is used
in standard Parallel Tempering.
In Multi-Tempering [1], index i is selected uniformly and index j is selected
with a probability that is based on the difference in total energy the proposed
exchange would cause:
rij
p(j|i) = M . (4)
j =1 rij
The acceptance probability is now given by

−E(X̂(i,j ))
j e
A(i, j) = min . (5)
−E(X̂(i,j,k))
ke
5 Using a Weighted Average of the Chains
Given the selection probabilities p(j|i) from Equation 4 and the acceptance prob-
abilities A(i, j|X), one can compute a weighted average to estimate the gradient
of the intractable likelihood term. This average is given by
M

g1 = [(1 − A (i, j)) g(x1 ) + A(i, j)g(xj )] p(j|i) , (6)
j=1
where g(·) is short for ∂E(·)

∂θ . This extension is originally called Information Re-
trieval but this term might lead to confusion in a Machine Learning context.
We will refer to this version of the algorithm as Multi-Tempering with weighed
averaging (MTw).
6 Experiments
All experiments were done on the MNIST dataset. This dataset is a collection of
70, 000 28 × 28 grayscale images of handwritten digits that has been split into a
1
So X̂(i, j, k) would mean that i is first swapped with j and subsequently, the sample
at position j is swapped with the one at position k.
train set of 50000 images and test and validation sets of each 10000 images. The
pixel intensities were scaled between 0 and 1 and interpreted as probabilities
from which binary values were sampled whenever a datapoint was required.
First, is was investigated how the MT and the PT algorithms behave with
different numbers of parallel chains by looking at the rate at which replicas travel
from the highest temperature chain to the one with the lowest temperature. Ten
RBM’s with 500 hidden units were trained with PCD using a linearly decay-
ing learning rate with a starting value .002 for 500 epochs. Subsequently, both
sampling methods were run for 10000 iterations and the number of times that
a replica was passed all the way from the highest to the lowest temperature
chain was counted. This experiment was done for different numbers of parallel
chains. The inverse temperatures were uniformly spaced between .8 and 1. In
preliminary experiments, we found that almost no returns from the highest to
the lowest temperature occurred for any algorithm for much larger intervals.
The second experiment was done to get some insight in the mixing rates
of the sampling methods and their success at approximating the gradient of the
partition function. A small RBM with 15 hidden units was trained on the MNIST
dataset using the PCD algorithm. The different sampling methods were now run
for 20000 iterations while their estimates of the gradient were compared with
the true gradient which had been computed analytically . Because the success of
the samplers partially depends on their random initialization, we repeated this
experiment 10 times.
Finally, to see how the different sampling algorithms perform at actual train-
ing, a method called annealed importance sampling (AIS) [8,10] was used to
estimate the likelihood of the data under the trained models. PCD, PT, MT
and MTw were each used to train 10 RBM models on the train data for 500
epochs. Each method used 100 chains in parallel. The inverse temperatures for
the Tempering methods were linearly spaced between .85 and 1 as we expected a
slightly more conservative temperature range would be needed to make PT com-
petitive. We used no weight decay and the order of magnitude of the starting
learning rates was determined using a validation set. The learning rate decreased
linearly after every epoch.
Fig. 1 displays the results of the first experiment. The number of returns is a lot
higher for MT at the start and seems to go down at a slightly slower rate than
for PT. This allows a larger number of chains to be used before the number of
returns becomes negligible.
As Fig. 2 shows, the MT estimator was most successful at approximating
the gradient of the partition function of the RBM with 15 hidden units. To
our surprise, the MT estimator also performed better than the MTw estimator.
However, it seems that the algorithms that used a single chain to compute the
expectations (MT and PT), fluctuate more than the ones that use averages
(MTw and PCD).
Fig. 1. Number of returns for parallel tempering and multiple replica exchange as a
function of the number of parallel chains that are used
Fig. 2. Mean Square Error (MSE) between the approximated and the true gradients of
the partition function of an RBM with 15 units as a function of the number of samples
Table 1. Means and standard deviations of the AIS estimates of the likelihood of the
MNIST test set for different training methods. Means are based on 10 experiments
with different random initializations.
Epochs MTw MT PT PCD

250 −82.25(10.33) −92.59(7.79) −93.48(11.54) −94.43(1.71)
500 −65.09(7.66) −83.74(6.76) −84.18(7.79) −80.45(11.36)
Table 1 displays the AIS estimates of the likelihood for the MNIST test set
for each of the training methods. MTw outperforms all other methods on this
task. The standard deviations of the results are quite high and MT, PT and
PCD don’t seem to differ much in performance. The fact that MT and PT use
only a single chain to estimate the gradient seems to be detrimental. This is
not in line with the results for the gradient estimates for the 15 unit RBM. It
could be that larger RBM’s benefit more from the higher stability of gradient
estimates that are based on averages than small RBM’s. The results suggest that
PCD with averaged parallel chains is preferable to Tempering algorithms that
use only a single chain as estimate due to its relative simplicity but that MTw
is an interesting alternative.
(a) Matrix of exchange frequen-(b) Binarized matrix of ex-

cies cut off at 100. changes.
Fig. 3. Plot of inter chain replica exchanges for MT
During MT training, we also recorded the transition indices for further inspec-
tion. There are clearly many exchanges that are quite large as can be seen in
Fig. 3a, which shows a matrix in which each entry {i, j} represents the number
of times that a swap occurred between chains i and j. While there seems to be
a bottleneck that is difficult to cross, it is clear that some particles still make it
to the other side once in a while. In Fig. 3b, one can see that occasionally some
very large jumps occur that span almost the entire temperature range.
8 Conclusion
We proposed two methods to improve Parallel Tempering training for RBM’s and
showed that the combination of the two methods leads to improved performance
on learning a generative model of the MNIST dataset. We also showed that the
MTw algorithm allows more chains to be used in parallel and directly improves
the gradient estimates for a small RBM. While the weighted average didn’t seem
to improve the mixing rate, it seemed to stabilize training. For future work, it
would be interesting to see how the sampling algorithms compare when the
RBM’s are used for pre-training a Deep Belief Network.
Acknowledgments. The work presented in this paper is funded by the EC

FP7 project ORGANIC (FP7-231267).
References
1. Athènes, M., Calvo, F.: Multiple-Replica Exchange with Information Retrieval.
Chemphyschem. 9(16), 2332–2339 (2008)
2. Bengio, Y.: Learning deep architectures for AI. Foundations and Trends in Machine
Learning 2(1), 1–127 (2009), also published as a book. Now Publishers (2009)
3. Brenner, P., Sweet, C.R., VonHandorf, D., Izaguirre, J.A.: Accelerating the Replica
Exchange Method through an Efficient All-Pairs Exchange. The Journal of Chem-
ical Physics 126(7), 074103 (2007)
4. Desjardins, G., Courville, A.C., Bengio, Y., Vincent, P., Delalleau, O.: Tempered
markov chain monte carlo for training of restricted boltzmann machines. Journal
of Machine Learning Research - Proceedings Track 9, 145–152 (2010)
5. Freund, Y., Haussler, D.: Unsupervised Learning of Distributions on Binary Vectors
Using Two Layer Networks. Tech. rep., Santa Cruz, CA, USA (1994)
6. Hinton, G.E.: Training Products of Experts by Minimizing Contrastive Divergence.
Neural Computation 14(8), 1771–1800 (2002)
7. Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief
nets. Neural Computation 18(7), 1527–1554 (2006)
8. Neal, R.M.: Annealed importance sampling. Statistics and Computing 11, 125–139
(1998)
9. Salakhutdinov, R.: Learning in markov random fields using tempered transitions.
In: Bengio, Y., Schuurmans, D., Lafferty, J.D., Williams, C.K.I., Culotta, A. (eds.)
NIPS, pp. 1598–1606. Curran Associates, Inc. (2009)
10. Salakhutdinov, R., Murray, I.: On the quantitative analysis of Deep Belief Net-
works. In: McCallum, A., Roweis, S. (eds.) Proceedings of the 25th Annual Inter-
national Conference on Machine Learning (ICML 2008), pp. 872–879. Omnipress
(2008)
11. Swendsen, R.H., Wang, J.S.: Replica Monte Carlo Simulation of Spin-Glasses.
Physical Review Letters 57(21), 2607–2609 (1986)
12. Tieleman, T., Hinton, G.: Using Fast Weights to Improve Persistent Contrastive Di-
vergence. In: Proceedings of the 26th International Conference on Machine Learn-
ing, pp. 1033–1040. ACM, New York (2009)
13. Tieleman, T.: Training restricted Boltzmann machines using approximations to the
likelihood gradient. In: Proceedings of the International Conference on Machine
Learning (2008)
A Computational Geometry Approach
for Pareto-Optimal Selection of Neural Networks
Luiz C.B. Torres, Cristiano L. Castro, and Antônio P. Braga
Federal University of Minas Gerais Department of Electronics Engineering Av.

Antonio Carlos, 6627, Pampulha 30161-970, Belo Horizonte, MG, Brazil
{luizlitc,crislcastro}@gmail.com,
apbraga@ufmg.br
Abstract. This paper presents a Pareto-optimal selection strategy for

multiobjective learning that is based on the geometry of the separation
margin between classes. The Gabriel Graph, a method borrowed from
Computational Geometry, is constructed in order to obtain margin pat-
terns and class borders. From border edges, a target separator is obtained
in order to obtain a large margin classifier. The selected model from the
generated Pareto-set is the one that is closer to the target separator.
The method presents robustness in both synthetic and real benchmark
datasets. It is efficient for Pareto-Optimal selection of neural networks
and no claim is made that the obtained solution is equivalent to a max-
imum margin separator.
Keywords: decision-making, multiobjective machine learning, gabriel

graph, classification.
1 Introduction
Multi-objective (MOBJ) learning of Artificial Neural Networks (ANNs) provides
an alternative approach for implementing Structural Risk Minimization (SRM)
[1]. Its basic principle is to explicitly minimize two separate objective functions,
one related to the empirical risk (training error) and the other to the network
complexity, usually represented by the norm of the network weights [3,4,5,6]. It
is known from Optimization Theory, however, that the minimization of these two
conflicting objective functions do not yield a single minimum but result, instead,
on a set of Pareto-optimal (PO) solutions [10]. Similarly to Support Vector Ma-
chines (SVM) [11] and other regularization learning approaches, the choice of a
PO solution is analogous to selecting the regularization parameter, which pro-
vides a balance between smoothness and dataset fitness. The selection of the PO
solution and of the regularization parameter in SVMs should be accomplished
according to an additional decision criteria. In SVMs learning, crossvalidation is
often adopted.
Some PO selection strategies have been proposed in the literature in the con-
text of MOBJ learning. Current approaches include searching the Pareto-set for

Computational Geometry Approach for Pareto-Optimal Model Selection 101
the solution that minimizes the error of a validation set [4], making assumptions
about the error distribution based on prior knowledge [7] and assuming uncorre-
lation between error and approximation function [2]. However, these strategies
are only valid under restricted conditions and can not be regarded as general. In
any case, the selection method embodies the criteria that will guide the search
towards a given kind of solution. For instance, margin maximization is a well
accepted criteria for model selection of classification and regression problems.
Nevertheless, margin maximization with SVMs depends on setting the regular-
ization parameter first, since the solution of the corresponding quadratic opti-
mization problem can only be accomplished after learning and kernel parameters
are set [11].
In this paper we present a parameterless PO selection method that is based on
the geometrical definition of the separation margin, which is estimated according
to concepts borrowed from Computational Geometry [8]. The Gabriel Graph
(GG ) [14] is adopted in order to construct a model of the graph formed by the
input data and their relative distances. Once the graph model is constructed,
it is possible to identify those patterns that are in the separation margin and
then to point out the PO solution that maximizes a given smoothness metric
defined according to the margin patterns. Results that will be presented in this
paper show that its performance on benchmark UCI datasets is similar to those
obtained by SVMs and LS-SVMs (Least Squares Support Vector Machines) on
the same data.
The remainder of this paper is organized as follows. Section 2 presents the
underlying principles for decision makings of MOBJ learning and the main mo-
tivations for PO selection. Section 3 extends the MOBJ section and shows the
main principles of multi-criteria decision making. Section 4 presents the quality
function proposed in this paper, followed by results and conclusions in the final
two sections.
2 MOBJ Learning
It is well accepted that the general supervised learning problem can be for-
mulated according to the minimization of two, sometimes conflicting, objective
functions, being one related to the learning set error and the other related to the
model complexity [1]. So, this general formulation of learning has a bi-objective
nature since, for most problems, there is not a single set of model parameters
that minimize concurrently the two objectives. In any case, the two objectives
should be minimized and the learning problem, according to this formulation,
can be stated as: “find the minimum complexity model that fits the learning set
with minimum error”. Learning algorithms differ on how this general statement
is implemented and many approaches that attempt to solve the problem have
appeared in the literature in the last decades. However, after the widespread
acceptance of Statistical Learning Theory (SLT) [1] as a general framework for
learning, the popularity of SVMs and the formal proof that the “magnitude of
the weights is more important than the number of weights” [12], algorithms that
102 L.C.B. Torres, C.L. Castro, and A.P. Braga
minimize both the learning set error and the norm of network weights became
popular for ANNs learning. For instance, MOBJ learning [4] can be described
according to the multi-objective formulation that follows.
Given the dataset D = {xi , yi }N
i=1 , MOBJ learning aims at solving the opti-
mization problem of Equation (1) [4].
N 2
J1 (w) = i=1 (yi − f (xi , w))
min (1)
J2 (w) = w
where f (xi , w) is the output of the model for the input pattern xi , w is the
vector of network parameters (weights), yi is the target response for xi and ·
is the Euclidean norm operator.
3 Decision Making
The generated PO solutions are all optimal according to Equation 1 [10], so

that the choice of any of them would be acceptable from the optimization
point of view. They differ, however, on how they trade-off the two objectives.
The two extremes of the Pareto-set are formed by (J1 (w), min(J2 (w)) and
(min(J1 (w), J2 (w)), so as we move from one extreme to the other, one ob-
jective is increased as the other one is reduced. This can be clearly seen on Fig.
1(a) where PO solutions are shown for a binary classification problem. As can be
observed, the solutions vary in smoothness (w) and adaptation to the training
N 2
set ( i=1 (yi − f (xi , w)) ) from one extreme to the other. The goal of decision
making strategy is to pick-up one of them.
Since the optimization problem was fully defined by Equation 1 which is
satisfied by all the PO solutions, an additional criteria is needed in order to
measure the quality of each solution and then to select the one the maximizes
the quality criteria. In general, the multi-criteria decision making problem can
be described by Equation (2) [7].
w∗ = arg max fe (2)

w∈W
where fe is a function that is able to assess the quality of PO solutions. In the

next section a quality measure function is proposed for classification problems
based on the geometry of the separation margin.
4 Quality Function for PO Selection
At this point we aim at a quality function for selecting PO solutions of binary

classification problems. It is well accepted that, for this kind of problem, the
discrimination function should maximize the separation margin between classes.
However, margin width in learning machines like SVMs, for instance, is usually
given by a pre-established margin parameter which, by its turn, is selected by
an external quality function, e.g. simple inspection or crossvalidation. We aim
to obtain a quality function that does not depend on external parameters and
that can be assessed directly from the dataset.
The concept of separation margin is well understood, especially for a separable
dataset. It is defined by the distance of the nearest patterns, or support vectors
in SVMs’ terminology, of each class to the separation hyperplane in feature space
[1]. The hyperplane should separate evenly the dataset or, in other words, the
distances of support vectors to the separation hyperplane should be maximum
and even for the two classes. When the dataset is not linearly separable in feature
space, slack variables determine the tolerance allowed in the overlapping region
between classes. In practice, the effect of formulating the problem according to
slack variables is to transform the problem into a linearly separable one, so that
the margin concept above can be applied. Our quality measure function aims,
therefore, at identifying the patterns that are in the overlapping region directly
from the dataset by applying concepts from Computational Geometry. Once the
overlapping patterns are identified, similarly to the slack variables formulation,
they are not considered in margin estimation and PO selection.
Considering that the PO solutions have been already generated, the proposed
selection strategy is accomplished in three distinct phases. The first one aims at
identifying the separation region between the two classes. This is carried on by
identifying the edges of a Gabriel Graph [8] that have patterns from different
classes in their vertices. The corresponding patterns at the extremes of the border
edges are analogous to the support vectors of a SVM, although we should make
it clear that we do not claim that they correspond exactly to the actual support
vectors that would have been obtained from a SVM solution. They will be simply
called here as border patterns, although their importance will be similar to the
support vectors of SVMs. Our selection strategy aims at choosing the maximum
margin separator from the PO solutions or, in other words, the closest one to the
mean of border patterns. So, in the second phase the mean-vector of each pair of
border patterns is obtained, so that the selection procedure can be accomplished
in the last phase. Each one of the three phases will be described next.
– Phase 1. Separation region

1. Gabriel Graph. Obtain the Gabriel Graph GG of the training set D =
{xi , yi }N N
i=1 with vertices formed by {xi }i=1 , i.e., V = {xi | i = 1 . . . N },
and edges E satisfying the condition of Expression 3.

(vi , vj ) ∈ E ↔ δ 2 (vi , vj ) ≤ δ 2 (vi , z) + δ 2 (vj , z) ∀ z ∈ V, vi , vj
= z (3)
where δ(·, ·) is the Euclidean distance operator.

2. Eliminate overlapping. Eliminate vertice xi ∈ V from V if the majority
of pattern-vertices of the subgraph that has xi at one end is formed by
patterns of the opposite class of xi . Steps 1 and 2 are repeated until no
more patterns to be eliminated are found.
3. Interclass edges. Find the interclass edges Br by selecting all edges that
have their vertices belonging to opposite classes.
– Phase 2. Margin Target

1. Target separator. For each edge (xi , xj ) ∈ Br , calculate the mean vector
xi , xj between the two vertices xi and xj . The set PM is then formed
by all mean vectors of all edges belonging to Br .
– Phase 3. PO Selection
1. PO selection. Considering that W = {wk | k = 1 . . . L} is the set of all
PO solutions, select the one that is closer to the elements of PM .
2.5 2.5 2.5

Class 2 Edge Class 2
2 2 Middle Point 2
Class 1 Class 1
Boundary
1.5 1.5 1.5
1 1 1
0.5 0.5 0.5
0 0 0
−0.5 −0.5 −0.5
−1 −1 −1
−1.5 −1.5 −1.5
−2 −2 −2
−2.5 −2.5 −2.5

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2
(a) (b) (c)
Fig. 1. PO selection for the problem of Figure 1(a). (a) Pareto optimal solutions for
a binary classification problem. (b) Border edges, mean separation vectors and the
closest PO solution. (c) Chosen solution from the Pareto-set.
5 Results
Prior to presenting the efficiency of the method with benchmark problems, we
show first the results for a two-dimensional synthetic dataset, known as “two-
moons” problem. This example is interesting because the non-Gaussian class
distributions present an additional challenge for classification models and also
because the actual graph for PO selection can be visualized. The results are
shown in the graph of Fig. 2. The dataset for the classification problem is shown
in Fig. 2(a), the corresponding Gabriel Graph in Fig. 2(b) and the final solution
selected from the Pareto set in Fig. 2(c), where it is also shown the solution
obtained from a validation dataset. It can be observed that the Gabriel Graph
solution has larger margin than the one obtained with validation.
Next, experiments were carried on with the following binary datasets from
the UCI repository: Stalog Australian Credit (acr), Stalog German Credit (gcr),
Stalog heart disease (hea), Ionosphere (ion), Pima Indians diabetes (pid), Sonar
(snr) and Wisconsin Breast Cancer (wbc). Table 1 shows the characteristics
of each database, where NT r/V c is the amount of data used for training or
cross-validation, Ntest is the test set size and N is the total number of samples.
The number of numerical and categorical attributes are denoted by nnum and
ncat respectively, and n is the total number of attributes. All datasets were
normalized with mean x̄ = 0 and standard deviation σ = 1. In order to achieve
representative results, 10 random permutations were generated for each dataset.
Then, each permutation was split into training (or cross-validation) (2/3) and
test (1/3) subsets.
1.2 1.2
Class 1 Edge 1.5
Class 2 Class 1 Class 1
1 1
Class 2 Class 2
Boundary
Validation
0.8 0.8
1 Margin
0.6 0.6
0.4 0.4
0.5
0.2 0.2
0 0
0
−0.2 −0.2
−0.4 −0.4 −0.5
−0.6 −0.6
−0.8 −0.8 −1
−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3
(a) (b) (c)
Fig. 2. (a) Dataset for the two-moons classification problem. (b) Gabriel Graph. (c)
Selected solution with Gabriel Graph compared with the one selected according to a
validation set.
Table 1. Characteristics of databases
acr gcr hea ion pid snr wbc

NT r/V c 460 666 180 234 512 138 455
Ntest 230 334 90 117 256 70 228
N 690 1000 270 351 768 208 683
nnum 6 7 7 33 8 60 9
ncat 8 13 6 0 0 0 0
n 14 20 13 33 8 60 9
The results were compared with the benchmarks of the LS-SVM algorithm
presented in the article [9], and also an SVM implemented from the LIBSVM
toolbox [15]. According to [9], an RBF Kernel was selected for LS-SVMs; the
regularization parameter γ and kernel ϕ were selected with a 10-fold cross-
validation grid-search procedure. In the case of SVMs, we used the standard
C-SVC formulation with RBF kernel. The same methodology (grid-search with
10-fold cross-validation) was used to chose the corresponding SVMs’ γ and ϕ
parameters. The best parameters for each dataset are shown in Table 2.
The results obtained with the datasets of Table 1 for SVMs and LS-SVMs
were then compared with those obtained with multiobjective learning of Multi-
layer Perceptrons (MLPs) [4]. The final selection strategy from the PO solutions
was the one described in Section 4. Mean accuracy and standard deviation for
Table 2. Values of parameters for the RBF Kernel

LS − SV M : ϕ 22.75 31.25 5.69 3.30 240.00 33.00 6.97
LS − SV M : log10 (γ) 0.09 2.43 -0.76 0.63 3.04 0.86 -0.66
SV M : ϕ 512 32768 8192 32768 2 32768 8192
SV M : log10 (γ) -2.70 2.30 -4.51 -0.9 -2.10 -1.5 -4.51
Table 3. Results
LS-SV M(RBF ) 87.0(2.1) 76.3(1.4) 84.7(4.8) 96.0(2.1) 76.8(1.7) 73.1(4.2) 96.4(1.0)

MOBJ(Margin) 87.79(0.88) 78.13(0.61) 87.3(2.3) 88.12(2.39) 76.04(2.38) 76.28(1.11) 97.05(1.01)
SV M 86.24(0.88) 75.86(2.20) 83.08(3.10) 93.86(3.31) 77.25(1.20) 75.82(1.11) 97.02(1.55)
all methods are presented in Table 3. Although a statistical test was not accom-
plished, since comparing numerically the results was not the main goal of this
paper, the inspection of Table 3 shows that the methods have similar perfor-
mances on all datasets.
6 Discussions and Conclusions
As a result of solving the problem defined by Equation 1, the learning algorithm

generates an estimate of the set of non dominated PO solutions [10], which
optimally trade-off learning error (J1 (w)) and model complexity (J2 (w)). The
next step of learning is, therefore, to choose one of the generated PO solutions.
It should be made clear, however, that the need to choose among the non-
dominated solutions appears in all methods that are described according to the
model complexity x error minimization principle. It is intrinsic to the formulation
laid down by SLT and not to the MOBJ approach itself, since learning, according
to these principles, is in fact a trade-off problem due to the conflicting nature
of the two objectives. The selection procedure appears, therefore, also in other
learning approaches, such as SVMs and regularization learning [13]. At some
point an external ad-hoc arbitration should be adopted.
We do not aim at going further into the statistical analysis of the results of
Table 3, since our purpose was not to present a method that would be numerically
superior to others. Instead, we presented a method to select a large margin
classifier from the PO solutions of a multiobjective learning method for MLPs.
The yielded classifiers have indeed similar performances to SVM and LS-SVMs,
which had theirs parameters set with exhaustive search and crossvalidation. We
do claim, however, that our selection method does not have sensitive parameters,
since margin and border patterns are obtained geometricaly and directly from
the dataset without any a priori parameter setting.
References
1. Vapnik, V.: Statistical Learning Theory. Wiley-Interscience (1998)

2. Teixeira, R.A., Braga, A.P., Saldanha, R.R., Takahashi, R.H.C., Medeiros, T.H.:
The Usage of Golden Section in Calculating the Efficient Solution in Artificial Neu-
ral Networks Training by Multi-objective Optimization. In: de Sá, J.M., Alexandre,
L.A., Duch, W., Mandic, D.P. (eds.) ICANN 2007. LNCS, vol. 4668, pp. 289–298.
3. Jin, Y., Sendhoff, B.: Pareto-Based Multiobjective Machine Learning: An Overview

and Case Studies. IEEE Transactions on Systems Science and Cybernetics 39, 373
(2009)
4. Teixeira, R.A., Braga, A.P., Takahashi, R.H.C.: Improving generalization of mlps
with multi-objective optimization. Neurocomputing 35, 189–194 (2000)
5. Costa, M.A., Braga, A.P., Menezes, B.R.: Improving generalization of MLPS witch
multi-objective witch sliding mode control and the Levenberg-Maquardt algorithm.
Neurocomputing 70, 1342–1347 (2007)
6. Kokshenev, I., Braga, A.P.: An efficient multi-objective learning algorithm for RBF
neural network. Neurocomputing 37, 2799–2808 (2010)
7. Medeiros, T.H., Takahashi, H.C.R., Braga, A.: A Incorporação do Conhecimento
Prévio na Tomada de Decisão do Aprendizado Multiobjetivo Congresso Brasileiro
de Redes Neurais - Inteligência Computacional 9, 25–28 (2009)
8. Berg, M., Kreveld, M.V., Overmars, M., Schwarzkopf, O.: Computational Geome-
try: Algorithms and Applications. Springer (2000)
9. Gestel, T., Suykens, J.A.K., Baesens, B., Viaene, S., Vanthienen, J., Dedene, G.,
De Moor, B., Vandewalle, J.: Benchmarking least squares support vector machine
classifiers. Machine Learning 54, 5–32 (2004)
10. Sawaragi, Y., Nakayama, H., Tanino, T.: Theory of multiobjective optimization,
vol. 176. Elsevier Science (1985)
11. Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20(3), 273–297
(1995)
12. Bartlett, P.L.: For valid generalization, the size of the weights is more important
than the size of the network. In: Advances in Neural Information Processing Sys-
tems, pp. 134–140. Morgan Kaufmann Publishers (1997)
13. Lawson, C.L., Hanson, R.J.: Solving least squares problems. Society for Industrial
Mathematics 15 (1995)
14. Sánchez, J., Pla, F., Ferri, F.: Prototype selection for the nearest neighbour rule
through proximity graphs. Pattern Recognition Letters 18(6), 507–513 (1997)
15. Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines. ACM
Transactions on Intelligent Systems and Technology 2, 27:1–27:27 (2011) software,
http://www.csie.ntu.edu.tw/~ cjlin/libsvm
Learning Parameters of Linear Models
in Compressed Parameter Space
Yohannes Kassahun1 , Hendrik Wöhrle2 , Alexander Fabisch1 , and Marc Tabie1

1
Robotics Group,
University of Bremen, Bremen, Germany
2
Robotics Innovation Center
DFKI GmbH, Bremen, Germany
Abstract. We present a novel method of reducing the training time by

learning parameters of a model at hand in compressed parameter space.
In compressed parameter space the parameters of the model are rep-
resented by fewer parameters, and hence training can be faster. After
training, the parameters of the model can be generated from the pa-
rameters in compressed parameter space. We show that for supervised
learning, learning the parameters of a model in compressed parameter
space is equivalent to learning parameters of the model in compressed
input space. We have applied our method to a supervised learning do-
main and show that a solution can be obtained at much faster speed than
learning in uncompressed parameter space. For reinforcement learning,
we show empirically that searching directly the parameters of a policy
in compressed parameter space accelerates learning.
Keywords: Compressed Sensing, Supervised Learning, Reinforcement

Learning.
1 Introduction
Many real world applications make use of high dimensional inputs, e. g. images
or multi-dimensional electroencephalography (EEG) signals. If we want to ap-
ply machine learning to such applications, we usually have to optimize a large
number of parameters. The optimization process requires a large amount of com-
putational resource and/or a long training time.
One way of reducing the training time is by first projecting the input signal
onto a subspace of a lower dimension. This idea is motivated by the work in the
area of compressed sensing [3]. Another way of reducing the training time is by
learning parameters of the model at hand in compressed parameter space. This
approach, which we consider in this paper, allows to optimize fewer parameters
than required without affecting the input data.
The paper is organized as follows: we first give a review of related work,
then continue with the discussion of the closed form solution for weighted sum

This work was supported by the German Bundesministerium für Wirtschaft und
Technologie (BMWi, grant FKZ 50 RA 1012 and grant FKZ 50 RA 1011).

Learning Parameters of Linear Models in Compressed Parameter Space 109
of squared errors in compressed parameter space for regression, and show its
equivalence to learning in compressed input space. Afterwards we discuss learn-
ing in compressed parameter space for classification problems and proceed to
learning in compressed parameter space for reinforcement learning. Finally, we
present some preliminary experimental results.
2 Review of Related Work

For supervised learning there are some works that showed that learning in com-
pressed input space is possible. Most of them are developed in the context of
compressed sensing [3]. One of them uses support vector machines (SVM) in
compressed input space for classification and shows that SVM has good gener-
alization properties when used in compressed input space (measurement space)
[1]. Haupt et al. [7] developed a method of classification of signals in compressed
input space. Their method was able to classify signals corrupted with noise and
verified using a classification problem of chirp and sinusoidal signals. Davenport
et al. [2] developed a method for classification of images in compressed input
space. They have shown that it is possible to achieve good classification per-
formance with few measurements without first reconstructing the input image.
Maillard and Munos [11] considerd the problem of learning a regression function
in a linear space of high dimension using projections onto a random subspace of
lower dimension. Zhou et al. considered the problem of learning a linear function
in the compressed input space [15]. They also investigated learning a regression
function in a linear space of high dimension using projections onto a random sub-
space of lower dimension. In this paper we show that for linear models, learning
in compressed parameter space is equivalent to learning in compressed input
space, and hence the results which have been developed for compressed input
space can be applied to the compressed parameter space.
We are aware of the works by Koutnı́k et al. [9,10], which apply learning in
compressed parameter space for evolutionary reinforcement learning. The work
in [10] uses an evolutionary algorithm called Cooperative Synapse NeuroEvolu-
tion (CoSyNE) [5] to optimize the parameters in compressed parameter space.
The method has been tested on reinforcement learning tasks, and it has been
shown that it outperforms other standard methods for completely observable do-
mains. In [9] an instance of the Practical Universal Search [12] called Universal
Network Search is used to obtain a solution. In this paper, we combine learning
in compressed parameter space with an augmented neural network to accelerate
learning specially for continuous state partially observable domains.
3 Learning in Compressed Parameter Space for

Regression
In this section, we consider linear models of the form

N
N
y(x; w) = wk φk (x) + w0 = wk φk (x) = w T φ(x), (1)
k=1 k=0
110 Y. Kassahun et al.
where w is a weight (parameter) vector of length N + 1, and φ(x) =

[1, φ1 (x), φ2 (x), . . . , φN (x)]T has as its components nonlinear and fixed basis
functions. We show that for linear models, there is a closed form solution for
the weighted sum of squared errors in compressed parameter space, and show
that the solution is equivalent to learning in compressed input space of lower
dimension. Let us assume that we have a training set given by
T = {(x(1) , c(1) ), (x(2) , c(2) ), . . . , (x(N ) , c(N ) )}. (2)

Assume further that x (n)
is a vector of length L and c ∈ R, where n ∈
(n)
{1, 2, . . . , N }. We consider the weighted sum of squared errors given by

1 T 2
e= λn w φ(x(n) ) − c(n) , (3)
2 n

where n λn = 1 1 . For learning in compressed parameter space, we approximate
wk using
wk = α0 ϕ0 (tk ) + α1 ϕ1 (tk ) + α2 ϕ2 (tk ) + · · · + αM ϕM (tk ), (4)

where M ≤ N + 1 and {ϕ0 , ϕ1 , ϕ2 , . . . , ϕM } forms an orthogonal set of basis
functions over the interval [−1, 1] and tk ∈ [0, 1] is a parametrization variable,
where tk = N k
and k ∈ [0, 1, 2, . . . , N ]. If we assume that the weight function is an
even function, it suffices to consider only the interval [0, 1]. For example, the set
{1, cos(πt), cos(2πt), . . . , cos(M πt)} forms an orthogonal set of basis functions
over the interval [−1, 1]. If we define a vector ϕm as
T
1 2 k
ϕm = ϕm (0), ϕm , ϕm , . . . , ϕm , . . . , ϕm (1) , (5)
N N N
k ∈ [0, 1, 2, . . . , N ] and m ∈ [0, 1, 2, . . . , M ] then we can write
d(n) T
m = ϕm φ(x
(n)
). (6)
(n) (n) (n)
Let d(n) = [d0 , d1 , . . . , dM ]T and α = [α0 , α1 , . . . , αM ]T . The sum of
weighted square errors is now given by
1 T (n) 2
e= λn α d − c(n) . (7)
2 n
One can see that this is equivalent to learning in a compressed input space of
lower dimension, where the training set is given by
Tc = {(d(1) , c(1) ), (d(2) , c(2) ), . . . , (d(N ) , c(N ) )}. (8)
This view on the problem has the advantage that it can easily be applied on clas-
sifiers with more complex optimization algorithms like support vector machines
1
Other forms of error functions can also be considered.
(SVM). Because of the reduced dimension, we obtain solutions in compressed

parameter space at faster speeds than obtaining solutions in uncompressed pa-
rameter space. Let Ω be an (M +1)×(M +1) matrix and γ be a vector of length
M + 1. If we differentiate the error e with respect to α and set the derivative to
zero, one can see that the vector α satisfies the linear equation
Ωα = γ, (9)
where
Ω = E[ddT ], γ = E[cd], (10)

d is a random variable whose realizations are {d , d , . . . , d }, and
(1) (2) (N )
and
P d = d(n) = λn . For a weighted error with regularization terms, one can show
that the vector α = [α0 , α1 , . . . , αM ]T minimizing the error function satisfies the
equation
Ω α + Ω α = γ, (11)
where Ω = Ω, Ω = diag(ν0 p0 , ν1 p1 , . . . , νM pM ) and pm = ϕTm ϕm . If ϕm is
a unit vector then pm = 1. The analysis presented in this section remains the
same for a linear model of the form

L
y(x; w) = wk xk + w0 = wT φ(x) (12)
k=1
if we define φ(x) = [1, x1 , x2 , . . . , xL ]T , w = [w0 , w1 , . . . , wL ]T and
T
1 2 k
ϕm = ϕm (0), ϕm , ϕm , . . . , ϕm , . . . , ϕm (1) , (13)
L L L
k ∈ [0, 1, 2, . . . , L] and m ∈ [0, 1, 2, . . . , M ].

Classification
In this section we consider the problem of two-class classification. In particular,
we assume that the output of the classifier is given by
P (C1 |x(n) ) = yn = y(w; x(n) ) = g(wT φ(x(n) )) (14)

with P (C2 |x ) = 1 − P (C1 |x ) and g(.) is the logistic sigmoid function. Since
(n) (n)
we do not have a closed form solution for classification problem, we make use of
the iterative reweighted least squares as follows:
1. Initialize the vector α and the basis functions ϕ0 , . . . , ϕM , and set αold = α.
2. Generate the weighting vector λ =[y1 (1 − y1 ), y2 (1 − y2 ), . . . , yN (1 − yN )]T .
λ
Normalize λ as λ ← λ , so that n λn = 1.
3. Use equation (9) to solve for α.
4. If α − αold ≤ stop, else set αold = α and go to step 2. The quantity is
a small positive real number used to stop the iteration.

Reinforcement Learning
For reinforcement learning, we perform a direct policy search. The policy is
represented by a neural network augmented by a Kalman filter. The augmented
neural network with Kalman Filter (ANKF) [8] to be learned is made up of
a neural network and a predictor that can estimate the next state based on
the current partially-observable state (which is possibly corrupted by noise).
The predictor we use is composed of n Kalman filters, one for each of the n
sensory readings. The outputs of these Kalman filters are connected to a feed-
forward neural network, whose outputs control the plant. The use of Kalman
filters provides memory to the system and as a result enables the system to
recover missing variables. Because of this, it is not necessary for the neural
network to have a recurrent connection, and the use of a feed-forward neural
network for the policy π to be learned is sufficient.
For the results presented in this paper, we assume that the feed-forward neural
network of the augmented neural network has no hidden layer (equivalent to a
linear model), and thus we can assume that we have a vector of parameters to
optimize. The number of parameters to optimize in uncompressed parameter
space is given by 2n for incomplete state variables, where n is the number of
inputs to the augmented neural network. If the length L = 2n of the parameters
of the augmented neural network is large, we need to determine a large number of
parameters. In order to speed-up the neuroevolutionary process, we approximate
wk using
wk = α0 ϕ0 (tk ) + α1 ϕ1 (tk ) + α2 ϕ2 (tk ) + · · · + αM ϕM (tk ) (15)
where {ϕ0 , ϕ1 , ϕ2 , . . . , ϕM } form an orthogonal set of basis functions, tk ∈ [0, 1]

is a parameterization variable and M ≤ 2n. Please note that equation (15) is
the same as equation (4) given above. The parametrization parameter tk = 0
corresponds to the first parameter in the uncompressed parameter space, and
tk = 1 corresponds to the last parameter in uncompressed parameter space.
In the compressed parameter space, we evolve the parameters {α0 , α1 , . . . , αM }
using the CMA-ES algorithm [6].
6 Classification of Evoked Potentials in EEG

A difficult task for any learning algorithm is the classification of evoked potentials
in EEG single trial data. This task is important for classification tasks that arise
in brain-computer interfaces (BCIs). BCIs detect patterns extracted from brain
activity signals to estimate the cognitive state of humans in order to control
devices or for the purpose of communication [14]. A problem in this detection
task is that EEG data is usually of high dimension and corrupted with a high
level of noise, while the amount of training data is small. However, in practice
it is usually of high importance to perform a fast training of the classifier with
the acquired data.
In the experiment two kinds of visual stimuli were presented to the test per-
son: irrelevant ”standards” and relevant ”targets”. When a target was presented
the test person had to react with a movement of the right arm. The ratio between
standards and targets was 8:1. The data was acquired from 8 different subjects
in 3 distinct sessions per subject. It is recorded at 5 kHz sampling rate with 136
electrodes from which 124 electrodes were used to record EEG data, 4 electrodes
to record electrooculogram (EOG) data, and 8 electrodes to record electromyo-
gram data. For the experiments we used 64 EEG electrodes (in accordance with
the extended 10-20 system with reference at electrode FCz). The data was ac-
quired using an actiCap system (Brain Products GmbH, Munich, Germany) and
amplified by four 32 channel BrainAmp DC amplifiers (Brain Products GmbH,
Munich, Germany). To estimate the effect of the compression on the classification
performance, we performed a stratified 2-fold cross validation on all data sets, and
repeated this experiment 100 times. All epochs of the data were processed as fol-
lows: (1) standardization (the mean of the data in the epoch was subtracted and
divided by the standard deviation) (2) decimation to 25 Hz (first the data was fil-
tered with an anti alias filter and afterwards subsampled) (3) again filtered with
low pass filter with cut-off frequency of 4 Hz (4) standardization per feature (5)
compression (6) classification with SVM since we showed that learning in com-
pressed parameter space is equivalent to learning in compressed input space. To
enhance the classification performance of the SVM, in each training attempt 7
different complexity values were investigated and the best one was chosen with
a 5-fold cross validation. Figure 1 shows the classification performance (balanced
accuracies [13]) and training times of different compression rates. As can be seen
from the figure, it is possible to reduce the training times of SVMs using learn-
ing in compressed input space. Note that the training time is reduced by a factor
of approximately 11 times for slight loss of the performance for SVM trained in
compressed input space for a compression rate of 80% (fraction left 0.2).
0.90 120
0.88
100
0.86
Balanced Accuracy
80
Training time [s]
0.84
60
0.82
0.80 40
0.78
20
0.76
0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Fraction left Fraction left
(a) (b)
Fig. 1. Balanced accuracy (a) and training times (b) versus fraction of data left for
training. A Fraction of one means that the original data is used. The training time
corresponds to the steps (4), (5) and (6) described above.
7 Experiments in Reinforcement Learning
The augmented neural network has been tested on the difficult versions of the
single and double pole balancing without velocities benchmarks [4], and has
achieved significantly better results on these benchmarks than the published
results of other algorithms to date. Table 1 shows the performance of learning in
compressed parameter space for both single and double pole balancing experi-
ments without velocity information. For this experiment, evolution of augmented
neural network in compressed parameter space outperforms significantly the evo-
lution of recurrent neural network in compressed parameter space. The increase
in performance is due to the simplification of neural networks through αβ fil-
ters. For CosyNE the performance in the compressed parameter space got worse,
which we suspect is due to the recurrent connections that exist in the recurrent
neural networks used to solve the problems.
Table 1. Results for the single and double pole-balancing benchmarks. Average over
50 independent evolutions. DOF stands for discrete orthogonal functions.
Task Method Parameters Evaluations

1 pole non-markov CoSyNE 4 (uncompressed) 127
1 pole non-markov ANKF 4 (uncompressed) 76
2 poles non-markov CoSyNE 5 (uncomressed) 954
2 poles non-markov ANKF 6 (uncompressed) 482
1 pole non-markov CoSyNE + DCT 4 (compressed) 151
1 pole non-markov ANKF + DOF 3 (compressed) 12
2 poles non-markov CoSyNE + DCT 5 (comressed) 3421
2 poles non-markov ANKF + DOF 4 (compressed) 480
8 Conclusion
For supervised learning, we have shown that it is possible to accelerate training
in compressed input space. For reinforcement learning, we have shown that by
evolving the parameters of the augmented neural network in a compressed pa-
rameter space, it is possible to accelerate neuroevolution for partially observable
domains. The results presented for reinforcement learning are preliminary since
(1) the problem considered is not difficult and (2) the number of parameters to
optimize in compressed parameter space is not large. Therefore, the method has
to be tested on more complex problems to assess the feasibility of evolving aug-
mented neural network for complex problems in compressed parameter space.
In the future, we would like to extend the method to non-linear models such as
neural networks, and test the method on standard benchmark problems for both
supervised and reinforcement learning.
References
1. Calderbank, R., Jafarpour, S., Schapire, R.: Compressed learning: Universal sparse
dimensionality reduction and learning in the measurement domain. Technical re-
port (2009)
2. Davenport, M.A., Duarte, M.F., Wakin, M.B., Laska, J.N., Takhar, D., Kelly, K.F.,
Baraniuk, R.G.: The smashed filter for compressive classification and target recog-
nition. In: Proceedings of Computational Imaging V at SPIE Electronic Imaging,
San Jose, CA (January 2007)
3. Donoho, D.L.: Compressed sensing. IEEE Transactions on Information The-
ory 52(4), 1289–1306 (2006)
4. Gomez, F.J., Miikkulainen, R.: Robust non-linear control through neuroevolution.
Technical Report AI-TR-03-303, Department of Computer Sciences, The Univer-
sity of Texas, Austin, USA (2002)
5. Gomez, F.J., Schmidhuber, J., Miikkulainen, R.: Efficient Non-linear Control
Through Neuroevolution. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.)
ECML 2006. LNCS (LNAI), vol. 4212, pp. 654–662. Springer, Heidelberg (2006)
6. Hansen, N., Ostermeier, A.: Completely derandomized self-adaptation in evolution
strategies. Evolutionary Computation 9(2), 159–195 (2001)
7. Haupt, J., Castro, R., Nowak, R., Fudge, G., Yeh, A.: Compressive sampling for
signal classification. In: Proceedings of the 40th Asilomar Conference on Signals,
Systems and Computers, pp. 1430–1434 (2006)
8. Kassahun, Y., de Gea, J., Edgington, M., Metzen, J.H., Kirchner, F.: Accelerating
neuroevolutionary methods using a kalman filter. In: Proceedings of the 10th An-
nual Conference on Genetic and Evolutionary Computation (GECCO), pp. 1397–
1404. ACM, New York (2008)
9. Koutnı́k, J., Gomez, F., Schmidhuber, J.: Searching for minimal neural networks
in fourier space. In: Baum, E., Hutter, M., Kitzelnmann, E. (eds.) Proceedings of
the Third Conference on Artificial General Intelligence (AGI), pp. 61–66. Atlantic
Press (2010)
10. Koutnı́k, J., Gomez, F.J., Schmidhuber, J.: Evolving neural networks in compressed
weight space. In: Proceedings of Genetic and Evolutionary Computation Confer-
ence (GECCO), pp. 619–626. ACM, New York (2010)
11. Maillard, O., Munos, R.: Compressed least-squares regression. In: Bengio, Y., Schu-
urmans, D., Lafferty, J., Williams, C.K.I., Culotta, A. (eds.) Advances in Neural
Information Processing Systems (NIPS), pp. 1213–1221 (2009)
12. Schaul, T., Schmidhuber, J.: Towards Practical Universal Search. In: Proceedings
of the Third Conference on Artificial General Intelligence (AGI), Lugano (2010)
13. Velez, D.R., White, B.C., Motsinger, A.A., Bush, W.S., Ritchie, M.D., Williams,
S.M., Moore, J.H.: A balanced accuracy function for epistasis modeling in im-
balanced datasets using multifactor dimensionality reduction. Genetic Epidemiol-
ogy 31(4), 306–315 (2007)
14. Zander, T.O., Kothe, C.: Towards passive brain computer interfaces: applying brain
computer interface technology to human machine systems in general. Journal of
Neural Engineering 8(2), 025005 (2011)
15. Zhou, S., Lafferty, J.D., Wasserman, L.A.: Compressed regression. In: Platt, J.C.,
Koller, D., Singer, Y., Roweis, S.T. (eds.) Advances in Neural Information Pro-
cessing Systems (NIPS), Curran Associates, Inc. (2008)
Control of a Free-Falling Cat
by Policy-Based Reinforcement Learning
Daichi Nakano, Shin-ichi Maeda, and Shin Ishii
Graduate School of Informatics,Kyoto University

Gokasho, Uji, Kyoto, 611-0011 Japan
{nakano-d,ichi}@sys.i.kyoto-u.ac.jp, ishii@i.kyoto-u.ac.jp
Abstract. Autonomous control of nonholonomic systems is one big

challenge, because there is no unified control method that can handle
any nonholonomic systems even if the dynamics are known. To this chal-
lenge, in this study, we propose a reinforcement learning (RL) approach
which enables the controller to acquire an appropriate control policy even
without knowing the detailed dynamics. In particular, we focus on the
control problem of a free-falling cat system whose dynamics are highly-
nonlinear and nonholonomic. To accelerate the learning, we take the
policy gradient method that exploits the basic knowledge of the system,
and present an appropriate policy representation for the task. It is shown
that this RL method achieves remarkably faster learning than that by
the existing genetic algorithm-based method.
Keywords: Free-falling cat, Nonholonomic system, Policy gradient

method.
1 Introduction
A nonlinear dynamical system is said ’nonholonomic’, if its constraints can-
not be reduced to algebraic equations consisting only of generalized coordi-
nates x ∈ n and time t [1], that is, it cannot be reduced to a form like
h(x, t) = 0 ∈ m (n ≥ m) but is represented by differential equations like
h(x, ẋ, ẍ, t) = 0 ∈ m . Cars, space-robots, submarines, and other underactuated
systems are examples of nonholonomic systems. According to the Brockett’s the-
orem, however, it is known that nonholonomic systems cannot be asymptotically
stabilized by static and smooth feedback control, indicating the difficulty to de-
sign a controller for such systems [2]. There have been many studies of control
of nonholonomic systems. However, they are mostly about heuristic approaches
specialized for target tasks, and It is difficult to generalize such heuristic ap-
proaches to a general and unified control method. Moreover, accurate dynamics
of target systems must be known to establish such a heuristic controller although
they are often unknown or partially known in practical situations.
In this study, we propose a reinforcement learning (RL) approach which is
a kind of autonomous control and applicable even when the detailed dynamics
of target systems are unknown. It enables an adaptive controller (an agent) to

Control of a Free-Falling Cat by Policy-Based Reinforcement Learning 117
acquire the optimal controller (policy) that maximizes the cumulative or average
reward through trial and error with the target system. RL approaches are clas-
sified into two major categories; one consists of value function-based methods,
by which the policy is updated such to enlarge the value function according to
the scenario of policy iteration, and the other consists of policy search methods,
by which policy parameters are directly updated so as to increase the objective
function. The latter category, policy search methods, is further classified into
two in terms of optimization techniques. One is a policy gradient method [6],
in which the policy parameters are updated based on the approximate gradi-
ent of the objective function. In the other class, any kinds of ’meta-heuristic’
optimization technique, like genetic algorithm (GA), can be used. GA has been
successfully applied to optimize the control policy, which is represented by means
of instances, and enabled controlling nonholonomic systems [7] [8]. While GA can
be applied even when the objective function is not differentiable with respect to
the policy parameters, there is almost no guideline for optimization with respect
to high-dimensional policy parameters; in GA, the dependence of the objective
function on the parameters is obscure.
In contrast, in policy gradient methods, more efficient optimization can be
performed because the (approximate) gradient of the objective function with
respect to the policy parameters represents the knowledge of the target system,
i.e., how the objective function depends on the parameters to be optimized.
Thus, in this study, we propose to use a policy gradient method to exploit the
basic knowledge of the target system. In particular, GPOMDP [8], which is
one of policy gradient methods, is used. As a typical and interesting example
of nonholonomic systems, we focus on a falling-cat system. Even when a cat
is falling on its back first, it can twist its body, put its feet down and land
safely. Such a cat’s motion in the air is called a falling-cat motion. In the falling-
cat motion, its angular momentum should be conserved and it constitutes a
nonholonomic constraint [9]. To fully utilize the inherent property of the system,
we also propose to use a stochastic policy incorporating normalized radial basis
functions, which is suitable for control in a periodic state space. We will show
our approach that makes use of these inherent natures of the system, enables a
quick learning than that by the existing GA method.
2 Falling-Cat Motion Problem
In this study, as the simplest model of a free-falling cat, we use a mathematical

model presented in [11]. This model is composed of two cylinders that move
point-symmetrically at their connection point O (Fig. 1). The system is placed
in a vacuum without gravity; i.e., it has to satisfy the law of conservation of
angular momentum.
The state variable x is a three-dimensional vector [ψ, γ, φ], which represents
three angles (rad) of the system, rotation angle of the cylinders (reduce to one-
dimensional due to the symmetric character), bending angle of the two cylinders,
and rotation angle of the entire model, respectively. As far as the object is free
118 D. Nakano, S. Maeda, and S. Ishii
Fig. 1. A free-falling cat model [11]
from the external force, total angular momentum of the cylinders should be
conserved along the whole movement of the model, which provides a nonholo-
nomic constraint. Thus, although the state is represented by three-dimensional
variables, its degree of freedom is constrained to two. So, two angular velocities,
u = [ψ̇, γ̇], are assumed to be controllable directly. In a continuous-time domain,
this model is described by
⎡ ⎤
1 0
ẋ = G(x)u, G(x) = ⎣ 0 1 ⎦, (1)
fψ (ψ, γ) fγ (ψ, γ)
cos γ sin2 γ[ρ + (1 − ) cos2 ψ]
fψ (ψ, γ) = ,
(1 − sin2 γ cos2 ψ)[1 + (ρ − cos2 ψ) sin2 γ]
cos ψ sin ψ sin γ(1 − + ρ sin2 γ)
fγ (ψ, γ) = ,
(1 − sin2 γ cos2 ψ)[1 + (ρ − cos2 ψ) sin2 γ]
where ρ and are scalars that determine the system dynamics and are set to
ρ = 3 and = 0.01, respectively, referring to the existing study [11]. As can be
seen in this equation, the system dynamics are highly nonlinear although the
object structure is quite simple.
In our simulation, this continuous-time system is observed, in terms of state
and reward, every 0.02 sec. A controller (policy) produces a control signal (an
action) immediately after the observation, and the control signal is continuously
applied until the next observation for the interval of 0.02 sec. Since the system
is observed intermittently, it can be approximated by a discrete-time system in
which a state xt and a reward rt are observed, and an action ut is taken at each
time step t = 0, 1, · · · . The initial state is fixed at x0 = [0, 0, 0] where a cat is in
the supine position. A control sequence of 5.0 sec from the initial state is defined
as an episode.
The objective of RL is defined as the maximization of the average reward
T −1
η = T1 t=0 rt+1 . Here, the reward rt+1 = r(xt , ut ) is given by the sum of
instantaneous rewards R(xt,k ) along the trajectory between time steps t − 1
K−1
and t, that is , r(xt , ut ) = k=0 R(xt,k ). xt,k denotes the k-th intermediate
state on the local trajectory between time step t − 1 and t where xt,0 and
xt,K are defined by xt,0 = xt and xt,K = xt+1 , respectively. The function
λ1
R(x) = (x−x )T Λ(x−x )+1
becomes maximal when the system is at the goal state
g g
xg = [2π, 0, π] suggesting the cat being in the prone position. Here, Λ denotes
weight parameters that represent the importance of the three state variables,
and was set to [λψ , λγ , λφ ] = [0.6, 0.6, 0.6]. In the later simulation, K and λ1 was
set to K = 20 and λ1 = 10, respectively.
3 Method
Our RL algorithm is based on GPOMDP [10], a policy gradient method. In this
section, we describe the learning algorithm and its implementation.
3.1 Definition of the Parameterized Policy

An action (a control signal) ut for the interval [t, t + 1) is produced by the
stochastic policy μ(ut |xt , Θ). Here, Θ denotes the set of policy parameters to
be learned. A stochastic policy is advantageous for better exploration. Since
we consider two-dimensional action space, let uti (i ∈ {ψ, γ}) be an element of
ut = [ψ̇, γ̇]. For simplicity, we consider an independent policy over its elements:
that is, μ(ut |xt , Θ) = μ(utψ |xt , Θψ )μ(utγ |xt , Θγ ).
Element-wise policy, μ(uti |xt , Θi ) (i ∈ {ψ, γ}), is given by Eq. (2) so that
1) each angular velocity does not take too large value and stays within a fixed
range −l ≤ ψ̇, γ̇ ≤ l (l is a constant), and 2) its support is periodic with respect
to the system’s angle.
1
f (uti , xt , Θi ), (f > 0, −li ≤ uti ≤ li )
μ(uti |xt , Θi )= c(xt ,Θi ) , (2)
0, (otherwise)
li
{uti − wi (pi )T b(xt )}2
f (uti , xt , Θi )=− + 1, c(xt , Θi ) = f (uti , xt , Θi )duti .
α2i + αmin −li
Here, T denotes a transpose.

According to Eq. (2), the stochastic policy is represented by the positive part
of the convex quadratic function whose center is wi (pi )T b(xt ) (Fig. 2). Be-
cause the distribution function is just polynomial, the normalization constant
c(xt , Θi ) can be easily calculated. wi (pi ) = [wi (pi1 ), · · · , wi (piN )]T and b(xt ) =
N 1 [b (x), · · · , bN (x)]T are an N -dimensional weight vector with an ad-
b (x) 1
j=1 j
justable parameter pi = [pi1 , · · · , piN ]T , and an N -dimensional state-dependent

normalized basis vector, respectively. αi and αmin (> 0) are an
adjustable parameter that controls a variance of μ and a fixed small positive
constant to avoid the numerical instability, respectively. These policy parame-
ters are represented by a parameter

Θi = {pi , αi }.
vector
The weight wi (p) = 2li 1+e−p − 2 is represented by a sigmoid function
1 1
to restrict the output range in [−li , li ] irrelevant to the parameter value p. To
Fig. 2. A policy is given by a truncated quadratic distribution

represent the system’s periodic character with respect to system angles, the basis
function bj (xt ) is described by cosine functions as;
bj (x) = exp[σ{cos(ψ − ψjb ) + cos(γ − γjb ) + cos(φ − φbj )}], (3)
where σ and xbj = [ψjb , γjb , φbj ] denote the width and center of the basis function,
respectively. The basis center xbj (j = 1, · · · , N ) was arranged on a regular grid
independently for each dimension. Putting l centers in the interval [0, 2π] for ψ,
m centers in the interval [−π, π] for γ, and n centers in the interval [0, 2π] for φ,
we have in total l × m × n = N grid points, leading to the center vector b(x).
For representing the distribution center, wi (pi )T b(xt ), we used the normal-
ized radial basis function (RBF) b(x), rather than the original RBF bj (x). The
normalized RBF provides good interpolation without being affected by the al-
location of the centers {xbj |j = 1, · · · , N }, whereas the original RBF outputs
tend to be small if the input x is apart from any of the basis centers. By using
the basis function vector and the weight vector described above, it is guaranteed
that not only the action value uti but also the distribution mode wi (pi )T b(xt )
are constrained within the interval [−li , li ].
3.2 GPOMDP
In this study, we use GPOMDP [10], which is one of the policy gradient methods,
for RL. According to the policy gradient method, policy parameters are updated
such to increase the objective function, the average reward in each episode,
without estimating the value function.
In the policy gradient method, policy parameters are updated by
θh+1 = θh + βh+1 ∇θ η, (4)
β0
where h counts the number of parameter updates. β(h) = δLh+1 is the reciprocal-
linearly scaled learning rate, where β0 , δ and L are an initial learning rate, a
decay constant, and the number of episodes for calculating the gradient, re-
spectively, which are pre-determined. ∇θ η denotes a differential of the average
reward η = η(θ) with respect to θ.
However, ∇θ η cannot be calculated analytically since the calculation of the
gradient of the average reward requires the unknown state transition probability.
Therefore, we approximate ∇θ η by GPOMDP. GPOMDP is advantageous, be-
cause no information of the system is necessary in the gradient estimation and the
required memory capacity is small. In our implementation, each policy parameter
was updated independently, and the average of L gradients ∇θ ηv (v = 1, · · · , L),
each estimated from L episodes, ξ v = [xv0:T , uv0:T −1 ] (v = 1, · · · , L), are used for
the policy updated in order to suppress the variance of the gradient estimation
and perform a parallel computing.
4 Simulations
Our RL algorithm was evaluated by numerical simulations.
4.1 Task and Algorithm Setting

The continuous-time system (Eq. (1)) is approximated by a discrete-time system
with the time step 1 msec. An episode consists of 5.0 sec control time-series
started from the fixed initial state x0 .
The constant and initial parameters in our RL algorithm were set as follows.
The number of episodes used for estimating the gradient, L, was set 50. The
constant parameters related to αi were: the initial value α0i = 6, the minimum
value αmin = 0.5, the initial learning rate β0,αi = 0.0002, and the decay constant
of the learning rate δαi =0. The parameters related to pi were: the initial learning
rate β0,pi = 0.06 and the decay constant of the learning rate δpi =0.001. Each
element of initial value of pi was set at small, randomly chosen value within
pij ∈ [−0.001, 0.001].
4.2 Result
Fig. 3(a) shows the learning curve, in which the abscissa and ordinate denote
the number of training episodes and the average reward obtained in a single
episode, respectively. The average reward increases rapidly up to 10, 000 episodes
and then slowly after that, suggesting RL obtained a control that enables the
system to reach the goal state by around 10, 000 episodes, and then a better
control to achieve the goal faster. Interesting is, this learning process is fairly
stable, which can be seen in the relatively small standard deviation over the 10
training runs. After 100, 000 training episodes, the controller makes the system
be the goal state in about 2.0 sec (Fig. 3(b)). This efficient control can also be
seen in Fig. 4; within 1.6 sec the target prone position is realized, although the
control trajectory from the initial supine position to the goal prone position is
not simple.
4.3 Comparison with Genetic Algorithm

We compared our RL method with the modified genetic algorithm (GA) [8] which
was also applied to the control problem of a falling-cat system. Our falling-cat
system was the same as used in the GA study, and the system parameters were
also common. One difference between our RL study and the GA study is the
sampling and control frequency; in the GA study, the observation and control
were performed every 0.01 sec, while in our RL study, they were done every
0.02 sec. Apparently, high frequency of observation and control is advantageous
that allows more precise control based on more rich observation. Another differ-
ence is in the objective function to optimize; so, after every learning episode, we
evaluated the fitness function below, which was the objective function optimized
by GA.
T
θ2 + θs2 + 5(π − |θr |)2 + 0.05(θ̇2 + θ̇s2 )
Fitness = r(xt , ut ), r(x, u) = − .
t=1
5000
(5)
2p
psi
gamma
phi
states(rad)
7 p
6
Average Reward
5 0
4 2p psi
psi
psi
g am m a
input(rad/s)
3 gamma
gamma
2 0
1
-2p
0 0 1 2 3 4 5
0 20,000 40,000 60,000 80,000 100,000 time(sec)
Episode
(b) Time-series of the (c) Comparison between
(a) The learning curve state variables (upper) RL (solid) and GA
and control signals (lower) (dashed)
Fig. 3. (a) Learning curve; mean and standard deviation over 10 simulation runs are
shown. (b) A control trajectory, the series of state variables (upper panel) and those
of control signals (lower panel) after 100, 000 training episodes. Here, the deterministic
policy, in which the mean action wi (pi )b(xt ) is always taken, was used. In the upper
panel, the goal state xg for the three state variables are depicted by thin straight lines.
(c) Fitness comparison between the GA study [8] and ours.
Fig. 4. Snapshots in a control trajectory after 100, 000 training episodes, every 0.02sec
from the initial state x0 to the last state at 5.0sec. Two small pins on cylinders indicate
the direction of cat’s feet.
Fig. 3(c) shows the average fitness curves by our RL method (solid line) and
by the GA method [8] (dashed line). Both fitness curves show average of 10
simulation runs. Although the direct optimization of the fitness function is ad-
vantageous for enlarging the fitness itself, our objective function, the average
reward, has high correlation with the fitness function. The higher fitness sug-
gests our algorithm obtained a policy that reaches the goal with smaller control
inputs.
5 Conclusion
A unified control law which can handle any nonholonomic systems has not yet
been discovered. In this study, we proposed an RL approach to the control prob-
lem of systems which cannot be well controlled by usual feedback control and
may include unknown dynamics. As an example, we put our focus on a free-falling

cat, which is a typical nonholonomic system with a highly nonlinear dynamics.
One possible drawback of RL would be its high computational demand, e.g.,
it may require a large amount of samples. In this study, however, we showed that
the combination of a good set-up of basis functions which utilizes the knowledge
of the target problem and a good optimization method, GPOMDP, realized
faster and better learning than the GA-based method [8].
References
1. Nakamura, Y.: Nonholonomic robot systems, Part 1: what’s a nonholonomic robot?
Journal of RSJ 11, 521–528 (1993)
2. Brockett, R.W.: Asymptotic stability and feedback stabilization. Progress in Math-
ematics 27, 181–208 (1983)
3. Mita, T.: Introduction to nonlinear control Theory-Skill control of underactuated
robots. SHOKODO Co., Ltd. (2000) (in Japanese)
4. Murray, R.M., Sastry, S.S.: Nonholonomic motion planning: steering using sinu-
soids. IEEE Transactions on Automatic Control 38, 700–716 (1993)
5. Holamoto, S., Funasako, T.: Feedback control of a planar space robot using a
moving manifold. Journal of RSJ 25, 745–751 (1993)
6. Peters, J., Schaal, S.: Reinforcement learning of motor skills with policy gradients.
Neural Networks 21, 682–697 (2008)
7. Miyamae, A., et al.: Instance-based policy learning by real-coded genetic algo-
rithms and its application to control of nonholonomic systems. Transactions of the
Japanese Society for Artificial Intelligence 24, 104–115 (2009)
8. Tsuchiya, C., et al.: SLIP: A sophisticated learner for instance-based policy using
hybrid GA. Transactions of SICE 42, 1344–1352 (2006)
9. Nakamura, Y., Mukherjee, R.: Nonholonomic path planning of space robots via a
bidirectional approach. IEEE Transactions on Robotics and Automation 7, 500–514
(1991)
10. Baxter, J., Bartlett, P.L.: Infinite-horizon policy-gradient estimation. Journal of
Artificial Intelligence Research 15, 319–350 (2001)
11. Ge, X., Chen, L.: Optimal control of nonholonomic motion planning for a free-
falling cat. Applied Mathematics and Mechanics 28, 601–607 (2007)
Gated Boltzmann Machine in Texture Modeling
Tele Hao, Tapani Raiko, Alexander Ilin, and Juha Karhunen
Department of Information and Computer Science

Aalto University, Espoo, Finland
{tele.hao,tapani.raiko,alexander.ilin,juha.karhunen}@aalto.fi
Abstract. In this paper, we consider the problem of modeling com-

plex texture information using undirected probabilistic graphical models.
Texture is a special type of data that one can better understand by con-
sidering its local structure. For that purpose, we propose a convolutional
variant of the Gaussian gated Boltzmann machine (GGBM) [12], inspired
by the co-occurrence matrix in traditional texture analysis. We also link
the proposed model to a much simpler Gaussian restricted Boltzmann
machine where convolutional features are computed as a preprocessing
step. The usefulness of the model is illustrated in texture classification
and reconstruction experiments.
Keywords: Gated Boltzmann Machine, Texture Analysis, Deep Learn-

ing, Gaussian Restricted Boltzmann Machine.
1 Introduction
Deep learning [7] has resulted in a renaissance of neural networks research. It

has been applied to various machine learning problem successfully: for instance,
hand-written digit recognition [4], document classification [7], and non-linear
dimensionality reduction [8].
Texture information modeling has been studied for decades, see, e.g., [6]. It can
be understood by considering combinations of several repetitive local features. In
this manner, various authors proposed hand-tuned feature extractors. Instead of
understanding the generative models for textures, those extractors try to consider
the problem discriminatingly. An old model called co-occurrence matrix was pro-
posed in [6], where it was used to measure how often a pair of pixels with a certain
offset gets particular values, thus tackling the structure of the textures. Despite
the good performances of these extractors, they suffer from the fact that they con-
tain only little information about the generative model for textures. Also, these
extractors can only be applied to certain type of data, and it is fairly hard to adopt
them to other tasks if needed. Conversely, generative models of textures can be ap-
plied to various texture modeling applications. In this direction, some statistical
approaches for modeling textures have been introduced in [14] and [11]. A pioneer-
ing work of texture modeling using deep network is proposed in [9].
Texture modeling is a very important task in real-world computer vision appli-
cations. An object can have any shape, size, and illumination condition. However,
the texture pattern within the objects can be rather consistent. By understanding

Gated Boltzmann Machine in Texture Modeling 125
that, one can improve the understanding of objects in complex real-world recog-
nition tasks.
In this paper, a new type of building block for deep network is explored
to understand texture modeling. The new model is used to model the local
relationship within the texture in a biologically plausible manner. Instead of
searching exhaustively over the whole image patch, we propose to search for
local structures in a smaller region of interest. Also, due to the complexity of
the model, a novel learning scheme for such model is proposed.
2 Background
2.1 Co-occurrence Matrices in Texture Classification
Co-occurrence matrix [6] measures the frequencies a pair of pixels with a certain
offset gets particular values. Modeling co-occurrence matrices instead of pixels
brings the analysis to a more abstract level immediately, and it has therefore
been used in texture modeling.
The co-occurrence matrix C is defined over {m × n} size image I, where
{1 . . . Ng } levels of gray scales are used to model pixel intensities. Under this
assumption, the size of C is {Ng × Ng }. Each entry in C is defined by

M N
1 if I(m, n) = i & I(m + δx , n + δy ) = j
cij = (1)
m=1 n=1
0 otherwise
Different offset schemes for {δx , δy } result in different co-occurrence matrices. For
instance, one can look for textural pattern over an image with offset {−1, 0} or
{0, 1}. These different co-occurrence matrices typically have information about
the texture from different orientations. Therefore, a set of invariant features can
be obtained by having several different co-occurrence matrices together.
2.2 Gaussian Restricted Boltzmann Machines

Gaussian restricted Boltzmann machine (GRBM) [7] is a basic building block for
deep networks. It tries to capture binary hidden features (hidden neurons) from a
continuous valued data vector (visible neurons), where hidden neurons and visi-
ble neurons are fully connected by an undirected graph. Even though an efficient
learning algorithm was proposed for GRBM [7], training is still very sensitive to
initialization and choice of learning parameters. Cho et al. proposed an enhanced
gradient learning algorithm for GRBM in [2]. Throughout the paper, a modified
version of GRBM [3] is adopted, where the energy function is defined as
xi (xi − bi )2
E(x, h) = − 2 hk wik − h k ck + (2)
σi i
2σi2
ik k
where xi , i = 1, . . . , N and hk , k = 1, . . . , K refer to the visible neurons and hid-

den neurons, respectively. wik characterizes the weight of the connection between
xi and hk , and ck is the bias term for hidden neuron hk . The mean and variance
126 T. Hao et al.
h h
h
x y x t
x
(a) GRBM(X) (b) GGBM (c) GRBM(X,T)
Fig. 1. The schematic Illustration of the structures of different Boltzmann machines
of xi are denoted by bi and σi . Accordingly, the joint distribution of different

Boltzmann machine can be computed as P (x, h) = Z −1 exp (−E(x, h)), where
Z is the normalization constant. A schematic illustration of GRBM is shown in
Figure 1a. The input neurons x connect to the hidden neurons h, where each
connection is characterized by wik , A weight matrix and two bias vectors are
used to characterize all the connections in the network.
2.3 Gaussian Gated Boltzmann Machine

Gaussian gated Boltzmann machine(GGBM) [12] is a higher order Boltzmann
machine where there are two sets of visible neurons and one set of hidden neu-
rons. It is developed to model the complex image transformation in paired im-
ages [12], and the internal structures of a single image [13]. The energy function
of GGBM is defined as
xi yj (xi − bx )2 (yj − byj )2
E(x, y, h) = − hk wijk + 2
i
+ − ck hk (3)
σi σj i
2σ i j
2σj2
ijk k
A graphical illustration of GGBM is shown in Figure 1b. GGBM tries to model

the relationship between visible neurons x and y by a set of hidden variables h.
A dot on the crossing of two lines in the figure represents one weight scalar wijk .
The biases are omitted in the figure for simplicity.
The weight tensor wijk can be rather large if there are lots of visible neurons and
hidden neurons. For instance, two data vectors of size 100 and 200 and a hidden
vector of size 500 increase the number of parameter in the wijk up to 100×200×500.
In order to overcome this, a low rank factorization of the weight tensor is done
y
as wijk → f wif x
wjf h
wkf [12]. A new different simplification approach based on
convolutional operation on local structure of texture is considered in this paper.
3 Proposed Method
Combining the nature of texture information and GGBM, a modified GGBM
especially suitable for texture modeling is proposed. To start with, we consider
a slightly modified general gated Boltzmann machine where there are pair-wise
connections between all sets of nodes. This model has the most comprehensive
information about the input vectors. Accordingly, the energy function of the
model is written as
xi yj xi yj (xi − bx )2
E(x, y, h) = − hk wijk − uij + 2
i
σi σj ij
σ σ
i j i
4σi
ijk
(4)
(yj − bj )y 2 xi (1)
yj (2)
+ − h k ck − hk vik − hk vjk
j
4σj2 2σi2 2σj2
k ik jk
(1) (2)
where uij ,vik and vik are additional parameters to model the pair-wise connec-
tions between two sets of visible neurons {x, y} and hidden neurons h. Instead
of looking for the image transformation, we seek for the internal structure of
texture information. Therefore, the same patch of image is fed to the two sets
of visible neurons, that is x = y. Accordingly, the weights v and bias b for the
two sets of visible neurons are tied, which is V = V(1) = V(2) ; b = bx = by .
Also, a unified variance σ 2 = σi2 = σj2 is learned to reduce the complexity of the
model further The complexity of the model remains as the weight tensor wijk
still needs huge learning efforts. As x = y, xi and yj can be considered a pair
of pixels, and hk is learned to model this interaction. Given an image patch, the
traditional GGBM will go through all the combinations of such pairs. This is
highly redundant as the texture is repetitive within a very small region. Recalling
that co-occurrence matrix tries to summarize the interaction of pairs of pixels
over a certain area, this structure can be introduced to GGBM. In order to do
that, we will assume wijk = wdk , such that the weight wijk depends only on the
displacement d and the hidden neuron hk . d represents the offest from i to j.
Similarly, uij = ud . One can think of wdk and ud as a convolutional model only
over the local regions in image patches. Convolutional approximation has been
argued to be rather successful in other applications such as image recognition
tasks [10]. It is further assumed tthat wdk = 0 for large displacement d.
After these simplifications, the energy function (4) becomes
1 1 1
E(x, y, h) = − 2 xi yj hk wdij k − 2 xi yj udij + 2 (xi − bi )2
σ σ 2σ i
ijk d
(5)
1
− 2 xi hk vik − hk c k
σ
ik k
Ignoring the restriction x = y, learning and inference of GGBM can be based

on sequentially sampling from the conditional distributions p(x|y, h), p(y|x, h)
and p(h|x, y). These conditional forms can all be written in a close form as
⎛ ⎞

p(x|y, h) = N ⎝b i + yj hk wijk + yj uij + hk vik , σ 2 ⎠ (6)
i jk j k
1
p(h|x, y) =
.
k 1 + exp − σ12 i xi vik −
1
σ2 j yj vjk −
1
σ2 ij xi yj wijk − ck
h
(7)
128 T. Hao et al.
3.1 GRBM with Preprocessing
We also define a related but much simpler model as follows. Firstly, we define
auxiliary variables td = i xi yi+d where d is the offset between pixels i and j
as before. This formulation stems from the principle of the co-occurrence matrix
where each feature is only related to particular pairs of pixels in the image.
These computations can be done as a preprocessing step. Secondly, we learn
a GRBM using the concatenation of vectors [x, t] as data. We call this model
the GRBM(X,T) and illustrate it in Figure 1c. In the figure, the dashed line
represents t being computed from x.
When we write the energy function of GRBM(X,T)

1
E(x, t, h) = − 2 xi hk vik + ti hk wdk − h k ck
σ
ik dk k
(8)
1
+ 2 (xi − bi )2 + (td − ud )2 ,
2σ i d
we notice the similarities to the GGBM energy function in Equation (5). Each
parameter has its corresponding counterpart. The only remaining difference is
1 2
E(x, t, h) − E(x, y, h) = td + const (9)
2σ 2
d
It turns out p(h|x, y) can be written in the exact same form as in Equation (7).
Since learning higher order Boltzmann machines is known to be quite difficult,
we propose to use this related model as a way for learning them. So in practice we
first train a GRBM(X,T), and then convert the parameters to the GGBM model.
Actually, in texture classification, the converted model produces exactly the same
hidden activations h and thus the same classification results. On the other hand,
in the texture reconstruction problem, the GRBM(X,T) model cannot be used
directly, since t cannot be computed from partial observations.
We noticed experimentally, that the converted GGBM model needs to be
further regularized, since the regularizing terms t2d in the energy function of
GRBM(X,T) are dropped off as seen in Equation (9). We simply converted wdk
and ud by scaling them with a constant factor smaller than 1, and chose that
constant by the smallest validation reconstruction error.
4 Experiments
We test our methods with texture classification and reconstruction experiments.

The proposed method is first run to extract a set of meaningful features from
different datasets, and these features are then used for the classification and
reconstruction.
Table 1. The texture classification result on various benchmark data sets
(a) Brodatz 24 data set (b) KTH data set

Settings Training Testing Settings Training Testing
X 25.0% 16.2% X 29.2% 19.0%
T 54.2% 50.4% T 46.7% 43.8%
XT 61.8% 52.8% XT 57.3% 49.2%
FX 87.6% 63.0% FX 68.2% 60.4%
FT 91.7% 65.3% FT 72.0% 62.2%
FXT 94.8% 67.0% FXT 77.4% 66.2%
4.1 Texture Classification

Two publicly available texture data sets are tested. The liblinear library [5]
is used to build a classifier. In all classification experiments, a L1-regularized
logistic regression (L1LR) is trained. For the feature extraction experiments,
one step contrastive divergence and some regularization parameters1 are used.
In all experiment, 1000 hidden neurons are chosen, and wdk = 0 for all ||d||∞ > 5.
For comparison, we conducted six different classification experiments:
raw image patches (X) L1LR on X
transforms of X (T) L1LR on T
joint X and T (XT) L1LR on XT
features from X (FX) First run GRBM on X, and then L1LR on FX
features from T (FT) First run GRBM on T, and then L1LR on FT
features from XT (FXT) First run GRBM on XT, and then L1LR on FXT
The classification results in our experiments cannot be directly compared to
other texture classification experiments as they typical extract a highly complex
feature set from the whole image, while we directly extract features from small
patches of textures. In other words, our model is capable of performing classifi-
cation even though there is only little information about the texture, while it is
typically hard to extract features if the images are too small in other conventional
texture classification experiments.
Brodatz 24 Data Set. A subset of 24 different textures is manually selected

from a large collection of 112 different textures. Only one large image is
available for each class [11]. Each image in each class is divided into 25
{128 × 128} small images, 13 of them are used to generate the training
patches, and rest of them are used to generate the testing patches. The
patch size in the learning and testing is manually selected as {20 × 20}.
240000 image patches are used in extracting the features. 24000 samples
are used for training a classifier and 2400 samples are used for testing. The
classification results are shown in Table 1a. Among all the experiments, the
proposed method performs the best.
1
Weight decay = 0.0002, momentum = 0.2.
130 T. Hao et al.
KTH Texture Dataset. This dataset [1] has 11 different textures, 4 different
samples for each texture, and 108 different images are available for each
sample. Each image is of size {200 × 200}, and the patch size is still selected
as 20 × 20. Only the 108 images from sample a2 in each texture are used:
54 for generating training samples and 54 for generating testing samples.
118800 patches are used for extracting the features. 11000 patches are used
for training a classifier and 1100 sample are used for testing. The best result
is obtained with the proposed method. Please note a poorer overall perfor-
mance is expected as the variations within the training samples make the
problem harder. The detailed results are shown in Table 1b.
4.2 Texture Reconstruction

We also made a demonstration of texture reconstruction for showing the connec-
tions between the proposed model and its approximation. In this experiment, 6
random image patches are chosen from the Brodatz 24 dataset testing samples,
and a {10 × 10} square center of the patches are removed for reconstruction. The
reconstruction result can be seen in Figure 2. For comparison, the reconstruction
result from GRBM(X) model is provided. From this experiment, we can see that
the learned model is capable of learning a generative model for the texture success-
fully. Despite the regularization, the reconstructions still seem to have blockiness
by over-emphasizing low frequencies. One way to improve the result would be to
use the GRBM(X,T) as an initialization for the GGBM, and train it further.
Fig. 2. The texture reconstruction experiment. The first row shows the random samples
with missing centers. The second row shows the reconstruction from GRBM model,
and the reconstruction from the proposed model is shown in the third row. The original
samples are shown at the last row.
5 Conclusions
In this paper, we tackled the problem of modeling texture information. We pro-
posed a modified version of GGBM and a simpler learning algorithm for that.
2
Available at http://www.nada.kth.se/cvap/databases/kth-tips/
From the experimental results, we can argue that the proposed model is bene-
ficial in terms of modeling the structured information such as textures. Among
all the results, the highest accuracies are obtained by the features learned from
the proposed model. Although these accuracies are not the state-of-the-art, the
proposed model opened up a possibility where the texture information can be
successfully modeled using the higher order Boltzmann machine.
References
1. Caputo, B., Hayman, E., Mallikarjuna, P.: Class-Specific Material Categorisation.
In: Int. Conf. on Computer Vision, pp. 1597–1604 (2005)
2. Cho, K., Raiko, T., Ilin, A.: Gaussian-Bernoulli Deep Boltzmann Machine. In: NIPS
2011 Workshop on Deep Learning and Unsupervised Feature Learning (2011)
3. Cho, K., Ilin, A., Raiko, T.: Improved Learning of Gaussian-Bernoulli Restricted
Boltzmann Machines. In: Int. Conf. on Artifical Neural Networks, pp. 10–17 (2011)
4. Cireşan, D.C., Meier, U., Gambardella, L.M., Schmidhuber, J.: Deep, Big, Simple
Neural Nets for Handwritten Digit Recognition. Neural Comput. 22(12) (2010)
5. Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: A Li-
brary for Large Linear Classification. JMLR, 1871–1874 (2008)
6. Haralick, R.M., Shanmugam, K., Dinstein, I.: Textural Features for Image Classi-
fication. IEEE Trans. Syst., Man, Cybern. 3(6), 610–621 (1973)
7. Hinton, G., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural
networks. Science 313(5786), 504–507 (2006)
8. Hinton, G., Salakhutdinov, R.: Discovering Binary Codes for Documents by Learn-
ing Deep Generative Models. Topics in Cognitive Science 3(1), 74–91 (2010)
9. Kivinen, J., Williams, C.: Multiple Texture Boltzmann Machines. JMLR
W&CP 22, 638–646 (2012)
10. Lee, H., Grosse, R., Ranganath, R., Ng, A.Y.: Convolutional deep belief networks
for scalable unsupervised learning of hierarchical representations. In: Int. Conf.
Machine Learning, p. 77 (2009)
11. Liu, L., Fieguth, P.: Texture Classification from Random Features. IEEE Trans.
Pattern Anal. Mach. Intell. 34(3), 574–586 (2012)
12. Memisevic, R., Hinton, G.E.: Learning to Represent Spatial Transformations with
Factored Higher-Order Boltzmann Machines. Neural Comput. 22(6), 1473–1492
(2010)
13. Ranzato, M., Krizhevsky, A., Hinton, G.E.: Factored 3-Way Restricted Boltzmann
Machines For Modeling Natural Images. JMLR W&CP 9, 621–628 (2010)
14. Varma, M., Zisserman, A.: A Statistical Approach to Material Classification Using
Image Patch Exemplars. IEEE Trans. Pattern Anal. Mach. Intell. 31(11), 2032–
2047 (2009)
Neural PCA and Maximum Likelihood Hebbian
Learning on the GPU
Pavel Krömer1,2, Emilio Corchado2,3, Václav Snášel1,2,

Jan Platoš1,2, and Laura Garcı́a-Hernández4
1
Department of Computer Science, VŠB-Technical University of Ostrava,
17.listopadu 15/2172, 708 33 Ostrava-Poruba, Czech Republic
2
IT4Innovations, 17.listopadu 15/2172, 708 33 Ostrava-Poruba, Czech Republic
{pavel.kromer,vaclav.snasel,jan.platos}@vsb.cz
3
Departamento de Informática y Automática, Universidad de Salamanca, Spain
escorchado@usal.es
4
Area of Project Engineering, University of Cordoba, Spain
ir1gahel@uco.es
Abstract. This study introduces a novel fine-grained parallel implemen-

tation of a neural principal component analysis (neural PCA) variant and
the maximum Likelihood Hebbian Learning (MLHL) network designed
for modern many-core graphics processing units (GPUs). The parallel
implementation as well as the computational experiments conducted in
order to evaluate the speedup achieved by the GPU are presented and
discussed. The evaluation was done on a well-known artificial data set,
the 2D bars data set.
Keywords: neural PCA, Maximum Likelihood Hebbian Learning, Ex-

ploratory Projection Pursuit, GPU, CUDA, performance.
1 Introduction
Modern many-core GPUs have been successfully used to accelerate a variety of
meta-heuristics and bio-inspired algorithms [6,12,13] including different types
of artificial neural networks [1,10,11,14,15,17,18,20,22,24]. To fully utilize the
parallel hardware, the algorithms have to be carefully adapted to data-parallel
architecture of the GPUs [21].
Artificial neural networks (ANNs) performing PCA and MLHL are known to
be useful for the analysis of high dimensional data [5,25]. Their main aim is to
identify interesting projections of high dimensional data to lower dimensional
subspaces that reveal hidden structure of the data sets. Due to the relative
simplicity of their operations and generally real-valued data structures, such a
networks are suitable for a parallel implementation on multi-core systems and on
the GPUs that reach peak performance of hundreds and thousands giga FLOPS
(floating-point operations per second) at low costs.
This study presents a design and evaluation of a novel fine-grained data-
parallel implementation of an ANN for PCA and MLHL for the nVidia Compute
Unified Device Architecture (CUDA) platform.

Neural PCA and Maximum Likelihood Hebbian Learning on the GPU 133
1.1 Neural PCA and MLHL

PCA is a standard statistical technique for compressing data; it can be shown to
give the best linear compression of the data in terms of least mean square error.
There are several ANNs which have been shown to perform PCA e.g. [9,19].
The Negative Feedback Network [9] for the PCA is defined as follows. Consider
an N-dimensional input vector x and an M-dimensional output vector y with Wij
being the weight linking input j to output i and let η be the learning rate. The
initial situation is that there is no activation at all in the network. The input
data is feedforward via the weights from the input neurons (the x-values) to the
output neurons (the y-values) where a linear summation is performed to get the
output neuron activation value. It can be expressed as:
N

yi = Wij xj , ∀i (1)
j=1
The activation is fed back through the same weights and subtracted from the
inputs (where the inhibition takes place):
M

ej = xj − Wij yi , ∀j (2)
i=1
After that, simple Hebbian learning is performed between input and outputs:
ΔWij = ηej yi (3)
The effect of the negative feedback is the network learning stability. This network
is capable of finding the principal components of the input data [9] in a manner
that is equivalent to Oja’s Subspace algorithm [19], and so the weights will not
find the current Principal Components but a basis of the Subspace spanned by
these components.
Maximum Likelihood Hebbian Learning [2,3,4,8] is based on the previous
PCA-type rule and can be described as a family of learning rules based on the
following equations: a feedforward step (1) followed by a feedback step (2) and
then a weight change, which is as follows:
ΔWij = ηyi sign(ej )|ej |p−1 (4)
Maximum Likelihood Hebbian Learning (MLHL) [2,3,4,8] has been linked to the
standard statistical method of Exploratory Projection Pursuit (EPP)[4,7].
2 GPU Computing
Modern graphics hardware has gained an important role in the area of paral-
lel computing. The data-parallel architecture of the GPUs is suitable for vec-
tor and matrix algebra operations and it is nowadays widely used for scientific
134 P. Krömer et al.
computing. The GPUs and general purpose GPU (GPGPU) programming have
established a new platform for neural computation. The usage of the GPUs to
accelerate neural information processing and artificial neural networks pre-dates
the inception of general purpose GPU APIs [1,11,17,18]. At that time, the data
structures were mapped directly to native GPU concepts such as textures and
the operations were implemented using vertex and pixel shaders of the GPUs.
Often, the ANNs were implemented using graphic oriented shader programs,
OpenGL functions, or DirectX functions to accelerate ANN operations. For ex-
ample, a 20 times accelerated feedforward network on the GPU was presented
by Oh and Jung in [18]. Martı́nez-Zarzuela et al. [17] proposed a 33 times faster
GPU-based fuzzy ART network. Ho et al. [11] developed a simulator of cellular
neural network on the GPU that was 8 to 17 times faster than a corresponding
CPU version, and Brandstetter and Alessandro [1] designed a 3 to 72-fold faster
radial basis function network powered by the GPUs.
The GPGPU APIs have simplified the development of neural algorithms and
ANNs for the graphics hardware significantly [10,16] and a variety of neuro-
computing algorithms were ported to the GPUs [10,14,15,16,18,20,22,24]. The
CUDA platform was used to achieve 46 to 63 times faster learning of a feedfor-
ward ANN by the backpropagation algorithm by Sierra-Canto et al. [24] while
Lopes and Ribeiro [14] reported a 10 to 40 faster implementation of the multiple
backpropagation training of feedforward and multiple feedforward ANNs.
Ghuzva et al. [10] presented a coarse-grained implementation of the multilayer
perceptron (MLP) on the CUDA platform that operated a set of MLPs in parallel
50 times faster than a sequential CPU-based implementation. The training of a
feedforward neural network by genetic algorithms was implemented on CUDA by
Patulea et al. [20] and it was 10 times faster than a sequential version of the same
algorithm. An application of a GPU-powered ANN for speech recognition is due
to Scanzio et al. [22]. The GPU technology accelerated the ANN approximately
6 times. Martı́nez-Zarzuela et al. [15] used the GPU to speedup a neural texture
classification process and achieved 16 to 26 times better performance than on
the CPU. In [16], the authors implemented a fuzzy ART network on the CUDA
platform and achieved a 57-fold peak speedup.
An example of the use of GPUs for unsupervised neural networks is due to
Shitara et al. [23]. Three different graphic cards were used to benchmark the
performance of the algorithm and it was shown that the GPUs can improve the
performance of the SOM up to 150 times for certain hardware configurations.
In this research, the CUDA platform is used to accelerate the training of the
Negative Feedback Network and also for MLHL.
3 A Version of Neural PCA and MLHL on CUDA
According to the authors’ knowledge, there is no prior research on the accelera-

tion of the training phase of ANNs for PCA and MLHL by the GPUs. However,
Oh and Jung [18] have combined the PCA and an ANN with CUDA accelerated
feedforward pass for a system for view-point tolerant human pose recognition.
&38 *38
WKUHDG PDQ\WKUHDGV
&38 *38 /RDG
WKUHDG PDQ\WKUHDGV WUDLQLQJ
GDWD
/RDG
WUDLQLQJ 3UHSURFHVVGDWD
GDWD FRPSXWH
FRYDULDQFH
VTUWPLQY
NHUQHOFDOO &RPSXWHGDWD
PHDQDQG NHUQHOFDOO
NHUQHOUHWXUQ *HQHUDWHUDQGRP
VXEWUDFW NHUQHOUHWXUQ ZHLJKWPDWUL[Z
NHUQHOFDOO
NHUQHOUHWXUQ
*HQHUDWHUDQGRP LQHDFK
ZHLJKWPDWUL[Z LWHUDWLRQ
LQHDFK
LWHUDWLRQ
6HOHFW
UDQGRP
URZ[
NHUQHOFDOO
6HOHFW
UDQGRP NHUQHOUHWXUQ FRPSXWH\
URZ[
NHUQHOFDOO
NHUQHOFDOO
NHUQHOUHWXUQ
FRPSXWHH
NHUQHOUHWXUQ FRPSXWH\
NHUQHOFDOO
NHUQHOFDOO XSGDWHHPOKO
NHUQHOUHWXUQ
NHUQHOUHWXUQ FRPSXWHH OHUDQLQJ
NHUQHOFDOO NHUQHOFDOO
NHUQHOUHWXUQ XSGDWHZ NHUQHOUHWXUQ
XSGDWH:
(a) Neural PCA Flowchart (b) MLHL Flowchart
Fig. 1. Neural PCA and MLHL on CUDA
The CUDA implementation of the Negative Feedback Network and MLHL

is outlined in fig. 1a and fig. 1b respectively. The GPU was used to accelerate
the iterative phase of the algorithm (i.e. (1) - (4)). The implementation used the
cublas library, a set of custom kernels that implemented operations not available
in cublas such as the sign function, and auxiliary kernels for common operations
such as generation of batches of random numbers. All operations of the iterative
phase of network training were implemented on CUDA to minimize memory
transfers between the host and the device and maximize the performance of the
implementation.
3.1 Experiments and Results

To evaluate the performance of the Negative Feedback Network and MLHL on
CUDA, the fine-grained parallel implementations were compared to sequential
single-threaded CPU implementations of the same algorithms. Both networks
were implemented from scratch in C/C++ and CUDA-C and their execution
times for the same data set were compared. The experiments were performed
on a server with 2 dual core AMD Opteron processors at 2.6GHz and an nVidia
Tesla C2050 device with 448 cores at 1.15GHz. The server was running Linux
operating system and CUDA SDK 4.0 was used.
To obtain a randomized high-dimensional data set with clear internal struc-
ture and simple interpretation, two variants of the 2D bars data set were gener-
ated. The first one contained 10000 records with 256 attributes and the second
Fig. 2. First 20 records of the 1024-dimensional data set as 32 × 32 images
one contained 10000 records with 1024 attributes. Each record in the data set
can be seen as an n × n image with a single vertical or horizontal bar painted
by different shades of gray (represented by real values between 0.7 and 1). The
visualisation of the first 20 records of the 1024-dimensional data set is shown
in fig. 2. It can be expected, that in such a data set, the pictures with the bar
in the same position ought to form (at least one) cluster, i.e. there might be at
least n + n clusters. The randomized data sets used in this study contained 15
and 31 unique bar positions respectively.
The data sets were processed by both, the Negative Feedback Network and
MLHL on CPU and GPU with the following parameters: 100000 iterations, learn-
ing rate 0.00001 and the MLHL parameter p = 2.2.
In the experiment, the dimension of the target subspace m was set to the
powers of 2 from the interval [2, DIM ] (where DIM was the full dimension of the
data set) and the execution time of network training was measured. The results
are visualized in fig. 3. It clearly illustrates how the execution time grows with the
dimension of the target subspace m and with the number of attributes. These two
parameters define the complexity of the vector-matrix operations. As expected,
the CPU is faster for small m (m < 32 for 256-dimensional data and m < 16
for 1024-dimensional data) for the Negative Feedback Network. The MLHL on
the GPU was faster than the CPU-based implementation of the same algorithm
even for small values of m. The speedup obtained by the parallel implementation
for the 256-dimensional data set ranged from 1.4 for m = 32 to 5.5 for m = 256
for the Negative Feedback Network and from 1.5 to 6.1 for the MLHL. The
performance increase was more significant for the 1024-dimensional data set.
The improvement in the training time of the Negative Feedback Network on the
GPU ranged from 1.36 times faster training for m = 16 to 47.95 faster training
for m = 512. The processing of the 1024-dimensional data set by the MLHL on
the GPU was between 2.18 to 47.81 times faster than on the CPU.
The performance results of both algorithms for the 1024-dimensional data
set on different hardware are visualized in fig. 3. It displays the dependency of
the execution time (y-axis, note the log scale) on the dimension of the target
subspace m and illustrates how the GPU versions of the algorithms outperform
the CPU versions by an order of magnitude for larger m.
The visual results of the projection of the 1024-dimensional data set to the 2-
dimensional subspace are for both methods shown in fig. 4. Figure 4a shows the
results of the projection by the neural PCA and fig. 4b shows the structure of the
same data processed by the MLHL. Points representing images that had a bar
in the same position were drawn in the same color. We can clearly see that both
107
PCA AMD Opteron 2.2GHz
PCA Tesla C2050
MLHL AMD Opteron 2.2GHz
MLHL Tesla C2050
6
10
Time [ms]
5
10
104
3
10
2 4 8 16 32 64 128 256 512 1024
m
Fig. 3. Neural PCA (Negative Feedback Network) and MLHL execution time for the
1024-dimensional data set
(a) Neural PCA (b) MLHL
Fig. 4. The results of projection to 2D for the 1024-dimensional data set
projections have emphasized a structure in the data. The neural PCA version
clearly separated several clusters from the rest of the data, which populates
the center of the graph, while the MLHL lead to a more regular pattern of 2D
clusters. This can serve as a visual proof that the CUDA-C implementations of
both algorithms provide projections to lower dimensional subspaces with good
structure.
4 Conclusions
This research introduced a fine-grained data-parallel implementation of two
types of ANNs, the Negative Feedback network for the PCA and the Maxi-
mum Likelihood Hebbian Learning network. The GPU versions of the algorithms
have achieved for two high-dimensional artificial data sets a significant speedup
in training times. When projecting to low dimensional subspaces (m < 16), the
CPU version of the negative feedback network was faster but when projecting
the data to spaces with larger dimension, the GPU was up to 47.99 times faster
(for the 1024-dimensional data set and m = 1024). The projection through the
MLHL network was on the GPU faster for all m ∈ [2, DIM ] ranging from 2.1-fold
speedup for m = 8 to 47.81 times faster execution time for m = 1024.
In the future, other variants of the MLHL will be implemented and the GPU
version will be used to process and analyze real world data sets.
Acknowledgements. This research is partially supported through a projects of
the Spanish Ministry of Economy and Competitiveness [ref: TIN2010-21272-C02-
01] (funded by the European Regional Development Fund). This work was also
supported by the European Regional Development Fund in the IT4Innovations
Centre of Excellence project (CZ.1.05/1.1.00/02.0070) and by the Bio-Inspired
Methods: research, development and knowledge transfer project, reg. no. CZ.1.07
/2.3.00/20.0073 funded by Operational Programme Education for Competitive-
ness, co-financed by ESF and state budget of the Czech Republic.
References
1. Brandstetter, A., Artusi, A.: Radial basis function networks gpu-based implemen-
tation. IEEE Transactions on Neural Networks 19(12), 2150–2154 (2008)
2. Corchado, E., Fyfe, C.: Orientation selection using maximum likelihood hebbian
learning. Int. Journal of Knowledge-Based Intelligent Engineering 2(7) (2003)
3. Corchado, E., Han, Y., Fyfe, C.: Structuring global responses of local filters using
lateral connections. J. Exp. Theor. Artif. Intell. 15(4), 473–487 (2003)
4. Corchado, E., MacDonald, D., Fyfe, C.: Maximum and minimum likelihood heb-
bian learning for exploratory projection pursuit. Data Mining and Knowledge Dis-
covery 8, 203–225 (2004)
5. Corchado, E., Perez, J.C.: A three-step unsupervised neural model for visualizing
high complex dimensional spectroscopic data sets. Pattern Anal. Appl. 14(2), 207–
218 (2011)
6. De, P., Veronese, L., Krohling, R.A.: Swarm’s flight: accelerating the particles using
c-cuda. In: Proceedings of the Eleventh conference on Congress on Evolutionary
Computation, CEC 2009, pp. 3264–3270. IEEE Press, Piscataway (2009)
7. Friedman, J., Tukey, J.: A projection pursuit algorithm for exploratory data anal-
ysis. IEEE Transactions on Computers C- 23(9), 881–890 (1974)
8. Fyfe, C., Corchado, E.: Maximum likelihood Hebbian rules. In: Verleysen, M. (ed.)
ESANN 2002, Proceedings of the 10th European Symposium on Artificial Neural
Networks, Bruges, Belgium, April 24-26, pp. 143–148 (2002)
9. Fyfe, C.: A neural network for pca and beyond. Neur. Proc. Letters 6, 33–41 (1997)
10. Guzhva, A., Dolenko, S., Persiantsev, I.: Multifold Acceleration of Neural Network
Computations Using GPU. In: Alippi, C., Polycarpou, M., Panayiotou, C., Ellinas,
G. (eds.) ICANN 2009, Part I. LNCS, vol. 5768, pp. 373–380. Springer, Heidelberg
(2009)
11. Ho, T.Y., Lam, P.M., Leung, C.S.: Parallelization of cellular neural networks on
gpu. Pattern Recogn. 41(8), 2684–2692 (2008)
12. Krömer, P., Platoš, J., Snášel, V., Abraham, A.: An Implementation of Differential
Evolution for Independent Tasks Scheduling on GPU. In: Corchado, E., Kurzyński,
M., Woźniak, M. (eds.) HAIS 2011, Part I. LNCS, vol. 6678, pp. 372–379. Springer,
Heidelberg (2011)
13. Langdon, W.B., Banzhaf, W.: A SIMD Interpreter for Genetic Programming
on GPU Graphics Cards. In: O’Neill, M., Vanneschi, L., Gustafson, S., Esparcia
Alcázar, A.I., De Falco, I., Della Cioppa, A., Tarantino, E. (eds.) EuroGP 2008.
14. Lopes, N., Ribeiro, B.: GPU Implementation of the Multiple Back-Propagation
Algorithm. In: Corchado, E., Yin, H. (eds.) IDEAL 2009. LNCS, vol. 5788, pp.
449–456. Springer, Heidelberg (2009)
15. Martı́nez-Zarzuela, M., Dı́az-Pernas, F., Antón-Rodrı́guez, M., Dı́ez-Higuera, J.,
González-Ortega, D., Boto-Giralda, D., López-González, F., De La Torre, I.: Multi-
scale neural texture classification using the gpu as a stream processing engine.
Machine Vision and Applications 22, 947–966 (2011)
16. Martı́nez-Zarzuela, M., Pernas, F., de Pablos, A., Rodrı́guez, M., Higuera, J., Gi-
ralda, D., Ortega, D.: Adaptative Resonance Theory Fuzzy Networks Parallel Com-
putation Using CUDA. In: Cabestany, J., Sandoval, F., Prieto, A., Corchado, J.M.
(eds.) IWANN 2009, Part I. LNCS, vol. 5517, pp. 149–156. Springer, Heidelberg
(2009)
17. Martı́nez-Zarzuela, M., Dı́az Pernas, F., Dı́ez Higuera, J., Rodrı́guez, M.: Fuzzy
ART Neural Network Parallel Computing on the GPU. In: Sandoval, F., Prieto,
A.G., Cabestany, J., Graña, M. (eds.) IWANN 2007. LNCS, vol. 4507, pp. 463–470.
18. Oh, K.S., Jung, K.: GPU implementation of neural networks. Pattern Recogni-
tion 37(6), 1311–1314 (2004)
19. Oja, E.: Neural networks, principal components, and subspaces. International Jour-
nal of Neural Systms 1(1), 61–68 (1989)
20. Patulea, C., Peace, R., Green, J.: Cuda-accelerated genetic feedforward-ann train-
ing for data mining. J. of Physics: Conference Series 256(1), 012014 (2010)
21. Sanders, J., Kandrot, E.: CUDA by Example: An Introduction to General-Purpose
GPU Programming, 1st edn. Addison-Wesley Professional (July 2010)
22. Scanzio, S., Cumani, S., Gemello, R., Mana, F., Laface, P.: Parallel implementation
of artificial neural network training for speech recognition. Pattern Recognition
Letters 31(11), 1302–1309 (2010)
23. Shitara, A., Nishikawa, Y., Yoshimi, M., Amano, H.: Implementation and eval-
uation of self-organizing map algorithm on a graphic processor. In: Parallel and
Distributed Computing and Systems 2009 (2009)
24. Sierra-Canto, X., Madera-Ramirez, F., Uc-Cetina, V.: Parallel training of a back-
propagation neural network using cuda. In: Proceedings of the 2010 Ninth In-
ternational Conference on Machine Learning and Applications, ICMLA 2010, pp.
307–312. IEEE Computer Society, Washington, DC (2010)
25. Zhang, K., Li, Y., Scarf, P., Ball, A.: Feature selection for high-dimensional machin-
ery fault diagnosis data using multiple models and radial basis function networks.
Neurocomputing 74(17), 2941–2952 (2011)
Construction of Emerging Markets Exchange Traded
Funds Using Multiobjective Particle Swarm Optimisation
Marta Díez-Fernández1, Sergio Alvarez Teleña1, and Denise Gorse1

1
Dept of Computer Science, University College London,
Gower Street, London WC1E 6BT, UK
{M.Diez,S.Alvarez,D.Gorse}@cs.ucl.ac.uk
Abstract. Multiobjective particle swarm optimisation (MOPSO) techniques are

used to implement a new Andean stock index as an exchange traded fund (ETF)
with weightings adjusted to allow for a tradeoff between the minimisation of
tracking error, and liquidity enhancement by the reduction of transaction costs
and market impact. Solutions obtained by vector evaluated PSO (VEPSO) are
compared with those obtained by the quantum-behaved version of this
algorithm (VEQPSO) and it is found the best strategy for a portfolio manager
would be to use a hybrid front with contributions from both versions of the
MOPSO algorithm.
Keywords: Multiobjective optimisation, particle swarm optimisation, portfolio

management, emerging markets.
1 Introduction
Emerging markets are increasingly being regarded as the new drivers of the global
economy and as a consequence more and more investors regard emerging markets
investments as a critical component in their portfolios. While many such investors
have chosen to focus on the 'Big Four' of Brazil, Russia, India and China there are
sopportunities beyond these, in particular in Andean countries rich in mineral
resources such as Colombia and Chile, whose economies are growing at an
accelerating pace. It is not always easy for foreign investors to gain exposure to these
markets, and one of the best ways is to invest in an exchange traded fund (ETF) that
replicates the behaviour of a representative index of stocks. However in setting up
such a fund it is necessary to consider both the transaction costs involved in buying
and selling the component assets and also the market impact of these transactions,
both of which may be larger in less developed economies. The aim of this work is to
show how multiobjective particle swarm optimisation can be used to implement a
new Andean index as an ETF with internal weightings adjusted to minimise tracking
error (how closely the fund mimics the behaviour of the index) while reducing
transaction costs and market impact.
Particle swarm optimisation (PSO) [1] is a population-based search algorithm that
has achieved popularity by being simple to implement, having low computational
Construction of Emerging Markets Exchange Traded Funds Using MOPSO 141
costs, and by having been found effective in a wide variety of application areas,
including finance [2]. However in its multiobjective form (MOPSO), where the aim is
to search for solutions that optimally satisfy possibly conflicting demands such as
maximising profit while at the same time minimising risk, it has so far been used
relatively little in a financial context (for examples see [3], [4]) and in particular
neither MOPSO nor any other population-based multiobjective algorithm has been
applied to the problem of minimising an index tracking error while attempting to
enhance liquidity, the subject of the current work.
2 Methods
2.1 Multiobjective Particle Swarm Optimisation
The relative simplicity of PSO and its quantum-behaved variant QPSO made them
natural candidates to be extended for multiobjective optimisation. The methods used
here are vector evaluated PSO (VEPSO) [5] and its quantum-behaved equivalent
VEQPSO [6], in which swarms seeking to optimise two conflicting objectives
exchange information by following each other's leaders. QPSO has the advantage of
needing fewer training parameters to be set—other than the number of iterations in
fact only one, the contraction-expansion coefficient β—compared to the original PSO,
which requires also a decision to be made about the balance between learning based
on each particle's own past best experience (cognitive contribution weighted by φ1)
and learning based on following the swarm's—or in the case of VEPSO a
neighbouring swarm's—best performing member (social contribution weighted by
φ2). The standard form of PSO also requires the specification of an iteration-
decreasing inertia weight W that balances the above forms of learning (exploitation of
the search space) with random search (exploration).
s s
The equations used to update the velocity v i and position x i of particle i in
swarm s (where here s=1,2) in the two versions of the multiobjective PSO algorithm
s
are given in summary below. In these expressions p i,t ('personal best') is the best
parameter position (in relation to the objective to be optimised by swarm s) found at
s
time t by particle i, g t ('global best') is the best position found by at this time by any
particle in swarm s, and 's+1' in the case of two-objective multiplication, as here,
denotes addition mod 2 (i.e. the leader of the competitor swarm is followed).
VEPSO:
v i,t+1 = W v i,t + φ1β1 (pi,t − x i,t ) + φ2 β 2 (g t − x i,t ) ,

s s s s s+1 s
(1a)
x i,t+1 = x i,t + vi,t+1 ,

s s s
(1b)
where β1, β2 are random numbers chosen uniformly from the interval [0,1].
142 M. Díez-Fernández, S. Alvarez Teleña, and D. Gorse
VEQPSO:
x i,t+1 = φ pi,t + (1− φ )gt − ϑ (k)× β (mt − x i,t )× ln(1 / u) ,

s s s+1 s s
(2a)
where φ, k, u are random numbers chosen uniformly from the interval [0,1],
 1 if k ≥ 0.5
ϑ (k) =  , (2b)
 0 if k < 0.5
and ms is the mean of the personal best positions for members of swarm s,
1 N s
m =
s
p .
N i=1 i
(2c)
It has generally been found that QPSO is both faster and more effective in finding
good solutions than the original PSO [7]. However it will be shown in the Results that
neither form of PSO can be regarded as 'better' for the current problem, as VEPSO
and VEQPSO methods will be seen to contribute solutions appropriate to different
parts of the problem space.
2.2 Construction of Pareto front
The most usual way to assess and compare the results of k-objective optimisation
procedures (typically phrased as the need to minimise each of f1(x), f2(x),..., fk(x),
possibly subject to a number of external constraints) in an n-dimensional decision
variable space x = (x1, x2,..., xn) is via a Pareto front
PF * = { f1 (x), f2 (x),..., fk (x) | x ∈ P * } , (3)
where P* is the Pareto optimal set of nondominated solutions x, where one solution x
is said to dominate another, v, denoted x ≤ v , if it is better than v with respect to at
least one problem objective and no worse with respect to any of the others:
x ≤ v if and only if:
fi (x) ≤ fi (v) for all i ∈ {1,2,...,k} and f j (x) < f j (v) for at least one j ∈ {1,2,...,k} (4)
This method is adopted here for the case of k=2 (as here there are two quantities to be
minimised, index tracking error TE and a joint measure of transaction costs and
market impact, TC&MI) and n=8 (as here there are eight weights, one for each of the
assets in the prototype eight-component portfolio).
2.3 Definition of the Benchmark Index

While it is later intended to extend to a larger number of stocks drawn from a larger
number of sectors, in the present work the intention is to track an equally weighted
test index made up of eight stocks with the country and sector distributions below.
Table 1. Country and sector distributions for the stocks in the benchmark index
Country % Weight Sector % Weight

Chile 62.5 Financials 37.5
Colombia 25 Utilities 25
Peru 12.5 Energy 12.5
Materials 12.5
Industrials 12.5
In practice, because it is otherwise difficult to gain access to local stocks in

emerging markets, ADRs (American Depository Receipts) of the eight local assets
were used. An ADR is a negotiable security that represents the underlying security of
a non-U.S. company but can be traded like a domestic stock in the U.S. financial
markets. ADRs additionally have the benefit that their prices are calculated constantly
based on the local asset and local currency values, so the effect of currency
movements does not need separate consideration. Daily data was obtained
representing the performance of this benchmark index using the time period 6
October 2009 to 2 January 2010 as the training set and from 3 January 2010 to 16
March 2010 as the test set.
2.4 Measuring Tracking Error, Transaction Costs and Market Impact
The measurement of tracking error (TE), the first of our objectives to be minimised, is
straightforward: it is the standard deviation of the difference between returns from the
above benchmark and from the constructed ETF. The second objective to be
minimised is denoted TC&MI and is the sum of transaction costs (TC) and market
impact (MI). Transaction costs are easy to define, being given as the bid-ask spread
(the difference between what one would pay to buy an asset and what one could sell it
for) and taxes on gains and dividends, if any, associated with the assets held.
However it is considerably more difficult to obtain a workable definition of market
impact, and it is necessary in emerging markets to use local expertise (this is standard
practice in the industry) to assign a parameter γi to each asset i which is calculated
linearly using the expert's estimation of the market impact buying or selling a
specified amount of shares would have in the market. The market impact of a
modification to a n-asset portfolio is then calculated according to the following
formula
n
wi × $budget
MI =  ×γi , (5)
i=1 pricei × MeDVi
in which for each of the i=1..n assets wi is its weight in the portfolio (note wi=0 if no
transactions have been carried out for asset i); $budget is the total budget managed, in
US dollars; pricei is the closing price of the asset on the day of the transaction; MeDVi
is the median daily volume of transactions in that asset (the median being calculated
over three months of past data); and γi is the expert-derived parameter discussed
above. Median rather than average daily volume is used to avoid the effect of outliers
generated by 'block trades' in which large volumes of shares may change hands.
2.5 Learning Algorithm Parameters and Settings

Both the VEPSO and VEQPSO algorithms used 500 iterations and swarm sizes of
N=16 particles. The algorithms were not found to be strongly sensitive to the values
of these parameters provided N ≥16 and at least 200-300 iterations of PSO were
performed. Other relevant parameters were set as follows: for VEPSO, cognitive and
social learning factors are given by φ1= φ2 =2, with a decreasing inertia weight W in
the range [1, 0.4]; for VEQPSO an increasing contraction-expansion coefficient β in
the range [0.4, 1] was used. These parameter values are regarded as good general-
purpose choices in the PSO literature and were not optimised for this particular task.
3 Results
As discussed in section 2.2, this work follows the usual methodology for
multiobjective optimisation problems in constructing a Pareto front of candidate
solutions. In the context of the present application a fund manager would be able to
choose from such a front a solution (weighted combination of the eight assets) that
emphasised either close tracking of the underlying index or maximal liquidity
(minimisation of transaction costs and market impact). However it was discovered
that while the Pareto fronts obtained by VEPSO and VEQPSO were of similar
quality, 0.968 and 0.936 respectively in terms of their hypervolume [8] (a measure of
the degree to which both objectives are being jointly achieved, being preferably as
large as possible in a situation such as this in which two or more quantities are to be
simultaneously minimised), they were significantly different in that VEPSO
predominantly found solutions with a low tracking error while VEQPSO in contrast
found solutions with low market impact and transaction costs. It was noteworthy that
no modification of the learning process—changes to learning parameters, running the
algorithms for more iterations, increasing or decreasing the number of particles in the
swarms, or attempting to add to the fronts by reinitialising the weights and re-running
the algorithm—was able to change this. Such an effect has not to our knowledge
been previously observed where standard and quantum behaved PSOs were being
compared for the same multiobjective problem, and the reasons why the algorithms
here appear to specialise in certain areas of the solution space are under investigation.
Fig. 1. Merged VEPSO (diamonds)-VEQPSO (circles) Pareto front

It was decided on pragmatic grounds that the best solution set would be a merging
of those points derived from VEPSO and those derived from VEQPSO, and this
merged Pareto front is shown in Figure 1 below. Note that the two groups of points
are more strongly separated in terms of tracking error than in terms of TC&MI; this is
another feature not significantly affected by performing additional runs of the
algorithm or by modifying training parameters, and also appears to be a feature of the
application of multiobjective algorithms to this particular data set.
It was clearly of interest to look at the composition of the generated optimal ETF
portfolios as one moves through the Pareto front. Figure 2 shows how the proportions
of the assets allocated to the five industrial sectors (Figure 2a) and the three countries
(Figure 2b) vary as one moves along the x-axis (TE) of the merged Pareto front. Note
the large break along the TE axis in both figures; the parts of the curves to the left of
this are derived purely from VEPSO solutions, and those to the right from VEQPSO,
and reflect the division shown in Figure 1. As TE → 0 these proportions should
automatically approach those of the benchmark portfolio, which was observed to be
the case. As TE increases (corresponding on the Pareto front to a lowered TC&MI) in
the case of sector allocations one sees an increase in the proportion of assets assigned
to the utilities sector, and a corresponding decrease with respect to the other sectors.
In the case of country, one sees an equally marked increase in the allocations to Chile.
The case of Peru is interesting as allocations to this country initially fall, then rise
again for a time at higher allowed values of TE (lower TC&MI).
Fig. 2. Portfolio composition in relation to a) industrial sector and b) country, as a function of

tracking error (TE)
Figure 3 shows in more detail how portfolio composition, now in terms of the eight
individual assets, varies as the allowed tracking error (TE) increases. It can be seen
that just two of the assets would take over the portfolio in the limit of very high TE,
one of them Chile's national energy provider, the other the major bank in Colombia,
these being the component assets within their sectors that are the most liquid.
Figures 4a, 4b show equivalent variations as TC&MI increases. Note that the
variations seen in these figures are expected to be the converse of those seen for TE
variation in Figure 2 (in the sense that a behaviour associated with a low TE
corresponds to a high TC&MI, and vice versa) since the MOPSO algorithms play off
the minimisation of one objective against the other, and it can be observed that this is
broadly the case.
Fig. 3. Portfolio composition in relation to the eight included assets as a function of tracking
error
Fig. 4. Portfolio composition in relation to a) industrial sector and b) country, as a function of

transaction costs and market impact (TC&MI)
To a significant extent the results of Figures 2 and 4 can be explained by the

importance of Chile, which is the most developed of the Andean economies and
currently second only to Brazil in economic importance within Latin America, with
foreign investment in Chile still rapidly increasing. In addition utility stocks are
among the most widely traded in Latin America; this activity acts to decrease TC&MI
very substantially for all Chilean stocks, but especially for Chilean utilities.
4 Discussion
It has been demonstrated that a combination of vector-evaluated PSO (VEPSO) and
its quantum-behaved equivalent VEQPSO can deliver an optimal trade-off between
tracking error minimisation and liquidity enhancement for a portfolio manager who
wishes to launch an ETF to track an index. The experimental results show a hybrid
Pareto front obtained from a combination of these algorithms produces the best range
of well-balanced Pareto-optimal solutions. Future research will be focused on a)
gaining a better understanding of why the two forms of MOPSO appear to specialise
so strongly in the minimisation of one or other of the objectives; b) experimenting
with a range of nonlinear market impact models to replace the linear one used here; c)
analysing the stability of the portfolio weights along the Pareto front in order to see
how robust these solutions are (it is expected VEQPSO will deliver a more steady
tracking error as it generates a more diversified composition); d) looking at the

possibility of using sector-futures, as this could increase the liquidity of the ETF; and
e) using performance as a third variable to optimise, as an ETF that has over-
performed its benchmark could be more attractive to potential clients.
References
1. Kennedy, J., Eberhart, R.: Particle Swarm Optimization. In: IEEE International Conference
Symposium on Neural Networks, pp. 1942–1948. IEEE Press, New York (1995)
2. Poli, R.: An Analysis of Publications on Particle Swarm Optimisation Application.
Technical report, Department of Computer Science, University of Essex (2007)
3. Mishra, K.S., Panda, G., Meher, S.: Multi-objective Particle Swarm Optimization Approach
to Portfolio Optimization. In: 2009 World Congress on Nature and Biologically Inspired
Computing, pp. 1611–1614. IEEE Press, New York (2009)
4. Briza, A.C., Naval Jr., P.C.: Stock Trading System Based on the Multi-objective Particle
Swarm Optimization of Technical Indicators on End-of-Day Market Data. Applied Soft
Computing 11, 1191–1201 (2011)
5. Parsopoulos, K.E., Vrahatis, M.N.: Particle Swarm Optimization Method in Multiobjective
Problems. In: 2002 ACM Symposium on Applied Computing, pp. 603–607. ACM Press
(2002)
6. Omkar, S.N., Khandelwal, R., Ananth, T.V.S., Naik, G.N., Gopalakrishnan, S.: Quantum
Behaved Particle Swarm Optimization (QPSO) for Multi-objective Design Optimization of
Composite Structures. Expert Systems with Applications 36, 11312–11322 (2009)
7. Sun, J., Xu, W., Feng, B.: A Global Search Strategy of Quantum-Behaved Particle Swarm
Optimization. In: 2004 IEEE Conference on Cybernetics and Intelligent Systems, pp. 111–
116. IEEE Press, New York (2004)
8. Benne, N., Fonseca, M., López-Ibáñez, M., Paquete, L., Vahrenhold, J.: On the Complexity
of Computing the Hypervolume Indicator. IEEE Transactions on Evolutionary
Computation 13, 1075–1082 (2009)
The Influence of Supervised Clustering
for RBFNN Centers Definition:
A Comparative Study
André R. Gonçalves, Rosana Veroneze, Salomão Madeiro,

Carlos R.B. Azevedo, and Fernando J. Von Zuben
School of Electrical and Computer Engineering,

University of Campinas, Campinas, SP, Brazil
{andreric,veroneze,salomaosm,azevedo,vonzuben}@dca.fee.unicamp.br
Abstract. Several clustering algorithms have been considered to deter-

mine the centers and dispersions of the hidden layer neurons of Radial
Basis Function Neural Networks (RBFNNs) when applied both to re-
gression and classification tasks. Most of the proposed approaches use
unsupervised clustering techniques. However, for data classification, by
performing supervised clustering it is expected that the obtained clus-
ters represent meaningful aspects of the dataset. We therefore compared
the original versions of k-means, Neural-Gas (NG) and Adaptive Radius
Immune Algorithm (ARIA) along with their variants that use labeled
information. The first two had already supervised versions in the litera-
ture, and we extended ARIA toward a supervised version. Artificial and
real-world datasets were considered in our experiments and the results
showed that supervised clustering is better indicated in problems with
unbalanced and overlapping classes, and also when the number of input
features is high.
Keywords: Radial Basis Function, Clustering, Adaptive Radius Im-

mune Algorithm, Supervised Learning for Data Classification.
1 Introduction
Radial Basis Function Neural Networks (RBFNNs) are universal approximators
and have been successfully applied to deal with a wide range of problems. The
architecture of an RBFNN is composed of an input layer, a hidden layer, and an
output layer. The number of neurons in the input layer is equal to the number
of attributes of the input vector. The hidden layer is composed of an arbitrary
number of RBFs (e.g. Gaussian RBFs), being each one defined by a center and
a dispersion parameter. The response of each neuron in the output layer is a
weighted sum over the values from the hidden layer neurons.
RBFNNs can be trained by either a full or a quick learning scheme. In the
former, nonlinear optimization algorithms (e.g. gradient-descent-based) are used
to determine the whole set of parameters of an RBFNN: (i) location of each
center, (ii) dispersion of each RBF, and (iii) weights of the output layer. In this

The Influence of Supervised Clustering for RBFNN Centers Definition 149
case, the number of RBFs is either defined a priori or estimated by a trial-and-

error procedure. As for the quick learning scheme, the internal structure of an
RBFNN (the number of RBFs, their centers and dispersions) is given a priori
and the weights of the output layer can be determined by the Least Squared
method or by a regularized version such as LASSO.
There is a plethora of ways to determine a priori the internal structure of
RBFNNs, which can be categorized as: clustering based [3,4,7], heuristic based
[1,6], and growing and pruning based [11]. Most of the proposed approaches to
determine the RBFNN internal structure are based on clustering methods [7].
Qian et al. [12] suggested that by guiding the clustering algorithm with la-
beled information, a more meaningful clustering result can be achieved. Hence,
when using the available labeled data to determine the RBFNN internal struc-
ture, one can expect that: (i) such knowledge will lead to clustering solutions
representing meaningful aspects of the dataset; and (ii) the performance of the
resulting classifier will reflect the trade-off achieved by the clustering algorithm
over different cluster validity metrics.
Even though different learning schemes for clustering have been used to de-
termine the internal structure of RBFNN [3,4], no study was conducted on the
impact of the usage of labeled information on the RBFNN accuracy. In light
of the lack of research on that subject, our study therefore spans through the
comparison of the performance of RBFNN using unsupervised and supervised
clustering algorithms for both synthetic and real-world classification problems.
Three clustering algorithms endowed with interesting properties for this prob-
lem were considered in this study. Two of them had already supervised versions
in the literature, and we extended a third algorithm toward a supervised version.
The clustering algorithms are described in Section 2. The remaining of the paper
is organized as follows: a detailed description of the experiments carried out in
this study is given in Section 3.1; in Section 3.2, we contrast the performance
of supervised and unsupervised clustering algorithms; the concluding remarks of
this study are outlined in Section 4.
2 Clustering Algorithms
The following clustering algorithms were considered, along with their correspond-
ing versions employing labeled data: (i) k-means, (ii) Neural-Gas (NG) and, (iii)
Adaptive Radius Immune Algorithm (ARIA). The first one was selected due to
its simplicity and wide usage. NG is also a fast clustering algorithm that can
be seen as a generalization of k-means, where each data point is assigned to ev-
ery prototype with different weights, based on their similarities. Both k-means
and NG have already supervised versions available in the literature. ARIA is
a self-adaptive algorithm which automatically determines the number of proto-
types based on the local data density. Moreover, ARIA intrinsically defines the
coverage of each prototype, which can clearly be adopted as the RBF disper-
sion. In [14], ARIA achieved good results when applied to determine the internal
structure of RBFNNs for regression problems.
150 A.R. Gonçalves et al.
In the supervised version of k-means [3], the value of k was divided among the
classes proportionally to the number of training samples per class, and k-means
was applied to each class individually. This variant of k-means is named here as
k-meansL .
In NG [10], the neighborhood ranking is based on Euclidean distance between
the prototypes and the training samples. In the supervised version of NG [9], the
Euclidean distance was replaced by the F-measure to define the neighborhood
ranking. This variant of NG is named here as NGF .
For k-means, k-meansL , NG and NGF , the number of centers (neurons) was
estimated based on the Bayesian Information Criterion√ (BIC) [13]. Moreover,
the dispersion ρ of each center is calculated as ρ = dmax 2 · k [8], where dmax
is the largest distance among the centers, and k is the number of centers.
2.1 ARIA Using Labeled Data
Adaptive Radius Immune Algorithm (ARIA) [2] is an immune inspired clustering

algorithm which uses mechanisms of affinity maturation, clonal expansion, radius
adaptation, and network suppression to allocate a reduced number of prototypes
on the most representative portions of the dataset. ARIA has already been used
to determine the number of centers of an RBFNN [1].
In the extended version of ARIA proposed here, called ARIACS , the steps (i)
creation of the initial prototypes, (ii) affinity maturation (prototype updating),
and (iii) network suppression (prototype removal) were revised and the modi-
fications are described as follows. In ARIACS , the number of prototypes in the
initial population corresponds to the number of classes of the problem (number of
distinct labels). Each prototype in the initial population is located at the “center
of mass” of all objects belonging to the class it represents. We are assuming here
that each object in the training dataset is characterized by the same set of nu-
merical attributes, thus corresponding to a point in the space of attributes. The
initial radius of the prototypes are still chosen randomly. Therefore, in ARIACS ,
each prototype represents one class, though we may have multiple prototypes
per class. For example, at each iteration, we say that prototype P represents
class C1 if the majority of samples to which it has been assigned as the best
matching unit (BMU) belongs to class C1 .
After one label is assigned to each prototype, as described previously, the new
prototype updating procedure takes place. Through this new mechanism, the
mutation operator is performed only if a prototype was determined as BMU of
one sample of the same class. For instance, if a prototype of class C1 is the BMU
of one sample of class C2 , nothing is done.
The last proposed modification to ARIACS is that the prototype removal is
only performed among candidates that represent the same class. It means that
there is no suppression between prototypes associated with distinct classes even
if the conditions for prototype removal are met.
Several experiments with ARIA and ARIACS were carried out aiming to esti-
mate the dispersions of the RBFs. It was observed that the best results for the
RBFNN classifiers regarding accuracy were achieved when the dispersions are
equal to three times the values of the adaptive radii.
3 Experiments
In this section, we carried out an experimental analysis to compare, by means of
the accuracy of RBFNN classifiers, the supervised and unsupervised clustering
algorithms in determining the RBF centers.
3.1 Experimental Setup

To perform the experiments, we considered nine artificial and nine real-world
datasets, submitting the classifiers to a wide range of scenarios. The artificial
datasets have 1,000 samples and 3 attributes generated from two Gaussian dis-
tributions, establishing two distinct classes. The first Gaussian distribution is
centered in μi = 2 and the second one is centered in μi = 5, with i = 1, 2, 3.
Both centers have the same covariance matrix. We varied standard deviation
(σ) and the balance in order to impose a distinct degree of difficulty to the
corresponding classification problem. Table 1 depicts the additional information
regarding the artificial datasets. The real-world datasets were collected from UCI
repository [5], and their information is shown in Table 2. In these tables, H is the
class distribution entropy and F SM stands for Fisher Separability Measure [15]
of the dataset. All datasets were normalized to avoid problems with attributes
in different scales.
Table 1. Artificial datasets description Table 2. Real-world datasets description
Dataset σ H FSM Dataset # feat. # classes H FSM

Artf1 1 0.69 6.83 Wpbc 34 2 0.54 0.59
Artf2 3 0.69 2.30 Bupa 7 2 0.68 0.066
Artf3 5 0.69 1.44 Ionosphere 33 2 0.65 1.83
Artf4 1 0.61 5.45 Pima 8 2 0.64 0.49
Artf5 3 0.61 1.94 Sonar 60 2 0.69 2.37
Artf6 5 0.61 1.09 Transfusion 5 2 0.54 0.18
Artf7 1 0.32 2.47 Wine 13 3 1.08 10.14
Artf8 3 0.32 0.88 Iris 4 3 1.09 10.29
Artf9 5 0.32 0.56 Glass 9 6 1.50 3.20
For the RBFNN accuracy evaluation, we used a ten-fold cross-validation

method. Data was divided into ten folds and the training was repeated ten
times. Each time, we applied nine folds for RBFNN training and the remaining
fold for validation. The final accuracy was obtained by the average results over
the ten validation folds.
ARIA and ARIACS parameter values were the same. For a complete descrip-
tion of the whole set of ARIA parameters, refer to [2]. For real-world datasets, the
following minimum radius values were set to: Wpbc=4, Bupa=1.5, Ionosphere=5,
Pima=1.6, Sonar =6, Transfusion=0.5, Wine=2.5, Iris=0.4 and Glass=0.7. For
the artificial ones, we used 0.8 for Artf1 to Artf6 and 0.6 for Artf7 to Artf9.
These values were obtained through a grid search procedure. The other parame-
ter values were: mutation rate μ = 1, decay rate γ = 0.9 and neighborhood size
Ns = 3. In NG and NGF , the initial step size was set to 0.5 and the initial
neighborhood range, λ, was defined as n/2, where n is the number of neurons.
For all algorithms, the number of iterations was fixed in 60.
3.2 Clustering with and without Labeled Data

Each clustering algorithm described previously was used to determine the RBFs
centers, and we analyzed the accuracy of the RBFNNs over the validation set.
Due to the intrinsic variability of the results, each classifier was trained and
tested 30 times. Tables 3 and 4 show the results obtained by the application
of one-tailed t-test at a 0.05 level of significance. The t-test result regarding
Alg. 1 – Alg. 2 is shown as “+”, “–”, or “∼” when Alg. 1 achieved significantly
higher, lower or equivalent average accuracy levels, when compared to Alg. 2,
respectively. In these tables, p-values lower that 1e-5 was considered zero.
Table 3. Statistical comparison of algorithms in the artificial datasets
Algorithms Artf1 Artf2 Artf3 Artf4 Artf5 Artf6 Artf7 Artf8 Artf9
(∼) (+) (+) (∼) (+) (+) (∼) (+) (+)
ARIACS – ARIA
0.72 0 1e-3 0.88 0 0 0.92 0 0
(–) (–) (–) (–) (–) (–) (–) (+) (+)
NGF – NG
2e-04 0 0 0 0 0 0 0 0
(∼) (–) (–) (+) (+) (+) (+) (+) (+)
k-meansL – k-means
0.87 0.02 0 8e-3 0 0 0 0 0
Regarding results presented in Table 3, the use of labeled information in

clustering algorithms to define the centers of an RBFNN increased the classifier
accuracy in problems with unbalanced and overlapping classes. k-means took
advantage of the usage of labeled information only in unbalanced datasets (Artf4-
Artf9), while NGF reached better performance than NG only in the two problems
with the highest level of unbalance and overlapping classes (Artf8 and Artf9).
Another aspect to point out was the performance of RBFNNs using ARIACS
when compared with those using ARIA. Except for well-separated classes (Artf1,
Artf4 and Artf7), where the classification problem is less challenging, including
labeled information significantly increased the RBFNN accuracy.
In Table 4, as already observed in the results obtained with artificial datasets,
ARIACS led to higher accuracy levels than, or at least equivalent to, those
achieved by ARIA. For NGF , better results were only found in overlapping
classes and high dimensional datasets, Bupa and Sonar, respectively. For k-
means, we found a worse outcome only for the dataset Ionosphere.
Table 4. Statistical comparison of algorithms in the real-world datasets
Algorithms Wpbc Bupa Ionosphere Pima Sonar Transfusion Wine Iris Glass
(∼) (+) (+) (+) (+) (∼) (∼) (+) (+)
ARIACS – ARIA
0.91 2e-4 0 0 0 0.42 0.42 0 9e-4
(–) (+) (–) (–) (+) (–) (–) (–) (–)
NGF – NG
4e-3 0 0 0 0 0 0 0 0
(∼) (+) (–) (+) (+) (∼) (∼) (+) (+)
k-meansL – k-means
0.93 0 0.02 0.03 2e-4 0.58 0.94 0 0
Table 5 shows the average and standard deviation of the classifiers’ accuracy
using the considered clustering algorithms for the real-world datasets. A pairwise
comparison was done to assess the effective impact of using labeled information
to define the centers of RBFs. Significantly better results are highlighted.
Table 5. Classifiers’ accuracy for the real-world datasets
Dataset ARIA ARIACS NG NGF k-means k-meansL

Wpbc 76.2(±1.2) 76.6(±1.4) 76.3(±1.0) 75.7(±0.9) 76.3(±0.9) 76.7(±1.1)
Bupa 57.5(±0.7) 58.8(±1.7) 55.7(±1.4) 58.3(±0.6) 56.3(±1.6) 61.5(±1.8)
Ionosphere 70.3(±2.0) 86.7(±0.9) 81.3(±1.2) 62.5(±0.7) 80.4(±1.7) 79.5(±1.3)
Pima 71.7(±0.9) 75.7(±0.6) 66.1(±0.6) 64.8(±0.3) 66.5(±0.6) 66.8(±0.8)
Sonar 54.5(±2.6) 69.7(±2.7) 48.1(±3.7) 52.0(±2.7) 50.5(±3.7) 54.0(±3.5)
Transfusion 76.6(±0.4) 76.6(±0.8) 79.0(±0.5) 75.0(±0.5) 78.9(±0.6) 78.9(±0.6)
Wine 96.7(±0.9) 96.7(±0.6) 75.4(±2.0) 43.5(±1.5) 75.1(±2.5) 76.0(±1.8)
Iris 87.9(±2.2) 93.6(±1.6) 80.9(±1.4) 58.5(±5.2) 75.5(±2.9) 83.2(±0.9)
Glass 66.7(±2.0) 68.3(±1.7) 56.9(±2.8) 43.2(±2.4) 58.3(±2.8) 63.8(±2.3)
In most cases, NGF performed worse than the NG algorithm, indicating that
the F-measure maximization does not improve RBFNN accuracy. Unlike Eu-
clidean distance, F-measure does not necessarily preserve the topological order
of the clusters. Considering multiple prototypes per class, the assignment of a
data point to a distant or a near cluster (representing the same class) may have
the same F-measure, causing a misleading update of the prototypes.
It is possible to infer that the incorporation of labeled information in clus-
tering algorithms may not always lead to an improvement in RBFNN accuracy.
Depending on the problem complexity, the two already proposed supervised clus-
tering algorithms, k-meansL and NGF , worsen the RBFNN performance. On the
other hand, ARIACS achieved greater or equal performance when compared to
the original ARIA for all considered problems.
4 Concluding Remarks and Future Works

RBFNN classifiers were implemented by means of three distinct clustering pro-
cedures to specify the number of RBFs, their location and dispersion. Labeled in-
formation were incorporated into k-means, NG and ARIA algorithms in
relevant stages of their corresponding clustering procedures. We compared the

performance of RBFNNs using unsupervised and supervised clustering in nine
artificial and nine real-world classification problems.
Regarding the observed results, we can say that the improvement provided
by the use of labeled information depends on how this information is used, and
the misuse of it can worsen the results. k-meansL and ARIACS led, in general,
to significant improvements in the accuracy of the resulting RBFNN classifiers
(relative to their respective unsupervised versions). But, NGF achieved worse
results than NG in most cases. F-measure belongs to a class of clustering metrics
that does not preserve the topological order of the clusters. Possibly, this class
is not suitable to be optimized in a clustering process, when applied to RBFNN
internal structure learning.
Summing up, supervised clustering seems to be better in most challenging
classification problems, more specifically, the ones characterized by unbalanced
and overlapping classes, and also when the number of input features is high.
Future works should investigate what are the most suitable class of clustering
metrics capable of indicating when the RBFNN classifiers tend to achieve higher
accuracy in classification and regression problems.
Acknowledgments. The authors would like to thank CNPq and CAPES for
the financial support.
References
1. Barra, T., Bezerra, G., de Castro, L., Von Zuben, F.: An Immunological Density-
Preserving Approach to the Synthesis of RBF Neural Networks for Classification.
In: IEEE International Joint Conference on Neural Networks, pp. 929–935 (2006)
2. Bezerra, G., Barra, T., De Castro, L., Von Zuben, F.: Adaptive Radius Immune
Algorithm for Data Clustering. In: Jacob, C., Pilat, M.L., Bentley, P.J., Timmis,
J.I. (eds.) ICARIS 2005. LNCS, vol. 3627, pp. 290–303. Springer, Heidelberg (2005)
3. Bruzzone, L., Prieto, D.: A technique for the selection of kernel-function param-
eters in RBF neural networks for classification of remote-sensing images. IEEE
Transactions on Geoscience and Remote Sensing 37(2), 1179–1184 (1999)
4. Cevikalp, H., Larlus, D., Jurie, F.: A supervised clustering algorithm for the ini-
tialization of RBF neural network classifiers. In: 15th IEEE Signal Processing and
Communications Applications, pp. 1–4 (2007)
5. Frank, A., Asuncion, A.: UCI machine learning repository (2010),
http://archive.ics.uci.edu/ml
6. Gan, M., Peng, H., Dong, X.: A hybrid algorithm to optimize RBF network archi-
tecture and parameters for nonlinear time series modeling. Applied Mathematical
Modelling (2011)
7. Guillén, A., Pomares, H., Rojas, I., González, J., Herrera, L., Rojas, F., Valenzuela,
O.: Studying possibility in a clustering algorithm for RBFNN design for function
approximation. Neural Computing and Applications 17(1), 75–89 (2008)
8. Haykin, S.: Neural Networks: A Comprehensive Foundation. Prentice-Hall (1999)
9. Lamirel, J., Mall, R., Cuxac, P., Safi, G.: Variations to incremental growing neural
gas algorithm based on label maximization. In: IEEE International Joint Confer-
ence on Neural Networks (IJCNN), pp. 956–965 (2011)
10. Martinetz, T.M., Berkovich, S.G., Schulten, K.J.: “Neural-Gas” network for vector
quantization and its application to time-series prediction. IEEE Transactions on
Neural Networks 4(4), 558–569 (1993)
11. Okamoto, K., Ozawa, S., Abe, S.: A Fast Incremental Learning Algorithm of RBF
Networks with Long-Term Memory. In: Proceedings of the International Joint Con-
ference on Neural Networks, pp. 102–107 (2003)
12. Qian, Q., Chen, S., Cai, W.: Simultaneous clustering and classification over cluster
structure representation. Pattern Recognition 45(6), 2227–2236 (2012)
13. Spiegelhalter, D., Best, N., Carlin, B., Van Der Linde, A.: Bayesian measures of
model complexity and fit. Journal of the Royal Statistical Society. Series B: Sta-
tistical Methodology 64(4), 583–616 (2002)
14. Veroneze, R., Gonçalves, A.R., Von Zuben, F.J.: A Multiobjective Analysis of
Adaptive Clustering Algorithms for the Definition of RBF Neural Network Centers
in Regression Problems. In: Yin, H., Costa, J.A.F., Barreto, G. (eds.) IDEAL 2012.
15. Wang, X., Syrmos, V.: Optimal cluster selection based on Fisher class separability
measure. In: Proceedings of American Control Conference, pp. 1929–1934 (2005)
Nested Sequential Minimal Optimization
for Support Vector Machines
Alessandro Ghio, Davide Anguita, Luca Oneto,

Sandro Ridella, and Carlotta Schatten
DITEN – University of Genova, Via Opera Pia 11A, Genova, I-16145, Italy
{Alessandro.Ghio,Davide.Anguita,Luca.Oneto,Sandro.Ridella}@unige.it,
Carlotta.Schatten@smartlab.ws
Abstract. We propose in this work a nested version of the well–known

Sequential Minimal Optimization (SMO) algorithm, able to contemplate
working sets of larger cardinality for solving Support Vector Machine
(SVM) learning problems. Contrary to several other proposals in liter-
ature, neither new procedures nor numerical QP optimizations must be
implemented, since our proposal exploits the conventional SMO method
in its core. Preliminary tests on benchmarking datasets allow to demon-
strate the effectiveness of the presented method.
Keywords: Support Vector Machine, Convex Constrained Quadratic

Programming, Sequential Minimal Optimization.
1 Introduction
The Support Vector Machine (SVM) [15] is one of the state–of–the–art tech-
niques for classification problems. The learning phase of SVM consists in solving
a Convex Constrained Quadratic Programming (CCQP) problem to identify a
set of parameters; however, this training step does not conclude the SVM learn-
ing, as a set of hyperparameters must be tuned to reach the optimal performance
during the SVM model selection. This last tuning is not trivial: the most used,
effective and reliable approach in practice is to perform an exhaustive grid search
[5], where the CCQP problem is solved several times with different hyperparam-
eters settings.
As a consequence, identifying an efficient QP solver is of crucial importance
for speeding-up the SVM learning and several approaches have been proposed
in literature [14]. Two main categories of solvers exist: problem-oriented and
general purpose methods. Problem-oriented techniques make the most of the
characteristics of the problem or of the model to train: e.g., when classification
with a linear SVM is targeted, the LibLINEAR algorithm [3] is a very efficient
solver, which however cannot be used when Radial Basis Function kernels (such
as the Gaussian one) are exploited. In the framework of general purpose meth-
ods, one of the most well-known tools for solving the SVM CCQP problem is
the Sequential Minimal Optimization (SMO) algorithm [12,7]. SMO takes inspi-
ration from the decomposition method of Osuna et al. [11], which suggests to

Nested Sequential Minimal Optimization for Support Vector Machines 157
solve SVM training problems by dividing the available dataset into an inactive
and an active part (namely the working set ). In particular, SMO pushes the
decomposition idea to the extreme, as it optimizes the smallest possible working
set, consisting of only two parameters selected according to proper heuristics [7].
The main advantage of SMO with respect to other general purpose methods (e.g.
SVMlight [6]) lies in the fact that solving such a simple problem can be done
analytically: thus, numerical QP optimization, which can be costly or slow, can
be completely avoided [13]. Moreover, SMO is easy to implement and is included
in the well-known LibSVM package [2], which allowed to further spread the use
of this solver. However, an overall speed-up of the algorithm is expected [13]: in
particular, performance can improve by exploiting working sets of larger cardi-
nality [9], as the number of accesses to memory, which represents one of the main
computational burden of SMO, can be reduced. Thus, the analytical solution for
a modified SMO algorithm, able to optimize three parameters at each iteration,
has been proposed in [9]: though efficient, this modified version of SMO (called
3PSMO) requires that a new optimization algorithm is implemented. Moreover,
its scalability to larger working sets is not straightforward.
In this paper, we propose an innovative Nested SMO (N–SMO) algorithm: we
firstly pick a subset of data by selecting the samples according to the heuristics
proposed in [7]; then, we apply the conventional SMO algorithm to optimize the
parameters on the selected subset, so that no ad hoc optimization procedures
must be implemented. In addition to be easily scalable and to allow the use
of widespread software libraries, our proposal outperforms the state-of-the-art
SMO implementation included in LibSVM, as shown by the tests in Section 4.
2 Support Vector Machines for Classification
Let us consider a non-trivial dataset Dn , consisting of n patterns Dn = {(xi , yi )},

i ∈ {1, . . . , n}, where xi ∈ Rd , yi ∈ {±1}. The SVM classifier is trained by
solving the following CCQP problem [15]:
1
n n n
min g (α) αi αj qij − αi (1)
α 2 i=1 j=1 i=1

n
s.t. 0 ≤ αi ≤ C yi αi = 0,
i=1
where Q = {qij } = {yi yj K(xi , xj )}, K(xi , xj ) is the kernel function and C is
the hyperparameter that must
be tuned during the model selection phase. The
SVM classifier is f (x) = ni=1 yi αi K(xi , x) + b, where b ∈ R is the bias. The
patterns for which αi > 0 are called Support Vectors (SVs), while the subset of
patterns for which 0 < αi < C are called True SVs (TSVs).
Osuna et al. [11] suggested to solve Problem (1) by selecting working sets
of smaller cardinality, which can be efficiently managed by the optimization
procedure. Let us define the following sets:
158 A. Ghio et al.
⏐ ⏐
⏐ ⏐
S = {α1 , . . . , αn } , S opt ⊆ S, I = i⏐αi ∈ S , I opt = i⏐αi ∈ S opt , (2)
where |S| = n, |S opt | ≤ n and |·| is the cardinality of the set. The algorithm
proposed in [11] randomly takes a subset of the αi ∈ S, S opt , and optimizes
the subproblem defined by these variables. The procedure is then repeated until
all the Karush–Kuhn–Tucker (KKT) conditions of Eq. (1) are satisfied [11].
Platt [12], in particular, proposed to select working sets such that |S opt | =
2, i.e. characterized by the minimum cardinality. As the selection of the two
parameters to optimize deeply affects the performance of the algorithm and
its rate of convergence, ad hoc strategies have been presented in [7]. In that
work, the authors propose to include in the working set the Most Violating Pair
(MVP), i.e. the two parameters corresponding to the samples which violate the
KKT conditions the most. A further improvement has been recently presented
in [4], which takes into account second order information regarding Problem (1)
and which is currently exploited by the last versions of the LibSVM package [2].
3 The Nested SMO Method

The idea on which the Nested SMO (N–SMO) algorithm builds is simple: given
Problem (1), a working set S opt of m patterns (m ≤ n) is chosen according
to the strategy proposed in [7], i.e. the m Most Violating (MV) samples are
included in S opt . The conventional SMO algorithm [2] can be then exploited to
solve the problem, formulated using the samples of the working set, without any
modification: Algorithm 1 details the proposed method. Let ∇g = ∇α g (α),
(t)
where ∇gi = ∂g(α)
∂αi , and let S = {αi } be the set of parameters at the t–th step,
where α(t) is a feasible point. The two following sets can be defined:
⏐
⏐ (t) (t)
Iup = k ⏐ αk < C ∧ yk = +1 ∨ αk > 0 ∧ yk = −1 (3)
⏐
⏐ (t) (t)
Ilow = k ⏐ αk < C ∧ yk = −1 ∨ αk > 0 ∧ yk = +1 . (4)
Then, by extending the criterion introduced in [7], the m MV samples (if any)
can be chosen as follows:
⏐
⏐
Iup
m–MV
= arg max −yk ∇gk α(t) ⏐k ∈ Iup (5)
k1 ,...,k m
2
⏐
⏐
Ilow
m–MV
= arg max yk ∇gk α(t) ⏐k ∈ Ilow , (6)
k1 ,...,k m
2
where maxk1 ,...,km selects the m largest elements of a vector. Then, the indexes of
⏐ set are I = Iup ∪ Ilow
opt m–MV m–MV
the patterns of the working and the working set

(t) ⏐
is defined as S opt
= αi ⏐i ∈ I opt
. Then, we can optimize Problem (1) only
with respect to the parameters included in S opt by exploiting the conventional
SMO algorithm1 :
1
The proof is omitted here due to space constraints.
⎡ ⎤
⏐
min
1
αi αj qij + ⎣∇gi ⏐
⏐ − αj qij ⎦ αi
(t)
(7)
αi ∈S opt 2 α(t)
i∈I opt i∈I opt j∈I opt
j ∈ I opt
(t)
s.t. 0 < αi < C i ∈ I opt yi αi = y i αi .
i∈I opt i∈I\I opt
It is also worth noting that the two sets Iup m–MV

and Ilow
m–MV
can be exploited
to derive the MVP at the t–th step, u in Iup
(t) m–MV
and v in Ilow
(t) m–MV
, which
can be used for defining the stopping criterion, analogously to the conventional
SMO algorithm [7].
Note that, every time a parameter αi ∈ S opt is updated, the gradient ∇gi
must be re-computed as well (the updating rule is analogous to the one of the
conventional SMO): for this purpose, a whole row (or column) of the matrix
Q must be read, and this represents one of the main computational burden of
the SMO procedure. By considering larger subsets (as also highlighted in [9]),
the number of accesses to memory is reduced. A further improvement can be
obtained by accessing rows (or columns) which are “near” (in terms of index) one
to each other, since the caching properties of the computing software architecture
can be better exploited: as the cardinality of the set of selected parameters S opt
can be properly tuned, the probability of contemplating rows “close” one to each
other increases and so does the overall performance of the method.
This section is devoted to the comparison of the performance of N–SMO against
the state-of-the-art and widely used implementation of the conventional SMO
algorithm included in LibSVM [2] (and exploited by our method as well, see line
13 in Algorithm 1). The datasets used for the comparison are presented in Table
1, where nl is the number of patterns used for the learning phase while nt is the
number of samples reserved for testing purposes. As MNIST and NotMNIST are
multi-class datasets and we target two-class problems, we performed an All-Vs-
All approach and we considered a subset of the so-created binary sets. The Test
Set (TS) approach is used for model selection purposes, where an exhaustive
grid search is used to explore several hyperparameter values. All the tests have
been performed on an Intel Core i5 processor (2.67 GHz) with 4 GB RAM. The
experiments presented in this section have been replicated 30 times with the
same setup in order to build statistically relevant results.
As a first issue, in Fig. 1 we compare the performance of the two solvers by
considering the MNIST 1 vs 7 problem, where a Gaussian kernel is used. In
particular, the figure on the left compares the time, needed by the algorithms to
compute the solution, when m (i.e. the dimension of the working set) and C are
varied (the width of the Gaussian kernel is fixed to the optimal value, identified
during the model selection phase): when C assumes either small (< 10−2 ) or
large (> 102 ) values, N–SMO allows to outperform the LibSVM SMO (for which
160 A. Ghio et al.
Algorithm 1. Nested SMO (N–SMO) algorithm.

Require: Q, C, the maximum number of iteration nmax and a tolerance
1: Initialize t = ⏐0, α(0) (α(0) = 0 is always a feasible starting point)

2: Compute ∇g⏐ ⏐ (0)
α
3: loop

4: I opt , u(t) , v (t) = search the MV samples in S
5: Iopt
Sort
6: if ∇gv(t) − ∇gu(t) ≤ then
7: break
8: else
9: if t ≥ nmax then
10: return Too many iteration
11: end if
12: end if

13: α(t+1) = solve Problem (7) using the conventional SMO [2]
⏐ ⏐
⏐ ⏐ (t+1) (t)
14: ∇gk ⏐ = ∇gk ⏐ + i∈I opt αi − αi qik ), ∀k ∈ {1, . . . , n}
α(t+1) α(t)
15: t=t+1
16: end loop
17: for i = 1 → n do
(t)
18: if 0 < αi < C then
19: b = −yi ∇gi
20: return α(t) ,b
21: end if
22: end for
∇g (t) +∇g (t)
23: b=− v 2 u
24: return α(t) ,b
Table 1. Description of the datasets used for the experiments
Dataset Reference nl nt d
MNIST 1 vs 7 [8] 11000 2448 784
MNIST 0 vs 1 [8] 10000 3074 784
MNIST 3 vs 8 [8] 10000 2381 784
NotMNIST A vs B [1] 10000 2000 784
NotMNIST C vs D [1] 10000 2000 84
NotMNIST I vs J [1] 10000 2000 784
Daimler [10] 8000 1800 648
Webspamunigram [16] 10000 2000 254
m = 2). Similar results can be obtained on the other datasets in Table 1, but
are not reported here due to space constraints: in particular, after the extensive
numerical simulations we performed on the datasets of Table 1, m ≈ 400 seems
to represent the optimal trade-off for datasets characterized by a cardinality of
approximately 10000 samples in the range of interest for the hyperparameters.
The right plot in Fig. 1, instead, compares the number of accesses to memory
nacc : it is worth noting that, when C is large, N–SMO remarkably outperforms
SMO; on the contrary, when C is small, the number of accesses to memory is
similar, which seems surprisingly in contrast with the results on the training
time. Thus, we deepened the analysis of the results, in order to better explore
the reasons of such an unexpected behavior.
In Table 2, we present a more detailed list of results for the comparison of
LibSVM SMO against N–SMO, where different kernels (the linear, the Gaussian
Number of access to the matrix K(xi,xi) Time γ = 1.0e−003

14000 1
m=2
m = 20
0.9 m = 220
12000
m = 420
0.8 m = 620
10000 m = 820
m = 1020
Time (sec)
m=2 0.7
8000 m = 20
nacc
m = 220
0.6
m = 420
6000 m = 620
m = 820 0.5
m = 1020
4000
0.4
2000 0.3
0 0.2
−8 −6 −4 −2 0 2 4 6 −8 −6 −4 −2 0 2 4 6
log10 C log C
10
Fig. 1. Comparison between SMO and N–SMO on the MNIST 1 vs 7 dataset
Table 2. Comparison of performance indexes (computational time, number of mis-

classifications, number of TSVs, accesses to memory and average distance between the
indexes of the rows/columns of the matrix Q) for MNIST 1 vs 7
SMO N–SMO, m = 400

C γ or p time nerr TSV nacc dacc time nerr TSV nacc dacc
Linear kernel
1.0e-7 – 1.1e+0 ± 3.0e-3 162 0 11000 5.5e+3 8.1e-1 ± 5.5e-3 162 0 11160 1.3e+1
1.0e+5 – 2.1e+0 ± 1.4e-2 5 165 30534 3.7e+3 6.0e-1 ± 9.7e-4 5 166 2197 1.3e+1
1.0e-1 – 1.0e+0 ± 1.2e-3 3 141 12984 3.5e+3 3.7e-1 ± 1.9e-3 3 139 2237 1.2e+1
Polynomial kernel
1.0e-7 1 1.2e+0 ± 5.4e-3 164 0 11000 5.5e+3 8.9e-1 ± 3.4e-3 164 0 11160 1.3e+1
1.0e-7 19 9.2e-1 ± 7.0e-3 55 0 11000 5.5e+3 6.5e-1 ± 5.2e-3 55 0 11160 1.3e+1
1.0e+5 1 1.6e+0 ± 3.0e-3 5 163 29278 3.7e+3 4.6e-1 ± 2.3e-3 5 164 2198 1.3e+1
1.0e+5 19 3.9e-1 ± 1.7e-3 4 463 4318 3.6e+3 3.4e-1 ± 2.2e-3 4 463 3504 1.3e+1
1.0e+2 4 6.0e-1 ± 4.3e-3 2 214 9276 3.6e+3 3.2e-1 ± 2.4e-3 2 213 2217 1.3e+1
Gaussian kernel
1.0e-7 1.0e-7 1.1e+0 ± 8.0e-3 162 0 11000 5.5e+3 8.5e-1 ± 2.7e-4 162 0 11160 1.3e+1
1.0e-7 1.0e+5 5.4e-1 ± 1.5e-3 2446 0 11000 5.5e+3 3.9e-1 ± 1.8e-4 2446 0 11160 1.0e+0
1.0e+5 1.0e-7 4.0e-1 ± 3.9e-4 6 113 5604 3.7e+3 2.8e-1 ± 2.3e-3 6 113 2278 1.2e+1
1.0e+5 1.0e+5 5.7e-1 ± 3.9e-3 2446 11000 11000 5.5e+3 3.9e-1 ± 1.2e-3 2446 11000 11080 1.0e+0
1.0e+3 1.0e-3 5.9e-1 ± 5.6e-3 2 213 9802 3.7e+3 3.1e-1 ± 1.1e-4 2 213 2197 1.3e+1
and the polynomial ones) are exploited for the SVM. In particular, we report
the results obtained for “extreme” values of the hyperparameters (C, the degree
of the polynomial p and the width of the Gaussian kernel γ) and for the opti-
mal values, identified during the model selection; the dimension of the working
set is fixed to m = 400. In addition to nacc and the time needed by the solver
to compute the solution, we also present the number of misclassification nerr ,
performed by the learned model on the test set, and the average distance dacc
between the indexes of the rows (or columns) of the matrix Q, read in memory
for updating the gradient value. It can be noted that, when C is small, dacc for
N–SMO is always noticeably smaller than the value obtained for the LibSVM
SMO, while nacc is similar for the two methods: this confirms that the caching
strategy of the computing system has a remarkable influence on the overall per-
formance of the algorithms and choosing larger working sets can help decreasing
the computational time.
Finally, the results obtained for all the datasets are presented in Table 3,
where only the values referred to the Gaussian kernel are proposed due to space
constraints: conclusions, analogous to the ones for the values of Table 2, can be
drawn.
162 A. Ghio et al.
Table 3. Comparison of performance indexes (computational time, number of mis-

classifications, number of TSVs, accesses to memory and average distance between the
indexes of the rows/columns of the matrix Q) for the datasets of Table 1
SMO N–SMO, m = 400

C γ time nerr TSV nacc dacc time nerr TSV nacc dacc
Daimler
1.0e-4 1.0e-3 6.6e-1 ± 5.4e-3 528 0 8000 4.0e+3 3.9e-1 ± 3.5e-3 528 0 8000 2.0e+1
1.0e-4 1.0e+3 5.0e-1 ± 6.3e-4 1777 0 8000 4.0e+3 3.6e-1 ± 3.3e-3 1777 0 8000 1.0e+0
1.0e+2 1.0e-3 5.7e+0 ± 3.6e-2 53 1251 88620 2.8e+3 1.5e+0 ± 1.4e-1 53 1251 36723 2.1e+1
1.0e+2 1.0e+3 3.8e-1 ± 1.1e-3 996 8000 8000 4.0e+3 2.7e-1 ± 1.5e-3 996 8000 8000 1.4e+0
1.0e+2 1.0e-2 1.3e+0 ± 1.3e-2 25 2579 18212 2.8e+3 1.0e+0 ± 1.5e-1 25 2584 12120 2.0e+1
Mnist01
1.0e-4 1.0e-3 9.6e-1 ± 1.5e-3 62 0 10000 5.0e+3 6.1e-1 ± 6.0e-3 62 0 10240 2.5e+1
1.0e-4 1.0e+3 7.4e-1 ± 7.1e-3 3074 0 10000 5.0e+3 5.6e-1 ± 2.7e-3 3074 0 10240 1.0e+0
1.0e+2 1.0e-3 1.8e-1 ± 1.5e-3 4 103 802 3.5e+3 1.7e-1 ± 2.5e-4 4 103 789 1.9e+1
1.0e+2 1.0e+3 4.7e-1 ± 2.0e-3 3074 10000 10000 5.0e+3 2.9e-1 ± 2.7e-3 3074 10000 10120 1.0e+0
1.0e+2 1.0e-3 1.8e-1 ± 1.4e-3 4 103 802 3.5e+3 1.7e-1 ± 1.7e-3 4 103 789 1.9e+1
Mnist38
1.0e-4 1.0e-3 9.9e-1 ± 6.5e-3 248 0 10000 5.0e+3 6.1e-1 ± 2.2e-4 248 0 10240 2.5e+1
1.0e-4 1.0e+3 7.5e-1 ± 6.4e-3 2381 0 10000 5.0e+3 5.5e-1 ± 5.2e-3 2381 0 10240 1.0e+0
1.0e+2 1.0e-3 1.4e+0 ± 9.2e-3 13 911 17720 3.7e+3 1.0e+0 ± 9.7e-2 13 909 15640 2.5e+1
1.0e+2 1.0e+3 8.5e-1 ± 6.4e-3 2381 10000 10000 5.0e+3 5.3e-1 ± 2.1e-3 2381 10000 10120 1.0e+0
1.0e+0 1.0e-2 6.8e-1 ± 4.5e-3 10 2367 6388 3.4e+3 4.1e-1 ± 1.9e-3 10 2384 4960 2.6e+1
NotMnistAB
1.0e-4 1.0e-3 1.0e+0 ± 7.4e-3 297 0 10000 5.0e+3 6.4e-1 ± 2.0e-4 297 0 10240 2.5e+1
1.0e-4 1.0e+3 7.9e-1 ± 2.2e-3 967 0 10000 5.0e+3 5.1e-1 ± 2.3e-4 967 0 10240 3.6e+0
1.0e+2 1.0e-3 2.4e+0 ± 2.4e-3 94 1742 28120 3.7e+3 1.2e+0 ± 3.5e-1 93 1749 24960 2.6e+1
1.0e+2 1.0e+3 9.7e-1 ± 6.7e-3 967 9850 17676 3.6e+3 4.5e-1 ± 1.4e-3 967 9850 13440 1.3e+1
1.0e+1 1.0e-3 1.3e+0 ± 1.2e-2 94 1647 17656 3.9e+3 9.3e-1 ± 8.0e-3 94 1653 12240 2.6e+1
NotMnistCD
1.0e-4 1.0e-3 1.0e+0 ± 4.5e-3 163 0 10000 5.0e+3 6.0e-1 ± 2.3e-3 163 0 10240 2.5e+1
1.0e-4 1.0e+3 7.5e-1 ± 5.8e-3 969 0 10000 5.0e+3 5.9e-1 ± 4.7e-3 969 0 10240 2.6e+0
1.0e+2 1.0e-3 1.7e+0 ± 3.3e-3 80 1378 20978 3.7e+3 1.1e+0 ± 1.4e-1 80 1382 16440 2.7e+1
1.0e+2 1.0e+3 9.5e-1 ± 4.2e-3 967 9862 11382 4.7e+3 7.8e-1 ± 5.0e-3 967 9862 10040 9.0e+0
1.0e+0 1.0e-3 4.1e-1 ± 2.9e-3 68 411 3536 3.8e+3 2.1e-1 ± 1.6e-2 68 416 2240 2.6e+1
NotMnistIJ
1.0e-4 1.0e-3 9.9e-1 ± 2.7e-3 311 0 10000 5.0e+3 5.7e-1 ± 3.9e-3 311 0 10240 2.5e+1
1.0e-4 1.0e+3 4.7e-1 ± 3.1e-3 883 0 10000 5.0e+3 2.9e-1 ± 4.8e-4 883 0 10240 5.0e+0
1.0e+2 1.0e-3 4.2e+0 ± 5.0e-3 145 2060 49304 3.8e+3 2.4e+0 ± 3.2e-1 144 2057 35360 2.6e+1
1.0e+2 1.0e+3 1.8e+0 ± 1.8e-2 859 9293 41784 4.0e+3 9.9e-1 ± 3.4e-3 859 9298 40640 1.8e+1
1.0e+1 1.0e-3 2.4e+0 ± 1.4e-2 135 1798 28892 3.6e+3 1.7e+0 ± 8.2e-2 135 1796 27360 2.6e+1
Webspamunigram
1.0e-4 1.0e-3 8.5e-1 ± 6.4e-3 795 0 10000 5.0e+3 6.6e-1 ± 1.7e-3 795 0 10240 2.5e+1
1.0e-4 1.0e+3 8.3e-1 ± 4.2e-3 994 0 10000 5.0e+3 7.8e-1 ± 5.4e-3 994 0 10240 7.0e+0
1.0e+2 1.0e-3 8.8e-1 ± 7.9e-3 156 92 9736 3.8e+3 5.0e-1 ± 4.8e-2 156 89 6946 2.6e+1
1.0e+2 1.0e+3 2.6e+0 ± 1.4e-2 890 9946 27484 3.7e+3 1.7e+0 ± 5.1e-3 890 9948 19160 2.3e+1
1.0e+2 1.0e+0 1.4e+0 ± 2.0e-3 47 3015 15812 3.3e+3 1.0e+0 ± 2.6e-2 48 3049 14240 2.6e+1
5 Concluding Remarks
This is a preliminary work, since N–SMO needs to be tested on a larger number
of datasets with different cardinalities and requires that a strategy for tuning
m is depicted; further comparisons with other state-of-the-art solvers must be
performed as well. Nevertheless, the N–SMO approach proved to be effective and,
as such, represents the basis for research on these topics. Possible perspectives
for further improvements are twofold, to the authors’ current best knowledge:
firstly, the exploitation of the working set selection strategy, proposed in [4], at
line 4 of Algorithm 1; as a second (and even more appealing) issue, the design
of a customized caching algorithm for very large cardinality problems.
References
1. Bulatov, Y.: (2011), dataset,
http://yaroslavvb.blogspot.com/2011/09/notmnist-dataset.html
2. Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines. ACM
Transactions on Intelligent Systems and Technology 2, 27:1–27:27 (2011)
3. Fan, R., Chang, K., Hsieh, C., Wang, X., Lin, C.: Liblinear: A library for large linear
classification. The Journal of Machine Learning Research 9, 1871–1874 (2008)
4. Fan, R., Chen, P., Lin, C.: Working set selection using second order information for
training support vector machines. The Journal of Machine Learning Research 6,
1889–1918 (2005)
5. Hsu, C., Chang, C., Lin, C.: A practical guide to support vector classification
(2003)
6. Joachims, T.: Making large-scale svm learning practical. In: Advances in Kernel
Methods (1999)
7. Keerthi, S., Shevade, S., Bhattacharyya, C., Murthy, K.: Improvements to platt’s
smo algorithm for svm classifier design. Neural Computation 13(3), 637–649 (2001)
8. Larochelle, H., Erhan, D., Courville, A., Bergstra, J., Bengio, Y.: An empirical
evaluation of deep architectures on problems with many factors of variation. In:
Proceedings of the International Conference on Machine Learning, pp. 473–480
(2007)
9. Lin, Y., Hsieh, J., Wu, H., Jeng, J.: Three-parameter sequential minimal optimiza-
tion for support vector machines. Neurocomputing 74(17), 3467–3475 (2011)
10. Munder, S., Gavrila, D.: An experimental study on pedestrian classification. IEEE
Transactions on Pattern Analysis and Machine Intelligence 28(11), 1863–1868
(2006)
11. Osuna, E., Freund, R., Girosi, F.: An improved training algorithm for support
vector machines. In: Proceedings of the Workshop Neural Networks for Signal
Processing (1997)
12. Platt, J.: Sequential minimal optimization: A fast algorithm for training sup-
port vector machines. In: Advances in Kernel Methods Support Vector Learning,
vol. 208, pp. 1–21 (1998)
13. Platt, J.: Using analytic qp and sparseness to speed training of support vector
machines. In: Advances in Neural Information Processing Systems, pp. 557–563
(1999)
14. Shawe-Taylor, J., Sun, S.: A review of optimization methodologies in support vector
machines. Neurocomputing 74(17), 3609–3618 (2011)
15. Vapnik, V.: Statistical learning theory. Wiley, New York (1998)
16. Webb, S., Caverlee, J., Pu, C.: Introducing the webb spam corpus: Using email
spam to identify web spam automatically. In: Proceedings of the Conference on
Email and Anti-Spam (2006)
Random Subspace Method and Genetic Algorithm
Applied to a LS-SVM Ensemble
Carlos Padilha, Adrião Dória Neto, and Jorge Melo
Federal University of Rio Grande do Norte, Department of Computer Engineering and

Automation, 59078-900 Natal, Brazil
{carlosalberto,adriao,jdmelo}@dca.ufrn.br
Abstract. The Least Squares formulation of SVM (LS-SVM) finds the solution
by solving a set of linear equations instead of quadratic programming
implemented in SVM. The LS-SVMs provide some free parameters that have to
be correctly chosen in order that the performance. Lots of tools have been
developed to improve their performance, mainly the development of new
classifying methods and the employment of ensembles. So, in this paper, our
proposal is to use both the theory of ensembles and a genetic algorithm to
enhance the LS-SVM classification. First, we randomly divide the problem into
subspaces to generate diversity among the classifiers of the ensemble. So, we
apply a genetic algorithm to find the values of the LS-SVM parameters and also
to find the weights of the linear combination of the ensemble members, used to
take the final decision.
Keywords: Pattern Classification, LS-SVM, Ensembles, Genetic Algorithm,

Random Subspace Method.
1 Introduction
The Least Squares Support Vector Machine (LS-SVM) is a reformulation of the
standard SVM [1] introduced by Suykens [2] that uses equality constraints instead of
inequality constraints implemented in the problem formulation. Both the SVMs and
the LS-SVMs provide some parameters that have to be tuned to reflect the
requirements of the given task because if these ones are not correctly chosen,
performances will not be satisfactory. Despite their high performance, several
techniques have been employed in order to improve them, either by developing new
training methods [3] or by creating ensembles [4].
The most popular ensemble learning methods are Bagging [5], Boosting [6] and
the Random Subspace Method (RSM) [7]. In Bagging, one samples the training set,
generating random independent bootstrap replicates [8], constructs the classifier on
each of these, and aggregates them by a simple majority vote in the final decision
rule. In Boosting, classifiers are constructed on weighted versions of the training set,
which are dependent on previous classification results. Initially, all objects have equal
weights, and the first classifier is constructed on this data set. Then, weights are
changed according to the performance of the classifier. Erroneously classified objects
Random Subspace Method and Genetic Algorithm Applied to a LS-SVM Ensemble 165
get larger weights, and the next classifier is boosted on the reweighted training set. In
this way, a sequence of training sets and classifiers is obtained, which is then
combined by single majority voting or by weighted majority voting in the final
decision. In the RSM, classifiers are constructed in random subspaces of the data
feature space. These classifiers are usually combined by single majority voting in the
final decision rule.
Several approaches related to RSM can be found in [9, 10, 11,12]. In [9] Bryll et
al. discuss Attribute Bagging, a technique for improving the accuracy and the stability
of classifier ensembles induced using random subsets of features. The Input
Decimation method (ID) [10] generates subsets of the original feature space for
reducing the correlation the correlation among the base classifiers in an ensemble.
Each base classifier is presented with a single feature subset, each feature subset
containing features correlated with a single class. The RSM by Ho [11] is a forest
construction method, which builds a decision tree-based ensemble classifier. It
presents randomly selected subspace of features to the individual classifiers, and then
combines their output using voting. The classifier ensemble entitled Classification
Ensembles from Random Partitions (CERP) was described in [12]. In CERP, each
base classifier is constructed from a different set of attributes determined by a
mutually exclusive random partitioning of the original feature space. CERP uses
optimal tree classifiers as the base classifiers and the majority voting as combiner.
In our previous work [13], we used a GA to analyze the importance of each SVM
in the ensemble by means of a weight vector. The diversity in ensemble was
generated providing different parameter values for each model. In this paper, we
propose another way to generate diversity in ensemble and extend the use of GA,
called RSGALS-SVM. We use the combination of the RSM and GA to enhance the
classification of a LS-SVM ensemble. First, we use the RSM, constructing models in
random subspaces of an n-dimensional problem, so that each LS-SVM will be
responsible for the classification of a subproblem. Then, the GA is used to minimize
an error function and therefore it will act finding effective values for the parameters
of each model in the ensemble and a weight vector, measuring the importance of each
one in the final classification. That way, if, for example, there is one LS-SVM whose
decision surface works better than the others, the GA will find the weight vector so
that the final classifying is the best possible. Then we compare our results to proposed
method with some other algorithms.
This paper is organized as follows: Section 2 introduces the LS-SVM and some of
its characteristics, and also has a brief explanation on genetic algorithms and its
mechanism. Section 3 describes our proposed method, while Section 4 has the
experimental results and its analysis. Section 5 presents the conclusion of the paper.
2 Theoretical Background
2.1 Least Squares Support Vector Machines

In this Section we consider first the case of two classes. Given a training set of N data
points , , where denotes the k-th input pattern and the k-th
output pattern, the support vector method approach aims at constructing a classifier of
the form:
166 C. Padilha, A.D. Neto, and J. Melo
, (1)
where are support values and b is a real constant. For ·,· one typically has the
following choices: , (linear SVM); , 1
(polynomial SVM of degree p); , exp (RBF SVM);
, tanh (MLP SVM), where , and are constants.
For the case of two classes, one assumes
1, 1
(2)
1, 1
which is equivalent to
1, 1, … , (3)
where · is a nonlinear function which maps the input space into a higher
dimensional space. LS-SVM classifiers as introduced in [2] are obtained as solution to
the following optimization problem:
1 1
min , , (4)
, , 2 2
subject to the equality constraints

1 , 1, … , (5)
One defines the Lagrangian
, , ; 1 (6)
where are Lagrange multipliers, which can be either positive or negative due to
equality constraints as follows from Karush-Kuhn-Tucker (KKT) conditions.
The conditions for optimality
0 ∑
0 ∑ 0
(7)
0 , 1, … ,
0 1 0, 1, … ,
can be written after elimination of w and e as the linear system [2]:

0 0
(8)
1
where ;…; , ;…; ,1 1; … ; 1 , ;…; ,
;…; . The Mercer’s condition is applied to the matrix Ω with
Ω , (9)
2.2 Genetic Algorithms

The Genetic Algorithm (GA) [14] is a search-based metaheuristic inspired by the
Darwinian principle of the evolution and by the theory of genetics. The GA was
initially proposed by John Holland in 1975, as part of his attempts to explain processes
occurring in natural systems and to build artificial systems based on such processes. In
Genetic Algorithm, code space is used to replace the problem space, fitness function is
regarded as evaluation criterion, code population is regarded as population of solutions,
selection and genetic mechanism is actualized by genetic operation on individual bit
chain of population. Individuals continually evolve the population by recombination of
some important genes of bit chain stochastically and it approaches to the optimal
stepwise to achieve the goal of solving the problem permanently.
3 RSGALS-SVM
Our main objective is to improve the LS-SVM ensemble performance through the
combination of RSM and GA. In order to create this set of LS-SVMs, we delved a
little into the ensemble theory.
In [15,16,17,18] we see that an effective ensemble should consist of a set of
models that are not only highly accurate, but ones that make their errors on different
parts of the input space as well. Thus, varying the feature subsets used by each
member of the ensemble should help promote this necessary diversity. From [4] we
see that the SVM kernel that allows for higher diversity from the most popular ones is
the Radial Basis Function kernel, because its Gaussian width parameter, , allows
detailed tuning.
Therefore, the combination of RSM and GA is used to generate highly accurate
models and promote disagreement among them. Given an n-dimensional problem, we
use RSM to divide it randomly into M subspaces of the data feature space, thus each
LS-SVM will be responsible for the classification of the problem based on the
information that your subspace provides.
Once defined the division into subspaces of the original problem, we define how
the GA will be used in this work. The GA will act on two different levels of the
ensemble, on the parameters and output of each model. At the first level, the GA will
find effective values of and , the regularization term that controls the tradeoff
between allowing training errors and forcing rigid margins, for M LS-SVMs. At the
second level, the GA will find a weight vector , measuring the importance of each
LS-SVM in the final classifying. The final classification is obtained by a simple linear
combination of the decision values of the LS-SVMs with the weight vector. This way,
the representation of each individual of our population is defined as a vector
containing the adjustable parameters and weights.
, ,… , , , ,…, , , ,…,
where M is the number of LS-SVMs.
The fitness function of our GA is the error rate of the ensemble and can be seen as:
, ,
,…, , ,…, (10)
, 1, … ,
where d contains the output patterns, y contains the final hypothesis, o contains the
LS-SVMs outputs for a given input pattern and w is the weight vector.
So, we can formulate the optimization problem to be solved by the GA:
min , , (11)
, ,
subject to
1. ∑ 1
2. , 0 and 0, 1, … ,
The initial population of the GA was generated randomly. We employed stochastic-
uniform selection, which divides the parents into uniform-sized sections, each
parent’s size being determined by the fitness scaling function. In each of these
sections, one parent is chosen. The mutation function is a Gaussian; the individuals in
mutation are added with a random number from a Gaussian distribution with mean
zero, and variance declining at each generation. The crossover function is the
scattered crossover; a binary vector is generated, and the elements of this vector
decide the outcome of the crossover. If the element is a 1, the corresponding gene of
the child will come from the first parent. If it’s a 0, said gene will come from the
second parent. The size of the population is 20, at each generation two elite
individuals are kept for the next generation, and the fraction generated by crossover is
0.8. The GA runs for 100 generations. Table 1 show the method’s pseudo code.
Table 1. RSGALS-SVM algorithm
Given:
, ,…, , ; , 1,1 , the input set
Procedure:
Generate from the training set and the test set
Randomly divide P into M subspaces of features
Generate M LS-SVM to compose the ensemble, each one will be trained using one of those groups
of features
Call the GA to solve the optimization problem:
min , ,
, ,
subject to
1. ∑ 1
2. , 0 and 0, 1, … ,
Retrieve the optimal values for , and the optimal weight vector
Evaluate the ensemble using V with the same division made in P
Output: Final Classification:
To evaluate the performance of proposed method, tests were performed using 11 two-
class benchmark data sets as used in [19] that include various types of classification
problems (real-world problems and artificial). Table 2 shows the number of inputs,
training data, test data, and the number of different partitions for training and test data.
The data sets chosen vary across a number of dimensions including: the type of the
features in the data set (continuous, discrete or a mix of the two) and number of
examples in the data set. We compared the proposed method to a single RBF Network,
AdaBoost with RBF Networks and SVM (with Gaussian kernel) trained with GA and
their results were obtained from [19]. All results are averaged over all partitions in
every problem. In all tests, we used an ensemble composed of 5 LS-SVM.
Table 3 shows the average recognition rates and their standard deviations of the
validation data sets by RBF, AdaBoost, SVM and proposed method.
The results of AdaBoost are in almost all tests worse than the single classifier.
Analyzing these results, this is clearly due to the overfitting of AdaBoost. In [19], the
authors explain that if the early stopping is used then this effect is less drastic but still
observable.
The averaged results of RSGALS-SVM are a bit better than the results achieved by
other classifiers in most tests (7/11). A significance test (95%-t-test) was done as seen
on Table 3 and it showed that the proposed method gives the best overall performance.
The results of SVM are often better than the results of RBF classifier.
Table 2. Two-class benchmark data sets

Data Inputs Training Test Partitions
B. cancer 9 200 77 100
Diabetis 8 468 300 100
German 20 700 300 100
Heart 13 170 100 100
Image 18 1300 1010 20
Ringnorm 20 400 7000 100
F. solar 9 666 400 100
Splice 60 1000 2175 20
Thyroid 5 140 75 100
Twonorm 20 400 7000 100
Waveform 21 400 4600 100
Table 3. Comparison between the RSGALS-SVM, a single RBF classifier, AdaBoost (AB) and
Support Vector Machine trained with GA (GA-SVM). The best average recognition rate is
shown in boldface. The columns and show the results of a significance test (95%-t-test)
between AB/RSGALS-SVM and RSGALS-SVM/GA-SVM, respectively.
Data RBF AB 1 GA-SVM 2 RSGALS-SVM

B. cancer 72.4±4.7 69.6±4.7 + 74±4.7 74.1±0.1
Diabetis 75.7±1.9 73.5±2.3 + 76.5±1.7 76.3±0.3
German 75.3±2.4 72.5±2.5 + 76.4±2.1 + 78±0.6
Heart 82.4±3.3 79.7±3.4 + 84±3.3 + 87±0.5
Image 96.7±0.6 97.3±0.7 97±0.6 96.3±0.1
Ringnorm 98.3±0.2 98.1±0.3 + 98.3±0.1 - 97.5±0.3
F. solar 65.6±2.0 64.3±1.8 + 67.6±1.8 + 70.3±0.6
Splice 90±1.0 89.9±0.5 + 89.1±0.7 + 90.1±0.1
Thyroid 95.5±2.1 95.6±0.6 - 95.2±2.2 94.7±0.6
Twonorm 97.1±0.3 97±0.3 97±0.2 + 97.9±0.2
Waveform 89.3±1.1 89.2±0.6 + 90.1±0.4 + 92.7±0.2
5 Conclusion
In this work, we proposed two changes in relation to the previous work [13], we
incorporated the RSM to make the feature selection, creating diversity among the LS-
SVMs in the ensemble, and we extended the use of GA to find good values for the
parameters ( , . The search space of these parameters is enormous in complex
problems due to their large range of values. This is why we extended this global search
technique (GA) to find their values. We tested the previous work using 4 data sets
(Image, Ringnorm, Splice and Waveform) and this work got better results in all cases.
We compared the proposed method RSGALS-SVM to a single RBF classifier,
AdaBoost with RBF networks and GA-SVM (with Gaussian kernel) and it achieved
better results than these traditional classifiers in most tests.
Many improvements are possible and need be explored. For example, we can
investigate further expand the use of GA to make the feature selection, as in [20], but
keeping the fitness function used in this work.
References
1. Vapnik, V.: Statistical Learning Theory. John Wiley and Sons Inc., New York (1998)
2. Suykens, J.A.K., Vandewalle, J.: Least-Squares Support Vector Machine Classifiers.
Neural Processing Letters 9(3) (1999)
3. Osuna, E., Freund, R., Girosi, F.: An Improved Training Algorithm for Support Vector
Machines. In: NNSP 1997 (1997)
4. Lima, N., Dória Neto, A., Melo, J.: Creating an Ensemble of Diverse Support Vector
Machines Using Adaboost. In: Proceedings on International Joint Conference on Neural
Networks (2009)
5. Breiman, L.: Bagging predictors. Machine Learning 24(2), 123–140 (1996)
6. Freund, Y., Schapire, R.E.: Experiments with a new boosting algorithm. In: Proceedings
13th International Conference on Machine Learning, pp. 148–156 (1996)
7. Ho, T.K.: The Random subspace method for constructing decision forests. IEEE
Transactions Pattern Analysis and Machine Intelligence 20(8), 832–844 (1998)
8. Efron, B., Tibshirani, R.: An Introduction to the Bootstrap. Chapman & Hall, New York
(1993)
9. Bryll, R., Gutierrez-Osuna, R., Quek, F.: Attribute Bagging: Improving Accuracy of
Classifier Ensembles by using Random Feature Subsets. Pattern Recognition 36, 1291–
1302 (2003)
10. Oza, N.C., Tumer, K.: Input Decimation Ensembles: Decorrelation through
Dimensionality Reduction. In: Kittler, J., Roli, F. (eds.) MCS 2001. LNCS, vol. 2096, pp.
11. Ho, T.K.: The Random Subspace Method for Constructing Decision Forests. IEEE
Transactions Pattern Analysis and Machine Intelligence 20, 832–844 (1998)
12. Ahn, H., Moon, H., Fazzari, M.J., Lim, N., Chen, J., Kodell, R.: Classification by
ensembles from random partitions of high-dimensional data. Computational Statistics and
Data Analysis 51, 6166–6179 (2007)
13. Padilha, C., Lima, N., Dória Neto, A., Melo, J.: An Genetic Approach to Support Vector
Machines in classification problems. In: Proceedings on International Joint Conference on
Neural Networks (2010)
14. Castro, L., Zuben, F.V.: Algoritmos Genéticos. Universidade Estadual de Campinas
(2002),
ftp://ftp.dca.fee.unicamp.br/pub/docs/
vonzuben/ia707_02/topico9_02.pdf
15. Kuncheva, L., Whitaker, C.: Measures in diversity in classifier ensembles and their
relationship with ensemble accuracy. Machine Learning 51(2), 181–207 (2003)
16. Hansen, L., Salamon, P.: Neural network ensembles. IEEE Transactions on Pattern
Analysis and Machine Intelligence 12, 993–1001 (1990)
17. Krogh, A., Vedelsby, J.: Neural network ensembles, cross validation, and active learning.
In: Advances in Neural Information Processing Systems, vol. 7, pp. 231–238. MIT Press,
Cambridge (1995)
18. Opitz, D., Shavlik, J.: Actively searching for an effective neural-network ensemble.
Connection Science 8(3/4), 337–353 (1996)
19. Rätsch, G., Onoda, T., Müller, K.-R.: Soft Margins for Adaboost. Machine Learning 42
(2001)
20. Opitz, D.: Feature Selection for Ensembles. In: Proceedings of the Sixteenth National
Conference on Artificial Intelligence (1999)
Text Recognition in Videos Using a Recurrent
Connectionist Approach
Khaoula Elagouni1,2, Christophe Garcia3 , Franck Mamalet1 ,

and Pascale Sébillot2
1
Orange Labs R&D, 35512 Cesson Sévigné, France
{khaoula.elagouni,franck.mamalet}@orange.com
2
IRISA, INSA de Rennes, 35042 Rennes, France
pascale.sebillot@irisa.fr
3
LIRIS, INSA de Lyon, 69621 Villeurbane, France
christophe.garcia@liris.cnrs.fr
Abstract. Most OCR (Optical Character Recognition) systems devel-

oped to recognize texts embedded in multimedia documents segment the
text into characters before recognizing them. In this paper, we propose a
novel approach able to avoid any explicit character segmentation. Using
a multi-scale scanning scheme, texts extracted from videos are first rep-
resented by sequences of learnt features. Obtained representations are
then used to feed a connectionist recurrent model specifically designed
to take into account dependencies between successive learnt features and
to recognize texts. The proposed video OCR evaluated on a database of
TV news videos achieves very high recognition rates. Experiments also
demonstrate that, for our recognition task, learnt feature representations
perform better than hand-crafted features.
Keywords: Video text recognition, multi-scale image scanning,

ConvNet, LSTM, CTC.
1 Introduction
Visual patterns in multimedia documents usually contain relevant information

that allows content indexing. In particular, texts embedded in videos often pro-
vide high level semantic clues that can be used to develop several applications
and services such as multimedia documents indexing and retrieval schemes,
teaching videos and robotic vision systems. In this context, the design of efficient
Optical Character Recognition (OCR) systems specifically adapted to video data
is an important issue. However, the huge diversity of texts and their difficult ac-
quisition conditions (low resolution, complex background, non uniform lighting,
etc.) make the task of video embedded text recognition a challenging problem.
Most of prior research in OCR has focused on scanned documents and hand-
written text recognition. Recently, systems dedicated to embedded texts in video
have generated a significant interest in the OCR community [10,2,12]. Most of
the proposed approaches rely on an initial segmentation step that splits texts

Text Recognition Using a Connectionist Approach 173
into individual characters (a complete survey of character segmentation methods

is presented in [1]) and a second one that recognizes each segmented character.
However, the different distortions in text images and the low resolution of videos
make the segmentation very hard, leading to poor recognition results. To improve
performance, some authors reduce segmentation ambiguities by considering char-
acter recognition results and by introducing some linguistic knowledge [4]. In our
recent work [3] dedicated to natural scene text recognition, an approach that
avoids the segmentation step by using a multi-scale character recognition and a
graph model was proposed. Even though this method obtained good results, its
main drawback remains the high complexity of the graph model which has to
deal with all the recognition results. Other recent work applied to handwriting
recognition [7] has also proposed to avoid any segmentation, using a connec-
tionist model that relies on a recurrent neural network (RNN) trained with
hand-crafted features extracted from the input image. In this paper, we adapt
this idea of absence of segmentation step to the task of video text recognition,
and propose an OCR scheme that relies on a connectionist temporal approach; a
second contribution lies in a multi-scale representation of text images by means
of learnt features particularly robust to complex background and low resolution.
The remainder of the paper is organized as follows: after presenting the out-
lines of our approach in section 2, we detail our method to generate feature-based
representations of texts (section 3). Section 4 introduces the fundamentals of the
recurrent neural model used and describes the chosen architecture. Finally, ex-
periments and obtained results are reported in section 5 before concluding and
highlighting our future work in section 6.
2 Proposed Approach
The first task for video text recognition consists in detecting and extracting texts
from videos as described in [4]. Once extracted, text images are recognized by
means of two main steps as depicted in fig. 1: generation of text image repre-
sentations and text recognition. In the first step, images are scanned at different
scales so that, for each position in the image, four different windows are ex-
tracted. Each window is then represented by a vector of features learnt with a
convolutional neural network (ConvNet). Considering the different positions in
the scanning step and the four windows extracted each time, a sequence of learnt
features vectors X 0 , . . . , X t , . . . , X p is thus generated to represent each image.
The second step of the proposed OCR is similar to the model presented in [7],
using a specific bidirectional recurrent neural network (BLSTM) able to learn
to recognize text making use of both future and past context. The recurrent
network is also characterized by a specific objective function (CTC) [7], that
allows the classification of non-segmented characters. Finally, the network’s out-
puts are decoded to obtain the recognized text. The following sections describe
these different steps and their interactions within the recognition scheme.
174 K. Elagouni et al.
Fig. 1. Scheme of the proposed approach
3 Multi-scale Feature Learning for Character

Representation
Our objective is to produce a relevant representation of texts extracted from
videos, which has to be robust to noise, deformations, and translations. To that
end, text images are scanned with different window sizes (cf. section 3.1) then
each window is represented by a set of learnt features (cf. section 3.2).
3.1 Multi-scale Image Scanning Scheme

Text images usually consist of a succession of characters having different sizes
and shapes depending on their labels. We therefore propose to scan each full
text image at several scales and at regular and close positions (typically, a step
of h8 , where h is the image height) to ensure that at least one window will be
aligned with each character in the image. Thus, at each position, different scales
are considered to handle various character sizes. Experiments have shown that
good results are obtained with four windows of widths h4 , h2 , 3h
4 and h.
Furthermore, since the characters can have different morphologies, we adapt

the window borders to the local morphology of the image and hence, when pos-
sible, clean the neighborhood of characters. For each window position and scale,
the borders within the full image are computed as follows. (For figure clearness
the computation of non linear border is not shown in fig. 1, but interested readers
can refer to [4] to get more details) Assuming that pixels in text images belong
to two classes—“text” and “background”—, a pre-processing step generates a
fuzzy map which encodes, for each pixel, its membership degree to class “text”.
Using a shortest path algorithm within the obtained map, non-linear vertical
borders are computed, following pixels that have a low probability to belong to
the class “text”. In case of complex background or non separated characters, the
shortest path algorithm induces straight vertical borders.
3.2 Neural-Based Model for Features Learning

For each position in the text, considering different scales, four windows are ex-
tracted, from which accurate representations that preserve the information use-
ful for the recognition task have to be found. In [7], Graves et al. have used
hand-crafted features which are known not to be robust to noise, deformations.
We thus propose to use learnt features. In this context, Convolutional Neural
Networks (ConvNets) [9] have shown to be well adapted [11] and particularly
robust to complex background and low resolution. A ConvNet is a bio-inspired
hierarchical multi-layered neural network able to learn visual patterns directly
from the image pixels without any pre-processing. Relying on specific properties
(local receptive fields, weight sharing and sub-sampling), this model learns to
extract appropriate descriptors and to recognize characters at the same time.
The proposed method consists in representing sliding windows by the de-
scriptors learnt by ConvNets. First a ConvNet is trained to classify images of
individual characters. Then, the stimulation of this network is applied on each
window (being a character or not), and the vector of the penultimate layer
activations, considered as a feature extraction layer, is used as the window’s
descriptor. In our experiments, several configurations of ConvNets have been
tested. The best configuration takes as input a color window image mapped into
three 36 × 36 input maps, containing values normalized between −1 and 1, and
returns a vector of values normalized with the softmax function. The architecture
of our ConvNet is similar to the one presented in [4] and consists of six hidden
layers. The first four ones are alternated convolutional and sub-sampling layers
connected to three other neuron layers where the penultimate layer contains 50
neurons. Therefore, using this network architecture, each position in the text
image is represented by a vector of 200 values (50 values for each scale window)
corresponding to the features learnt by the ConvNet model.
4 Text Recognition Using a Recurrent Neural Model

Once text images are represented by sequences of automatically learnt features,
we combine a particular RNN (BLSTM) and a connectionist classification model
(CTC) to build a model able to learn how to classify the feature sequences and
hence recognize texts. While the BLSTM allows to handle long-range depen-
dencies between features, the CTC enables our scheme to avoid any explicit
segmentation in characters, and learn to recognize jointly a sequence of classes
and their positions in the input data.
4.1 Bidirectional Long Short-Term Memory (BLSTM)
The basic idea of RNN is to introduce recurrent connections which enable the
network to maintain an internal state and thus to take into account the past
context. However, these models have a limited “memory” and are not able to
look far back into the past [8] becoming insufficient when dealing with long input
sequences, such as our feature sequences. To overcome this problem, the Long
Short-Term Memory (LSTM) [5] model was proposed to handle data with long
range interdependencies. A LSTM neuron contains a constant “memory cell”—
namely constant error carousel (CEC)—whose access is controlled by some mul-
tiplicative gates. For these reasons we chose to use the LSTM model to classify
our learnt feature sequences. Moreover, in our task of text recognition, the past
context is as important as the future one (i.e., both previous and next letters are
important to recognize the current letter). Hence, we propose to use a bidirec-
tional LSTM which consists of two separated hidden layers of LSTM neurons.
The first one permits to process the forward pass making use of the past context,
while the second serves for the backward pass making use of the future context.
Both hidden layers are connected to the same output layer (cf. fig. 1).
4.2 Connectionist Temporal Classification (CTC)
Even though BLSTM networks are able to model long-range dependencies, as for
classical RNNs, they require pre-segmented training data to provide the correct
target at each timestep. The Connectionist Temporal Classification (CTC) is a
particular objective function defined [6] to extend the use of RNNs to the case of
non-segmented data. Given an input sequence, it allows the network to jointly
learn a sequence of labels and their positions in the input data. By considering
an additional class called “Blank”, the CTC enables to transform the BLSTM
network outputs into a conditional probability distribution over label sequences
(“Blank” and Characters). Once the network is trained, CTC activation outputs
can be decoded, removing the “Blank” timesteps, to obtain a sequence of labels
corresponding to a given input sequence. In our application, a best path decoding
algorithm is used to identify the most probable sequence of labels.
4.3 Network Architecture and Training
After testing several architectures, a BLSTM with two hidden layers of 150
neurons, each one containing recurrent connexions with all the other LSTM cells
and fully connected to the input and the output layers, has been chosen. The
network takes as input a sequence of vectors of 200 values normalized between −1

and 1 and returns a sequence of vectors of 43 outputs (42 neurons corresponding
to 42 classes of characters, namely letters, numbers and punctuation marks, and
one more neuron for the class “Blank”). In our experimental data, depending on
the text image size, the sequence of inputs can contain up to 300 vectors. The
network is trained with the classical back-propagation through time algorithm
using a learning rate of 10−4 and a momentum of 0.9.
5 Experimental Setup and Results

This section reports several tests and discusses obtained results. After presenting
the datasets, the proposed OCR scheme is evaluated and compared to other
state-of-the-art methods. Learnt feature representations are shown to outperform
hand-crafted features, leading to better recognition results.
5.1 Datasets
Our experiments have been carried out on a dataset of 32 videos of French
news broadcast programs. Each video, encoded by MPEG-4 (H. 264) format at
720 × 576 resolution, is about 30 minutes long and contains around 400 words
which correspond to a set of 2200 characters (i.e., small and capital letters,
numbers and punctuation marks). Embedded texts can vary a lot in terms of
size (from 8 to 24 pixels of height), color, font and background. Four videos
were used to generate a dataset of 15168 images of single characters perfectly
segmented. This database—called CharDb—consists of 42 classes of characters
(26 letters, 10 numbers, the space character and 5 special characters; namely ’.’,
’-’, ’(’, ’)’ and ’:’) and is used to train the ConvNet described in section 3.2. The
remaining videos were annotated and divided into two sets: VidTrainDb and
VidTestDb containing respectively 20 and 8 videos. While the first one is used
to train the BLSTM, the second is used to test the complete OCR scheme.
5.2 Experimental Results

The training phase of the BLSTM is performed on a set of 1399 text images
extracted from VidTrainDb. Once trained, the BLSTM network is evaluated on a
set of 734 text images extracted from VidTestDb. To evaluate the contribution of
learnt features, the BLSTM was trained with two types of input features (namely
hand-crafted and learnt ones) and evaluated with the Levenshtein distance.
On the one hand, texts are represented by sequences of hand-crafted features
as proposed in [7] for handwriting recognition. In this experimentation, text im-
ages are first binarized; then nine geometrical features per column are extracted
(average, gravity center, etc.). The BLSTM trained with these features achieves
a good performance of 92.73% of character recognition rate (cf. table 1).
On the other hand, as proposed in section 3, we represent text images by
means of multi-scale learnt features. For this, the ConvNet was trained to rec-
ognize image of characters on 90% of CharDb, and its classification performance
Fig. 2. Example of recognized text: each class is represented by a color, the label “ ”
represents the class “space” and the gray curve corresponds to the class “Blank”
Table 1. Usefulness of learnt features
Used features Character recognition rate

Geometrical features 92.73%
Learnt features 97.18%
Table 2. Comparison of the proposed scheme to a state-of-the-art method and com-

mercial OCR engines
Used features Character recognition rate Word recognition rate

Proposed OCR scheme 97.35% 87.20%
Elagouni et al. [4] 94.95% 78.24%
ABBYY engine 94.68% 87.23%
Tesseract engine 88.19% 69.22%
SimpleOCR engine 73.58% 29.01%
GNU OCR engine 56.47% 12.79%
was evaluated on the remaining 10%. A very high recognition rate of 98.04%
was obtained. Learnt features were thus generated with the trained ConvNet
and used to feed the BLSTM. Fig. 2 illustrates an example of recognized text
and shows its corresponding BLSTM outputs where each recognized character
is represented with a peak. Even though extracted geometrical features achieve
good performance, for our application, they seem to be less adapted than learnt
features which obtain a high character recognition rate of 97.18% (cf. table 1).
The main improvement is observed for text images with complex background,
for which the geometrical features introduced high inter-class confusions.
We further compare our complete OCR scheme to another previously pub-
lished method [4] and commercial OCR engines; namely ABBYY, tesseract,
GNU OCR, and SimpleOCR. Using the detection and extraction modules pro-
posed in [4], these different systems were tested and their performances were
evaluated. As shown in table 2, the proposed OCR yields the best results and
outperforms commercial OCRs.
6 Conclusions
We have presented an OCR scheme adapted to the recognition of texts extracted
from digital videos. Using a multi-scale scanning scheme, a novel representation
of text images built with features learnt by a ConvNet is generated. Based on a

particular recurrent neural network—namely the BLSTM—and a connectionist
classification—namely the CTC—our approach takes as input generated repre-
sentations and recognizes texts. Besides its ability to make use of learnt feature
dependencies, the proposed method permits to avoid the difficult character seg-
mentation step. Our experiments have highlighted that learnt feature represen-
tations are well-adapted to texts embedded in videos and yield to better results
than hand-crafted features. Our complete scheme was evaluated on a dataset of
news TV videos and obtained promising results (exceeding 97% of characters and
87% of words correctly recognized) outperforming other state-of-the-art meth-
ods and commercial OCRs. As future extensions of this work, we plan to test
our approach on scene text images (i.e., no longer on embedded text) and also
to produce new text representations based on unsupervised learning techniques
(autoencoders) and evaluate their contribution to our recognition task.
References
1. Casey, R., Lecolinet, E.: A survey of methods and strategies in character segmen-
tation. PAMI 18(7), 690–706 (2002)
2. Chen, D., Odobez, J., Bourlard, H.: Text detection and recognition in images and
video frames. PR 37(3), 595–608 (2004)
3. Elagouni, K., Garcia, C., Mamalet, F., Sébillot, P.: Combining multi-scale character
recognition and linguistic knowledge for natural scene text OCR. In: DAS, pp. 120–
124 (2012)
4. Elagouni, K., Garcia, C., Sébillot, P.: A comprehensive neural-based approach for
text recognition in videos using natural language processing. In: ICMR (2011)
5. Gers, F., Schraudolph, N., Schmidhuber, J.: Learning precise timing with lstm
recurrent networks. JMLR 3(1), 115–143 (2003)
6. Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal
classification: Labelling unsegmented sequence data with recurrent neural net-
works. In: ICML, pp. 369–376 (2006)
7. Graves, A., Liwicki, M., Fernández, S., Bertolami, R., Bunke, H., Schmidhu-
ber, J.: A novel connectionist system for unconstrained handwriting recognition.
PAMI 31(5), 855–868 (2009)
8. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computa-
tion 9(8) (1997)
9. LeCun, Y., Bengio, Y.: Convolutional networks for images, speech, and time series.
In: The Handbook of Brain Theory and Neural Networks. MIT Press (1995)
10. Lienhart, R., Effelsberg, W.: Automatic text segmentation and text recognition for
video indexing. Multimedia Systems 8(1), 69–81 (2000)
11. Saidane, Z., Garcia, C.: Automatic scene text recognition using a convolutional
neural network. In: ICBDAR, pp. 100–106 (2007)
12. Yi, J., Peng, Y., Xiao, J.: Using multiple frame integration for the text recognition
of video. In: ICDAR, pp. 71–75 (2009)
An Investigation of Ensemble Systems Applied
to Encrypted and Cancellable Biometric Data
Isaac de L. Oliveira Filho, Benjamn R.C. Bedregal, and Anne M.P. Canuto
Department of Informatics and Applied Mathematics,

Federal University of RN Natal, RN - Brazil, 59072-970
isaac@ppgsc.ufrn.br, {anne,bedregal}@dimap.ufrn.br
Abstract. In this paper, we propose the simultaneous use of cryptogra-

phy and transformation functions in biometric-based identification sys-
tems aiming to increase the security level of biometric data as well as the
performance of these systems. Additionally, we aim to keep a reasonable
efficiency level of these data through the use of more elaborated clas-
sification structures, such as ensemble systems. With this proposal, we
intend to have a robust and secure identification system using signature
data.
Keywords: Ensemble systems, Cryptosystem, Cancellable biometric

data.
1 Introduction
The use of different approaches for the identification of individuals in user-access
systems reflects the relevance of information security in data storage. For exam-
ple, passwords, key phrases and identification numbers have traditionally been
used in the authentication process. However, they can be used in a fraudulent
way. In order to increase the security and robustness of identification systems, it
is important to use more elaborated approaches, such as biometric data. These
features are unique of each person and it increases reliability, convenience and
universality of the identification systems [3]. However, there are still some is-
sues that need to be addressed in biometric-based identification systems. The
main issues are concerned with the security of biometric identification systems
since these systems need to ensure their integrity and public acceptance. For
biometric-based identification systems, security is even more important than for
the non-biometric systems, since a biometric is permanently associated with a
user and cannot be revoked or cancelled if compromised. Therefore, it is impor-
tant to avoid an explicit storage of biometric templates in the system, eliminating
any possibility of leakage of the original biometric trait.
Cancellable biometrics have been increasingly applied to address such security
issues [8]. This term is commonly referred to the application of non-invertible
and repeatable modifications to the original biometric templates. However, the
use of transformation functions in biometric data still allows the improper use of
these information by unauthorized individuals. In [4], for instance, it was shown

Applying Ensembles on Encrypted and Cancellable Signature Data 181
that the use of ensemble systems in cancellable data had a similar accuracy level
than the original data. Therefore, in case of being stolen, the original data could
not be obtained, but the transformed dataset could be used with a reasonable
performance level.
Aiming to increase the security level for biometric dataset, we propose an anal-
ysis of the use of the cryptography and transformation methods in biometric data
and we focus on signature as the main biometric of this analysis. The main goal
of this work is making an analysis of elaborated classification structures in secure
datasets. In order to do this, a transformation function was applied in the origi-
nal signature dataset, creating a transformed dataset. Then, it was used the last
versions of a hard cryptosystem, cryptography algorithm, initially developed in
[2] on the transformed signature dataset, creating a cryptographic/transformed
dataset. A comparative analysis of these three datasets is made and the results
shows as the used cryptosystem breaks the relationship of between attributes
and patterns, decreasing the performance level of the ensemble systems. In this
way, it is possible to say that these data are more secure than the use of a trans-
formation method. In addition, the use of a transformation method guarantees
that the original data will be recovered, even if the encrypted data is broken,
providing robust and secure biometric-based identification systems.
2 Increasing Security in Biometric Data
In the context of biometric data, the unauthorized copy of stored data is proba-
bly the most dangerous threat, regarding the privacy and security of the users. In
order to offer security for biometric-based identification systems, the biometric
templates must be stored in a protected way. There are several template protec-
tion methods proposed in the literature. In [7], these methods were broadly di-
vided into two classes of methods, which are: biometric cryptosystem and feature
transformation functions. In the latter, a transformation function (f ) is applied
to the biometric template (T ) and only the transformed template (f (T )) is stored
in the database. These functions can be categorized as salting and non-invertible
transformations. In salting, the transformation function (f ) is invertible, while f
is (as implied in the name) non-invertible in the non-invertible transformations.
In this work, we will focus on the use of non-invertible transformation functions.
Hence, hereafter, the terms transformation function and template protection will
be taken as referring to the non-invertible transformation function.
The use of signature template protection systems were first considered in [13]
this being based on the biometric cryptosystem approach (key generation cryp-
tosystem). In this method, a set of parametric features was extracted from the
acquired dynamic signatures and a hash function was applied to the feature
binary representation, exploiting some statistical properties of the enrolment
signatures. Another study can be found in [5]. In this work, an adaptation of the
fuzzy vault to signature protection has been proposed, employing a quantized
set of maxima and minima of the temporal functions mixed with chaff points in
order to provide security.
182 I. de L. Oliveira Filho, B.R.C. Bedregal, and A.M.P. Canuto
In the field of noninvertible transforms, in [11], for instance, a signature tem-

plate protection scheme, called bioConvolving, where non-invertible transforms
are applied to a set of signature sequences, has been presented and its non-
invertibility discussed. In the proposed method, the signature is represented by
7 (seven) signature time sequences (x-position, y-position, pressure signal, path-
tangent angle, path velocity magnitude, log curve radius and total acceleration
magnitude), which represent the features extracted and all the transformation
functions are applied in these sequences. In this paper, we will use the bioCon-
volving method in the signature dataset.
In all aforementioned works, only one protection method (transformation or
cryptosystem) was applied. Nevertheless, the question of this procedure is to
allow or not that the biometric data can be captured and understood by others
even when it is encrypted and/or transformed. The sole use of transformation
functions usually allows a good performance for the classification methods when
applied to this dataset. It means that the relationship between original and
transformed data is still very strong. However, the sole use of cryptosystem still
use the original dataset in the classification process (using coding/decoding al-
gorithm to obtain the original dataset), making it vulnerable for being stolen.
Then, in this way, it is necessary to find a balance in the use of these methods.
Through the use of a strong cryptosystem, we can not break the relationship-
pattern characteristics in an attempt to give security and to get the minimal
correct classification at same time. However, the main aim is to use transforma-
tion function to allow the storage of a transformed dataset and, in case of stolen,
the original dataset will not be recovered. In this case, the cryptosystem will be
used in the transformed dataset (after the process of a transformation function).
Thus, we aim to overcome the main drawbacks of both approaches.
3 Papı́lio Cryptosystem and Transformation Method

In the literature, the main cryptosystems are AES [9], RSA [12], RC4 [1]. In this
paper, we use the Pap´lio’s cryptosystem [2]. It is a Feistel cipher encryption
algorithm where the function F is a function computed by the Modified Viterbe
[2] algorithm whose parameters are codification rate n/s, Q, m and the poly-
nomial generator. Currently, blocks of any size of bits are considered. Also, the
size of the key is 128 bits, but its size could be variable and the number of the
rounds may vary between zero and sixteen.
The process of Papĺio decryption is essentially the same as the encryption pro-
cess, except that the sub-keys are employed in reverse order. There is a function
F for each Papı́lio’s round for both processes. This function F is the same for all
rounds and it is considered as the main component of this process. The encryp-
tion and decryption processes always begin with the text block division m/2.
The right part is used as F function input and the left one is used, along with
the output text function F in a XOR operation. This output XOR operation is
the input function F in the next round and so on until the last round.
The size (number of bits) of the resulting encrypted text is the same of
the plaintext, the original file, which is an advantage of the Papı́lio method.
Therefore it is possible to achieve a chiphertext, Encrypted text, for each com-

pleted round. With the variation of the number of rounds and through the
operations modes ECB, CBC, CFB and OFB, it is possible to achieve the high
level of the diffusion and confusion.
The BioConvolving method was originally proposed in [11]. The main aim of
this function is to divide each original voice sequence into W non-overlapping
segments, according to a randomly selected transformation key d. Then, the
transformed functions are obtained by performing a linear convolution between
the obtained segments. A general description of BioConvolving is described as
follows.
1. Randomly select a number (W −1) of values dj . The selected number have to

be between 1 and 99 in an ordered fashion. The selected values are arranged
in a vector d = [d0 , ..., dW ], having kept d0 = 0 and dW = 100. The vector d
represents the key of the employed transformation.
dj
2. Convert the values dj according to the following relations bj = round(( 100 )∗
n), j = 0, ..., W , where n is the number of attributes and round represents
the nearest integer,
3. Divide the original sequence Γ ∈ Rn , into W segments Γ (q) of length Nq =
bq − bq−1 and which lies in the interval [bq−1 , bq ];
4. Apply the linear convolution of the functions f (Γ (q)), q = 1..W in order to
obtain the transformed function
f = Γ (1) ∗ ... ∗ Γ (W ) (1)
As it can be seen, due to the convolution operation in 1, the length of the

transformed functions is equal to K = N − W + 1, being therefore almost the
same of the original functions one. A final signal normalization, oriented to obtain
zero mean and unit standard deviation transformed functions, is then applied.
Different realizations can be obtained from the same original functions, simply
varying the size or the values of the parameter key d.
4 Methods and Materials

In the original dataset, proposed in [6] and called OriginalDataset, data was
collected from 359 individual subjects. A total of 7428 signatures were donated.
The number of signature samples for each subject varied between 2 and 79. This
was dependent on the frequency of the signature donation sessions. However,
we used a reduced version of this dataset, with 100 users and with 10 samples
for users, making a total of 1000 signature samples. The dataset had 18 at-
tributes describing aspects like execution time, pen lifts, signature width and
height, height to width ratio, average horizontal pen velocity, average vertical
pen velocity, vertical midpoint pen crossings and invariant moments.
For the transformation function, called TransfDataset, we have used only pen
lifts, signature width and height, height to width ratio, average vertical pen ve-
locity and average horizontal pen velocity. These attributes can be represented as
time sequences and, therefore, allow application of the BioConvoling method. In

the transformation procedure, we used W = 2. Once we have applied the trans-
formation function, the resulting functions are time sequences and we divided
these functions into 4 intervals and calculate the average values for each inter-
val, resulting in a total of 24 attributes (6 time sequences * 4 interval averages).
Finally, we have the CryptDataset that was generated from output application
of the Papı́lio cryptosystem in the transformed dataset.
The ensemble’s structures were composed of three main parameters of an en-
semble system, which are the number of the base classifiers, classifier types and
the combination method. In this way, we used homogeneous and heterogeneous
structures of ensembles that are composed of 3, 6 and 12 base classifiers. The cho-
sen base classifiers were Naive Bayes, Neural Networks, Decision Tree e KNN [14].
In addition, these individual classifiers are combined using four common com-
bination methods, which are: Sum, Majority Voting, Support Vector Machine
(SVM) and k-Nearest Neighbor (k-NN) [10]. For each system size, two different
structures of ensembles are used, the homogeneous (combines different types of
classification algorithms as individual classifiers) and Heterogeneous (combines
classification algorithms of the same type) options. As there are several possi-
bilities for each structure, we report the average of the accuracy delivered by all
configurations within the corresponding structure. The individual classifiers and
combination methods used in this study were exported from the WEKA package
(http:www.cs.waikato.ac.nz/ ml/weka).
The ensemble systems investigated were implemented using a standard stack-
ing procedure to define the learning process for the individual classifiers and
for the combination methods [10]. In order to obtain a better estimation, a 10-
fold cross validation method is applied to all ensembles (as well as individual
classifiers). Thus, all accuracy results presented in this paper refer to the mean
over 10 different test sets. However, some of the combination methods are train-
able methods (k-NN and SVM). In these cases, a validation set is used to train
these methods. For the parameter setting of the combination methods, we have
opted for the simplest version of these methods. Therefore, we have used k-NN
with k=1 and SVM using a polynomial kernel and c=1. Finally, a statistical
test is applied to compare the accuracy of the classification systems. We use
the hypothesis test (t-test), a test which involves testing two learned hypotheses
on identical test sets. In this investigation, we use the bi-caudal t-test with a
confidence level chosen of 95% (α = 0.05).
5 Comparative Results
This section presents the results of the experiments described in the previous sec-
tion. It is important to emphasize that we have used the identification task with
the users. In this case, once a biometric is provided, the classification systems
(classifiers and ensembles) will provide the class of user. From a classification
point of view, a verification task is a two-class problem since an identification task
is a N-class problem (where N is the number of user). In this case, from a classi-
fication point of view, identification is more complex and time consuming task.
Table 1. The results of the Ensembles on OriginalDataset
OriginalDataset
Size 3 Ind Sum Voting K-NN SVM
Het 82.55 ± 5.9 87.8 ±5.1 86.56±5.96 85.47± 4.26 88.56 ± 2.24
Hom 81.41 ±7.21 83.47±5.32 82.1±6.83 79.13± 8.3 83.37 ± 5.77
Het 81.66±5.57 88.29±6.26 86.14±6.86 87.26±4.82 89.5±2.55
Hom 80.59±7.70 84.03±5.18 81.30±6.51 79.54±6.94 83.43±5.82
Het 81.49±5.99 88.04±6.17 87.46±6.78 87.46±4.74 89.09±2.49
Hom 81.19±8.22 83.90±5.67 82.83±6.51 79.77±7.49 82.37±6.75
As we want to analyze the benefits of well-established classification structures,

we have chosen identification. Nevertheless, the classification systems could be
easily applied for verification.
The results of the classification systems on three datasets (OriginalDataset,
TransfDataset and CryptDataset ) are shown in Tables 1, 2 and 3, respectively. In
these Tables, we illustrate the accuracy level and standard deviation of the clas-
sification systems (individual classifiers - Ind, and ensemble systems combined
with Sum, Majority Voting - MV, k -NN and SVM). It is important to emphasize
that as we use different individual classifiers to construct the homogeneous and
heterogeneous ensembles, they have different accuracy values.
As already mentioned, in this investigation, we carried out a statistical test,
comparing the ensemble systems in OriginalDataset, TransfDataset and Crypt-
Dataset, on a two-by-two basis. Table 1 shows that the results were satisfac-
tory on the original dataset. We can observe that the use of ensemble systems
was positive for the accuracy level of the classification systems, increasing per-
formance, when compared with the individual classifiers. In comparing both
ensemble structures, it is possible to say that the heterogeneous ensembles ob-
tained better results, when compared with homogeneous one. This shows the
importance of having diversity (heterogeneity) in the base classifiers. In hetero-
geneous structures, this is possible because these structures contain classifiers
with different specialties to reach the problem goal.
When applying these ensembles for TransfDataset dataset (Table 2), as it was
expected, the accuracy level decreased in all cases (these decreases were statisti-
cally significant). It shows that the use of the transformation functions made the
decision process more complex and decrease the general accuracy level. However,
these results alert us for an important observation, the protection (transforma-
tion method) applied on signature dataset is not really strong. In using this
transformation function, we can not obtain the original dataset. However, it is
still possible to use the transformed dataset to obtain a satisfactory classification
result. The ensemble that showed the best result, for instance, was the hetero-
geneous ensemble using SVM as combination methods with 79.33 ± 4.25%. This
accuracy level can be considered as excellent result on ”encrypted” (transformed)
dataset. Therefore, in case of being stolen, the transformed dataset could be used
for classification tasks with a reasonable performance.
Table 2. The results of the Ensembles when applied to TransDataset
TransDataset
Het 74.01±5.18 76.41±9.57 75.23±8.65 74.49±6.81 78.47±4.12
Hom 72.67±5.51 74.33±7.00 72.97±5.89 69.33±10.18 73.43±7.76
Het 73.25±5.01 76.66±9.44 74.84±9.09 75.74±6.62 79.33±4.25
Hom 71.92±5.21 74.37±7.04 72.00±5.24 69.07±8.05 73.33±7.85
Het 72.35±5.19 76.63±8.82 75.99±9.81 75.88±6.60 79.02±3.53
Hom 72.02±11.33 74.50±12.33 73.30±11.61 68.93±16.25 72.50±13.77
Table 3. The results of the Ensembles when applied to CryptDataset
BaseCrypt
Het 5.59±2.26 7.61±2.87 4.30±2.00 5.23±1.00 5.89±1.52
Hom 5.41±1.92 5.43±1.98 5.00±1.80 4.53±2.75 5.90±3.46
Het 5.42±2.02 8.18±2.80 5.86±2.15 5.87±1.34 6.44±1.50
Hom 5.36±1.82 5.70±2.20 4.90±1.51 4.73±2.26 4.80±2.86
Het 5.28±1.89 6.43±2.95 5.97±2.27 5.37±1.37 6.80±1.67
Hom 5.42±1.93 5.40±2.00 5.57±2.11 7.20±2.43 4.97±3.78
One of the main principles of data security is that a template protection

method must ensure the diffusion and confusion among the cipher texts and
unencrypted text. The correct classification of the transformed database is rel-
evant for classification purposes but this classification shows the fragility of the
transformation method used in this process. In this case, the transformed data
still keeps a strong relationship between them. This transformation procedure
also allowed a certain level of classification, but it leaves your data vulnerable
to some kinds of attacks such as brute force or differential cryptanalysis.
Therefore, in this work the Papı́lio method is used on the signature dataset
to increase the security level of this data. However, it is important to emphasize
that this method could be used in any biometric modalities to verify that a cipher
encryption method is considered strong to guarantee the confusion and diffusion
of the encrypted outputs, in relation to the original data. In this paper, the main
aim is to show that this method can indeed break relationship of patterns of this
method and that it is actually stronger than the simply transformation method.
As it was expected, the results in Table 3 prove that an encryption algorithm
is really stronger than the transformed function, when analyzing the accuracy
level of these systems. The performances showed in this Table illustrated a sharp
decreases in the accuracy levels. We believe that this decrease occurs because
the cryptosystem was able to break the relationship between the values of each
attribute for the original dataset. On the results, it is worth noting that the
pattern of behaviour of the ensemble systems are still the same, with the best
results obtained by the heterogeneous ensembles. However there is a difference
on the best combination method, since ensembles combined by SUM obtained
the best accuracy level in the CryptDataset. Thus, it is possible reiterate that
ensembles still can take advantage of the heterogeneous settings structures, even
in very difficult scenarios.
6 Conclusion
Considering the results provided by the use of ensemble systems on all three
datasets of this work, it was possible to determinate the importance of the use
of ensembles on signatures datasets, taking into account the good results (ac-
curacy rate) obtained by these systems. However, as biometric datasets require
confidentiality of the stored values, it is necessary to apply some protection tem-
plate methods. In this paper, this was done applying the following two methods:
Transformation functions and the Papı́lio cryptosystem. Through this analysis,
it is possible to verify that the ensemble systems applied on the transformed
database had better results than the ones obtained by the encrypted dataset.
This proves that Papı́lio really broke the interdependence of the values of each
pattern in the dataset. The Papı́lio method provided a level of greater complex-
ity than the transformation function by itself. Therefore, it can not be used for
classification purposes (only for storage). In addition, the use of a transformation
function means that a sole break in the cryptography algorithm does not lead
to the access of the original data, but the transformed data. In this case, the
biometric data becomes more secure and with a reasonable level of performance,
since the transformed data is used for classification purposes.
This analysis reveals us a hypothesis: a cryptosystem is considered strong
when performance is drastically reduced, even when using more elaborated clas-
sification structures as ensemble systems. In other words, the growth of the
power of a cipher encryption method is inversely proportional to the efficiency
of the classification method. Therefore, the use of other cryptosystems and/or
transformation functions and their application to other modalities is the subject
of on-going research.
References
1. Akgün, M., Kavak, P., Demirci, H.: New Results on the Key Scheduling Algorithm
of RC4. In: Chowdhury, D.R., Rijmen, V., Das, A. (eds.) INDOCRYPT 2008.
2. Araujo, F.S., Ramos, K.D., Bedregal, B.R., Silva, I.: Paplio cryptography algo-
rithm. In: International Symposium on Computational and Information Sciences
(2004)
3. Bringera, J., Chabanne, H., Kindarji, B.: The best of both worlds: Applying secure
sketches to cancellable biometrics. Science of Computer Programming 74(1-2), 43–
51 (2008)
4. Canuto, A.M., Fairhurst, M.C., Pintro, F., Junior, J.C.X., Neto, A.F., Gonalves,
L.M.G.: Classifier ensembles and optimization techniques to improve the perfor-
mance of cancellable fingerprint. Int. J. of Hybrid Intelligent Systems 8(3), 143–154
(2011)
5. Freire-Santos, M., Fierrez-Aguilar, J., Ortega-Garcia, J.: Cryptographic key gen-
eration using handwritten signature. In: Biometric Technology for Human Identi-
fication III. SPIE, Int. Society for Optical Engineering, United States (2006)
6. Guest, R.: The repeatability of signatures. In: The 9th Int. Workshop on Frontiers
in Handwriting Recognition, IWFHR 2004, pp. 492–497 (2004)
7. Jain, A.K., Nandakumar, K., Nagar, A.: Biometric template security. Eurasip Jour-
nal on Advance in Signal Processing (2008)
8. Jin, A.T.B., Hui, L.M.: Cancelable biometrics. Scholarpedia (2010)
9. Daemen, J., Rijmen, V.: The Design of Rijndael: AES - The Advanced Encryption
Standard (2002)
10. Kuncheva, L.I.: Combining Pattern Classifiers: Methods and Algorithms. Wiley
(2004)
11. Maiorana, E., Martinez-Diaz, M., Campisi, P., Ortega-Garcia, J., Neri, A.: Tem-
plate protection for hmm-based on-line signature authentication. In: IEEE Confer-
ence on Computer Vision and Pattern Recognition Workshops, CVPRW, pp. 1–6
(2008)
12. Rivest, R.L., Shamir, A., Adleman, L.: A method for obtaining digital signatures
and public-key cryptosystems. Commun. ACM 21(2), 120–126 (1978)
13. Vielhauer, C., Steinmetz, R., Mayerhofer, A.: Biometric hash based on statistical
features of online signatures. In: Proceedings of 16th International Conference on
Pattern Recognition, vol. 1, pp. 123–126 (2002)
14. Witten, I.H., Frank, E.: Data Mining: Pratical Machine Learning Tools and Te-
chiniques, 2nd edn. Elsevier (2005)
New Dynamic Classifiers Selection Approach
for Handwritten Recognition
Nabiha Azizi1, Nadir Farah1, and Abdel Ennaji2

1
Labged Laboratory: Laboratoire de Gestion électronique de documents
Departement d’informatique, Université Badji Mokhtar,
Bp n°12, Annaba, 23000, Algeria
2
Litis Laboratory, Laboratoire d'Informatique, du Traitement de l'Information et des Systèmes
(LITIS), Rouen University, France
{azizi,farah}@labged.net, abdel.ennaji@univ-rouen.fr
Abstract. In this paper a new approach based on dynamic selection of

ensembles of classifiers is discussed to improve handwritten recognition
system. For pattern classification, dynamic ensemble learning methods explore
the use of different classifiers for different samples, therefore, may get better
generalization ability than static ensemble learning methods. Our proposed
DECS-LR algorithm (Dynamic Ensemble of Classifiers Selection by Local
Reliability) enriched the selection criterion by incorporating a new Local-
Reliability measure and chooses the most confident ensemble of classifiers to
label each test sample dynamically. Confidence level is estimated by proposed
reliability measure using confusion matrix constructed during training level.
After validation with voting and weighted voting fusion methods, ten different
classifiers and three benchmarks, we show experimentally that choosing
classifiers ensemble dynamically taking into account the proposed L-
Reliability measure leads to increase recognition rate for Handwritten
recognition system using three benchmarks.
Keywords: Multiple classifier system, Dynamic classifier selection, Local

accuracy estimation, Classifiers fusion, Handwritten recognition.
1 Introduction
For almost any real world pattern recognition problem a series of approaches and
procedures may be used to solve it. After more than 20 years of continuous and
intensive effort devoted to solving the challenges of handwriting recognition, progress
in recent years has been very promising [1].
Classical approaches to pattern recognition require the selection of an appropriate
set of features for representing input samples and the use of a powerful single
classifier. In recent years, in order to improve the recognition accuracy in complex
application domains, there has been a growing research activity in the study of
efficient methods for combining the results of many different classifiers [2], [3].
The application of an ensemble creation method, such as bagging [4], boosting
and random subspace, generates a set of classifiers C, where C = {C1, C2, . . . , Cn}.
190 N. Azizi, N. Farah, and A. Ennaji
Given such a pool of classifiers, the selection of classifiers has focused on finding the
most relevant subset of classifiers E, rather than combining all available L classifiers,
where |E| ≤ |L|. Indeed, the selection of classifiers relies on the idea that either each
classifier member is an expert in some local regions of the feature space or
component classifiers are redundant.
Ensemble of classifiers (EoCs) exploits the idea that different classifiers can offer
complementary information about patterns to be classified. It is desirable to take
advantage of the strengths of individual classifiers and to avoid their weaknesses,
resulting in the improvement of classification accuracy. Both theoretical and
empirical researches have demonstrated that a good ensemble can not only improve
generalization ability significantly, but also strengthen the robustness of the
classification system [2], [3]. The EoCs has become a hotspot in machine learning and
pattern recognition and been successfully applied in various application fields,
including handwriting recognition [5], [6], speaker identification and face recognition.
In our previous works, we dealt with the recognition of handwritten Arabic words in
Algerian and Tunisian town names using single classifiers [6]. We focused later on
multiple classifiers approaches. We tried several combination schemes for the same
application [8], [9] and, while studying diversity role in improving multiple classifiers
system (MCS) and in spite of the weak correlation between diversity and performance,
we argue that diversity might be useful to build ensembles of classifiers. We
demonstrated through experimentation that using diversity jointly with performance to
guide selection can avoid overfitting during the search. So we have proposed three new
approaches based on Static classifiers selection using diversity measures and individual
classifiers accuracies to choose best set of classifiers [9]-[11].
The static classifier selection strategy called either the “overproduce and select” is
subject to the main problem. A fixed subset of classifiers defined using a
training/optimization data set may not be well adapted for the classification of the
whole test set [12]. This problem is similar to searching for a universal best individual
classifier, i.e., due to differences among samples, there is no an individual classifier
perfectly adapted for every test set. On the other hand, in dynamic classifier selection,
a competence of each classifier in the ensemble is calculated during the classification
phase and then the most competent classifier is selected [7], [12], [13]. The
competence of a classifier is usually defined in terms of its estimated local accuracy
[7]. Recently, dynamic ensemble of classifiers selection (DES) methods have been
developed. In these methods, first a subset of classifiers is dynamically selected from
the ensemble and then the selected classifiers are combined by majority voting.
However, the computational requirements of the DES methods developed are still
high [14].
In this paper, we propose a new dynamic ensemble of classifiers selection
approach based on local reliability estimation. The proposed algorithm extracts the
best EoC for each test sample using new measure about each class for all classifiers.
That measure named Local-Reliability measure witch is calculated by information set
extracted from confusion matrixes constructed during training level. When an
ensemble of classifiers (EoC) is selected based on our Algorithm and the L-Reliability
measure , two fusion method which are voting and weighted voting are applied to
generate the final label class with the appropriate confidence.
New Dynamic Classifiers Selection Approach for Handwritten Recognition 191
The remainder of this paper is organized as follows: The next section describes
DCS paradigm and the main idea of our proposed Dynamic Ensemble Classifier
Selection methodology Based on Local Reliability (DECS-LR) with the proposed
algorithm. The main results are illustrated in section3.
2 Proposed Approach Based on Dynamic Classifier Selection

Dynamic Classifier Selection methods are divided into three levels, as illustrated in
Fig. 1. First, the classifier generation level uses a training data set to obtain the pool
of classifiers; secondly, region of competence generation uses an independent
evaluation data set (EV) to produce regions of competence (Rj); and dynamic
selection chooses a winning partition or the winning classifier (Ci* ), over samples
contained in Rj , to assign the label M to the sample I from the test data set. Several
methods reported in the literature as DCS methods pre estimate regions of
competence during the training phase [12] and perform only the third level during the
test phase. For each unknown test pattern, the problem addressed is the selection of
the classifiers ensemble out of L that are most likely to classify it correctly.
Classifier Generation Pool of classifiers

Training Set
Competence generation -Measures

Or -Clustering
-Various Training data
sets,…
Evaluation Set Dynamic Classifier Selection
Test set
Fig. 1. Dynamic Classifier Selection Components
The main difference between the various DCS methods is the strategy employed to
generate regions of competence and proposed selection algorithm.
Among different DCS schemes, the most representative one is Dynamic Classifier
Selection by Local Accuracy (DCS-LA) [7].
Dynamic Classifier Selection by Local Accuracy explores a local community for
each test instance to evaluate the base classifiers, where the local community is
characterized as the k Nearest Neighbours (kNN) of the test instance in the evaluation
set EV. The intuitive assumption behind DCS-LA is quite straightforward: Given a
test instance I, we find its neighbourhood δI in EV (using the Euclidean distance), and
the base classifier that has the highest accuracy in classifying the instances δI should
also have the highest confidence in classifying δ. Let Cj(j = 1,…,L) be a classifier, and
an unknown test instance I. We first label with all individual classifiers (Cj ; j =
1,…,L) and acquire L class labels C1(I);…;CL(I). If individual classifiers disagree, the
local accuracy is estimated for each classifier. Given EV, the local accuracy of
classifier Cj with instance δI, LocCj(δI), is determined by the number of local
evaluation instances for classifier Cj that have the same class label as the classifier's
classification, over the total number of instances considered.
The final decision for δI is base classifier which provide the max of local accuracy.
This best classifierfor C* for classifying sample Ican be selected by [16],[34]:
(1)
Where Wj =1/dj is the weight, and dj is the Euclidean distance between the test
pattern I and the its neighbor sample xj. .
The advantage of using local accuracy is that instead of using the entire evaluation
set, DCS-LA uses a local neighbourhood of the given test instance to explore the
reliability of the base classifier. DCS-LA is an efficient mechanism in selecting the
"best" classifier.
We observe that used selection criterion in Local accuracy estimation take into
account only the local accuracy of each classifier without taking into account
behaviour of output classes for each classifier. This behaviour criterion may be added
new information concerning the selected set in the evaluation region. So, it may be
improving classification rate.
To attempt this objective, and also to choose the best set of classifiers dynamically
not only the one winner classifier, we propose a new algorithm based on the
definition of DCS-LA method but makes it possible to select, for each test pattern,
the best ensemble of classifiers that has more chances to make a correct classification
on that pattern. The proposed criteria uses a new measure named Local Reliability
measure witch is calculated in k nearest neighbor of the input pattern I
(neighbourhood (I)), defined with respect to evaluation set.
To calculate L-reliability measure, we need to construct confusion matrix for each
classifier during training level. Used confusion matrix can be define as square matrix
that contains N rows (the calculated Label Class) ands N columns (the predicted label
class). Each cell (d, f) represents the training samples number classified in label class
d knowing that the predicted label class is f.
- We can also define the Local classifier accuracy (AC(ci)) by Equation 2:
N

j =1
a jj
(2)
Ac ( C i ) =
N
With “ajj”: the number of correct predictions for each class j (j=1,…,N).
After training phase execution, proposed Local reliability of each class J of each
classifier Ci can be defined by the following equation:
a jj * Ac ( C i )
L − reliabilit y ( C )= d=N
i, j
a
(3)
d,j
d =1 etd ≠ j
In general, the framework consists of three parts: (1) utilize N heterogamous

classifiers and train thus individual classifiers with the same training database; (2)
employ new DCS-LA algorithm based on Local –reliability measure to dynamically
safeguard the best set of classifiers associating to k neighbourhoods; (3) combine the
output candidate classifiers with voting and weighted voting fusion methods to give
the final prediction.
The proposed DCS-LA algorithm can be summarized as follow:
Algorithm 1: MODIFIED DECS-LRUSING CONFUSION MATIX

1: Design a pool of classifiers C.
2: Perform the competence level using confusion matrix
3: For each test pattern I Do Begin
5: If all classifier Ci are agree for the same Class j
for the pattern I Then assign the class j to the I
Else Begin
6: Find the k nearest neighborhood using Euclidian
Distance from the evaluation set.
7: Calculate the accuracy of each neighborhood m with
all base classifiers Ci :Accuracy AC(Ci,)(m).
8: Combine ACTUAL accuracy results AC (ci)(m) with the
local reliability associated to the actual label using
Eq. 3 for each neighborhood m to obtain the Combined
accuracy .
9: Delete all pairs ( neighborhood, classifier Ci)
which the global accuracy value <ε from the candidate
list
10: If there are plus then 2 classifiers in candidate
list Then Begin
For each classifier, retain one neighborhood
having the maximal value of Local reliability (Eq.3)
12: If number of rested classifiers >=2 Then Begin
13: Combine these classifiers with fusion methods
14: If a majority voting or weighted voting class
label among all candidate classifiers is identified
Then
15: Assign the majority voting class to I
Else
16: Randomly select a candidate classifier to the
pattern I End
Else
17: The remained classifier is the responsible to
classify the pattern I End
18: Else
19: If there is one classifier Then this one is the
winner
20: Else Pattern I were Rejected.End End
3.1 Ensemble Creation
The pool of classifiers used for proposed approach validation uses the same ensemble
of classifiers published in our previous work based on static classifier selection to
permit to compare both of results. In fact, we have used different classification
algorithms:
- 02 SVM (Support Vector Machine), with the strategy "one against all ",
elaborated under the library lisb SVM, version 2.7. The inputs on this SVM system
are the structural features. We have used polynomial and Gaussian kernel function.
- 03 KNN (k - Nearest Neighborhoods with K=2, 3and 5).
- 03 NN (Neuronal Network with different number of the hidden layer neurons.
- 02 HMM (Hidden Marcov Models: Discreet and Continuous with modified
Viterbi algorithm.
Classifiers individual performance using the Ifn-Enit , AL-LRI and Mnist databases
are resumed in (Tab.1) .
3.2 Ensemble Selection: Dynamic Ensemble Classifier Selection Steps
During training level, confusion matrix for each classifier Ci ( i=1;..;10) and the local
Reliability of output class aj (j=1;..;48) for all classifiers are calculated.
Before DECS-LR algorithm execution, we need to select two parameters. The first
is the number of k value witch represents the number of neighborhoods that are
chosen for the local decision set. The second one is the ε threshold. A series of
experiments has been carried out to determine the best value of K, for dynamic
selection level proposed in our approach and to show whether or not DECS is better
than SECS (Static Ensemble Classifier Selection) of previous work on Arabic
handwritten recognition. For Ensemble combination, we have tested two fusion
methods witch are majority voting and weighted voting.
Table 2 show the performances of various implemented MCSs based on proposed
DCS-LR algorithm with comparison with classical DCS by LA. For validation we
have tested our approach using tree databases: IFN-ENIT [15], Algerian Database
[11]and MNist-digit database [16].
We can conclude that with k value equal to 4, our general methodology offer the
best accuracy for the both fusion methods and the two used databases with 93.89% as
best accuracy (from W. Voting and Algerian database).We must indicate that obtained
performance of our novel algorithm based on DECS-LR is better than our pervious
work witch the best percentage accuracy were 94.22% (from Table2) and better the
classical DCS based on Local Accuracy estimation.
Table 1. Individual classifier accuracy

Member IFN-ENIT AL-LRI MNIST-Digit
Classifier Database Database
Svm(1) 86.03 85.88 90.12

Svm(2) 86.69 86.12 90.42
Knn(1) 81.78 81.45 85.36
Knn(2) 81.42 82.41 87.64
Knn(3) 83.96 84.02 87.95
Nn(1) 86.12 85.69 88.45
Nn(2) 86.46 86.08 88.89
Nn(3) 86.05 85.23 89.45
Hmm(1) 87.78 87.23 91.11
Hmm(2) 88.23 88.15 91.45
Table 2. Classification accuracies on the test set provided by our DECS-LR Algorithm using
Voting method
Ifn-Enit database Algerian town name MNist digit database

M. voting W. voting M.voting W.voting M.voting W.voting
decs-lr k=1 88.78 89.02 87.15 87.72 91.98 92.45
decs-lr k=2 89.28 89.79 87.89 88.42 92.54 92.87
decs-lr k=3 89.42 90.36 88.75 88.98 93.21 93.48
decs-lr k=4 90.89 91.25 90.87 89.97 93.49 94.22
decs-lr k=5 89.56 90.14 88.68 88.32 93.21 93.87
classical dcs- 89.94 90.14 88.12 87.02 92.02 92.86
la
We show with this experimentation that introducing behavior output classes of

classifiers in dynamic ensemble processes can improve the global performance. In
our case, we have modeled the classifier behavior by combining the local accuracy
estimation of each classifier with the reliability of each output class. Other new point
in our work is the selection of the best ensemble of classifier instead of a one
classifier selection. The fusion of the last ensemble will surely increases final results.
4 Conclusion
In this paper, new DES strategy based on Local accuracy estimation and a proposed
Local Reliability measure is proposed to improve performance of handwritten lexicon
classification. This strategy using DECS-LR Algorithm exploit Local accuracy
estimation to propose a new measure named Local reliability measure. The L-

Reliability measure employs confusion matrixes witch are filled based on classifiers
outputs (i.e., the output profile) during the training phase. Performed experiments
using the three databases indicate that DECS-LR can achieve a higher level of
accuracy. Future work consists of investigating the adaptive capabilities of the
proposed strategy for large or dynamic lexicon. Also, we want to generalize this
approach to any classification problem.
References
1. Govindaraju, V., Krishnamurthy, R.K.: Holistic handwritten word recognition using
temporal features derived from off-line images. Pattern Recognition Letters 17(5), 537–
540 (1996)
2. Kittler, J., Hatef, M., Duin, R.P.W., Matas, J.: On combining classifiers. IEEE Trans.
Pattern Anal. Mach. Intell. 20(3), 226–238 (1998)
3. Kuncheva, L.I., Whitaker, C.S.C., Duin, R.P.W.: Is independence good for combining
classiers. In: Proceedings of the 15th International Conference on Pattern Recognition,
Barcelona, Spain, pp. 168–171 (2000)
4. Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1991)
5. Azizi, N., Farah, N., Khadir, M., Sellami, M.: Arabic Handwritten Word Recognitionv
Using Classifiers Selection and features Extraction / Selection. In: 17 th IEEE Conference
in Intelligent Information System, IIS 2009, Poland, pp. 735–742 (2009)
6. Azizi, N., Farah, N., Sellami, M., Ennaji, A.: Using Diversity in Classifier Set Selection
for Arabic Handwritten Recognition. In: El Gayar, N., Kittler, J., Roli, F. (eds.) MCS
2010. LNCS, vol. 5997, pp. 235–244. Springer, Heidelberg (2010)
7. Woods, K., Kegelmeyer, W.P., Bowyer, K.: Combination of multiple classifiers using
local accuracy estimates, IEEE Trans. Pattern Anal. Mach. Intell. 19(4), 405–410 (1997)
8. Azizi, N., Farah, N., Sellami, M.: Off-line handwritten word recognition using ensemble of
classifier selection and features fusion. Journal of Theoretical and Applied Information
Technology, JATIT 14(2), 141–150 (2010)
9. Azizi, N., Farah, N., Sellami, M.: Ensemble classifier construction for Arabic handwritten
recognition. In: The 7th IEEE International Workshop in Signal Processing and Sustems,
WOSSPA 2011, Tipaza, Algeria, May 8-10 (2011)
10. Azizi, N., Farah, N., Sellami, M.: Progressive Algorithm for Classifier Ensemble
Construction Based on Diversity in Overproduce and Select Paradigm: Application to the
Arabic handwritten Recognition. In: The 2nd ICICS 2011, Jordan, May 22-24, pp. 27–33
(2011)
11. Farah, N., Souici, L., Sellami, M.: Classifiers combination and syntax analysis for arabic
literal amount recognition. Engineering Applications of Artificial Intelligence 19(1) (2006)
12. Dos Santos, E.M., Sabourin, R., Maupin, P.: A dynamic overproduce-and-choose strategy
for the selection of classiffier ensembles. Pattern Recognition 41, 2993–3009 (2008)
13. Singh, F., Singh, M.A.: dynamic classifier selection and combination approach to image
region labelling, Signal Process. In: Image Commun., vol. 20, pp. 219–231 (2005)
14. Woloszynski, T., Kurzynski, M.: A Measure of Competence Based on Randomized
Reference Classifier for Dynamic Ensemble Selection. In: ICPR 2010, Turkey, August 23-
26, pp. 4194–4198 (2010)
15. Pechwitz, M., Maergner, V.: Baseline estimation for arabic handwritten words. In:
Frontiers in Handwriting Recognition, pp. 479–484 (2002)
16. http://yann.lecun.com/exdb/mnist/
Vector Perceptron Learning Algorithm
Using Linear Programming
Vladimir Kryzhanovskiy1, Irina Zhelavskaya1, and Anatoliy Fonarev2

1
Scientific Research Institute for System Analysis, Russian Academy of Sciences,
Vavilova st., 44/2, 119333 Moscow, Russia
2
CUNY City University of New York,Department of Engineering and Science
2800 Victory Blvd. SI, NY 10314
Vladimir.Krizhanovsky@gmail.com,winjei@ya.ru
Abstract. Application of Linear Programming for binary perceptron

learning allows reaching theoretical maximum loading of the perceptron
that had been predicted by E. Gardner. In the present paper the idea
of learning using Linear Programming is extended to vector multistate
neural networks. Computer modeling shows that the probability of false
identification for the proposed learning rule decreases by up to 50 times
compared to the Hebb one.
Keywords: vector neural networks, simplex-method, linear

programming.
1 Introduction
Vector models of neural networks (VNN) were investigated in a lot of papers

[1-7]. Among them the most well-known is the Potts spin-glass model [1]. Its
properties were investigated rather well by means of statistical physics meth-
ods [2-4]. Memory characteristics of the Potts model were analyzed with the
aid of computer simulations mainly. In [5-6] the so-called parametrical neural
networks (PNN), directed at realization in the form of optical devices, were in-
vestigated. In the last case, rather simple analytical expressions describing PNN
efficiency, storage capacity and noise immunity were obtained. The similar ideas
were applied to Correlation Memory Matrix [8]. The authors of these papers had
succeeded in practical application of their models.
The aim of neural networks learning is to calculate optimal values of synaptic
coefficients. They are usually calculated by the Hebb learning rule. On the one
hand, it is very fast and easy to use, but on the other, as is obvious from E.
Gardner researches, this rule is not able to bring out full potential of a neural
network.
E. Gardner et al. proved that the maximum achievable binary perceptron
loading is 2 [13]. Linear programming (LP) allows to approximate this critical
loading. However the main disadvantage of LP approach is its high computa-
tional complexity (exponential in the worst case).

198 V. Kryzhanovskiy, I. Zhelavskaya, and A. Fonarev
The idea of using LP to perceptron learning was first put forward by Krauth
and Mezard in 1987 [9]. We ve also suggested a learning rule based on LP de-
scribed in the paper [10]. But the main and the significant difference between
these rules is a number of objective variables and these variables themselves. In
the Krauth and Mezard learning rule N values of synaptic weights are optimized.
In the suggested rule we optimize M coefficients (having an indirect effect on
interconnections), where M is a number of stored patterns. It is obvious that in
the field of greater practical interest (M < N ) our algorithm outperforms.
In the present paper the idea of using LP to vector neural networks learning
is considered for the first time. The number of synaptic coefficients significantly
increases for this type of networks, so it becomes impossible to use Krauth
and Mezard algorithm for solving high dimensionality problems (comparable to
practical tasks). But even at low dimensions, when this algorithm is applicable,
the suggested rule outperforms up to 104 times. It is quite clear that LP approach
is considerably slower than the Hebb rule, but this disadvantage is balanced out
by decreasing the probability of incorrect recognition by up to 50 times.
2 Problem Statement
Consider the following model problem. Suppose we have photos of some objects
− reference patterns. They are grayscale images made under favorable conditions
(with a number of gray gradations Q). The system receives the photos of these
objects as inputs (the photos are made from the same angle etc., therefore the
problems of scaling and others will not be covered here). These input photos
differ from the reference ones in the distortions imposed as:
x̃i = xi + δ, (1)
where xi is a color of the i-th pixel, xi = 1, Q, and δ is a normally distributed

random variable with parameters N (0, σout ). By σout denote the parameter of
external environment distortions. For simplicity δ for all pixels is assumed to be
identically distributed. The parameter σout of this distribution is unknown in
advance, but it can be estimated by practical considerations. Knowledge of the
distortions distribution and its parameters allows us to set a neural network for
this kind of distortions in a proper way.
The problem consists in constructing neural network that would allow defining
exactly what photo (of which object) was demonstrated.
Thus the primary objective of this paper is to research the suggested algorithm
performance on this model problem.
3 Model Description
3.1 Vector Perceptron
Let us consider a vector perceptron (VP), consisting of two layers of vector
neurons; each neuron of the input layer (N neurons) is connected with all output
Vector Perceptron Learning Algorithm Using Linear Programming 199
layer neurons(n neurons). Input and output layers neurons have Q and q discrete
states accordingly (in the general case, Q = q). States of the input layer neurons
Q
are described by the basis vectors {ek } in Q-dimensional space, states of the
output layer neurons by the basis vectors {vl }q in q-dimensional space. Vectors
ek and vl are zero vectors with k-th and l-th identity component correspondingly.
Let each reference vector Xm = (xm1 , xm2 , ..., xmN ) be put in a one-to-
one correspondence to the response vector Ym = (ym1 , ym2 , ..., ymn ), where
Q q
xmj ∈ {ek } , ymi ∈ {vl } , and m = 1, M . Then the synaptic connection be-
tween i-th and j-th neurons is assigned by q × Q-matrix, according to the gen-
eralized Hebb rule:
M

Wij = rm ymi xTmj J, i = 1, n, j = 1, N , (2)
m=1
where rm = (0; 1] is a weight coefficient, which we are going to optimize by

means of Linear Programming (by putting rm = 1, we get the classic Hebb rule).
Coefficient rm defines basin of attraction size of the m-th pattern in case of fully
connected neural networks [11]. Research of quasi-Hebb rule had showed that if
we vary these coefficients we can achieve minimal interaction between attractors.
For binary perceptrons we proved that finding optimal values of these depths
and widths of attractors (by means of simplex-method) we can achieve maximal
theoretical perceptron storage capacity. J is a matrix of proximities measures to
be discussed in detail in subsection 3.2.
When presenting an unknown input vector X = (x1 , x2 , ..., xN ), where xj ∈
{ek }Q , the local field on the i-th output neuron is calculated as follows:
N

Hi = Wij xj (3)
j
Then the i-th neuron, similar to the spin locating in the magnetic field under
the influence of the local field Hi assumes the position which is the closest to
the direction of this field (the neuron s state is discrete, that is why it cannot
be oriented exactly along the vector Hi ). In other words, if the local field s
projection Hi on a basis vector vs is maximal, the neuron will be oriented along
this basis vector. Let it be, for instance, the projection on a basis vector v3 .
Then the i-th output neuron will switch in the state 3 described by the basis
vector v3 :
yi = v3 (4)
This procedure is carried out concurrently for all output neurons (i = 1, n).
3.2 Measure of Proximity

Consider a specific example. Suppose there is a keyboard operator who must
type certain previously known messages that are five characters in length, for
example, the word ”ICANN”. It is obvious that the operator can make mistakes,
so these messages can be typed with errors. For example, the input of a letter
”I” would most probably be followed by the incorrect input of letters ”U”, ”O”
and ”K” (these letters are neighbours to ”I” on a keyboard). For constructing a
neural network for input words recognition it would be reasonably good to take
the information about these most probable errors (nearby keys) into account.
It is clear that these letters are not neighbours in the alphabet, but in the
instant case, these letters are nearer to ”I” than ”habitual” ”G”, ”H” or ”J”.
This information is represented in the above-mentioned matrix of measures of
proximities J. Let us describe it more formally.
J is a symmetric matrix of proximities measures between the states of the
input layer neurons; elements Jkl = Jlk are the proximities measures between
the states k and l, k, l = 1, Q. The proximity measure between the states of
the output layer neurons is not entered. If J is a unit matrix E (i.e., J = E),
expression (2) will describe weights of the classic Potts perceptron [1, 3, 4].
Therefore to enter the proximity measure to the Potts perceptron that has been
trained already, it is sufficient to modify interconnection weights multiplying
them by the matrix J on the right side.
The proximity measures matrix J may be defined either by problem’s spec-
ifications, or based on the data analysis and the nature of the noise. To enter
information of noise distribution to the VNN, it is suggested to specify proxim-
ity measures between the states of neurons by probability of switching from one
state to another under the influence of distortions:
Jkl = Pkl , k, l = 1, Q, (5)
where Pkl is the probability of switching from the state k to the state l under
the influence of distortions. For the model problem in hand, the matrix P is
characterized with the only parameter σout , named the external environment
parameter:
2
1 − (k−l)
2
Pkl = √ e 2σout , (6)
2πσout
Parameter σout is unknown precisely; therefore we use the estimation of this
parameter − σin . Parameter σin is an internal variable parameter of the model
to be chosen such that the recognition error is minimal. From general consid-
erations, it can be expected that σin = σout ; however, as computer modeling
shows, this equation holds with an accuracy to a multiplier: σin = c · σout , where
1 < c < 2.
4 LP Learning Rule
In accordance to the algorithm described above, conditions for correct recog-
nition of all reference patterns may be presented by the system of M (q − 1)
equations:
⎧
⎨ hi (m)ymi − hi (m)vl > Δ, ymi = vl , m = 1, M , l = 1, q,
0 < rm < 1, (7)
⎩
Δ > 0,
where hi (m) is a local field on the i-th output neuron when undistorted m-th
reference pattern is presented, ymi is the expected response value.
Parameter Δ is introduced for better recognition stability. Parameter Δ is
responsible for depth and size of basins of attraction of local minima being formed
speaking in the language of the fully connected Hopfield model. The more Δ is
in the process of training, the more the probability of right recognition of noisy
patterns is. Therefore it is necessary to find weight coefficients rm such that
system (7) holds for all reference patterns for the largest possible value of Δ. In
this case, the depth of local minima formed is maximum possible.
Thus we have a linear programming problem with the set of constraints (7)
and with the following objective function:
f (r1 , r2 , ..., rM ) = Δ → max (8)
It is required to find (M + 1) variables that are the solution to this linear pro-
gramming problem.
The similar idea was formulated by Krauth and Mezard [9]. It is concerned
with binary neural networks, but, to be fair, we extend their algorithm to vector
multistate ones (MATLAB programs with realization of all methods can be found
at [12]). The unknown quantities in the algorithm of Krauth and Mezard (on
the analogy of [10]) are N · Q · q weight coefficients and the stability parameter
Δ. For binary perceptrons these algorithms (ours and theirs) are nearly equal in
noise immunity and memory capacity; but for vector perceptrons they can’t be
applied under the same conditions, since the inequality N · Q · q >> M always
holds. Even at low values of parameters, such as N = 100, Q = 20, q = 24,
the resulting number of variables prohibits solving the formulated problem in a
reasonable time. Note that RAM requirements for solving the problem increase
with its size. For example, at such low parameters the KM algorithm uses more
than 19 Gb of RAM while the proposed one only 1 Gb.
5 Suggested Learning Rule Analysis

In this section, we analyze experimentally the proposed algorithm properties and
compare it with the KM algorithm and with the Hebb rule. Here we give the
results of the experiments with the parameters σin and σout being varied. To
begin with, let us compare our approach with the KM one. First we consider
recognition errors, then learning time.
We have conducted a large number of experiments at different values of param-
eters N, q, Q, M, σin , and σout . Fig. 1 shows the typical incorrect recognition
probability dependence of the external environment parameter σout at chosen
parameters N, q, Q, M, σin . The best results are shown by the proposed rule
(curve with square markers); the algorithm of Krauth and Mezard performs
slightly worse; Hebb’s rule shows the worst results.
As it can be seen in this figure, neural networks trained by the Hebb rule
commit errors even at small values of the external environment parameter, i.e., at
small distortion level (σout = 0.3), while networks trained by linear programming
Fig. 1. The probability of incorrect noisy Fig. 2. The ratio of the KM algorithm
reference patterns recognition as a func- learning time tKIM to the proposed algo-
tion of the external environment param- rithm learning time tOU R as a function
eter σout at the chosen parameters N = of the problem size N at the parameters
50, q = 6, Q = 16, M = 60, σin = 1.3 q = 6, Q = 16, σin = 1.3 and σout = 0.7
approach prove to be considerably more noise-resistant: they are insensitive to

distortions up to σout = 0.9.
Now let us have a look at the KM-to-proposed algorithm learning time ra-
tio. Fig.2 demonstrates the results of the experiment wherein the parameters
q, Q, σin , σout were held fixed while the parameters M and N were varied in
such a manner that networks loading remained constant: M /N q = 0.06. It can
be seen that the proposed algorithm outperforms the KM one more as the prob-
lem size increases; this difference reaches several orders of magnitude (4 orders
in this experiment).
Such small values of parameters were selected because the computational com-
plexity of the KM algorithm rises too quickly as the problem size increases, so
it does not seem physically possible to get experimental data for this algorithm
at the larger values of parameters.
Now we compare the proposed algorithm and the Hebb rule. We will experi-
mentally show how networks properties change, for example, how the probabil-
ity of incorrect recognition depends upon the internal variable parameter of the
model σin . Fig.3 illustrates curves of the probability of incorrect recognition for
σout = 0.6, 0.7, 0.8. It can be seen that for both algorithms there is an optimal
value of the parameter σin (a point such that the recognition error is minimal).
For the Hebb rule it is σin∗
≈ σout , for the proposed algorithm σin
∗
≈ 2σout , i.e.,
is shifted to the right. The similar behaviour was observed in all experiments
conducted.
Let us track the recognition error ratios in the optimal points. Fig.4 shows
that the usage of LP in learning is justified, because it allows reducing the
error probability by up to 50 times. The error ratio quickly increases as σout
decreases, i.e., in the region of small errors the profit of LP application is more
significant. It is also worth mentioning that the proposed algorithm is more
resistant to measurements errors of σin than the Hebb rule. It follows from the
curves flatnesses around the optimal points.
Fig. 3. The probability of incorrect pat- Fig. 4. KM-to-proposed algorithm recog-

terns recognition as a function of the in- nition errors ratio in the optimal points
ternal variable parameter σin at the fixed {σin
∗
} as a function of the external envi-
parameters N = 100, q = 12, Q = ronment parameter σout . The curve has
16, M = 408 and σout = 0.6, 0.7, 0.8 been plotted experimentally for N =
for the Hebb rule (curves with markers) 100, q = 12, Q = 16, M = 408. (Dashed
and for our algorithm (solid curves) line indicates one level.)
6 Conclusions
In this paper we have considered three algorithms for vector neural networks
learning: the Hebb rule, the Krauth-Mezard learning rule (generalized to vector
neural networks) and our algorithm. The last two algorithms involve the use of
Linear Programming.
It was shown that despite greater computational complexity than one of the
Hebb rule, the use of LP for vector neural networks learning is justified, because
this approach allows reducing incorrect recognition probability by up to 50 times.
(Note that application of linear programming for binary perceptron learning
allows reaching theoretical maximum loading that had been predicted by E.
Gardner.)
The suggested algorithm was compared to the algorithm of Krauth and
Mezard. The proposed algorithm differs from the KM one in essentially less
number of objective variables. This has a positive effect on the learning rate: the
suggested algorithm outperforms the KM one by several orders of magnitude
(by up to 104 times). Moreover, the stability of neural network trained by this
approach is 10-75 percent higher.
We want to highlight that here we suggest only a modification of the Hebb
rule. Therefore generalization performance of the proposed rule is the same as
one of the Hebb rule in the sense that proximity between two patterns (the
distance between them) is measured by the Hamming distance. By using linear
programming we increase noise immunity in particular, but the generalization
performance remains unchanged.
Throughout the paper we are referring to the KM algorithm generalized to
vector neural networks. However the algorithm itself is not presented here due
to space limit. Materials and MATLAB listings can be found at [12].
Acknowledgments. Dedicated to the memory of PhD Michail Vladimirovich

Kryzhanovskiy, the best father and a good researcher. This work was supported
by the program of the Presidium of the Russian Academy of Science (project
2.15) and partially by the Russian Basic Research Foundation (grant 12-07-
00295).
References
1. Kanter, I.: Potts-glass models of neural networks. Physical Review A 37(7), 2739–
2742 (1988)
2. Cook, J.: The mean-field theory of a Q-state neural network model. Journal of
Physics A 22, 2000–2012 (1989)
3. Bolle, D., Dupont, P., Huyghebaert, J.: Thermodynamics properties of the q-state
Potts-glass neural network. Phys. Rew. A 45, 4194–4197 (1992)
4. Wu, F.: The Potts model. Review of Modern Physics 54, 235–268 (1982)
5. Kryzhanovsky, B., Mikaelyan, A.: On the Recognition Ability of a Neural Network
on Neurons with Parametric Transformation of Frequencies. Doklady Mathemat-
ics 65(2), 286–288 (2002)
6. Kryzhanovsky, B., Kryzhanovsky, V., Litinskii, L.: Machine Learning in Vec-
tor Models of Neural Networks. In: Koronacki, J., Raś, Z.W., Wierzchoń, S.T.,
Kacprzyk, J. (eds.) Advances in Machine Learning II. SCI, vol. 263, pp. 427–443.
7. Kryzhanovskiy, V.: Binary Patterns Identification by Vector Neural Network with
Measure of Proximity between Neuron States. In: Honkela, T. (ed.) ICANN 2011,
Part II. LNCS, vol. 6792, pp. 119–126. Springer, Heidelberg (2011)
8. Austin, J., Turner, A., Lees, K.: Chemical Structure Matching Using Correlation
Matrix Memories. In: International Conference on Artificial Neural Networks, IEE
Conference Publication 470, Edinburgh, UK, September 7-10. IEE, London (1999)
9. Krauth, W., Mezard, M.: Learning algorithms with optimal stability in neural
networks. J. Phys. A: Math. Gen. 20, L745–L752 (1987)
10. Kryzhanovskiy, V., Zhelavskaya, I., Karandashev, J.: Binary Perceptron Learning
Algorithm Using Simplex-Method. In: Rutkowski, L., Korytkowski, M., Scherer, R.,
Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2012, Part I. LNCS,
vol. 7267, pp. 111–118. Springer, Heidelberg (2012)
11. Kryzhanovsky, B., Kryzhanovsky, V.: Binary Optimization: On the Probability of
a Local Minimum Detection in Random Search. In: Rutkowski, L., Tadeusiewicz,
R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2008. LNCS (LNAI), vol. 5097, pp.
12. Center of Optical-Neural Technologies,
http://www.niisi.ru/iont/downloads/km/
13. Gardner, E., Derrida, B.: Optimal storage properties of neural network models. J.
Phys. A: Math. Gen. 21, 271–284 (1988)
A Robust Objective Function of Joint
Approximate Diagonalization
Yoshitatsu Matsuda1 and Kazunori Yamaguchi2

1
Department of Integrated Information Technology,
Aoyama Gakuin University,
5-10-1 Fuchinobe, Chuo-ku, Sagamihara-shi, Kanagawa, 252-5258, Japan
matsuda@it.aoyama.ac.jp
2
Department of General Systems Studies,
Graduate School of Arts and Sciences, The University of Tokyo,
3-8-1, Komaba, Meguro-ku, Tokyo, 153-8902, Japan
yamaguch@graco.c.u-tokyo.ac.jp
Abstract. Joint approximate diagonalization (JAD) is a method solv-

ing blind source separation, which can extract non-Gaussian sources
without any other prior knowledge. However, it is not robust when the
sample size is small because JAD is based on an algebraic objective func-
tion. In this paper, a new robust objective function of JAD is derived by
an information theoretic approach. It has been shown in previous works
that the “true” probabilistic distribution of non-diagonal elements of
approximately-diagonalized cumulant matrices in JAD is Gaussian with
a fixed variance. Here, the distribution of the diagonal elements is also
approximated as Gaussian where the variance is an adjustable param-
eter. Then, a new objective function is defined as the likelihood of the
distribution. Numerical experiments verify that the new objective func-
tion is effective when the sample size is small.
Keywords: blind source separation, independent component analysis,

joint approximate diagonalization, information theoretic approach.
1 Introduction
Independent component analysis (ICA) is a widely-used method in signal pro-

cessing [5,4]. It solves blind source separation problems under the assumption
that source signals are statistically independent of each other. In the linear model
(given as X = AS), it estimates the N × N mixing matrix A = (aij ) and the
N × M source signals S = (sim ) from only the observed signals X = (xim ).
N and M correspond to the number of signals and the sample size, respec-
tively. Joint approximate diagonalization (denoted by JAD) [3,2] is one of the
efficient methods for estimating A. In JAD, the following algebraic property of
cumulant matrices is utilized: ν pq = V C pq V is diagonal for any p and q if
V = (vij ) is equal to the separating matrix A−1 , where each (i, j)-th element
of C pq is given as the 4-th order cumulant of X (denoted by κijpq ). Then, the

206 Y. Matsuda and K. Yamaguchi
for any p and

error function of JAD
2
q is defined as the sum of non-diagonal
elements Ψpq (V ) = i,j>i (νijpq ) , where each νijpq is the (i, j)-th element of
ν. Consequently, an objective function of JAD is given as
⎛ ⎞

Ψ (V ) = Ψpq = ⎝ (νijpq )2 ⎠ . (1)
p,q>p p,q>p i,j>i
A significant advantage of JAD is the versatility. Because JAD utilizes the linear
algebraic properties on the cumulants, it does not depend on the specific statis-
tical properties of source signals except for non-Gaussianity [2]. However, from
the viewpoint of the robustness, JAD lacks the theoretical foundation. Because
many ICA methods are based on some probabilistic models, the estimated re-
sults are guaranteed to be “optimal” in the models. On the other hand, JAD
is theoretically valid only if every non-diagonal element νijpq is equal to 0. In
other words, it is not guaranteed in JAD that V with the minimal non-diagonal
elements is more “desirable.” This theoretical problem often causes a deficiency
of robustness in practical applications.
In this paper, a new objective function of JAD is derived by an information
theoretic approach in order to improve the robustness of JAD. The information
theoretic approach has been proposed previously in [7,8], which incorporates a
probabilistic model into JAD by regarding each non-diagonal element of cumu-
lants as independent random variables. The approach can theoretically clarify
the properties of JAD. It has been also shown that the approach can improve
the efficiency and robustness of JAD practically by model selection [7] or the
approximation of entropy [8]. However, the previous probabilistic models were
too rough to utilize the information theoretic approach exhaustively. In this
paper, the robustness of JAD is improved further by using a more accurate ap-
proximation involving diagonal elements. This paper is organized as follows. In
Section 2, the information theoretic approach to non-diagonal elements is briefly
explained. In Section 3.1, a new objective function of JAD is proposed by apply-
ing the information theoretic approach to diagonal elements, whose distributions
are approximated as Gaussian with unknown variance. In addition, an optimiza-
tion algorithm of the objective function is proposed in Section 3.2. In Section
4, numerical results on artificial datasets verify that the proposed method can
improve the robustness when the sample size is small. This paper is concluded
in Section 5.
2 Information Theoretic Approach to JAD

Here, the original objective function of JAD is derived by an information theo-
retic approach. In the original JAD, each νijpq is regarded as an error to be as
close to 0 as possible. In the information theoretic approach [7,8], each νijpq is
regarded as a random variable generated from a probabilistic distribution. If the
generative probabilistic distribution is true, this approach is expected to derive
a more robust objective function and estimate V more accurately. It has been
A Robust Objective Function of Joint Approximate Diagonalization 207
shown in [8] that the true generative distribution of non-diagonal νijpq (i

= j)
in the accurate estimation of V (V = A−1 ) is given by the following theorem:
Theorem 1. Under the following four conditions, each non-diagonal νijpq (i <
j and p < q) is expected to be an independent and identically distributed ran-
dom variable according to the Gaussian distribution with thevariance of
1/M . In
other words, the true distribution gn-diag (ν) is given by exp −ν 2 M/2 / 2π/M .
where, the four conditions are given as follows:
1. Linear ICA Model: The linear ICA model X = AS holds, where the mean
and the variance of each source sim are 0 and 1, respectively.
2. Large Number of Samples: The sample size M is so large that the central
limit theorem holds.
3. Random Mixture: Each element aij in A is given randomly and indepen-
dently, whose mean and variance are 0 and 1/N , respectively.
4. Large Number of Signals: The number of signals N is sufficiently large.
The details of the proof are described in [8]. In brief, it is proved first that the
distribution of each νijpq is approximated to be Gaussian by the central limit
theorem. Then, it is proved that E (νijpq νklrs ) is approximated as δik δjl δpr δqs /M
under the conditions, where δik is the Kronecker delta. Though it has been
described in a rough manner in [8] that the objective function Ψ is derived from
this theorem, the following more rigorous derivation is given in this paper. First,
it is assumed that the diagonal elements νiipq is given according to a sparse
uniform distribution u (x) = c because there is no prior knowledge. Regarding
νjipq = νijpq , the value is determined by algebraic symmetry. So, any fixed prior
distribution can be employed without changing the likelihood essentially. Here,
the same uniform distribution u (x) = c is employed for simplicity. Then, the
ν
true distribution Ptrue (ν pq ) is given as

ν
Ptrue (ν pq ) = cN (N +1)/2 gn-diag (νijpq ) (2)
i,j>i

By the transformation ν = V C pq V , the linear transformation matrix from the
vectorized elements of C pq to those of ν pq is given as the Kronecker product
V ⊗ V . Therefore, the distribution of C pq with the parameter V is determined
by
P ν (ν pq ) P ν (ν pq )
P C (C pq |V ) = = (3)
|V ⊗ V | |V |2N
where |V | is the determinant of V . The log-likelihood function is given as

C ν
(V ) = log Ptrue (C pq |V ) = −2N log |V | + log Ptrue (ν pq )
p,q>p p,q>p

∼
= −N 2 (N − 1) log |V | + log gn-diag (νijpq )
p,q>p i,j>i
∼ M 2
= −N 2 (N − 1) log |V | − ν (4)
2 p,q>p i,j>i ijpq
where some constant terms are neglected. In many JAD algorithms, V is con-
strained to be orthogonal by pre-whitening (in other words, |V | = 1). In this
case, the maximization of the likelihood in Eq. (4) is equivalent to the mini-
mization of the JAD objective function Ψ in Eq. (1).
3 Method
3.1 Objective Function
While the original objective function of JAD is derived by the information theo-
retic approach in Section 2, it is not useful for improving the objective function.
In Section 2, the diagonal elements are assumed to be distributed uniformly
and independently. This assumption gives no additional clues for estimating V .
Here, the “true” distribution of the diagonal elements is focused on and the
new objective function is derived. When V = A−1 (the accurate estimation),
the dominant term of a diagonal element νiipq without the estimation error
is given as νiipq api aqi κ̄iiii (κ̄iiii is the unknown true kurtosis of the i-th
source) [2,8]. Because each aij is assumed to be a normally and independently
distributed random variable in Section 2, the dominant term of νiipq is given
according to a normal product distribution with unknown variance. In addition,
νiipq slightly depends on every νijpq by api and aqi . However, in order to esti-
mate the likelihood easily, independent Gaussian distributions are employed as
the approximations in this paper. Therefore, the distribution of νiipq (p < q)
is approximated as an independent

Gaussian one with unknown variance σi2 :
gdiag (ν) = exp −ν 2 /2σi2 / 2πσi2 . Then, Eq. (2) is rewritten as

ν
Ptrue (ν pq ) = cN (N −1)/2 gdiag (νiipq ) gn-diag (νijpq ) .
i i,j>i

Therefore, the log-likelihood depending V and σ = σi2 is given as
(V , σ)

∼ M 2 log σi2 νiipq
2
= −N 2 (N − 1) log |V | − νijpq − + (5)
2 p,q>p i,j>i p,q>p i
2 2σi2
where some constant terms are neglected. It is easily

shown 2that the maximum
likelihood estimator σ̂ = σ̂i2 is given by σ̂i2 = p,q>p νiipq /(N (N − 1)/2).
Consequently, the log-likelihood is given as
M 2 N (N − 1)
(V ) ∼
= −N 2 (N − 1) log |V | − νijpq − log νiipq
2
.
2 p,q>p i,j>i 4 i p,q>p
(6)
This is the new objective function of JAD. It is worth noting that Eq. (6) is
closer to the original JAD as the number of samples (M ) is greater than the
number of parameters to be estimated (N 2 ).
3.2 Optimization Algorithm

Here, an algorithm optimizing Eq. (6) is proposed, which is similar to the well-
known JADE algorithm [3] except for the optimization in each pair of signals.
First, V is orthogonalized by pre-whitening X. Rigorously speaking, it involves
an approximation because the true estimation is not orthogonalized accurately.
However, the pre-whitening is employed also in this paper because it has been
known to be useful in JADE.
Then, the first term in Eq. (6) vanishes. In addition,
it is easily shown that i,j νijpq
2
is a constant K for any orthogonal V . Therefore,
2
ν 2
i,j>i ijpq = K − ν
i iipq /2. Thus, the maximization of Eq. (6) is equivalent
to the minimization of the following objective function Φ under orthogonality
constraints:
Φ (V ) = − νiipq
2
+λ log νiipq
2
. (7)
p,q>p i i p,q>p
where λ = N (N − 1) /M can be regarded as a non-linearity parameter. Though

λ is theoretically determined, it can be adjusted to improve the performance. In
the same way as in JADE, Φ is minimized by the Jacobi method (the repetition
of optimal rotations of pairs). In the Jacobi method, Φ is simplified into the
following term Φij on only a pair (i, j) and a rotation θ:

Φij (θ) = − ν̃iipq
2
+ ν̃jjpq
2
+ λ log ν̃iipq
2
+ log ν̃jjpq
2
(8)
p,q>p p,q>p p,q>p
where ν̃iipq (θ) is the rotated element by θ, which is given by

ν̃iipq ν̃ijpq cos θ sin θ νiipq νijpq cos θ − sin θ
= . (9)
ν̃ijpq ν̃jjpq − sin θ cos θ νijpq νjjpq sin θ cos θ
Note that the range of θ can be limited to [0, π/2) by the symmetry. Therefore,
ν̃iipq
2
and ν̃jjpq
2
are given by
ν̃iipq
2
= α1 sin 4θ + α2 cos 4θ + α3 sin 2θ + α4 cos 2θ + α5 , (10)
ν̃jjpq
2
= α1 sin 4θ + α2 cos 4θ − α3 sin 2θ − α4 cos 2θ + α5 (11)
where
νiipq νijpq − νijpq νjjpq
α1 = , (12)
p,q>p
2
νiipq2
+ νjjpq
2
− 2νiipq νjjpq − 4νijpq
2
α2 = , (13)
p,q>p
8

α3 = (νiipq νijpq + νjjpq νijpq ) , (14)
p,q>p
νiipq
2
− νjjpq
2
α4 = , (15)
p,q>p
2
3νiipq
2 2
+ 3νjjpq + 2νiipq νjjpq + 4νijpq
2
α5 = . (16)
p,q>p
8
Note that these coefficients α1−5 can be calculated before the optimization of
Φij because they do not depend on θ. Unlike the original JADE, Φij can not be
minimized analytically because it includes logarithms. However, the optimal θ̂ is
easily calculated numerically because the function has only the single parameter
θ in [0, π/2). Though there is the possibility of finding some local optima, a
simple MATLAB function “fminbnd” is employed in this paper. In summary,
the proposed method is given as follows:
1. Initialization. Whiten the given observed matrix X (orthogonalization) and
calculate the cumulant matrices C pq = (κijpq ) for every p and q > p. Besides,
set V to the identity matrix.
2. Sweep. For every pair i and j > i,
(a) Calculate θ̂ minimizing Φij in Eq. (8).
(b) Only if θ̂ is greater than a given small threshold , do the actual rotation
of V and update every νijpq depending on i or j by θ̂.
3. Convergence decision. If no pair has been actually rotated in the current
sweep, end. Otherwise, go to the next sweep.
4 Results
Here, the proposed method is compared with the original JADE in blind source
separation of artificial sources. Regarding the source signals, a half of which were
generated by the Laplace distribution (super-Gaussian) and the other half by the
uniform distribution (sub-Gaussian). All the sources are normalized (the mean
of 0 and the variance of 1). JAD is known to be effective for such cases where
sub- and super-Gaussian sources are mixed. The number of sources N was set
to 24 and 30. The mixing matrix A was randomly generated where each element
is given by the standard normal distribution. The non-linearity parameter λ
was empirically set to N (N − 1) /2M , which is the half of the theoretical value
and weakening the non-linearity. A small threshold was set to 10−8 . All the
experiments were averaged over 10 runs. The results are shown in Fig. 1. Fig.
1-(a) shows the transitions of the separating error along the sample size by the
proposed method and the original JADE. Fig. 1-(b) shows the transitions of Ψ
(the objective function of the original JADE). In order to clarify the difference
between the two methods, the transitions of the t-statistics comparing the sep-
arating error of the proposed method with that of the original JADE are shown
in Fig. 1-(c). The t-statistics were calculated under the assumptions that there
are two independent groups with the same variance, where the sample size of
each group is 10 (the times of runs). Though the results fairly fluctuated, espe-
cially for N = 30, the t-statistics tend to be smaller than 0 for a relatively small
sample size (around under 1200 for N = 24 and 1800 for N = 30). In addition,
the t-statistics are often below the t-test threshold at the 0.1 level. It shows that
N = 24 N = 30
Separating error along sample size (N = 24) Separating error along sample size (N = 30)
300 500
proposed method proposed method
original JADE 400 original JADE
separating error
separating error
200
300
200
100
100
0 0
500 1000 1500 2000 2500 500 1000 1500 2000 2500
sample size (M) sample size (M)
(a) Transitions of separating error.

JADE function along sample size (N = 24) JADE function along sample size (N = 30)
400 800
proposed method proposed method
original JADE original JADE
300 600
JADE function
JADE function
200 400
100 200
0 0
500 1000 1500 2000 2500 500 1000 1500 2000 2500
(b) Transitions of Ψ (JADE objective function).

t-statistic along sample size (N = 24) t-statistic along sample size (N = 30)
2 t-statistic 2 t-statistic
zero line zero line
t-test threshold (10%) t-test threshold (10%)
1 1
t-statistic
t-statistic
0 0
-1 -1
-2 -2
500 1000 1500 2000 2500 500 1000 1500 2000 2500
(c) t-statistic comparing proposed method with original JADE.
Fig. 1. Separating error and reduction rate along the sample size: The left and right
sides correspond to N = 24 and N = 30, respectively. (a) The transitions of Amari’s
separating error [1] along the sample size M by the proposed method (solid curves)
and the original JADE (dashed). (b) The transitions of Ψ by the proposed method
(solid) and the original JADE (dashed). (c) The transitions of the t-statistics comparing
the proposed method with the original JADE for the separating error (solid curves).
The dashed and dotted lines are the zero line and the t-test threshold (10% and left-
tailed), respectively. If the t-statistic is smaller than the threshold, the superiority of
the proposed method is statistically significant at the 0.1 level.
the superiority of the proposed method is statistically significant. On the other

hand, the proposed method seems to be slightly inferior to the original JADE
for a large sample size. By comparing Figs. 1-(a) and 1-(c), this reversal of the
superiority seems to occur when the separating error drastically decreased. In
other words, the proposed method seems to be effective only under such “phase-
shift” thresholds. It possibly suggests that the proposed probabilistic model has
to be improved for large M . Though those phenomena seem interesting, further
analysis is beyond the scope of this paper. In any case, the results verify that
the proposed method is superior to the original JADE for a small sample size.
5 Conclusion
In this paper, we propose a new objective function of JAD by an information the-
oretic approach and a JADE-like method minimizing the function. The numerical
results show that the proposed method is effective for the limited samples. We
are planning to improve the proposed method by analyzing numerical results
and elaborating the probabilistic model. Especially, we are planning to carry
out extensive numerical experiments in order to find the optimal value of the
non-linearity parameter λ and to estimate the accurate distribution of the diag-
onal element νiipq (which is roughly approximated as Gaussian in this paper).
We are also planning to compare this method with other ICA methods such as
the extended infomax algorithm [6]. In addition, we are planning to apply this
method to various practical applications as well as artificial datasets.
References
1. Amari, S., Cichocki, A.: A new learning algorithm for blind signal separation. In:
Touretzky, D., Mozer, M., Hasselmo, M. (eds.) Advances in Neural Information
Processing Systems 8, pp. 757–763. MIT Press, Cambridge (1996)
2. Cardoso, J.F.: High-order contrasts for independent component analysis. Neural
Computation 11(1), 157–192 (1999)
3. Cardoso, J.F., Souloumiac, A.: Blind beamforming for non Gaussian signals. IEE
Proceedings-F 140(6), 362–370 (1993)
4. Cichocki, A., Amari, S.: Adaptive Blind Signal and Image Processing: Learning
Algorithms and Applications. Wiley (2002)
5. Hyvärinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. Wiley
(2001)
6. Lee, T.W., Girolami, M., Sejnowski, T.J.: Independent component analysis using
an extended infomax algorithm for mixed subgaussian and supergaussian sources.
Neural Computation 11(2), 417–441 (1999)
7. Matsuda, Y., Yamaguchi, K.: An adaptive threshold in joint approximate diago-
nalization by assuming exponentially distributed errors. Neurocomputing 74, 1994–
2001 (2011)
8. Matsuda, Y., Yamaguchi, K.: An Information Theoretic Approach to Joint Approx-
imate Diagonalization. In: Lu, B.-L., Zhang, L., Kwok, J. (eds.) ICONIP 2011, Part
I. LNCS, vol. 7062, pp. 20–27. Springer, Heidelberg (2011)
TrueSkill-Based Pairwise Coupling
for Multi-class Classification
Jong-Seok Lee
School of Integrated Technology

Yonsei University, Korea
jong-seok.lee@yonsei.ac.kr
Abstract. A multi-class classification problem can be solved efficiently

via decomposition of the problem into multiple binary classification prob-
lems. As a way of such decomposition, we propose a novel pairwise
coupling method based on the TrueSkill ranking system. Instead of ag-
gregating all pairwise binary classification results for the final decision,
the proposed method keeps track of the ranks of the classes during the
successive binary classification procedure. Especially, selection of a bi-
nary classifier at a certain step is done in such a way that the multi-class
classification decision using the binary classification results up to the
step converges to the final one as quickly as possible. Thus, the number
of binary classifications can be reduced, which in turn reduces the com-
putational complexity of the whole classification system. Experimental
results show that the complexity is reduced significantly for no or minor
loss of classification performance.
Keywords: TrueSkill, multi-class classification, classifier fusion,

match-making, pairwise coupling, on-line ranking.
1 Introduction
In the field of pattern classification, solving multi-class problems involving more

than two target classes still remains challenging, whereas methods to solve bi-
nary classification problems are rather well established, such as support vector
machine (SVM), linear discriminant analysis, and AdaBoost. Extending these for
multi-class problems has been investigated in literature, e.g., multi-class SVM
[3]. However, such extensions sometimes suffer from difficulties such as high
computational complexity [11]. Thus, a different approach to solve multi-class
pattern classification problems has been also researched, which decomposes a
multi-class problem into several binary problems and combines the binary clas-
sification results to obtain the final result for the original problem [13]. In other
words, for a given sample x, a set of N binary classifiers fn , n = 1, ..., N is used
to predict the class label of x among C classes, where C > 2.
There are three major approaches for decomposition of a multi-class problem.
In the “one-vs-all” decomposition approach, each binary classifier is trained to

214 J.-S. Lee
distinguish a class from the remaining classes [14]. Thus, the total number of
binary classifiers is equal to the number of target classes, i.e., N = C. When
a novel sample is given for classification, the class for which the correspond-
ing classifier shows the highest probability is chosen. In the “one-vs-one” ap-
proach, all the C(C − 1)/2 pairwise combinations of the classes are considered,
for each of which a binary classifier fij is trained to distinguish class i and class
j [12]. For a test sample x, each classifier provides its preference between the
two classes considered by the classifier during training, which is in the form of
rij = P (x in class i|x in class i or j). The final classification decision is drawn
based on the outputs of all the classifiers. In [9], a Bradley-Terry model-based
method was proposed to estimate the probability that the sample is from a
class, pi , by minimizing the weighted Kullback-Leibler distance between rij and
qij = pi /(pi + pj ), i.e.,

rij rji
min wij rij log + rji log
{pi }
i<j
qij qji

subject to 0 ≤ pi ≤ 1, i = 1, ..., C, and pi = 1 (1)
i
where rji = 1 − rij , qij = 1 − qji , and wij is a weighting factor that can be simply
put as unity or other values such as the number of training samples for classes
i and j. An iterative algorithm to solve the above optimization problem was
proposed in [9], and a more stable algorithm was developed in [16]. Finally, the
“error-correcting output code” approach considers solving a multi-class problem
as a communication task [4]. Binary classifiers are trained as in the one-vs-all
and one-vs-one approaches, but each of them set some classes as a positive label
and the rest as a negative label. The number of classifiers must be redundant,
i.e., higher than the minimum number required to uniquely distinguish each
class, so that classification errors by some classifiers can be recovered. When a
test sample is inputted, the final decision is based on decoding the classifiers’
outputs, e.g., Hamming decoding. The approach was extended to allow for binary
classifiers not to consider some classes as either a positive or negative label [1]. In
[11,7], the aforementioned approaches have been compared for various problems,
from which their performance are nearly same. It was also observed that the
one-vs-one approach is more practical than the others due to its reduced time
complexity [11].
In this paper, we address the issue of the complexity of the one-vs-one ap-
proach during the process of classification of an unseen sample, which has been
rarely addressed in the prior work. Traditionally, C(C − 1)/2 pairwise classifica-
tions need be performed for classification of the sample. Then, an aggregation
process takes place to combine the results, which requires additional complex-
ity. For instance, the Bradley-Terry model-based method [9] performs iterative
estimation of the class probabilities according to the problem in (1), from which
the final classification decision is made.
Therefore, we propose a novel aggregation method for the one-vs-one ap-
proach. In contrast to conventional methods, the “score” of each class label for
TrueSkill-Based Pairwise Coupling for Multi-class Classification 215
a given test sample is updated in an on-line manner after the classification by

each binary classifier. This has connection to the on-line ranking problem for
which various techniques have been developed [5,8,15]. In our case, we employ
the TrueSkill ranking system [10], which was developed to track the skills of
game players in a Bayesian framework. Since the skills (scores in our case) are
updated after each classification step, the winning class at a certain step is known
without aggregation of the classification results up to the moment. Especially,
at each step a binary classifier is carefully selected, so that the winning class can
be predicted even before all the pairwise classifications are performed, which
reduces the complexity of the whole classification system.
In the following section, the proposed method is explained in detail. Section
3 demonstrates the performance of the method on several real-world multi-class
classification problems. Finally, conclusion is made in Section 4.
2 Proposed Method
As already mentioned, the proposed method is based on the one-vs-one approach.
Therefore, N = C(C − 1)/2 binary classifiers are constructed and each of them
is trained using the training data having the corresponding two class labels.
When a test sample is given, binary classifications are conducted sequentially
by choosing one among the N trained classifiers, during which the score of each
class label (and consequently the ranking of the classes) is updated at each binary
classification step. The key components of the proposed method are the TrueSkill
system-based on-line ranking scheme and the prioritized match-making scheme
for classifier selection, which are explained below.
The TrueSkill system [10] is characterized by the average skill of a player
and the degree of uncertainty (or variance) in the player’s skill, which models
a player’s skill as a normally distributed random variable. A larger variance
indicates that the player’s performance is more unstable. In our case, a player
can be considered as a class label. Once the game between two players has
finished and thus the winner and loser have been determined (in our case, once
a binary classifier has performed classification of a given sample), their skills (μ)
and degrees of uncertainty (σ) are updated according to the following rule:

σ2 Δμ
μwinner ← μwinner + winner · v − (2)
c c c

σloser
2
Δμ
μloser ← μloser − ·v − (3)
c c c

σ2 Δμ
σwinner
2
← σwinner
2
· 1 − winner ·w − (4)
c2 c c

σ2 Δμ
σloser
2
← σloser
2
· 1 − loser ·w − (5)
c2 c c
216 J.-S. Lee
where Δμ = μwinner − μloser and c2 = 2β 2 + σwinner

2
+ σloser
2
. The functions v
and w are given by
N (x)
v(x) = (6)
Φ(x)
w(x) = v(x) · (v(x) + x) (7)
where N and Φ are the probability density function and cumulative density
function of the standard normal distribution, respectively. The parameter β 2 is
a per-game variance, in other words, a larger value of β 2 means that a game’s
outcome becomes less dependent on the players’ skills. According to the above
updating rule, the winner’s (loser’s) skill is increased (decreased), while the vari-
ances of the two players are decreased in order to reflect that the confidence
about the skills increases as more games are conducted.
In order to speed up the convergence of the ranking, the following “match-
making” procedure is used in the proposed method. At the beginning of classi-
fication, the scores of all the players are the same to an initial value, so matches
are randomly selected, i.e., a classifier is selected randomly among the C(C −1)/2
trained binary classifiers at each step. After a few random matches, candidates
for matches are strategically selected as follows. First, among the players that
have not completed all the matches with the other players, the one having the
best score is selected (player 1). Then, among the players that have not played
with player 1, the one having the largest chance of drawing is selected as the op-
ponent of player 1 (player 2). A large chance of drawing between the two players
means that there exists a large ambiguity in the ranking. Therefore, by promot-
ing the match between these players, such ambiguity can be reduced quickly.
The chance of drawing is a function of the skills (μ1 and μ2 ) and variances (σ12
and σ22 ) of the two players [10], i.e.,
√
2β (μ1 − μ2 )2
pdraw = exp − (8)
c 2c2
where c2 = 2β 2 +σ12 +σ22 . pdraw is a decreasing function with respect to increasing

σ1 or σ2 . Thus, even if the two players have the same skill, the probability of
draw becomes small for large variances.
3 Experiments
The performance of the proposed algorithm is evaluated on several multi-class
classification problems. For comparison, the proposed algorithm without the
prioritized match-making, i.e., a match is always selected randomly at each
classification step, is also evaluated. In addition, the method in [9] where the
Bradley-Terry model-based method is applied after the full pairwise matches
are conducted is considered.
Six real-world classification problems were chosen from the UCI Machine
Learning Repository [6]. Table 1 summarizes the characteristics of the chosen
Table 1. Summary of the datasets for multi-class classification. Since the original
Soybean dataset contains missing values, we used only the samples without missing
attributes, which is indicated in the parenthesis. The Soybean dataset specifies the
training and test data, while the other datasets do not.
Dataset #class #attribute #training sample #test sample

Zoo 7 17 101
Ecoli 8 7 336
Yeast 10 8 1484
Vowel 11 10 990
Soybean 19 (15) 35 307 (266) 376 (296)
Letter 26 16 20000
Table 2. Final accuracy and standard deviation values (%) of the Bradley-Terry
model-based method, the Trueskill-based method with random match selection, and
the proposed Trueskill-based method with prioritized match-making. Note that for the
Soybean dataset where the training and test data are fixed over ten trials, the Bradley-
Terry model-based method always produces the same results for ten different random
seeds so that the standard deviation is zero.
Trueskill-based
Dataset Bradley-Terry model
Random match-making Prioritized match-making
Zoo 94.60±4.99 94.60±4.99 94.60±4.99
Ecoli 83.75±3.13 83.75±3.13 83.81±3.11
Yeast 55.80±2.50 55.65±2.52 55.74±2.47
Vowel 94.24±1.36 94.38±1.25 94.30±1.38
Soybean 93.92±0.00 94.46±0.36 94.05±0.24
Letter 96.72±0.18 96.73±0.19 96.74±0.17
problems. Except for the Soybean dataset, there is no explicit separation of train-
ing and test data. Thus, we randomly divided the whole data of each dataset
into halves, one for training and the other for test.
SVMs having Gaussian kernels were employed as the binary classifiers by
using the LIBSVM library [2]. In our method, we set the initial scores (μ) to 25,
initial variances (σ 2 ) to (25/3)2 , β = 25/6, and = 0.5, as in [10]. The number of
random matches at the beginning of our algorithm was set to five. We repeated
each experiment ten times with different random seeds, and report the average
performance below.
The final classification accuracies of the three different methods are compared
in Table 2. It can be seen that the final accuracies and their standard deviations
of the three algorithms is almost the same, which validates the empirical con-
vergence of the proposed TrueSkill-based methods.
As aforementioned, an important advantage of the proposed method is that
the number of matches can be reduced while the classification performance is
not degraded significantly. Table 3 shows the number of matches (as the ratio
with respect to the number of total matches) that is required to reach the 90%,
218 J.-S. Lee
Table 3. Required numbers of matches (i.e., pairwise classifications) as the ratio to

the total number of matches in % to reach 90%, 95%, 98%, and 100% of the final
classification accuracy
Relative accuracy Dataset Random match-making Prioritized match-making

Zoo 85.7 76.2
Ecoli 78.6 57.1
Yeast 71.1 44.4
90% Vowel 81.8 69.1
Soybean 79.1 66.7
Letter 81.2 83.4
Average 79.6 66.2
Zoo 95.2 90.5
Ecoli 89.3 67.9
Yeast 84.4 55.6
95% Vowel 90.9 83.6
Soybean 89.5 81.0
Letter 89.9 89.5
Average 89.9 78.0
Zoo 100 95.2
Ecoli 96.4 71.4
Yeast 95.6 68.9
98% Vowel 96.4 90.9
Soybean 96.2 95.2
Letter 96.0 95.7
Average 96.8 86.2
Zoo 100 100
Ecoli 100 92.9
Yeast 100 80.0
100% Vowel 100 100
Soybean 100 100
Letter 100 100
Average 100 95.5
95%, 98%, and 100% of the final accuracy (i.e., the accuracy obtained after all
the matches are performed). On average, the final accuracy is reached when
95.5% of matches are performed, in other words, the complexity can be reduced
up to 4.5% without any performance loss. Especially, the maximum reduction
(20%) was obtained for the Yeast dataset. If a slight loss of the accuracy is
permitted, the complexity reduction becomes large; for relative accuracy errors
of 2%, 5%, and 10%, the complexity can be reduced by 13.8%, 22%, and 43.8%,
respectively. The Zoo and Letter problems are more difficult to obtain higher
reductions in comparison to the other problems, while the proposed method is
the most effective for the Yeast problem.
The evolution of the classification accuracy with respect to the number of
matches is shown in Fig. 1, where the two match-making methods are compared.
It is clearly seen that the prioritized match-making method helps the accuracy to
100 100 100

Random match Random match Random match
90 Prioritized match 90 Prioritized match 90 Prioritized match
80 80 80
70 70 70
Accuracy (%)
Accuracy (%)
Accuracy (%)
60 60 60
50 50 50
40 40 40
30 30 30
20 20 20
10 10 10
0 0 0
0 5 10 15 20 0 5 10 15 20 25 0 5 10 15 20 25 30 35 40
Number of matches Number of matches Number of matches
(a) (b) (c)

100 100 100
Random match Random match Random match
90 Prioritized match 90 Prioritized match 90 Prioritized match
80 80 80
70 70 70
Accuracy (%)
Accuracy (%)
Accuracy (%)
60 60 60
50 50 50
40 40 40
30 30 30
20 20 20
10 10 10
0 0 0
0 10 20 30 40 50 0 20 40 60 80 100 0 50 100 150 200 250 300
Number of matches Number of matches Number of matches
(d) (e) (f)
Fig. 1. Evolution of the classification accuracy with respect to the number of matches
for the two different match-making schemes. (a) Zoo (b) Ecoli (c) Yeast (d) Vowel (e)
Soybean (f) Letter.
converge to the final value quickly. It produces the stair-shaped curves because
the class label having the highest score competes with the remaining classes
until all the matches for the class are performed, during which the flat regions
are produced in the curves. Because of this, the random match-making is faster
at the beginning. However, as the number of matches increases, the prioritized
match-making scheme converges faster than the random match-making, which
was also shown in Table 3.
4 Conclusion
We have presented a TrueSkill-based on-line aggregation method for binary de-

composition of multi-class classification problems. Through tracking of the scores
of the classes and prioritized match-making, it is possible to make final classi-
fication decision without performing all pairwise binary classifications. It was
demonstrated that the proposed method can reduce the complexity for classifi-
cation of test samples, where the amount of reduction is greater if the permitted
loss of the classification accuracy is higher. Moreover, the method does not re-
quire an additional step to obtain the final decision from binary classification
results (e.g., the iterative process in the Bradley-Terry model-based method).
In the future, we will work on the problem of estimating the class probabilities
from the scores and variances. In addition, the theoretical aspect and the effect
of noise on the performance of the method will be further investigated.
220 J.-S. Lee
Acknowledgment. This work was supported by the Ministry of Knowledge

Economy, Korea, under the IT Consilience Creative Program (NIPA-2012-H0201-
12-1001) and by Yonsei University Research Fund.
References
1. Allwein, E.L., Schapire, R.E., Singer, Y.: Reducing multiclass to binary: a unifying
approach for margin classifiers. J. Mach. Learn. Res. 1, 113–141 (2001)
2. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM
Trans. Intell. Syst. Techn. 2(3), 27:1–27:27 (2011),
http://www.csie.ntu.edu.tw/~ cjlin/libsvm
3. Crammer, K., Singer, Y.: On the algorithmic implementation of multiclass kernel-
based vector machines. J. Mach. Learn. Res. 2, 265–292 (2001)
4. Dietterich, T.G., Bakiri, G.: Solving multiclass learning problems via error-
correcting output codes. J. Artif. Intell. Res. 2, 263–286 (1995)
5. Elo, A.E.: The Rating of Chessplayers, Past and Present. Arco Publishing, New
York (1986)
6. Frank, A., Asuncion, A.: UCI machine learning repository. School of Information
and Computer Science, University of California, Irvine (2010)
7. Garcı́a-Pedrajas, N., Ortiz-Boyer, D.: An empirical study of binary classifier fusion
methods for multiclass classification. Inform. Fusion 12, 111–130 (2011)
8. Glickmann, M.E.: Parameter estimation in large dynamic paired comparison ex-
periments. Appl. Statist. 48, 377–394 (1999)
9. Hastie, T., Tibshirani, R.: Classification by pairwise coupling. Ann. Stat. 26(2),
451–471 (1998)
10. Herbrich, R., Minka, T., Graepel, T.: Trueskill: a Bayesian skill rating system.
In: Adv. Neural Info. Process. Syst., vol. 19, pp. 569–576. MIT Press, Cambridge
(2007)
11. Hsu, C.W., Lin, C.J.: A comparison of methods for multiclass support vector ma-
chines. IEEE Trans. Neural Netw. 13(2), 415–425 (2002)
12. Kreßel, U.: Pairwise classification and support vector machines. In: Schölkopf, B.,
Burges, C.J.C., Smola, A.J. (eds.) Advances in Kernel Methods-Support Vector
Learning, pp. 255–268. MIT Press, Cambridge (1999)
13. Lorena, A.C., de Carvalho, A.C.P.L.F., Gama, J.M.P.: A review on the combination
of binary classifiers in multiclass problems. Artif. Intell. Rev. 30, 19–37 (2008)
14. Rifkin, R., Klautau, A.: In defense of one-vs-all classification. J. Mach. Learn.
Res. 5, 101–141 (2004)
15. Weng, R.C., Lin, C.J.: A Bayesian approximation method for online ranking. J.
Mach. Learn. Res. 12, 267–300 (2011)
16. Wu, T.F., Lin, C.J., Weng, R.C.: Probability estimates for multi-class classification
by pairwise coupling. J. Mach. Learn. Res. 5, 975–1005 (2004)
Analogical Inferences in the Family Trees Task:
A Review
Sergio Varona-Moya and Pedro L. Cobos
Department of Basic Psychology, University of Málaga

Faculty of Psychology, Campus Teatinos, 29071, Málaga, Spain
{sergio.varona,p cobos}@uma.es
http://causal.uma.es/
Abstract. We reviewed the performance results on a generalization test

obtained by Hinton in a pioneering study of perceptrons’ capacity to
make analogical inferences. In that test, a five-layer feed-forward net-
work was exposed to items that were not part of its training set (i.e.,
four relationships between members of two genealogically identical fa-
milies), to see whether it could nevertheless generalize and produce the
correct output. The good performance showed by two simulations was
interpreted by Hinton as an evidence of its ability to detect the common
structure shared by both families and make analogical inferences from
one family tree to the other. However, we claim that this work’s good
results were clouded by unsuitable test items, and that it also lacked a
control condition to prove that analogical inferences were actually done.
Our study, which tackled these two aspects through experimental ma-
nipulation, provides stronger evidence supporting Hinton’s conclusions.
Keywords: Analogical inference, structure sensitivity, systematicity.
1 Introduction
In a work that has become a recurrent citation in the field, Hinton [1] paved the
way for connectionist modelling of higher-order cognitive processes by exploring
neural networks’ ability to make analogical inferences to complete their know-
ledge of a domain by learning another with the same relational structure [2].
This ambitious goal was tackled by means of a pattern association task in
which two families with isomorphic genealogical trees (see Fig. 1) were involved.
Through supervised learning, a multilayer perceptron was trained to point at
the right relative given one family member and the relationship between them
as inputs. For example, given ‘Roberto’ and ‘son’ as inputs, the network had to
learn to point at ‘Emilio’, since Emilio is Roberto’s son.
In this so-called family trees task, only twelve kinship terms were allowed—
mother, daughter, sister, wife, aunt, niece and their male equivalents—, so the
relational structure of a family was defined by 52 relationships. These were coded
as triplets like {person 1, relationship, person 2}, in which person 1 and rela-
tionship were the network’s inputs and person 2 its desired output.

222 S. Varona-Moya and P.L. Cobos
Fig. 1. Isomorphic family trees of British and Italian relatives. The symbol ‘=’ means
“married to”. Taken from [1].
Hinton trained the network on only 100 of the 104 triplets concerning both
families and checked if it could guess the remaining four triplets. For each of
these unlearned triplets, the filler of the person 1 role had never been associated
with the filler of the person 2 role during training, but it had actually been
associated with other family members in the person 2 role. Thus, solving any of
those triplets would not be explicable in terms of correlation between fillers, but
in terms of some structure-based process, arguably an analogical inference.
Two simulations were run and subjected to this generalization test. Results
were surprisingly good: one of them pointed at the correct filler of the person 2
role in the four test triplets, whereas the other one got three test triplets right.
Besides, an indirect analysis of one simulation’s internal representations of fillers
of the person 1 role showed that they were organized around structural features
like generation and family branch, and that those corresponding to analogous
fillers (e.g., Charlotte and Sophia) were similar. So, it was concluded that the
network had systematically solved the test triplets by discovering both relational
structures and matching analogous concepts together.
However, despite the sound rationale supporting this finding, some aspects in
the way it was obtained cast doubt on it. So, we considered advisable to replicate
Hinton’s work [1] in order to obtain stronger evidence of his conclusions.
The rest of the paper is organized as follows. First, we will describe the
architecture of that perceptron. Then, those problematic aspects and the co-
rresponding methodological changes that they impose will be identified. Finally,
after detailing the learning procedure of our simulations, we will show and discuss
the results obtained in our review.
2 Network Architecture
Hinton’s [1] network was a five-layer feed-forward perceptron (See Fig. 2). Each
neuron had a logistic transfer function.
The input layer was divided into two groups of 24 and 12 units, which coded
the fillers of person 1 and relationship roles, respectively. The output layer con-
sisted of a single group of 24 units, which coded the filler of person 2 role.
Analogical Inferences in the Family Trees Task: A Review 223
Fig. 2. Architecture of the perceptron
Since hidden layers allow perceptrons to detect abstract features in task-

relevant patterns that are not explicitly represented in them (e.g., the XOR
problem), Hinton included three hidden layers, so that the network could su-
pposedly identify implicit structural features regarding family members, grasp
the common genealogical structure and eventually make analogical inferences.
Specifically, hidden layer 1 was divided into two groups of six units, in which
distributed representations of a family member in the person 1 role and a kinship
relationship were formed, respectively. Hidden layer 2 consisted of 12 units that
represented the network’s input compound. Finally, six units in hidden layer 3
supported the representation of a family member in the person 2 role.
3 Methodological Improvements on Hinton’s Work
As we said in the introduction, we found some critical aspects regarding the

way in which evidence that the network could make analogical inferences was
obtained in [1]. We considered that it was not sufficiently proved that the network
made analogical inferences to solve unlearned triplets.
First, not all the triplets are, if unlearned, solvable by analogy-based gene-
ralizations only. Solving some triplets can be equally due to a simpler process.
Take, for instance, the five triplets in which Colin acts as person 1:
1. {Colin, father, James}

2. {Colin, mother, Victoria}
3. {Colin, uncle, Arthur and Charles}
4. {Colin, aunt, Margaret and Jennifer}
5. {Colin, sister, Charlotte}
and compare them with the five triplets in which Charlotte acts as person 1:
1. {Charlotte, father, James}

2. {Charlotte, mother, Victoria}
3. {Charlotte, uncle, Arthur and Charles}
4. {Charlotte, aunt, Margaret and Jennifer}
5. {Charlotte, brother, Colin}
As can be seen, except for one out of five cases, these members are perfectly
interchangeable as fillers of the person 1 role.
If from these ten triplets {Colin, father, James} were reserved for the gene-
ralization test, that similarity between Colin and Charlotte regarding the output
pattern would be less marked, but they could still be considered as almost equi-
valent fillers of the person 1 role.
Now, it is well known that perceptrons with one or more hidden layers come
up with similar internal representations of input patterns not only when these
are alike, but also when they are associated with the same output patterns. This
way they escape the so-called “tyranny of similarity” [3]. In the family trees task,
this means that those members who are mostly associated with the same fillers
of the person 2 role would have similar internal distributed representations.
Returning to our example, Colin and Charlotte would very likely have similar
hidden layer 1 representations. So, when the input compound {Colin, father}
were presented to the network, hidden layer 1 units would transform it into a
vector similar to that corresponding to the input compound {Charlotte, father},
which was associated with the output pattern of {James} during training. Thus,
the network would produce the correct answer with a high probability, but not
because of an analogy-based generalization from the Italian tree.
This example highlights the need to carefully select the test triplets to be
sure that they are only solvable through analogical inferences. But, as far as we
know, this aspect was not taken into account by Hinton [1]. As a matter of fact,
he did not report which four test triplets were used, but he did report, however,
their correct answers. From these, we can deduce the triplets from which those
four must have been selected. These are:
1. {Christine, husband, Andrew}

2. {Arthur, sister, Victoria}
3. {James, wife, Victoria}
4. {Emilio, sister, Lucia}
5. {Marco, wife, Lucia}
6. {James, father, Andrew}
7. {Jennifer, father, Andrew}
8. {Christopher, daughter, Victoria}

9. {Penelope, daughter, Victoria}
10. {Colin, mother, Victoria}
11. {Charlotte, mother, Victoria}
12. {Roberto, daughter, Lucia}
13. {Maria, daughter, Lucia}
14. {Alfonso, mother, Lucia}
15. {Sophia, mother, Lucia}
Applying the same logic we used in the case of Colin and Charlotte, we can
conclude that the last ten triplets can be solved on the basis of the similarity
between family members acting as person 1. Even in the case of James and
Jennifer, who are not so equivalent as Colin and Charlotte are, it could still be
argued that their hidden layer 1 representations were similar enough so as to
lead to the correct relative with some probability.
But this same argument could not be held up if any of the first five triplets were
solved. For example, if the triplet {Christine, husband, Andrew} were reserved
for the generalization test, then, when the input compound {Christine, husband}
were presented to the network, hidden layer 1 units would represent it quite close
to the compound {Andrew, husband} for Christine and Andrew are parents of
the same children and are not related to any other family member. However,
that similarity would be rendered useless, as Andrew himself is not his husband
and there are no reflexive relationships that associate one person with himself.
So, if that triplet were solved it would be necessary to turn to other inferential
processes—arguably structure-based processes—as an explanation.
As we see, only one third (the first five) of the triplets Hinton possibly used
were not problematic for testing the network’s analogy-based generalization ca-
pacity. We could now compute the probability that Hinton’s generalization test
had a certain number of problematic triplets, but, due to space limitations, we
will just claim that such triplets were very likely used.
In our review, we avoided those problematic triplets, by selecting {Christine,
husband, Andrew} and {James, wife, Victoria} as test triplets. We did not add
any of the triplets with Lucia as person 2 because it would have reduced the
isomorphism between both families—the hypothetical key to solve test triplets—
at a critical point, since Lucia is Victoria’s Italian analogous.
However, despite our two test triplets arguably would induce structure-based
inferential processes, there was room for doubt whether they were analogical
inferences or not. It could be possible that the relational structure of one family
conveyed enough information for the network to produce the correct answer.
It was difficult to argue against this possibility, but there was a simple way
to control for it. We only have to compare the network’s performance on the
generalization test when both families were learned with its performance when
only one family was learned. By introducing this latter control condition, we
would be able to draw a neat distinction between analogy-based generalizations
and other structure-based inferences and also test the working hypothesis more
properly, because if the network were capable of detecting the common genealo-
gical structure and base its inferences on that knowledge, performance results on
the generalization test should be significantly better after learning about both
families compared to the results after learning one family.
Thus, our simulations were first trained on 50 British triplets only—the re-
maining two were reserved for the generalization test—and then they were
trained simultaneously on the 50 British and the 52 Italian triplets.
Besides, to provide evidence of the influence of test triplets on the network’s
performance, we also tested our simulations with two unsuitable triplets, {James,
father, Andrew} and {Penelope, daughter, Victoria}, expecting that they would
be easily solved due to the similarity of fillers of the person 1 role.
4 Learning Procedure
500 simulations were run to strengthen the external validity of our results. The
Neural Network Toolbox of MATLAB (Version 7.4.0.287) was used.
We followed all the procedural specifications detailed in [1] and used the same
parameter values, except for the learning rate and the coefficient of the momen-
tum term in the second phase of the learning process, for using Hinton’s values we
could not obtain a significant reduction of the cost function (a modified version
[1] of the sum of squared errors [SSE]).
Instead, we used a learning rate of 0.04 and a coefficient of .95 in that second
phase, and allowed 50000 epochs of training1 . This way, in the case of the two
suitable test triplets, the 500 simulations that learned the 50 British triplets
achieved a mean (standard deviation in parentheses) SSE of 0.046 (0.005) and
of 0.127 (0.016) when they learned the 102 triplets concerning both families.
Likewise, in the case of the two unsuitable test triplets, the corresponding values
were 0.048 (0.006) and 0.129 (0.015), respectively. Thus, each training set was
almost perfectly learned irrespective of the generalization test triplets.
Performance on each generalization test was scored following Hinton’s [1] crite-
rion, according to which test triplets were considered solved only if the output
unit (or units) corresponding to the right answer had an activation level above
.5 and all the other output units were below or equal to .5.
Results corresponding to the two mentioned training conditions on the two
different generalization tests are displayed in Table 1. For the sake of brevity, we
will refer to the test constituted by the triplets {Christine, husband, Andrew}
and {James, wife, Victoria}, which were thought to induce structure-based infe-
rences, as generalization test A, and to that constituted by the triplets {James,
father, Andrew} and {Penelope, daughter, Victoria}, which could arguably be
solved by simpler processes, as generalization test B.
1
Our training times contrast with Hinton’s [1], who stated that one of his simulations
achieved null SSE after 1500 epochs. However, according to Melz [4], Hinton himself
declared that the training times reported in his paper [1] were “probably erroneous”.
Table 1. Distribution of the 500 simulations under the two training conditions accor-
ding to their performance on both generalization tests
Test triplets solved

Training set Both One None
Generalization test A
50 British triplets 0 280 220
50 British triplets and 52 Italian triplets 33 273 194
Generalization test B
50 British triplets 360 125 15
50 British triplets and 52 Italian triplets 323 152 25
From these results, several aspects must be highlighted. First, performance

results on generalization test A were generally worse than on test B, which were
quite good, as we expected. In the case of learning only the British triplets, a
chi-square test of independence between the two generalization tests regarding
the three performance categories showed that the network’s ability to solve test
triplets depended on the test itself, χ2 (2, N = 1000) = 598, p < .001. The same
conclusion was drawn in the case of learning the British and the Italian triplets,
χ2 (2, N = 1000) = 401, p < .001. In both cases, generalization test A was harder
to solve than generalization test B. This confirms that test triplets themselves
are a factor to be controlled for and questions Hinton’s [1] performance results.
Secondly, sticking to generalization test A, 33 simulations that learned also
the Italian triplets could solve both test triplets, whereas none of the simulations
that learned only the British triplets achieved such performance. This striking
difference suggested that learning an analogous domain of concepts did influence
generalization, as Hinton defended, although at a lesser extent. A chi-square test
of the results on the three performance categories regarding the two training sets
confirmed that learning about the Italian isomorphic family improved generaliza-
tion, χ2 (2, N = 1000) = 34.72, p < .001. It should be noted that the proportion
of simulations that managed to solve this test after learning both families should
not be taken as a final quantitative measure of the perceptron’s generalization
capacity, but merely as a methodologically stronger evidence of that capacity,
as it was significantly greater than the corresponding proportion when only the
British family had been learned.
The same chi-square test of the results on generalization test B showed that per-
formance was again dependent on the training set, χ2 (2, N = 1000) = 7.14, p <
.028, but now learning the Italian family made performance worse. Yet, the fact
that 72 % of the simulations that had not learned the analogous family solved test
B indicates that it did not appeal to analogy-based generalization as test A did
and thus, its results should not be compared to results on test A.
Finally, it was also remarkable that the test triplet {James, wife, Victoria}
were harder to solve than {Christine, husband, Andrew}, irrespective of train-
ing set, as can be seen in Table 2. Actually, a pair of chi-square tests indi-
cated that solving {James, wife, Victoria} was dependent on the training set,
χ2 (1, N = 47) = 47, p < .001, but solving {Christine, husband, Andrew} was
not, χ2 (1, N = 572) = 0.25, p = .882. In other words, despite our careful selec-
tion of test triplets, only {James, wife, Victoria} was really sensitive to learning
an analogous domain. On the contrary, the triplet {Christine, husband, Andrew}
was mainly solvable due to grasping the relational structure of the British family.
Future work should be done to explain this inequality and to understand why
so few simulations benefited from learning the analogous family tree.
Table 2. Frequency of simulations that solved some triplet of generalization test A

under the two training conditions as a function of the solved triplet
Test triplet solved

{Christine, husband, {James, wife,
Training set Andrew} Victoria}
50 British triplets 280 0
50 British triplets and 52 Italian triplets 292 47
6 Conclusions
We reviewed one of the first works [1] that tackled connectionist modelling of ana-
logical thinking, as some methodological aspects raised doubts about its results.
We focused on the network’s reportedly good capacity to acquire knowledge of
a domain via generalizations from an analogous domain, and proved, through
experimental manipulation, that while such good performance was probably an
artifact, there is strong evidence to consider that a multilayer perceptron is sensi-
tive to being trained on analogous domains with a common relational structure.
Acknowledgments. This research was supported by Junta de Andalucı́a

P08-SEJ-03586.
References
1. Hinton, G.E.: Learning Distributed Representations of Concepts. In: Proceedings of
the Eight Annual Conference of the Cognitive Science Society, pp. 1–12. Lawrence
Erlbaum Associates, New Jersey (1986)
2. Clement, C.A., Gentner, D.: Systematicity as a Selection Constraint in Analogical
Mapping. Cognitive Sci. 15, 89–132 (1991)
3. McLeod, P., Plunkett, K., Rolls, E.T.: Introduction to Connectionist Modelling of
Cognitive Processes. Oxford University Press, New York (1998)
4. Melz, E.R.: Developing Microfeatures by Analogy. In: Proceedings of the Fourteenth
Annual Conference of the Cognitive Science Society, pp. 42–47. Lawrence Erlbaum
Associates, New Jersey (1992)
An Efficient Way of Combining SVMs for Handwritten
Digit Recognition
Renata F.P. Neves, Cleber Zanchettin, and Alberto N.G. Lopes Filho
Center of Informatics, Federal University of Pernambuco, Recife, Brazil

{rfpn,cz,anglf}@cin.ufpe.br
Abstract. This paper presents a method of combining SVMs (support vector

machines) for multiclass problems that ensures a high recognition rate and a
short processing time when compared to other classifiers. This hierarchical
SVM combination considers the high recognition rate and short processing time
as evaluation criteria. The used case study was the handwritten digit recognition
problem with promising results.
Keywords: pattern recognition, handwriting digit classifier, support vector

machine.
1 Introduction
Nowadays the world is digital. Technology has become ubiquitous in people´s lives,
and some human tasks, such as handwriting recognition, voice recognition, face
recognition and others are now machine tasks. The main recognition process [1][2]
used in this kind of application requires the following steps: data acquisition; pre-
process data to eliminate noise; segmentation, where the objects (text, numbers, face,
etc) to be recognized are located and separated from the background; feature
extraction, where the main features of each object are extracted; and finally there is
the recognition, or classification, where the objects are labeled based on their features.
This paper is focused on the classification task and we used as case study the
handwritten digit recognition problem because this task can represent some classification
issues. For example, the patterns can be ambiguous or some features are similar in more
than one group of classes. An example of this problem is presented in Fig. 1. In Fig. 1a
and in Fig. 1c the correct value of the image is seven, and in Fig. 1b, four. But the Fig. 1a
and b are similar and could be the same digit. Fig. 1c could be confused as a digit one.
Because of this, to build a classifier that generalizes well is a hard task. In some cases the
best choice is try to use the context information to differentiate one class from the other.
The Hidden Markov Models (HMM) [3] is a technique frequently used to analyze
the context and improve the classifier recognition rate. But its main disadvantage is
the processing time. Modeling context techniques also usually are slower. Thus, our
research focuses on studying the optimization and combination of classical
approaches and trying to introduce more knowledge in the classifier.
A brief overview of the handwritten digit recognition research in recent years shows
that classical classifiers such as the multilayer perceptron (MLP) [5], k-nearest
neighbor (kNN) [2] and support vector machine (SVM) [6] are extensively used. Some
230 R.F.P. Neves, C. Zanchettin, and A.N.G. Lopes Filho
authors tried combinations of these classifiers to improve results [7][8][10][11][12].

The main problem of combining different techniques is that although we are
combining the advantages, we are also joining the disadvantages of both.
The MLP [5] is a powerful classifier for multiclass problems, but there is a
disadvantage when using back-propagation as the learning algorithm. The algorithm can
stop training in a local minimum. It is possible to use the momentum strategy to escape
from local minima, however, if we try to continue the training phase the network can
overfit the weights, decreasing the generalization capability. kNN [2] classifies a sample
based on the distance from the patterns in training set nearest to the sample. Thus, the
more patterns there are in training set, equally distributed between the classes, higher is
the recognition rate. But the time to classify a sample depends of the number of patterns
in training database. Therefore, this technique is usually slow.
a b c
Fig. 1. Imagens from NIST database[3]. a) handwritten digit 7; b) handwritten digit 4; and
c) handwritten digit 7 again.
The SVM [6] is considered the best binary classifier, because it finds the best margin
of separation between two classes. The fact that SVM is a binary classifier is its greatest
disadvantage, as most of the recognition tasks are multiclass problems. To solve this,
some authors try to combine the SVMs [8] or use it as a decision maker classifier [9].
Based on these assumptions, this paper introduces a hierarchical SVM combination
that provides a highly accurate recognition rate in a short time to answer when applied
to handwritten digit recognition.
The present study is structured as follows: related works are presented in Section 2;
the SVM combination architecture proposed in Section 3; the experiments and results
are in Section 4; and the final conclusions of this paper are in Section 5.
2 Related Works
Support Vector Machine (SVM) [6][5] is a binary classification technique. The training
phase consists of finding the support vectors for each class and creating a function that
represents an optimal separation margin between the support vectors of different
classes. Consequently, it is possible to obtain an optimal hyperplane for class separation.
Analyzing SVM and its previously presented characteristics, it seems similar to the
perceptron [1] because it also tries to find a linear function to separate classes. But
there are two main differences: the SVM discovers the optimal linear function while
the perceptron seeks to discover any linear separation function; and the second
difference is that the SVM can deal with non-linearly separable data. To do that, SVM
uses a kernel function to increase the feature dimensionality and consequently turning
the data linearly separable.
There are two classical ways to work with multiple classes using SVM: one-against-
all and one-against-one [13]. In one-against-all approach, one SVM is created for each
An Efficient Way of Combining SVMs for Handwritten Digit Recognition 231
class. If we have 10 classes for example, as in digit recognition, we will have 10

SVMs, one for each digit. In this way we train the SVM(0) to differentiate the class 0
from the others classes labeling it as 1 and the others patterns as 0; the SVM(1) to
differentiate the class 1 from the others classes in the same way, and so on. In the
recognition phase, the pattern is submitted to the 10 SVMs, the SVM that replies label
1 indicates the class of the pattern [2]. The training set is the same database for all
SVMs changing just the label of patterns. If the set is used to train the SVM(i), patterns
classified as i will be replaced by 1 and the others patterns replaced by 0.
Neves et al. [8] presented the combination one-against-one applied to handwritten
digit recognition. The authors used a different SVM for each possible pair of classes.
In this approach, symmetric pairs, such as 0-1 and 1-0, are considered as the same pair,
whereas pairs of the same class, such as 1-1 are not considered. In the case of
handwriting digit recognition, there are 10 classes: 0 to 9. Pairing the classes, we have
a grand total of 45 pairs. Each class will appear in 9 SVMs. For example, the class 0
will be present in the following pairs: 0-1, 0-2, 0-3, 0-4, 0-5, 0-6, 0-7, 0-8, 0-9.
The training phase consists of two steps:
• Separate the database creating 45 data subsets for each pair of SVMs.
Each subset contains only the patterns of its respective classes. For
example, if the pair is responsible for differentiating 0 from 1, the subset
(0,1) will contain only 0 and 1 patterns;
• Find the kernel function that better separate the classes;
• Train all 45 SVMs.
The classification phase consists in submitting the pattern to all 45 SVMs and
identifying the most frequent class in the outputs of the SVMs.
The main advantage of this algorithm comes from the optimal hyperplane separation
given by each SVM. It produces a highly accurate recognition rate. However, the
number of SVMs combined depends of the number of classes. If the problem has a lot
of classes, the system time processing will increase.
Another method of using SVM for handwriting digit recognition is as a decision
classifier, as proposed by Bellili et al. [9]. In this work the authors observed that the
correct recognition class is present in the highest MLP output in 97.45% of cases.
However, if we consider the two highest MLP outputs, the percentage of correct class
is increased to 99% of cases. Then, the authors verified which main pairs of classes
were being confused. In their proposed method, when these pairs are present in the two
highest MLP outputs, the authors use an SVM to decide which output corresponds to
the correct class. For all other cases, they use the MLP output.
In [7] the main idea is to increase the kNN recognition rate using the SVM as a
decision maker classifier, similar to Bellili et al. [9]. The adaptation in this case is to
take the two most frequent classes in the k nearest neighbors and to use the SVM to
decide between these two classes. It is a satisfactory technique to be used where a
misclassification results in high and processing time or computational cost are not
essentially.
In Ciresan et al. [17] the digit recognition task is performed using a MLP. The
author’s argument is that the main problem to find the correct MLP architecture is to
produce a robust classifier from the training dataset. The proposal is to provide a MLP
with hidden layers and neurons enough. Camastra [18] combines SVMs with neural
gas. The method uses a neural gas network to verify in which case the characters are,
capitalized or not, and then defines if the upper or lower case forms of a character can
be included into a single class. The characters are then submitted to the SVM
recognizer to obtain the final classification.
In [19] the authors create an ensemble classifier, using gating networks to assemble
outputs given from three different neural networks. In [20] is proposed a method that
uses a new feature extraction technique based on recursive subdivision of the character
image. The combination of MLP and SVM are also used to recognize no western
languages.
3 Proposed Algorithm
After analyzing the state of the art, we propose another simple SVM combination. The
main idea is to create a hierarchical SVM structure. The first level is composed by a set
of SVMs. There is one SVM for each class pair but if one class is in one pair it cannot
be in other pairs. For example, in the case of digit recognition, we have 10 possible
classes (outputs): 0 to 9. The first level will have five SVMs, one for each of these
pairs: 0-1, 2-3, 4-5, 6-7 and 8-9.
The pattern will be classified by each SVM in the first level. It is expected that the
SVM trained with the correct classification set correctly classifies the sample and the
others can choose any class of its pair. The second level will combine the outputs
obtained in the first level, using the same strategy of the previous level. The process
will continue until there is only one output. An example of this hierarchical structure is
shown in Fig. 2, where the letters a, b, x, y and i represents the outputs given by the
SVMs and the number in parenthesis represents the pair that the SVM can
differentiate.
Fig. 2. Hierarchical SVM structure for digit recognition
4 Methods, Experiments and Results

The images used in the experiments were extracted from the NIST SD19 database [3],
which is a numerical database available from the American National Institute of
Standards and Technology. Each image in this database contains a varied number of
digits, as presented in Fig. 3.
Fig. 3. Sample image of NIST SD19 database [4]

The images were separated into isolated digits using an algorithm which employed
connected component labeling [10]. Each label corresponds to an isolated digit. After
segmentation, vertical and horizontal projections [15] were used to centralize the digit
in the image. Images larger than 20x25 were cropped, removing only the extra white
borders. If the object in the image is larger than 20x25, the white border is eliminated
and the digit resized. The size (20x25) was selected because the majority of digits are
approximately of this size.
Each digit was manually separated and labeled into classes to be used in the
supervised training of classifiers. The final database contains a total of 11,377 digits.
Each class contains in average 1,150 digits. This digit database was separated into a
training set and a test set. The training set contains 7,925 samples. It contains
approximately 800 digits per class. The test set contains 3,452 samples, which is
approximately 350 digits per class.
The feature vector is the same for all classifiers. It is the matrix image structured as
a vector. However, before converting it into a vector the image was resized again to
12x15 in order to reduce the dimensionality of the feature vector, generating a vector
with 180 binary features. This size was empirically defined based on previous
experiments.
4.1 Algorithms: Training and Configurations

The methods used in the experiments were MLP, kNN, MLP-SVM, 45SVMs (one-
against-one), kNN-SVM, SVM (one-against-all) and the proposed hierarchical SVM
structure. During experiments we tried to find the best configuration of each classifier.
The used configurations are described below:
a) Multilayer Perceptron
• Number of inputs: 180 (based on the feature vector);
• Number of hidden layers: one;
• Number of hidden layer nodes: 180 (during the experiments this number
was reduced and increased, this is the best topology found).
• Number of outputs:10 (number of possible classes);
• Hyperbolic tangent as activation function for all nodes;
• Number of training epochs: 30,000;
• Back-propagation (gradient descent) as the training algorithm.
b) Support Vector Machine (pairs)
As previously defined, a set of training databases were created for each possible pair
and forty-five pairs were used.
The polynomial and Gaussian radial basis functions (RBF) were used as the kernel
function. The polynomial kernel presented the best results and was selected for SVMs
pairs and SVMs one-against-all.
c) K Nearest Neighbor
The kNN main parameters are the k value and the distance equation used. The k
value was varied between 3 and 11. The distances used in the experiments were
Euclidian, Manhattan and Minkwoski. The best results are found with k equal to 3 and
Euclidian distance.
All algorithms were implemented using MatlabTM [16] version R2010a. All
algorithms were trained after the parameters selection. The same trained MLP and
SVMs were used throughout the methods, as were the same kNN parameters.
4.2 Experiments and Results

Two criteria were used to compare the seven algorithms (MLP, kNN, MLP-SVM [9],
45 SVMs – one-against-one [8], hierarchical SVMs combination, kNN-SVM [7] and
SVM – one-against-all). These criteria were processing time and recognition rate.
The Table 1 presents the recognition rates obtained by the algorithms on the test set.
Table 2 presents the average and standard deviation of the processing time in seconds
to classify one pattern in each algorithm.
In a statistical test considering the hypothesis that the algorithm with hierarchical
SVMs is the fastest algorithm (Table 2), the results demonstrated that the hypothesis is
true with a 98% of confidence.
Analyzing the SVM combinations, one-against-all and one-against-one, they
presented promising results but it can be insufficient if the criteria to use the classifier
are: processing time and high recognition rate. In one against-all, the main difficulty
was to find a kernel function that increases the feature space turning the one class
linearly separable from the others. If the number of classes increases, the complexity to
find an effective kernel function increases as well. In some cases, this combination
does not return a valid output. For example, after submitting one pattern to all SVMs,
the outputs were labeled as 0 and the classification was not possible.
In one-against-one, the number of SVMs used is based on the number of classes.
This number can be obtained by using the formula below (1). The architecture
proposed in this paper also depends on the number of classes but decreases
significantly because the number of SVMs is smaller, as is shown in (3).
Number of SVMs in one-against-one = n * (n-1) / 2 (1)
Number of SVMs in one-against-all = n (2)
Number of SVMs in new approach = n-1 (3)
Where n is the number of classes.
The techniques based on kNN obtained a high recognition rate and a long
processing time. Thus, these approaches are inadequate to use for online recognition
tasks, for example. The techniques based on SVM achieve good results if we consider
Table 1. Recognition Rate in the Test Set
Recognition rate Recognition rate

Algorithms Number of Algorithms Number of
(%) (%)
samples samples
MLP 3,301 95.63% MLP-SVM 3,362 97.39%
kNN 3,350 97.05% kNN-SVM 3,382 97.97%
45 SVMs 3,375 97.76% SVM(One-against-all) 2,818 81,63%
Hyerarchical
3,374 97.74%
SVMs
Table 2. Processing time results in the Test Set
Processing Time Processing Time

(seconds) (seconds)
Algorithms Algorithms
Standard Standard
Average Average
Deviation Deviation
MLP 0.0221 0.0049 MLP-SVM 0.0252 0.0102
kNN 0.0674 0.0088 kNN-SVM 0.0684 0.0104
45 SVMs 0.0603 0.0273 SVM(One-against-all) 0.0177 0.0174
Hyerarchical
0.0115 0.0065
SVMs
the recognition rate criterion. The one-against-all approach is the only one that does not
present a high recognition rate among the SVM approaches. However, analyzing it
carefully, we can see that among the 634 errors, 629 had not been labeled, when all
SVMs returns the same classification and in this way it is not possible to classify the
pattern (the sample was rejected by the classifier). Among the 634 errors just 15 were
misclassified. This approach can be a good choice when rejection is rather than a
classification error.
The proposed hierarchical SVM is the third best classifier for handwritten digits, but
the difference between the first and the second is very small so in the statistical tests
they are equivalent. In this case, the processing time is the main goal because the kNN-
SVM and 45SVMs techniques are the slowest and the proposed method is the fastest.
Thus, the hierarchical method can be highlighted by the high recognition rates in a
short processing time. In Fig. 4 the processing time and error rate of the evaluated
methods were normalized and plotted together in the graphic. In this case the best
method is the method with results close to graphical origin. This analysis shows that
the proposed method is the best evaluated method.
Fig. 4. Comparison graphic between the algorithms
5 Conclusion
This paper presents a method to combining SVMs applied to handwritten recognition
problem considering a short processing time and a highly accurate recognition rate.
After a brief study of the related works, it was found that classical classifiers are still
used to recognize handwritten texts because of their low processing times and high
recognition rates. New approaches, such as proposed by Neves et al. [8] and Zanchettin
et al. [7] increase the recognition rates but also increase processing time and
computational costs.
Based on these criteria, classical classifiers and classifiers built specifically to
handwritten digits recognition were implemented and tested. The proposed new
approach presented the best results considering the processing time and recognition
rate. The techniques based on kNN had the longest processing time and the highest
recognition rate. On the other hand, the hierarchical SVM combination obtained high
recognition rates and the shortest processing time. As verified by experiments this
SVM combination is the best choice when both of these criteria were important
requirements.
The future works will consider handwritten characters and to attempt to combine the
proposed method with word classification methods.
Acknowledgment. This work was supported by CAPES (Brazilian Research Agency)

and Faculdade Boa Viagem (FBV).
References
1. Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach. Prentice Hall Pearson
Education Inc. (2003)
2. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. Wiley–Intersciance (2001)
3. Plötz, T., Fink, G.A.: Markov models for offline handwritting recognition: a survey.
International Journal on Document Analysis and Recognition 12(4), 269–298 (2009)
4. NIST Special Database 19. Handprinted Forms and Characters Database,
http://www.nist.gov/srd/nistsd19.cfm
5. Haykin, S.: Neural Networks: A Comprehensive Foundation, 2nd edn. Prentice-Hall
(1999)
6. Vapnik, V.N.: Statistical Learning Theory. John Wiley and Sons, New York (1998)
7. Zanchettin, C., Bezerra, B.L.D., Azevedo, W.W.: A KNN-SVM Hybrid Model for Cursive
Handwriting Recognition. In: IEEE Int. Joint Con. on Neural Networks, Birsbane (2012)
8. Neves, R.F.P., Lopes-Filho, A.N.G., Mello, C.A.B., Zanchettin, C.: A SVM Based Off-
Line Handwritten Digit Recognizer. In: IEEE International Conference on Systems, Man,
and Cybernetics, vol. 1, pp. 510–515 (October 2011)
9. Bellili, A., Gilloux, M., Gallinari, P.: An MLP-SVM combination architecture for offline
handwritten digit recognition. Reduction of recognition errors by Support Vector Machines
rejection mechanisms. International Journal on Document Analysis and Recognition 5,
244–252 (2004)
10. Bhowmik, T.K., Ghanty, P., Roy, A., Parui, S.K.: SVM-based hierarchal architectures for
handwritten Bangla character recognition. International Journal on Document Analysis and
Recognition 12, 97–108 (2009)
11. Parsiavash, H., Mehran, R., Razzazi, F.: A robust free size OCR for omni-font
Persian/Arabic document using combined MLP/SVM. In: Proceedings of Iberoamerican
Congress on Pattern Recognition, pp. 601–610 (2005)
12. Camastra, F.: A SVM-based cursive character recognizer. Pattern Recognition 40, 3721–
3727 (2007)
13. Hsu, C.W., Lin, C.J.: A Comparison of Methods for Multiclass Support Vector Machines.
IEEE Transactions on Neural Networks 13(2), 415–425 (2002)
14. Parker, J.R.: Algorithms for Image Processing and Computer Vision. John Wiley and Sons
(1997)
15. Gonzalez, R., Woods, C., Richard, E.: Digital Image Processing. Addison-Wesley (1992)
16. Mathworks MatlabTM – The language of technical computing,
http://www.mathworks.com/products/matlab/
17. Ciresan, D.C., Meier, U., Gambardella, L.M., Schmidhuber, J.: Deep, big, simple neural
nets for handwritten digit recognition. Neural Computation 22, 3207–3220 (2010)
18. Camastra, F.: A SVM-based cursive character recognizer. Pattern Recognition 40, 3721–
3727 (2007)
19. Zhang, P., Bui, T.D., Suen, C.Y.: A novel cascade ensemble classifier system with a high
recognition performance on handwritten digits. Pattern Recognition 40, 3415–3429 (2007)
20. Vamvakas, G., Gatos, B., Perantonis, S.J.: Handwritten character recognition through two-
stage foreground sub-sampling. Pattern Recognition (43), 2807–2816 (2010)
Comparative Evaluation of Regression Methods
for 3D-2D Image Registration
Ana Isabel Rodrigues Gouveia1,2, Coert Metz3, Luís Freire4, and Stefan Klein3
1
CICS-UBI – Health Sciences Research Centre, University of Beira Interior,
Covilhã, Portugal
2
Institute of Biophysics and Biomedical Engineering, University of Lisbon, Lisbon, Portugal
3
Biomedical Imaging Group Rotterdam, Depts. of Medical Informatics & Radiology,
Erasmus MC, Rotterdam, the Netherlands
4
Escola Superior de Tecnologia da Saúde de Lisboa, Instituto Politécnico de Lisboa,
Lisbon, Portugal
Abstract. We perform a comparative evaluation of different regression

techniques for 3D-2D registration-by-regression. In registration-by-regression,
image registration is treated as a nonlinear regression problem that relates
image features of 2D projection images to the transformation parameters of the
3D image. In this work, we evaluate seven regression methods: Multiple Linear
and Polynomial Regression (LR and PR), k-Nearest Neighbour (k-NN),
Multiple Layer Perceptron with conjugate gradient optimization (MLP-CG) and
with Levenberg-Marquardt optimization (MLP-LM), Radial Basis Function
network (RBF) and Support Vector Regression (SVR). The experiments are
performed using simulated X-ray images (DRRs) of nine coronary vessel trees,
allowing us to compute the mean target registration error (mTRE) to the ground
truth. All methods were robust to large initial misalignment and the highest
accuracy was achieved using MLP-LM and RBF.
Keywords: 3D-2D image registration, regression, Multiple Layer Perceptron,

Radial Basis Function network, Support Vector Regression, coronary arteries.
1 Introduction
Medical interventions can benefit from the integration of preoperative (diagnostic)

and intraoperative imaging data. Accurate and fast image registration is required to
find the relation between the preoperative and the intraoperative images of the patient
in the intervention room. A good example is the registration of preoperative 3D
computed tomography angiography (CTA) images and intraoperative 2D X-ray
images, for percutaneous coronary interventions. For a detailed explanation of the 3D-
2D registration problem we refer to [1]. Most previously presented methods are based
on simulated X-ray projection images – digitally reconstructed radiographs (DRRs) –
computed from the preoperative CTA scan. In these methods, registration is
performed by iteratively optimizing a similarity metric, measuring the difference
between the DRR and the X-ray image [1]. Due to local maxima of the similarity
Comparative Evaluation of Regression Methods for 3D-2D Image Registration 239
measure, such iterative optimization procedures usually have a small capture range
and therefore require initialization close to the searched pose [1, 2]. In our recently
proposed registration-by-regression framework [3], image registration is treated as a
nonlinear regression problem, as an alternative for the iterative traditional approach.
A MLP was used as the regression model relating the DRR image to the
transformation parameters of the 3D object.
In literature, some authors already treat medical image registration as a regression
problem, but the application, the features, and the outputs predicted by the function
are different from our approach. MLP is used as the regression method in [4, 5] and in
[6], different Neural Networks were studied, including RBF networks. Image
registration using SVR [7] and k-NN [8] was investigated as well.
In this paper we perform a comparative evaluation of seven different regression
techniques for the 3D-2D registration-by-regression problem, particularly for the
registration of 3D preoperative coronary CTA images to 2D intraoperative X-Ray
images. For reference, the results of a conventional registration method (i.e. based on
iterative optimization) obtained in [3] are reported as well.
2 Method
2.1 Registration by Regression
The 3D-2D registration method based on regression was first presented in [3]. The
regression model relates image features of the 2D projection image (Fig. 1) to the
transformation parameters of the 3D image (translation and rotation). Before the
intervention, a set of simulated 2D images (DRRs) is generated by applying random
transformations to the pre-interventional 3D image followed by projection of its
coronary artery segmentation. A set of features extracted from the DRR and their
corresponding transformation parameters form an input-output pair in the training set
for the learning process. During the intervention, the image features of the 2D
projection image are computed and fed as input to the regression function, which
returns the estimated 3D translation and rotation parameters of the 3D image.
Mathematically, given an input vector ,…, , we want to predict an
output Y with a model such that . To estimate the parameters of the
prediction model we use a set of measurements , for 1, … , (the
training data). For each transformation parameter (three rotation angles and three
translations) an independent regression model is trained.
2.2 Input Features

In [3], different sets of image features were compared. In this work, we use the best
performing set, consisting of two types of features: 1) 2D geometric moments and 2)
the eigenvalues and eigenvectors obtained by Principal Component Analysis (PCA)
The 2D geometric moments included are the moment of order zero, the two first
order moments and the three second order moments. The PCA features are computed
on the pixels of the object of interest after a coarse segmentation of the image into
240 A.R. Gouveia et al.
object and background objects, described in [3]. The PCA is performed on a

combination of the pixel locations and their corresponding intensity values, i.e., a 3D
vector with x, y and I(x,y) as variables, where I(x,y) is the intensity value of the point
at position (x,y). PCA was preceded by computing the z-score of the features, where
we used an identical mean and standard deviation for x and y, to prevent losing pose
information in this normalization procedure. The normalization is necessary since the
intensities I(x,y) have a different unit than the pixel positions x and y.
2.3 Regression Models
The following regression methods are compared in this work.
Multiple Regression: Linear and Polynomial. We consider two Multiple Regression

cases. The first is Multiple Linear Regression (LR) where the model in Section 2.1
has the form ∑ with the model parameters to be trained [9].
The second model is the polynomial regression (PR) model, which uses basis
expansions and interactions between variables : ∑
∑ ∑ . Both regression models are linear in the parameters and
we use the Least Squares Method to compute the parameters [9].
k-Nearest Neighbour. The k-nearest neighbour (k-NN) method makes predictions by
averaging the responses of the k nearest points in the training set. The model can
be written as ∑ , where is the neighborhood of defined
by the k nearest points in the training set [9]. The metric used is the Euclidean
distance.
Multiple-Layer Perceptron. A MLP with a hidden layer is considered with the
hyperbolic tangent function as activation function. For the output layer a linear
function is used. This MLP model for P input features and H hidden units is given by
∑ φ ∑ with φ being the hyperbolic
tangent function [9, 10]. The model parameters ’s represent the synaptic weights
and the indexes (1) and (2) correspond to the connections between input-hidden layers
and hidden-output layers, respectively.
Two optimization techniques for training are evaluated: conjugate gradient (MLP-
CG), as in [3], and Levenberg-Marquardt (MLP-LM) [11]. The weights are randomly
initialized within a range [-1,1]. The regularization parameter in the Levenberg-
Marquardt method is initialized at 10-3 with an increase factor of 10 and a decrease
factor of 10-1.
Radial Basis Function Network. The RBF network has a similar architecture as the
MLP. The activation function of the hidden layer is the Gaussian function. The
argument of the activation function in RBF networks is the Euclidean distance
between the input vector and the synaptic weight vector (the RBF centre) of that
neuron [10, 12]. The RBF network model can be written as
∑ with a normalized Gaussian function.
Comparative Evaluatio
on of Regression Methods for 3D-2D Image Registration 241
The RBF uses the Orth hogonal Least Squares as training algorithm where R RBF
centres are chosen one by one
o from the input data. Each selected centre maximizes the
increment to the explained variance
v of the desired output [13].
Support Vector Machine.. In Support Vector Regression (SVR) a linear regresssion
function in a high N-d dimensional feature space is computed as f
∑ , where the input data are mapped via a nonlinear transformationn
T
according to the kernel funnction , [14]. The -SVR moodel
[15] with an -insensitive lo
oss function [16] and RBF kernel function is used.
3 Experiments and Results
3.1 Imaging Data

We used 3D preoperativ ve coronary CTA data with 2D intraoperative X--ray
angiography of ten patien nts. To evaluate the regression methods in a controllled
setting, we used simulated projection images of the coronary vessel tree, with knoown
ground truth transformatio on, instead of real X-ray images. To this end, we m made
binary vessel tree models by segmenting coronary arteries at end-diastole (Fig. 2)
[17]. From these 3D modells, DRRs were generated using the computation proceddure
described in [18]. The projection geometry for the computation of the DRRs and the
initial orientation of the preeoperative data were derived from the interventional X--ray
image, thereby mimicking a clinically relevant view. The size and voxel spacingg of
the CTA images were 256x256x[99-184] voxels and 0.7x0.7x[0.8-1.0] m mm3,
respectively, and the size and
a pixel spacing of the DRR images were 512x512 pixxels
and 0.22x0.22 mm2, respecttively.
For each patient, 11000 0 DRRs were generated, 10000 to obtain the regresssion
model and 1000 to test th he performance of our method. The transformations w were
drawn from a uniform disttribution, with a wide yet relevant range, i.e. between -10
and 10 degrees for rotationss and between -10 and 10 mm for translations.
Fig. 1. Geometry of a C-arm C Fig. 2. Coronary CTA slice (a), coronary 3D CTA with
device, which makes 2D segmented coronary arteries (b), 3D coronary arteery
projection images of a 3D ob
bject model (c), and DRR obtained from the model (d)
3.2 Conventional 3D-2D Registration Method

The different registration-by-regression models were compared to a method based on
iterative optimization [3]. This method was proposed in [17, 3] and was designed for
the application considered. It uses a nonlinear conjugate gradient optimizer and a
similarity metric based on the distance transform of the projection of the 3D coronary
segmentation, and a fuzzy segmentation of vessel structures in the 2D image.
3.3 Evaluation Methodology

The evaluation of the registration approaches was performed by the computation of
the mean target registration error (mTRE) before and after registration, which is a
well-known evaluation criterion for registration methods [19]. It computes the
distance between corresponding points, to assess the accuracy of the registration. The
mTRE is computed as the mean 3D distance of the centerline tree at the ground truth
position and orientation to the centerline tree at the position and orientation
determined by the registration-by-regression method, given by
∑ , with T the transformation resulting from one of the
registration methods, Tgold the known ground-truth transformation and pn points on the
centrelines of the 3D vessel tree. All reported mTRE values in the following sections
were computed on the test set of 1000 images, which was not used for the
construction of the regression models.
3.4 Optimization and Parameter Settings for Regression Based Registration
Prior to the evaluation of the regression models, some experiments were performed to
optimize k-NN, MLP, RBF and SVR. To this end, the image set of one of the patients
(named as patient 0) was used, whereas the sets of the nine remaining patients were
used for the evaluation in Section 3.5. For all patients, the set of 10000 images, for the
construction of the regression model, was split in two sets of 70% and 30%. The set
of 7000 images was used to train the regression models; the remaining 3000 images
were considered for validation purposes, to select tuning parameters.
For the k-NN model, the search range for the optimal value of k was limited to
[1,50], based on experiments on patient 0 with a coarse grid-search in a larger range.
For every patient and for each transformation parameter, the optimal value of k
[1,50] was chosen as the point where the validation error started to grow with
increasing k.
For the MLPs we tried several numbers of units for the hidden layer, {9, 18, 36,
54}, using the image set of patient 0. The number of hidden units was set to 36, which
leads to a topology of P=18 input units, 36 hidden units and 1 output unit. The
number of epochs was defined separately for each MLP by a stopping epoch (i.e., the
epoch when the validation error started to grow) with a maximum of 1000.
For RBF networks, a two-level grid-search for the spread of radial basis functions
was performed for patient 0. First, a rough spread estimate was computed as
/ where represents the maximum distance between the inputs [10]. The
grid search was performed in the range /2, 5 , and the initial estimate was
found to be optimal. For the other patients, the spread was set to , with
computed on their respective input sets. The optimum number of RBF centres was
determined for each patient (and each transformation parameter) as the point where
the validation error started to grow, with a maximum of 1000 neurons.
In the SVR case, three parameters need to be set: , C and γ [20]. All three are
tuned simultaneously. We use a coarse-to-fine grid-search as recommended by [21,
22], considering exponentially growing sequences of the parameter values. We
performed a wide range three-level grid-search for patient 0 and the values obtained
were used for the other patients. In the first level, we used ranges {2-13, 2-11,…, 25} for
, {2-1, 21,…, 215} for C and {2-9, 2-7,…, 29} for γ.
All regressions experiments were performed using MatLab 7.11.0.584. For SVR
we used version 3.1 of the LIBVSM tool [22].
3.5 Results
The mTRE values of the different regression methods and of the conventional
registration method, considering all patients except the one used for parameter
optimization, are shown in Figure 3. We also present, for each of the above patients,
the Regression Error Characteristic (REC) curves [23] for all methods (Fig. 4). These
REC curves estimate the cumulative distribution function of the error and are a
customization of receiver operating characteristic curves to regression. They plot the
error tolerance on the x-axis (expressed as mTRE in our case) and the accuracy of a
regression function on the y-axis (i.e. the percentage of points that lie within the
tolerance).
4 Discussion and Conclusion
We compared different regression methods for the registration-by regression approach

for 3D-2D image registration of coronary vessel trees. Figure 3 shows that Neural
Networks (MLP and RBF) and SVR behave quite similarly and perform better than
Multiple Regressions and k-NN. The registration-by-regression approach is less
accurate but has a smaller variance when compared to the iterative registration
approach, confirming the results obtained in [3] using the MLP-CG strategy. The
evaluation per patient (Fig. 4) shows that MLP-LM and RBF have the best
performances for all patients except for patient 7 and 9, followed by MLP-CG and
SVR. For patients 7 and 9, MLP-LM still gives the best result, followed by MLP-CG,
SVR and finally RBF. For all patients, the worst results were obtained with k-NN
and LR.
The MLP-LM evaluated in this work uses an optimization strategy which performs
better than the MLP-CG approach in [3]. The relatively low performance of LR and
PR suggests that a highly nonlinear regression model is required for the registration-
by-regression method.
Registration-by-regression with real X-ray images will be the focus of future work.
Fig. 3. Comparison of registrration results for all regression methods and for a conventioonal
registration method (Conv), considering
c all patients, except patient 0 (which was used for
parameter optimization). The graphic
g also shows the initial mTRE before registration.
Fig. 4. REC curves for all meethods and for each patient, except patient 0 (which was usedd for
parameter optimization)
References
1. Markelj, P., et al.: A review of 3D/2D registration methods for image-guided interventions.
Medical Image Analysis (2010)
2. van de Kraats, E.B., et al.: Standardized evaluation methodology for 2D – 3D registration.
IEEE Transactions on Medical Imaging 24, 1177–1189 (2005)
3. Gouveia, A.R., et al.: 3D-2D Image registration by nonlinear regression. In: IEEE-ISBI (in
press, 2012)
4. Freire, L.C., Gouveia, A.R., Godinho, F.M.: A Neural Network-Based Method for Affine
3D Registration of FMRI Time Series Using Fourier Space Subsets. In: Diamantaras, K.,
Duch, W., Iliadis, L.S. (eds.) ICANN 2010, Part I. LNCS, vol. 6352, pp. 22–31. Springer,
Heidelberg (2010)
5. Zhang, J., et al.: Rapid surface registration of 3D volumes using a neural network
approach. Image and Vision Computing 26, 201–210 (2008)
6. Wachowiak, M.P., et al.: A supervised learning approach to landmark-based elastic
biomedical image registration and interpolation. In: IEEE – IJCNN 2002, pp. 1625–1630
(2002)
7. Qi, W., et al.: Effective 2D-3D medical image registration using Support Vector Machine.
In: IEEE - EMBS 2008, pp. 5386–5389 (2008)
8. Banks, S., et al.: Accurate measurement of three-dimensional knee replacement kinematics
using single-plane fluoroscopy. IEEE TBE 43, 638–649 (1996)
9. Hastie, T., et al.: The elements of statistical learning: data mining, inference and
prediction. Springer (2009)
10. Haykin, S.: Neural networks: a comprehensive foundation. Prentice Hall, Delhi (1999)
11. Hagan, M.: Training feedforward networks with the Marquardt algorithm. IEEE
Transactions on Neural Networks 5, 2–6 (1994)
12. Sarle, W.: Neural network frequently asked questions (2005),
ftp://ftp.sas.com/pub/neural/FAQ.html
13. Chen, S., et al.: Orthogonal least squares learning algorithm for radial basis function
networks. IEEE Transactions on Neural Networks 2, 302–309 (1991)
14. Basak, D., et al.: Support Vector Regression. Neural Information Processing – Letters and
Reviews 11, 207–224 (2007)
15. Vapnik, V.N.: The nature of statistical learning theory (1995)
16. Smola, A.J., Schölkopf, B.: A tutorial on support vector regression. Statistics and
Computing 14, 199–222 (2004)
17. Metz, C.T., et al.: Alignment of 4D Coronary CTA with Monoplane X-Ray Angiography.
In: MICCAI 2011 (2011), http://www.bigr.nl/publication/694
18. Metz, et al.: GPU accelerated alignment of 3-D CTA with 2-D X-ray data for improved
guidance in coronary interventions. In: IEEE-ISBI (2009)
19. Fitzpatrick, J.M., West, J.B.: The distribution of target registration error in rigid-body
point-based registration. IEEE Transactions on Medical Imaging 20, 917–927 (2001)
20. Cherkassky, V., Ma, Y.: Practical selection of SVM parameters and noise estimation for
SVM regression. Neural Networks: the Official Journal of the INNS 17, 113–126 (2004)
21. Hsu, C., et al.: A Practical Guide to Support Vector Classification. Bioinformatics 1, 1–16
(2010)
22. Chang, C., et al.: LIBSVM: A Library for Support Vector Machines. Science, 1–39 (2011)
23. Bi, J., et al.: Regression error characteristic curves. In: ICML (2003)
A MDRNN-SVM Hybrid Model for Cursive Offline
Handwriting Recognition
Byron Leite Dantas Bezerra1 ,

Cleber Zanchettin2, and Vinı́cius Braga de Andrade1
1
Polytechnic School of Pernambuco University of Pernambuco
50.750-470, Recife - PE, Brazil
{byronleite,viniciusbraga}@ecomp.poli.br
2
Federal University of Pernambuco, Center of Informatics
Cidade Universitária, 50.732-970, Recife - PE, Brazil
cz@cin.ufpe.br
Abstract. This paper presents a recurrent neural networks applied to handwrit-

ing character recognition. The method Multi-dimensional Recurrent Neural Net-
work is evaluated against classical techniques. To improve the model performance
we propose the use of specialized Support Vector Machine combined whit the
original Multi-dimensional Recurrent Neural Network in cases of confusion let-
ters. The experiments were performed in the C-Cube database and compared with
different classifiers. The hierarchical combination presented promising results.
1 Introduction
Despite more than 30 years of handwriting recognition research [15], [12], [21], [3],
developing a reliable, general-purpose system for unconstrained text line recognition
remains an open problem. This is a complex task due to variations of existing styles
in handwriting, noise in acquisition process and similarity among some classes [23]. In
this domain each writer has a different calligraphy and it my change depending on the
writing material (type of pen, pencil or paper). The emotional state of the person, the
paper space available and time to write may yet have some influence on the handwriting
text. In the classical literature the classifiers used to perform handwriting recognition
are statistical, connectionist and probability based classifiers [3].
Recently the Recurrent Neural Networks (RNN) presented promising results in this
field [1]. The architecture proposed by Graves et al. [7] obtained the best results in
The ICDAR 2009 Handwriting Recognition Competition [1]. This model is composed
by a hierarchy of Multi-dimensional Recurrent Neural Network (MDRNN) [9] layers
that uses the Long Short Term Memory (LSTM) method [8] and Connectionist Tempo-
ral Classification (CTC) output layer to character and word recognition. The proposed
model is an offline recognition system that would works on raw image pixels. As well as
being alphabet independent, such a system would have the advantage of being globally
trainable, with the image features optimized along with the classifier.
As the model does not need a explicit feature extraction procedure from the digi-
talized handwriting image, despite simplify the recognition process, this property may
cause misclassification in similar letters. This confusion can happen because the method

A MDRNN-SVM Model for Cursive Handwriting Recognition 247
don’t have specific information about discriminatory regions of specific letters (e.g. U
and V, Q and O, I and l, among others), like specialized feature extraction techniques
perform to avoid misclassifications. In this paper, we evaluate the performance of the
original MDRNN in handwriting character recognition problem and propose the use
of specialized Support Vector Machine (SVM) [20] to improve the performance of the
MDRNN in cases of confusion letters. The performance of the method is verified in the
C-Cube database and compared with different classifiers using the same database.
2 Handwritten Character Recognition

The handwriting recognition task is traditionally divided into online and offline recog-
nition. In online recognition a time series of coordinates, representing the movement of
the pen-tip, is captured, while in the offline case only an image of the text is available.
Because of the greater ease of extracting relevant features, online recognition generally
yields better results [15]. Another important distinction among recognizing isolated
characters or words, and recognizing whole lines of text. Lastly, handwriting recogni-
tion can be split into cases where the writing style is constrained in some way, for more
challenging scenario where it is unconstrained.
Character recognition consists of recognizing a set of characters from an image,
separating them into 10 classes, in the case of digits, or 26 classes, in the case of the
Western alphabet letters. There are some problems that hinder the implementation of
character recognition. In some cases the scanned image is of low quality, thus is nec-
essary to perform a preprocessing to eliminate image noises. Another problem is the
existence of distorted characters, especially when dealing with handwritten documents,
due to the characteristics of the writer’s style. Moreover, another difficulty to consider
is the similarity between some characters, such as I and J, Q and O, U and V, among
other. In Figure 1 is presented a sample of characters with similar characteristics.
To build an efficient handwritten recognition system, it is important to choose cor-
rectly the extracted features to be used. In Rodrigues et al. [13] is performed a study to
evaluate a set of feature extraction techniques for handwritten letters recognition. This
work presented a technique based on the projection of the image outline on the sides of
a regular polygon built around each character. The feature vector is formed by the per-
pendicular distances taken from each side of the polygon to the contour of the image. In
the evaluation process the proposed approach is compared using two polygons (square
and hexagon) and two different amounts of projection lines taken from each side of the
polygon, with two versions of coding bit maps (standard and tuned). The discriminatory
power of each case is examined through the use of a Multi-Layer-Perceptron (MLP).
Vamvakas et al. [19] proposed a methodology based on a new feature extraction
technique based on recursive subdivision of the character image so that the result of
sub-images in each iteration has a balanced number (approximately equal) of pixels in
the foreground, as far as it is possible. In the experiments two databases of handwrit-
ten characters (CEDAR and LIC) and two databases of handwritten digits were used
(MNIST and CEDAR). The classification step was performed using SVM with Radial
Basis Function (RBF).
Cruz et al. [5] presented a new approach for recognizing cursive characters us-
ing multiple feature extraction algorithms and an ensemble classifier. Several feature
248 B.L.D. Bezerra, C. Zanchettin, and V.B. de Andrade
extraction techniques, using different approaches, were evaluated. Based on the results,
a combination of feature sets was proposed in order to achieve high recognition perfor-
mance. This combination was motivated by the observation that the sets of character-
istics are independent and complementary. The ensemble was conducted by combining
the outputs generated by the classifier in each set of features separately. The database
used for the experiments was the C-Cube and the classifier a three-layer MLP network,
trained with the Resilient Backpropagation.
Bellili et al. [2] proposed an hybrid MLP-SVM method for unconstrained handwrit-
ten digits recognition. This hybrid architecture is based on the idea that the correct digit
class almost systematically belongs to the two maximum MLP outputs and that some
pairs of digit classes constitute the majority of MLP misclassifications. Specialized
local SVMs are introduced to detect the correct class among these two classification
hypotheses.
Camastra [4] presented a cursive character recognizer that performs the character
classification using SVM and neural gas. The neural gas is used to verify whether lower
and upper case version of a certain letter can be joined in a single class or not. Once
this is done for every letter SVMs performs the character recognition. They compare
notably better, in terms of recognition rates, with popular neural classifiers, such as
Learning Vector Quantization (LVQ) and MLP. The SVM recognition rate is among the
highest presented in the literature for cursive character recognition.
In Neves et. al. [11] is presented a hierarchical combination of SVM to handwriting
digit recognition with promising classification results but a high classification time due
to the necessity of use a model of SVM to each class pair.
Zanchettin et al. [23] presents a hybrid KNN-SVM method for cursive character
recognition. Specialized SVMs are introduced to significantly improve the performance
of KNN in handwriting recognition. This hybrid approach is based on the observation
that when using KNN in the task of handwritten characters recognition, the correct
class is almost always one of the two nearest neighbors of the KNN and SVM is used
to reduce misclassifications between the two nearest neighbors.
In [13] and [19] it was observed similarities between some letters (e.g. ‘B and D’,
‘H and N’, ‘O and Q’). In [23] it was observed similarities between letters ‘O and D’,
‘B and R’, ‘D and B’, ‘N and M’. In [5] was detected a high error rate in characters
with different ways of writing (e.g., ‘a and A’, ‘f and F’). In this paper we perform
experiments with the different ways of writing like upper, lower and joint case.
3 MDRNN Boosted by SVM Applied to Character Recognition
The hierarchical structure proposed by Graves et al. [7] is composed by the MDRNN
using the LSTM. The output layer use the connectionist temporal classification (CTC)
[10]. The concept of MDRNNs [9] is to replace the single recurrent connection found
in standard recurrent networks with as many connections as there are spatio-temporal
dimensions of the data. These connections allow the network to create a flexible internal
representation of surrounding context, which is robust to localized distortions.
This architecture allows cyclic connections among the network nodes. These connec-
tions enable the network save the entire history of previous inputs and for each output
(without this characteristic - cyclic connections) only mapping the input to the output
class. The key issue is the recurrent connections acts as a ‘memory’ of previous inputs
to persist in the network’s internal state, which can be used to influence the network
output [8].
In Figure 2 is presented a RNN with one hidden-layer. Unfortunately, for standard
RNN architectures, the range of context accessible is limited. The problem is that the in-
fluence of a given input on the hidden layer, and therefore on the network output, either
decays or blows up exponentially as it cycles around the network’s recurrent connec-
tions. This behavior results in what is generally called the Vanishing Gradient Problem,
illustrated in Figure 4(a). In this figure the luminance of the nodes indicate the sensitiv-
ity over time that previous information of an input can influence in the information of
the next input. The sensitivity decays exponentially over time as new inputs overwrite
the activation of hidden units and the network ’forgets’ the first input.
Fig. 1. Characters with struc- Fig. 2. RNN with one hidden Fig. 3. Illustration of the
tural similarity layer LSTM model [8]
To solve this problem Graves et al. [7] proposed the use of the LSTM method. The
LSTM consists of a recurrent subnetwork that is known as memory block. Each block
contains one or more self-connected memory cells and three multiplicative units: the
input, output and forget gates. The multiplicative gates allow LSTM memory cells to
store and access information over long periods of time, thereby avoiding the Vanishing
Gradient Problem. A LSTM unit is illustrated in Figure 3. In Figure 4(b) is presented
how the LSTM work over time with the next network inputs. A detailed description is
presented in [7].
To work with multiple dimensions the authors suggested to use a multi-dimensional
network expanding the concept above presented. The basic idea of the MDRNNs is to
replace the single recurrent connection found in standard RNNs with as many recurrent
connections as there are dimensions in the data. During the forward pass, at each point
in the data sequence, the hidden layer of the network receives both an external input and
its own activations from one step back along all dimensions. Therefore, in the character
recognition problem we use two recurrent connections based on the image dimensions.
The MDRNN and LSTM activation output is presented to the CTC output layer
designed for sequence labeling with RNNs. Unlike other neural network output layers, it
does not require pre-segmented training data, or postprocessing to transform its outputs
into transcriptions. Instead, it trains the network to directly estimate the conditional
probabilities of the possible class given the input sequences.
Although the robustness and complexity in different problems of the MDRNN-LSTM
model [8], we have observed some misclassifications when the correct character class or
word have significant similarities to an incorrect class. In some cases still is possible to
distinct the classes without using of context information. The reason this happens is, at
the same time, one of the most interesting advantages of the MDRNN-LSTM model: it
does not have to run the feature extraction step. Otherwise, this step is generally needed
in recognition methods grounded on the Bayes Decision Theory (BDT) [6]. According
to the Bayes desision rule, the class of a given observed feature vector must be chosen
uniquely based on the likelihood of this feature vector concerning the class, times the
probability of this class. Therefore, suppose we are able to define discriminant functions
which take as input well defined features extracted from the image of some character
and, as output, compute the sample’s probability to belong to the respective class as-
sociated. Then, the class associated with the discriminant function which has produced
the highest output, is the chosen class. Our idea is combining both strategies in order to
improve the recognition results.
(a) Vanishing Gradient Prob- (b) Consider ’O’, for a open gate
lem and ’-’ for a closed gate
Fig. 4. Information in MDRNN over time [8]
If we adopt the previous approach for each class where the MDRNN-LSTM model
have more misclassifications, we maximize the chance of deciding the correct class
based on this composed model, since it is specialized on the representative features ex-
tracted from the data. Additionally, take into account the majority of confusion observed
in the MDRNN-LSTM model belongs to some pair of classes (e.g., ‘U’ and ‘V’, ‘I’ and
‘l’, among others), we just need a dichotomizer [6] for each pair of classes misclassified
by the MDRNN-LSTM method. In other to determine the pair of classes we need to run
the respective dichotomizer, we select the two classes which receives the highest values
(the most probable) in the MDRNN output layer.
Fig. 5. The SVM optimal Fig. 6. Features in a bi- Fig. 7. kernel function
margin separation [14] dimensional space [14]
The SVM [20] is a binary classification technique and also a dichotomizer. We pro-
pose the use of the SVM as a class confirmation for the MDRNN-LSTM method. The
SVM training consists of finding the support vectors for each class and creating a func-
tion that represents an optimal margin separation between the support vectors of dif-
ferent classes. Consequently, it is possible to obtain an optimal hyperplane to class
separation as shown in Figure 5. Therefore, the SVM find the optimal linear separation
function among two classes and still deal with non-linear separable data (see Figure 6).
The SVM uses a kernel function to increase the feature dimensionality and consequently
turning the data linearly separable as is shown in Figure 7.
The proposed system architecture is shown in Figure 8. In the proposed system, the
feature extraction step is done only when the MDRNN-LSTM return as output a class
with low confidence and this class is generally confounded with some other one. In this
case, we take the second most probable class returned from the MDRNN-LSTM and
use the appropriate SVM to solve the confusion between the options. In experiments
we used the 34 features proposed by Camastra [4].
Different SVMs were derived from pairs of the classes (e.g. (U, V), (m, n), (N, n),
etc.) constituting the majority of the confusions observed with the MDRNN-LSTM
classifier. Different kernel functions (linear, polynomial and RBF) were tested and the
best performances were obtained by trained SVMs with the RBF kernel function. The
choice of pairs of classes with confusion was based on the amount of errors taking as
minimum 10% size of the validation set.
Fig. 8. The proposed model architecture
4 Experiments
The C-Cube is a public database available for download on the Cursive Character Chal-
lenge website (http://ccc.idiap.ch). The database consists of 57,293 files, including up-
percase and lowercase letters, manually extracted from the CEDAR and United States
Post Service (USPS) databases. All images are binary and with variable size. The data
are unbalanced and there is a big difference in the number of pattern among the letters.
There are several feature extraction techniques proposed in literature to character
recognition and it’s an important factor to achieve high accuracy rates [18]. Cruz et
al. [5] performed experiments with different feature extraction techniques with this
database. Camastra [4] used a clustering analysis to verify whether the upper and lower
case versions of the same letters are similar in shape. The letters (c, x, o, w, y, z, m, k,
j, u, n, f, v) presented the highest similarity between the two versions and were joined
into a single class in experiments without much loss of generality.
The classification results for the split (Upper and Lower case) and Joined cases are
shown in Table 1. The edge maps algorithm presented the overall best result. Most
feature sets presented better accuracy for the upper case letters with the exception of
the method proposed by Camastra that performed better for lower case. This feature set
also presented the best accuracy (84.37%) for the lower case. The last two lines of Table
1 present yet the results of the MDRNN and MDRNN+SVM. The MDRNN presented
promising results, specially in comparison with methods where the feature extraction
step is needed. Additionally, our proposed hybrid model statistically outperforms the
MDRNN model. In fact, our proposed model achieves the best rates in the upper and
joint databases and is statistically equivalent to the Camastra method in case of the
lower letters database. The used hypothesis test was the t-test with 1% of significance.
In Tables 1 and 2 the boldface numbers are the best results found.
The best results obtained in recent years for C-Cube database are displayed in
Table 2. In Thornton et al. [16] the HVQ with temporal pooling algorithm is a partial im-
plementation of hierarchical temporal memory. This biologically-inspired model places
emphasis on the temporal aspect of pattern recognition, and consequently parses all im-
ages as ‘movies’. In [17] the modified direction feature extraction technique combines
the use of direction features (DFs) and transition features (TFs) to produce recognition
rates that are generally better than either DFs or TFs used individually.
Camastra [4] presented a cursive character recognizer. The character classification
is achieved by using SVMs and a neural gas. The neural gas is used to verify whether
lower and upper case version of a certain letter can be joined in a single class or not.
Once this is done for every letter, the character recognition is performed by SVMs.
A method for increasing the recognition rates of handwritten characters by com-
bining MLP and SVM was presented in [22]. The experiments demonstrated that the
combination of MLPs networks with SVMs experts pairs of classes that constitute the
greatest confusion of MLP, had improved performance in terms of recognition rate.
Table 1. Recognition rate by feature set for Table 2. Recognition rates for the C-Cube
the upper and lower case separated [13] Database Joint Case
Method Nodes Upper Lower Joint Algorithm #Class Recog.

Case (%) Case (%) Case (%) Rate (%)
Edge 490 86.52 81.13 82.49 HVQ-32 [16] 52 84.72
Binary Grad. 490 86.35 79.89 81.46 HVQ-16 [16] 52 85.58
MAT Grad. 300 85.77 79.22 80.83 MDF-RBF [17] 52 80.92
Median Grad. 360 85.10 79.48 79.96 34D-RBF [17] 52 84.27
Camastra 34D 400 79.63 84.37 79.97 MDF-SVM [17] 52 83.60
Zoning 450 84.46 78.07 78.60 34D-SVM+Neu. Gas [4] 52 86.20
Structural 320 81.94 77.70 77.07 34-MLP [4] 52 71.42
Concavities 530 73.35 81.89 74.90 MLP+SVM [22] 52 82.53
Projections 500 71.73 79.90 73.85
KNN+SVM [23] 52 83.76
MDRNN - 89.51 82.98 84.15
MDRNN 52 84.15
MDRNN+SVM - 90.45 84.49 87.14
MDRNN+SVM 52 87.14
In Zanchettin et al. [23] a combination of KNN and SVM is presented. The main idea
is to use the SVM to increase the kNN recognition rate as a decision maker classifier.
The adaptation in this case is to take the two most frequent classes in the k nearest
neighbors and to use the SVM to decide between these two classes. It is a satisfactory
technique to be used where a misclassification results in high costs. However, as this
technique depends on the kNN method, its main disadvantage is the processing time.
According the results presented in Table 2 we conclude the MDRNN model and our
hybrid proposed model are in the top list methods for cursive character recognition. One
advantage to use these methods against the others is the fact the MDRNN itself design
and learn everything that is needed from the pixels of the image to distinguish the main
differences of the classes. Therefore, the training step is much easier than other methods
and the classification step performs faster, even in the MDRNN-SVM proposed model,
since the feature extraction and the classification step with SVM occur only in case of
doubt of the MDRNN output.
5 Final Remarks
This paper evaluated the performance of the original MDRNN recurrent neural net-
works in handwriting character recognition in a well-known benchmark, the C-Cube
database. Additionally, it is proposed the use of specialized SVMs to improve the per-
formance of the MDRNN in a hierarchical way. The performance of the method is ver-
ified and compared with different classifiers using the C-Cube database. The method
presented promising results in the classification task and the proposed combination
improve the method performance and robustness, specially in the disjoint Upper and
Lower letters databases.
As future work we suggest to evaluate the performance of the MDRNN and also the
proposed method against others in some benchmark of isolated word images, varying
the sample training number, the size of classes in the dataset, the amount of noise in the
words, the resolution of the images, and other variables.
References
1. El Abed, H., Margner, V., Kherallah, M., Alimi, A.M.: ICDAR 2009 Handwriting Recogni-
tion Competition. In: Int. Conf. Document Analysis and Recognition, pp. 1388–1392 (2009)
2. Bellili, A., Gilloux, M., Gallinari, P.: An Hybrid MLP-SVM Handwritten Digit Recognizer.
In: Int. Conf. on Document Analysis and Recognition, pp. 28–32 (2001)
3. Bunke, H.: Recognition of cursive roman handwriting - past present and future. In: Proc. 7th
Int. Conf. on Document Analysis and Recognition, vol. 1, pp. 448–459 (2003)
4. Camastra, F.: A SVM-Based Cursive Character Recognizer. Pattern Recognition 40(12),
3721–3727 (2007)
5. Cruz, R.M.O., Cavalcanti, G.D.C., Tsang, I.R.: An Ensemble Classifier for Offline Cursive
Character Recognition using Multiple Feature Extraction Techniques. In: IEEE Int. Joint
Conf. on Neural Networks, pp. 744–751 (2010)
6. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. John Wiley and Sons (2001)
7. Graves, A., Fernández, S., Schmidhuber, J.: Multidimensional Recurrent Neural Networks.
In: Proc. of Int. Con. on Artificial Neural Networks, pp. 549–558 (2007)
8. Graves, A.: Supervised Sequence Labelling with Recurrent Neural Networks. Dissertation,
Technische Universität München, München (2008)
9. Graves, A., Schmidhuber, J.: Offline Handwriting Recognition with Multidimensional Re-
current Neural Networks. In: Adv. in Neural Information Proc. Syst., pp. 545–552 (2009)
10. Hochreiter, S., Schmidhuber, J.: Long Short-Term Memory. Neural Computation 9(8), 1735–
1780 (1997)
11. Neves, R.F.P., Lopes, A.N.G., Mello, C.A.B., Zanchettin, C.: A SVM Based Off-line Hand-
written Digit Recognizer. In: IEEE Int. Conf. on Sys., Man, and Cyb., pp. 510–515 (2011)
12. Plamondon, R., Srihari, S.N.: On-line and Off-line Handwriting Recognition: A Comprehen-
sive Survey. IEEE Trans. Pattern Anal. Mach. Intell. 22(1), 63–84 (2000)
13. Rodrigues, R.J., Kupac, G.V., Thomé, A.C.G.: Character Feature Extraction using Polygo-
nal Projection Sweep (Contour Detection). In: Proc. Int. Work. Conf. on Artificial Neural
Networks, pp. 687–695 (2001)
14. Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach. Prentice Hall Pearson
Education Inc. (2003)
15. Tappert, C., Suen, C., Wakahara, T.: The State of the Art in Online Handwriting Recognition.
IEEE Trans. on Patt. Analysis and Machine Intelligence 12(8), 787–808 (1990)
16. Thornton, J., Faichney, J., Blumenstein, M., Hine, T.: Character Recognition using Hierarchi-
cal Vector Quantization and Temporal Pooling. In: Proc. Australasian Joint Con. on Artificial
Intelligence, pp. 562–572 (2008)
17. Thornton, T., Blumenstein, M., Nguyen, V., Hine, T.: Offline Cursive Character Recognition:
A State-of-the-art Comparison. In: Conf. Int. Graphonomics Society (2009)
18. Trier, O.D., Jains, A.K., Taxt, T.: Feature Extraction Methods for Character Recognition - A
Survey. Pattern Recognition 29(4), 641–662 (1996)
19. Vamvakas, G., Gatos, B., Perantonis, S.J.: Handwritten Character Recognition Through Two-
stage Foreground Sub-sampling. Pattern Recognition (43), 2807–2816 (2010)
20. Vapnik, V.N.: Statistical Learning Theory. John Wiley and Sons, New York (1998)
21. Vinciarelli, A.: A Survey on Off-line Cursive Script Recognition. Pattern Recognition 35(7),
1433–1446 (2002)
22. Washington, W.A., Zanchettin, C.: A MLP-SVM Hybrid Model for Cursive Handwriting
Recognition. In: Proc. of Int. Joint Conf. on Neural Networks, pp. 843–850 (2011)
23. Zanchettin, C., Bezerra, B.L.D., Azevedo, W.W.: A KNN-SVM Hybrid Model for Cursive
Handwriting Recognition. In: IEEE Int. Joint Con. on Neural Networks, Birsbane (2012)
Extraction of Prototype-Based Threshold Rules
Using Neural Training Procedure
Marcin Blachnik1 , Mirosław Kordos2 , and Włodzisław Duch3,4

1
Silesian University of Technology, Department of Management and Informatics,
Katowice, Krasinskiego 8, Poland
marcin.blachnik@polsl.pl
2
University of Bielsko-Biala, Department of Mathematics and Computer Science,
Bielsko-Biała, Willowa 2, Poland
mkordos@ath.bielsko.pl
3
Nicolaus Copernicus University, Department of Informatics,
Grudzia̧dzka 5, Toruń, Poland
4
School of Computer Science, Nanyang Technological University, Singapore
Google: W. Duch
Abstract. Complex neural and machine learning algorithms usually lack com-
prehensibility. Combination of sequential covering with prototypes based on
threshold neurons leads to a prototype-threshold based rule system. This kind
of knowledge representation can be quite efficient, providing solutions to many
classification problems with a single rule.
Keywords: Data understanding, rule extraction, prototype-based rules.
1 Introduction
Neural networks and other complex machine learning models usually lack the advan-
tages of comprehensibility. This property is very important in many application, in-
cluding safety-critical ones. Also in technological applications, where the systems are
used for support of industrial processes (see for example [1]) simple and understand-
able models are crucial to avoid dangerous situations and to raise process engineers’
confidence in the solutions. The second aspect of building a comprehensible model is
directly related to knowledge extraction. In medical applications or social science the
data driven models may be the only source of knowledge about certain processes. Thus
there are evident advantages of data driven models, which use human-friendly knowl-
edge representation.
Four general approaches, which find comprehensive mappings f (xi ) → yi are usu-
ally considered in that purpose [2]:
– propositional logic using crisp logic rules (C-rules)
– fuzzy logic and fuzzy rule based systems (F-rules)
– prototype-based rules and logic (P-rules)
– first and higher-order predicate logic
C-rules are the most common form of user-friendly knowledge representation. They
avoid any ambiguity, which assures that there is only one possible interpretation of

256 M. Blachnik, M. Kordos, and W. Duch
the rules, although at the expense of several limitations. Continuous attributes usually
cannot be discretized in a natural way, crisp decision borders divide the input space into
hyperboxes that cannot easily cover some data distributions, and some simple forms of
knowledge (like "the majority vote") may be expressed only using a very large number
of logical conditions in propositional form. The problems with discretization have been
addressed by fuzzy logic, with fuzzy rule-based systems using membership functions
to represent the degree (in [0, 1] range) of fulfilling certain conditions [3]. This leads
to more flexible decision borders and allows for handling uncertainty in the real world
data. However, F-rules do not represent many forms of knowledge that predicate logic
can express in a natural way.
Prototype based rules try to capture intuitive estimation of similarity in decision-
making processes [4], as it is done in the case-based reasoning. P-rules, which are rep-
resented by reference vectors and similarity measures are more suitable to learn from
continuous data than predicate logic. Relation of P-rules with F-rules has been studied
in [5,6] where it has been shown that fuzzy rules can be converted to prototype-rules
with additive distance functions. P-rules work with symbolic attributes by applying
heterogeneous distance measures like VDM distance [7]. Moreover, they can easily ex-
press some forms of knowledge, such as the selection of the m of n true conditions, what
limits the use of crisp and fuzzy rules.
P-rules can be expressed in two forms: as the nearest neighbor rules or as the
prototype-threshold based rules. This article addresses the problem of extracting the
prototype-threshold based rules from the data. A simple algorithm called nOPTDL
based on combination of the sequential covering approach to rule extraction with neural-
like training and representation of single rules is presented below. it allows to use
gradient-based optimization methods to optimize the parameters of neurons, which rep-
resent single rules.
In the next section the details of prototype-threshold rules are discussed and in sec-
tion 3 the nOPTDL algorithm is presented. Section 4 presents a few examples of the
nOPTDL performance and discusses the results. The last section concludes the paper
and presents the directions of further research.
In the paper the notation, which describes the data is given as a tuple
T = {[x1 , y1 ] , [x2 , y2 ] , . . . , [xn , yn ]}, where xi ∈ Rm is an input vector, yi ∈ [−1, 1]
is a label (only two-class problems are considered).
2 Prototype-Threshold Based Rules

A single prototype-threshold based rule has the form:
If D(xi , p) < θ Then yi ← l (1)
where D(xi , p) expresses the distance between vector xi and the prototype p and l is
the class label associated with the rule.
There are several approaches to construct this type of rules. Perhaps the simplest one
is based on classical decision trees. First, conditions that define each branch of the tree
are used to define a separate distance function for a single prototype associated with the
root of the tree. A more common approach leading to a lower number of rules starts
Extraction of Prototype-Based Threshold Rules Using Neural Training Procedure 257
from a distance matrix D (q, w) that is used to construct new features for the decision
tree training. Each new feature represents the distance from a selected training vector,
so the number of new features added to the original set is equal to the number of training
instances m = n. In this approach each node can consist either of a single prototype-
threshold rule or a combination of crisp conditions with distance-based conditions [8].
Another approach called ordered prototype-threshold decision list OPTDL is derived
from the sequential covering principle. In this approach, described in [9], the algorithm
starts from creating a single rule and then adds new rules such that each new rule covers
examples not classified by previously constructed rules (sketch (1)). The shape of the
decision border of a single rule depends on the distance function. Euclidean distance
function creates hyperspherical borders. To avoid unclassified regions the new rules
should overlap with each other. When a test vector falls in such an overlapping region a
unique decision is made by ordering the rules from the most general to the most specific.
The training algorithm starts from creating the most general rule and then each new rule
is marked as more specific then the previous one. The decision-making process starts
from analyzing the most specific rule and if its conditions are not fulfilled then more
general rules are being analyzed. If an instance is not covered by any rule then the else
clause is used to determine the default class label.
Algorithm 1. Sequential covering algorithm

Require: T, minSupport, maxIterations
P ← ∅ {set of prototype rules}
S ← T {set of uncovered or misclassified examples}
i←0
repeat
{pi , θi } ← CreateN ewRule(S, T)
P ← P ∪ {pi , θi }
else ← DetermineElseCondition(S)
S ← ApplyRules(T, P)
S ← ApplyElse(S, else)
i← i+1
until (|S | < minSupport) or (i ≥ maxIterations)
return P, else
Construction of a single rule requires determination of the prototype position and

its corresponding threshold, what is performed based on search strategies and criteria
commonly used in decision trees sOPTDL, such as the information gain, Gini index or
SSV separability measure [9]. The algorithm considers each input vector xi as a poten-
tial prototype and the best tuple [p, θ, l], where θ is a threshold and l is the consequence
of the rule, is optimized using search strategy sketched in Algorithm (2), where S is the
set of examples that are either unclassified or incorrectly classified.
In experiments reported in [9] this approach worked quite well, although the posi-
tions of all prototypes were restricted to one of the instances of the training set. That
makes the interpretation of resulting rules easy because a prototype is a real example
from the training set. On the other hand that reduces the ability to create an arbitrary
Algorithm 2. Search based ordered prototype-threshold decision list algorithm

(sOPTDL)
Require: S, T
for i ∈ T do
l ← yi
for k ∈ |S| such that yk−1 = yk do
θk = 0.5 · (d (xi , xk ) − d (xi , xk−1 ))
v ← Criterion(xi , θk , Ty=l , Sy=l )
if v > v then
v ← v
θ ← θk
p ← xi
end if
end for
end for
return p, θ
shape of the decision border. For two overlapping Gaussian distributions with identi-
cal σ the optimal decision border is a hyperplane, but such decision border cannot be
created using prototypes restricted to the examples of the training set. Without that re-
striction a prototype can be moved to infinity and with appropriately large threshold
a good approximation to a linear decision border can be obtained. In the next section
presents the optimization procedures to determine the position and appropriate thresh-
old of a prototype.
3 Neural Optimization of Prototype-Threshold Rules

The goal here is to determine the optimal position of the prototype and its associated
threshold. This is done by optimization of parameters of the neurons that implement
hyperspherical isolines, such that each coverage step of the rule induction consists of a
single neuron training (the nOPTDL algorithm). The transfer function of that neuron is
based on a modified logistic function:
α
z(x|p, θ) = σ (D (x, p) − θ) (2)
1
σ(x) = (3)
(1 + exp (−x))
where p is the position of the prototype, D (x, p) is the distance function, α represents
the exponent of the distance function (for Euclidean distance α = 2) and θ denotes
the threshold or bias. The α parameter is used to add flexibility to distance functions,
regulating their shape as a function of differences between vectors.
α
The inner part of the transfer function g (x) = D (x, p) −θ defines the area covered
by active neuron, such that vectors x that fall into this area give positive values g (x) >
0 and those being outside - negative values g (x) < 0. The logistic function is used
for smooth nonlinear normalization of the g (x) values to fit them into the range [0, 1].
x vectors close to the border defined by z(·) = 0.5 will increase this value towards 1
inside and towards 0 outside the area covered by the neuron, with the speed of change
that depends on the slope of the logistic function and the scaling of the distance function.
The objective function used to optimize the properties of the neuron is defined as:

E(p, θ) = z (xi |p, θ) · l · yi (4)
i∈C
which is a sum of neuron activations multiplied by the product of the consequence

l = ±1 of the rule associated with the prototype p. C denotes a set of training examples
incorrectly classified (T with l
= yi ) and samples that are not yet covered by the current
set of rules (T for which l = yi ).
The objective function can be optimized using either gradient or non-gradient opti-
mization methods. To avoid local minima and speed up convergence gradient optimiza-
tion procedure that is restarted from 5 different random localizations is used, each time
starting from a vector that is not yet properly classified.
4 Numerical Experiments
We experimentally compare the accuracy and comprehensibility of rules induced by

our system. The experiments were performed using 6 benchmark datasets with different
properties taken from the UCI repository [10]. They include: the Cleveland heart dis-
ease (Heart disease), Pima Indian diabetes (Diabetes), sonar (Sonar), Wisconsin breast
cancer (Breast cancer) and appendicitis (Appendicitis). The properties of these datasets
are summarized in Tab. (1). These datasets represent quite diverse domains, including
medical data with heterogeneous type of attributes and datasets with many continuous
attributes, such as the Sonar dataset, which are difficult to handle using crisp rules [2].
Table 1. Description of the datasets used in rule extraction experiments
Dataset # vectors # features # of classes comment

Heart disease 297 13 2 6 vectors with missing values were re-
moved
Diabetes 768 8 2
Sonar 208 60 2
Breast cancer 683 9 2 16 vectors with missing values were re-
moved
Ionosphere 351 34 2
Appendicitis 106 8 2
In the first experiment the influence of the number of extracted rules on classification
accuracy of the nOPTDL has been analyzed. A 10-fold crossvalidation has been used
for estimation of accuracy, repeating the test for different number of rules in the range
k = [1 . . . 10]. The results are presented in Fig. (1). The obtained results show that the
classification accuracy using just a single P-rule is sometimes as good as with many
rules (heart, breast cancer). In other cases adding new rules improves accuracy up to a
90 80
78
76
85
74
72
70
80
68
66
75 64
0 2 4 6 8 10 12 0 2 4 6 8 10 12
(a) Heart disease (b) Diabetes dataset

95
99.5
99
90
98.5
98
85
97.5
80 97
96.5
75
96
95.5
70
95
65 94.5
0 2 4 6 8 10 12 0 2 4 6 8 10 12
(c) Sonar dataset (d) Breast cancer

95 94
92
90
90
85
88
80
86
84
75
82
70
80
65
78
60 76
0 2 4 6 8 10 12 0 2 4 6 8 10 12
(e) Ionosphere dataset (f) Appendicitis dataset
Fig. 1. Classification accuracy and variance as the function of the number of the nOPTDL rules
certain point, but for all datasets no more than 5 rules were needed to reach the maxi-
mum accuracy. This shows that prototype-threshold form of knowledge representation
can be quite efficient.
To compare the proposed nOPTDL algorithm to other state-of-the-art rule extraction
algorithms another test was performed using double crossvalidation: the inner crossval-
idation was used to optimize parameters of the given classification algorithm (for exam-
ple, the number of rules in our system) and the outer crossvalidation was used to predict
the final accuracy. The testing procedure is presented in Fig.(4). Our nOPTDL algo-
rithm has been compared to the previous version based on search strategies (sOPTDL),
and also to the C4.5 decision tree [11] and Ripper rule induction system [12]. The
experiments have been conducted using RapidMiner [13] with Weka extension and
with the Spider toolbox [14]. The parameters of both C4.5 and Ripper algorithms have
also been optimized using double crossvalidation, optimizing pureness and the minimal
weights of instances. The results are presented in Tab. (2).
Outer cross validation

Parameter optimization
Inner cross validation
Model training
Model testing
Selecting best
parameters
Model training
Model testing
Fig. 2. The accuracy estimation procedure
Table 2. Comparison of the accuracy of the nOPTDL algorithm with C4.5 decision tree and
Ripper rule induction
Dataset nOPTDL sOPTDL C4.5 1 Ripper 2

Acc±Std Acc±Std Acc±std Acc±std
Heart disease 83,5±5,76 80.48±4.33 77,2±4,3 80,13±7,24
Diabetes 72,00±4,4 71.62±4.01 74,2±4,7 74,61±2,66
Sonar 81,12±11,42 75.02±8.91 72,5±11,2 79,76±6,8
Breast cancer 96,92±2,13 96.93±1.08 95,28±4,7 96,28±1,7
Ionosphere 88,05±5,26 92.02±3.51 90,33±4,7 88,61±4,2
Appendicitis 86,72±6,63 82.27±11.85 83,9±6 85,81±6,2
For Heart disease the average accuracy of nOPTDL (1 rule) is over 5% higher in
comparison with C4.5 classifier (21 rules) and 3% higher than the Ripper algorithm
(4 rules). a very good accuracy was also achieved for the Appendicitis dataset. The
average accuracy of the Sonar dataset (4 rules) was also very high, however the standard
deviation, comparable to that obtained from C4.5 decision tree, was much higher than
the standard deviation of Ripper. Diabetes also required 3 P-rules. In other cases a
single rule was sufficient. The results show that knowledge representation using a small
number of P-rules is very efficient.
5 Conclusions and Future Research
A modification of the OPTDL algorithm (nOPTDL) for extraction of prototype-

threshold based rules has been described. Neurons implementing sigmoidal functions
combined with distance-based functions represent single P-rules. Such an approach al-
lows for efficient gradient based optimization methods for rule extraction. Moreover,
the use of VDM metric and heterogeneous distance functions make it possible go apply
this method to datasets consisting of symbolic or mixed types of features.
Experiments performed on diverse types of datasets showed that a good classification
accuracy can be achieved with a small number of P-rules, which is the goal of any rule
induction algorithm. In most cases even one single rule leads to a rather small error rate,
what proves high expressive power of prototype based knowledge representation.
Further extensions of this algorithm, including beam search instead of the best first
search, should improve its quality. Our future work also includes adding local fea-
ture weights to each neuron to automatically adjust feature significance. Enforcing
regularization should increase the sparsity of the obtained feature weights and lead
to improvement of comprehensibility by filtering useless attributes and thus simplify
the extracted knowledge. Adopting appropriate distance measures and switching to the
Chebyshev distance (Linf norm) may allow for classical crisp rule extraction using the
same OPTDL family of algorithms.
Acknowledgment. The work was founded by the grant No. ATH 2/IV/GW/2011 from
the University of Bielsko-Biala and by project No. 4421/B/T02/2010/38 (N516 442138)
from the Polish Ministry of Science and Higher Education.
The software package is available on the web page of The Instance Selection and
Prototype Based Rules Project at http:\www.prules.org
References
1. Wieczorek, T.: Neural modeling of technological processes. Silesian University of Technol-
ogy (2008)
2. Duch, W., Setiono, R., Zurada, J.: Computational intelligence methods for understanding of
data. Proceedings of the IEEE 92, 771–805 (2004)
3. Nauck, D., Klawonn, F., Kruse, R., Klawonn, F.: Foundations of Neuro-Fuzzy Systems. John
Wiley & Sons, New York (1997)
4. Duch, W., Grudziński, K.: Prototype based rules - new way to understand the data. In: IEEE
International Joint Conference on Neural Networks, pp. 1858–1863. IEEE Press, Washington
D.C. (2001)
5. Duch, W., Blachnik, M.: Fuzzy Rule-Based Systems Derived from Similarity to Prototypes.
In: Pal, N.R., Kasabov, N., Mudi, R.K., Pal, S., Parui, S.K. (eds.) ICONIP 2004. LNCS,
6. Kuncheva, L.: On the equivalence between fuzzy and statistical classifiers. International Jour-
nal of Uncertainty, Fuzziness and Knowledge-Based Systems 15, 245–253 (1996)
7. Wilson, D.R., Martinez, T.R.: Value difference metrics for continuously valued attributes. In:
Proceedings of the International Conference on Artificial Intelligence, Expert Systems and
Neural Networks, pp. 11–14 (1996)
8. Grabczewski,
˛ K., Duch, W.: Heterogeneous Forests of Decision Trees. In: Dorronsoro, J.R.
(ed.) ICANN 2002. LNCS, vol. 2415, pp. 504–509. Springer, Heidelberg (2002)
9. Blachnik, M., Duch, W.: Prototype-Based Threshold Rules. In: King, I., Wang, J., Chan, L.-
W., Wang, D. (eds.) ICONIP 2006. LNCS, vol. 4234, pp. 1028–1037. Springer, Heidelberg
(2006)
10. Asuncion, A., Newman, D.: UCI machine learning repository (2007),
http://www.ics.uci.edu/~mlearn/MLRepository.html
11. Quinlan, J.: C 4.5: Programs for machine learning. Morgan Kaufmann, San Mateo (1993)
12. William, C.: Fast effective rule induction. In: Twelfth International Conference on Machine
Learning, pp. 115–123 (1995)
13. Rapid-I: Rapidminer, http://www.rapid-i.com
14. Weston, J., Elisseeff, A., BakIr, G., Sinz, F.: The spider,
http://www.kyb.tuebingen.mpg.de/bs/people/spider/
Instance Selection with Neural Networks
for Regression Problems
Mirosław Kordos1 and Marcin Blachnik2

1
University of Bielsko-Biala, Department of Mathematics and Computer Science,
Bielsko-Biała, Willowa 2, Poland
mkordos@ath.bielsko.pl
2
Silesian University of Technology, Department of Management and Informatics,
Katowice, Krasińskiego 8, Poland
Abstract. The paper presents algorithms for instance selection for regression
problems based upon the CNN and ENN solutions known for classification tasks.
A comparative experimental study is performed on several datasets using multi-
layer perceptrons and k-NN algorithms with different parameters and their vari-
ous combinations as the method the selection is based on. Also various similarity
thresholds are tested. The obtained results are evaluated taking into account the
size of the resulting data set and the regression accuracy obtained with multi-
layer perceptron as the predictive model and the final recommendation regarding
instance selection for regression tasks is presented.
Keywords: neural network, instance selection, regression.
1 Introduction
1.1 Motivation
There are two motivations for us to undertake research in the area of instance selec-
tion for regression problems. The first one is theoretical - most research on instance
selection, which has been done so far refers to classification problems and there are
only few papers on instance selection for regression tasks, which do not cover the topic
thoroughly, especially in the case of practical application to real-world datasets. And
our second motivation is very practical - we have implemented in industry several com-
putational intelligence systems for technological process optimization [1], which deal
with regression problems and huge datasets and there is a practical need to optimally
reduce the number of instances in the datasets before building the prediction and rule
models.
There are following reasons to reduce the number of instances in the training dataset:
1. Some instance selection algorithms, as ENN, which is discussed later, reduce noise
in the dataset by eliminating outliers, thus improving the model performance.
2. Other instance selection algorithms, as CNN, which is also discussed later, discard
from the dataset instances that are too similar to each other, what simplifies and
reduces size of the data.

264 M. Kordos and M. Blachnik
3. The above two selection models can be joined together to obtain the benefits of
both.
4. The training performed on a smaller dataset is faster. Although reducing the dataset
size also takes some time, it frequently can be done only once, before attempting
various models with various parameters to match the best model for the problem.
5. While using lazy-learning algorithms, as k-NN, reducing the dataset size also re-
duces the prediction time.
6. Instance selection can be joined with prototype selection in prototype based sys-
tems, e.g. prototype rule-based systems.
1.2 Instance Selection in Classification Problems

In that area many research have been done using the k-nearest neighbor algorithm (k-
NN) for instance selection in classification tasks. The early research in that area lead
to the Condensed Nearest Neighbor rule (CNN) [2] and Edited Nearest Neighbor rule
(ENN) [3]. These basic algorithms were further extended leading to more complex ones
like Drop1-5 [6], IB3, Gabriel Editing (GE) and Relative Neighborhood Graph Editing
(RNGE), Iterative Case Filtering (ICF), ENRBF2, ELH, ELGrow and Explore [7]. In-
stead of directly selecting instances from the training data an interesting approach for
training k-NN classifier was proposed by Kunheva in [5], where preselected instances
were relabeled, such that each instance was assigned to all class labels with appropriate
weights describing the support for given label. A large survey including almost 70 dif-
ferent algorithms of instance selection for classification tasks can be found in [7]. The
instance selection algorithms were designed to work with k-NN classifier, so an inter-
esting idea was proposed by Jankowski and Grochowski in [4]; to use the algorithms
as instance filters for other machine learning algorithms like SVM, decision trees etc.
By filtering noisy and compacting redundant examples they were able to improve the
quality and speed of other classification algorithms. In this paper we aim to obtain the
same for regression tasks.
1.3 Challenges in Regression Tasks

The instance selection issue for regression tasks is much more complex. The reason is
that in classification tasks only the boundaries between classes must be precisely deter-
mined, while in regression tasks the output value must be properly calculated at each
point of the input space. Moreover, the decision in classification tasks is frequently bi-
nary or there are at most several different classes, while in regression tasks, the output
of the system is continuous, so there are unlimited number of possible values predicted
by the system. That causes that the dataset compression obtained by instance selection
can be much higher in classification than in non-linear regression problems. Moreover,
the decision about rejection of a given vector in classification tasks can be made based
on right or wrong classification of the vector by some algorithm. In regression prob-
lems, rather some threshold defining the difference between the predicted and the ac-
tual value of the vector output should be set. As discussed later, the threshold should be
rather variable than constant, taking different values in different areas of the data. Thus,
determining the threshold is an issue that does not exist in classification tasks. Another
Instance Selection with Neural Networks for Regression Problems 265
problem is the error measure, which in classification tasks is very straightforward, while
in regression tasks, the error measure can be defined in several ways and in practical
solutions not always the simple error definitions as the MSE (mean square error) work
best [1].
Because of the challenges, there were very few approaches in the literature to in-
stance selection for regression problems. Moreover, the approaches were usually not
verified on real-world datasets. Zhang [9] presented a method to select the input vectors
while calculating the output with k-NN. Tolvi [10] presented a genetic algorithm to per-
form feature and instance selection for linear regression models. In their works Guillen
et al. [11] discussed the concept of mutual information used for selection of prototypes
in regression problems.
2 Methodology
2.1 ENN and CNN Instance Selection Methods
The CNN (Condensed Nearest Neighbor) algorithm was proposed by Hart [2]. For clas-
sification problems, as shown in [4], CNN condenses on the average the number of vec-
tors three times. CNN used for classification works in the following way: the algorithm
starts with only one randomly chosen instance from the original dataset T. And this
instance is added to the new dataset P. Then each remaining instance from T is classi-
fied with the k-NN algorithm, using the k nearest neighbors from the dataset P. If the
classification is correct, then the instance is not added to the final dataset P. If the clas-
sification is wrong - the instance is added to P. Thus, the purpose of CNN is to reject
these instances, which do not bring any additional information into the classification
process.
The ENN (Edited Nearest Neighbor) algorithm was created by Wilson [3]. The main
idea of ENN is to remove given instance if its class is different than the majority class
of its neighbors, thus ENN works as a noise filter. ENN starts from the entire original
training set T. Each instance, which is correctly classified by its k nearest neighbors is
added to the new dataset P and each instance wrongly classified is not added. Several
variants of ENN exist. Repeated ENN, proposed by Wilson, where the process of ENN
is iteratively repeated as long as there are any instances wrongly classified. In all k-NN
algorithm proposed Tomek [12], the ENN is repeated for all k from k=1 to kmax .
2.2 RegENN and RegCNN: ENN and CNN for Regression Problems
The first step to modify the CNN and ENN algorithms to enable using them for regres-
sion task is to replace the wrong/correct classification decision with a distance measure
and a similarity threshold, to decide if the examined vector can be considered as simi-
lar to its neighbors or not. For that purpose we use Euclidean measure and a threshold
θ, which express the maximum difference between the output values of two vectors to
consider them similar. Using θ proportional to the standard deviation of several nearest
neighbors of the vector xi reflects the speed of changes of the output around xi and
allows adjusting the threshold to that local landscape, what, as the experiments showed,
allows for better compression of the dataset. Then we changed the algorithm used to
predict the output Y (xi ) from k-NN to an MLP (multilayer perceptron) neural net-
work, which in many cases allowed for better results (see Table 1). Additionally the
best results were obtained if the MLP network was trained not on the entire dataset, but
only on a part of it in the area of the vector of interest. The algorithms are shown in the
following pseudo-codes.
Algorithm 1. RegENN algorithm Algorithm 2. RegCNN algorithm

Require: T Require: T
m ← sizeof (T); m ← sizeof (T)
for i = 1 . . . m do P=∅
Ȳ (xi ) =NN((T \ xi ), xi ); P ← P ∪ x1 ;
S ← Model(T, xi ) for i = 2 . . . m do
θ= α · std (Y (XS )) Ȳ (xi ) =NN(P, xi )
if Y (xi ) − Ȳ (xi ) > θ then S ← Model(T, xi )
T ← T \ xi θ= α · std (Y (XS ))
end if if Y (xi ) − Ȳ (xi ) > θ then
end for P ← P ∪ xi ;
P←T T ← T \ xi
return P end if
end for
return P
Where T is the training dataset, P is the set of selected prototypes, xi is the i-th
vector, m is the number of vectors in the dataset ,Y (xi ) is the real output value of vector
xi , Ȳ (xi ) is the predicted output of vector xi , S is the set of nearest neighbors of vector
xi , NN(A,x) is the algorithm, which is trained on dataset A and vector x is used as a test
sample, for which the Y (xi ) is predicted (in our case NN(A,x) is implemented by k-NN
or MLP), kNN is the k-NN algorithm returning the subset S of several closest neighbors
to xi , and θ is the threshold of acceptance/rejection of the vector as a prototype, α is a
certain coefficient (it will be discussed in the experimental section) and std (Y (XS ))
is the standard deviation of the outputs of the vectors in S.
2.3 Instance Selection Extension for RapidMiner

We used the RapidMiner [15] software for implementing the algorithms and performing
numerical experiments. We created in Java the instance selection modules and incorpo-
rated them in the entire model. The source code of the instance selection modules, the
.jar file containing them, that can be used with RapidMiner without the necessity to
compile the sources, the .xml file with the process and the datasets we used can be ob-
tained from [16]. The most outer loop of the model iterates over θ or α parameters, the
lower level loop is a 10-fold crossvalidation, where in the training part the RegCNN
or RegENN algorithm is performed and an MLP network is trained on the selected in-
stances, and in the test part this MLP networks is tested on the test set. The RegCNN or
RegENN module contains the NN(A,x) algorithm which in Fig. 1 is an MLP network.
Fig. 1. The RapidMiner model used in the experiments

3.1 Dataset Description
We performed the experiments on five datasets. Before the experiments all the datasets
were standardized so that the mean value of each attribute is zero and the standard devi-
ation is one to enable easy comparison of the results obtained with different datasets and
of the parameters used in the experiments. All the datasets in the form used in the exper-
iments can be download from our web page [16]. These datasets included four datasets
from the UCI Machine Learning repository [14]: Concrete Compression Strength (7 in-
put attributes, 1030 instances), Forest Fires (10 input attributes, 518 instances), Crime
and Communities (7 input attributes, 320 instances), Housing (13 input attributes, 506
instances). One dataset (Steel: 12 input attributes, 960 instances) contains data from
steel production process and the task here is to predict the amount of carbon that must
be added to the liquid steel in the steel-making process to obtain steel of desired param-
eters, given the chemical and physical properties of the steel and the current melting
process.
3.2 Experimental Results

We performed the experiments using the model described in the previous section. Both
MLP networks (the network used for instance selection and the network used for the
final prediction) had one hidden layer with sigmoid activation function and the number
of hidden neurons was rounded to (number of attributes + 1) / 2 + 1. Both networks
were trained with the backpropagation algorithm for 500 training cycles with learning
rate=0.3 and momentum=0.2. The stopping error was 1.0E-5.
In the case of ENN, which is a kind of a noise filter, the θ should be rather big,
because only the outliers should be rejected. In the experiments with ENN we used 50
different values of θ usually from 0.05 to 5.0 or α usually from 0.5(1) to 50(100) in the
case of variable θ. On the other hand for the CNN algorithm θ should be small, because
only if two vectors are very similar, one of them should be rejected. In the experiments
with CNN we used 50 different values of θ from 0.001 to 0.5 or α usually from 0.01 to
5 for each dataset. We used the following algorithms to predict the value to the vector
being selected:
– k-NN with k=3
– k-NN with k=9 (frequenty k close to 9 is the optimal value [13])
Table 1. Results for Steel, Concrete, Crime, Forest and Housing datasets obtained with the opti-
mal α parameters (where θ = α · std (Y (XS ))
algorithm. Steel Concrete Crime Forest Housing

MSE 0.23±0.08 0.94±0.10 0.69±0.09 1.44±0.3 0.46±0.11
ENN-3NN
vect(α) 740±3(5) 275±6(4) 211±3(5) 373±6(1) 348±4(5)
MSE 0.25±0.08 0.92±0.08 0.68±0.09 1.42±0.3 0.43±0.10
ENN-9NN
vect(α) 776±2(5) 277±5(4) 210±2(5) 379±5(10) 350±4(5)
MSE 0.21±0.06 0.92±0.10 0.66±0.09 1.39±0.3 0.42±0.09
ENN-MLP90
vect(α) 772±3(5) 279±5(3.5) 206±1(6) 389±6(6) 365±3(6)
MSE 0.21±0.06 0.90±0.10 0.67±0.08 1.39±0.3 0.41±0.08
ENN-MLP30
vect(α) 773±3(5) 245±6 (4) 209±1(8) 391±6(7) 359±3(6)
MSE 0.23±0.06 0.90±0.10 0.67±0.08 1.38±0.3 0.42±0.08
ENN-MLP10
vect(α) 752±3(5) 210±5(3.4) 215±1(7) 388±7(10) 360±3(7)
MSE 0.24±0.06 0.97±0.10 0.68±0.07 1.41±0.3 0.40±0.09
CNN-3NN
vect(α) 747±6(0.1) 782±7(0.4) 235±2(0.3) 429±6(0.1) 385±3(0.4)
MSE 0.23±0.07 0.96±0.10 0.66±0.06 1.38±0.3 0.39±0.08
CNN-9NN
vect(α) 746±8(0.15) 783±6(0.4) 243±2(0.3) 429±6(0.8) 387±3(0.5)
MSE 0.23±0.08 0.96±0.10 0.68±0.07 1.39±0.3 0.41±0.08
CNN-MLP90
vect(α) 741±6(0.1) 787±6(0.5) 240±2(0.3) 423±4(0.2) 387±3(0.3)
MSE 0.21±0.07 0.95±0.09 0.67±0.07 1.38±0.3 0.42±0.10
CNN-MLP30
vect(α) 752±12(0.1) 784±7(0.5) 238±2(0.3) 426±4(0.1) 391±3(0.3)
MSE 0.22±0.01 0.94±0.10 0.68±0.06 1.42±0.3 0.42±0.08
CNN-MLP10
vect(α) 763±10(0.1) 780±7(0.5) 243±2(0.3) 424±4(0.1) 387±3(0.3)
MSE 0.19±0.05 0.86±0.07 0.64±0.07 1.35±0.3 0.39±0.08
ENN-CNN vect 722±10 186±4 197±4 366±4 339±6
α 5/0.1 3.5/0.5 8/0.3 8/0.5 6/0.3
MSE 0.23±0.08 1.01±0.10 0.71±0.10 1.50±0.3 0.40±0.08
No selection
vec 864 927 288 466 455
– MLP network trained on the entire training data within one validation of the cross-
validation process (that is 90 percent of whole the dataset, therefore it is shown in
the result section as CNN-MLP90 or ENN-MLP90)
– MLP network trained on 33 percent of the training vectors, which were closest
to the considered vector (shown in the result section as CNN-MLP30 or ENN-
MLP30)
– MLP network trained on 11 percent of the training vectors, which were closest
to the considered vector (shown in the result section as CNN-MLP10 or ENN-
MLP10)
Because of limited space, only the best results obtained for each model with variable θ
are shown in the table and comparison of results obtained with various θ (constant and
variable) are shown for one dataset and one method in a graphical form.
Fig. 2. Dependance of MSE (MSE_CT: with constant θ, MSE_VT: with variable θ) and the num-
ber of selected vectors (vect_CT: with constant θ, vect_VT: with variable θ) on the threshold θ
(when it is constant) and on α (whereθ = α · std (Y (XS )))
4 Conclusions
We presented an extension of CNN and ENN, called RegCNN and RegENN algorithms
that can be applied to regression tasks and experimentally evaluated the influence of the
θ and α parameter and various learning methods within the selection algorithm on the
number or selected vectors and the prediction accuracy obtained with an MLP neural
network on the reduced dataset. The general conclusions are that in most cases the best
results are obtained using an MLP network trained on the subset of closest neighbors
of the considered point. It was observer, that in general the θ used with CNN could be
on average set to 0.1 (or α to 0.5) of the MSE value while performing prediction on the
unreduced dataset, while the θ used with ENN to 5 times the value of the MSE (or α
to 5) for standardized data, however, the algorithms are not very sensitive to the change
of α in the terms of prediction accuracy, but especially RegENN with lower α allows
for better dataset compression. Variable θ allows for reducing more vectors, while it
does not influence the prediction accuracy. The best results are obtained if first ENN
and after that CNN was applied to the dataset.
It should be possible to significantly improve the results, first by tuning the param-
eters of the MLP network and using more efficient MLP training methods, such as
Levenberg-Marquardt algorithm and second by using more advanced instance selection
methods, which were shortly presented in the introduction. These issues will be the area
of our further research.
Acknowledgment. The work was sponsored by the grant ATH 2/IV/GW/2011 from
the University of Bielsko-Biala.
References
1. Kordos, M., Blachnik, M., Wieczorek, T.: Temperature Prediction in Electric Arc Furnace
with Neural Network Tree. In: Honkela, T. (ed.) ICANN 2011, Part II. LNCS, vol. 6792, pp.
2. Hart, P.E.: The condensed nearest neighbor rule. IEEE Transactions on Information The-
ory 14, 515–516 (1968)
3. Wilson, D.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans-
actions on Systems, Man, and Cybernetics 2, 408–421 (1972)
4. Jankowski, N., Grochowski, M.: Comparison of Instances Seletion Algorithms I. Algorithms
Survey. In: Rutkowski, L., Siekmann, J.H., Tadeusiewicz, R., Zadeh, L.A. (eds.) ICAISC
2004. LNCS (LNAI), vol. 3070, pp. 598–603. Springer, Heidelberg (2004)
5. Kuncheva, L., Bezdek, J.C.: Presupervised and postsupervised prototype classifier design.
IEEE Transactions on Neural Networks 10, 1142–1152 (1999)
6. Wilson, D.R., Martinez, D., Reduction, T.: techniques for instance-based learning algorithms.
Machine Learning 38, 251–268 (2000)
7. Salvador, G., Derrac, J., Ramon, C.: Prototype Selection for Nearest Neighbor Classifica-
tion: Taxonomy and Empirical Study. IEEE Transactions on Pattern Analysis and Machine
Intelligence 34, 417–435 (2012)
8. Kovahi, R., John, G.: Wrappers for Feature Subset Selection. AIJ special issue on relevance
(May 1997)
9. Zhang, J., et al.: Intelligent selection of instances for prediction functions in lazy learning
algorithms. Artifcial Intelligence Review 11, 175–191 (1997)
10. Tolvi, J.: Genetic algorithms for outlier detection and variable selection in linear regression
models. Soft Computing 8, 527–533 (2004)
11. Guillen, A., et al.: Applying Mutual Information for Prototype or Instance Selection in Re-
gression Problems. In: ESANN 2009 Proceedings (2009)
12. Tomek, I.: An experiment with the edited nearest-neighbor rule. IEEE Transactions on Sys-
tems, Man, and Cybernetics 6, 448–452 (1976)
13. Kordos, M., Blachnik, M., Strzempa, D.: Do We Need Whatever More Than k-NN? In:
Rutkowski, L., Scherer, R., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2010,
Part I. LNCS (LNAI), vol. 6113, pp. 414–421. Springer, Heidelberg (2010)
14. Merz, C., Murphy, P.: UCI repository of machine learning databases (1998-2012),
http://www.ics.uci.edu/mlearn/MLRepository.html
15. http://www.rapid-i.com
16. http://www.kordos.com/icann2012
A New Distance for Probability Measures
Based on the Estimation of Level Sets
Alberto Muñoz, Gabriel Martos, Javier Arriero, and Javier Gonzalez
Department of Statistics, University Carlos III, Madrid, and Johann Bernoulli

Institute for Mathematics and Computer Science, University of Groningen
{alberto.munoz,gabrielalejandro.martos,javier.arriero}@uc3m.es,
j.gonzalez.hernandez@rug.nl
Abstract. In this paper we propose to consider Probability Measures

(PM) as generalized functions belonging to some functional space en-
dowed with an inner product. This approach allows to introduce a new
family of distances for PMs. We propose a particular (non parametric)
metric for PMs belonging to this class, based on the estimation of den-
sity level sets. Some real and simulated data sets are used for a first
exploration of its performance.
1 Introduction
The study of distances between probability measures/distributions (PM) is in-

creasingly attracting attention in the fields of Data Analysis and Pattern Recog-
nition. Classical examples are homogeneity, independence and goodness of fit
test problems between populations where the general aim is to determine if data
generated from different random samples come from the same population or not.
These problems can be solved by choosing an appropriate distance between PM.
You can also find examples in Clustering [5,25], Image Analysis [9,10], Time Se-
ries Analysis [18,13], Econometrics [24,17] and Text Mining [15,7], just to name
a few.
For a review of interesting distances between probability distributions and
theoretical results, see for instance [26,4] and references therein. Non parametric
estimators often play a role in estimating such distances. In practical situations
there is usually available a (not huge) data sample, and the use of purely non
parametric estimators often results in poor performance [11].
An appealing point of view, initiated by Fisher and Rao [6,1,3] and continued
with recent development of Functional Data Analysis and Information Geometry
Methods (e.g. [20,2]), is to consider probability distributions as points belonging
to some manifold, and then take advantage of the manifold structure to derive
appropriate metrics for distributions.
In this work we elaborate on the idea of considering PMs as points in a func-
tional space endowed with an inner product, and then obtain different distances
for PMs from the metric structure derived from the ambient inner product. We
propose a particular instance of such metrics for generalized functions based on

272 A. Muñoz et al.
the estimation of density level sets regions. The article is organized as follows:
In Section 2 we present probability measures as generalized functions and then
we define general distances acting on the Schwartz distribution space. Section 3
presents a new distance built according to this point of view. Section 4 illustrates
the theory with some simulated and real data sets.
2 Probability Measures as Schwartz Distributions

Consider a measure space (X, F , μ), where X is a sample space (here a compact
set of a real vector space), F a σ-algebra of measurable subsets of X and μ :
F → IR+ the ambient σ-additive measure. A probability measure P is a σ-
additive finite measure absolutely continuous w.r.t. μ that satisfies the three
Kolmogorov axioms. By Radon-Nikodym theorem, there exists a measurable
function f : X → IR+ (the density function) such that P (A) = A f dμ, and
f = dPdμ is the Radon-Nykodim derivative.
A PM can be regarded as a Schwartz distribution (generalized function) (see
[22] for an introduction to Distribution Theory): We consider a vector space D
of test functions. The usual choice is to consider all the functions in C ∞ (X)
having compact support. A distribution (also named generalized function) is a
continuous linear functional on D. A probability measure can be regarded as
a
Schwartz distribution P : D → IR by defining P (φ) = P, φ = φdP =
φ(x)f (x)dμ(x) = φ, f . Thus, the probability distribution identifies with a
linear functional obeying the Riesz representer theorem: the representer for P is
its density function f ∈ D: P (·) = ·, f .
In particular, the familiar condition P (X) = 1 is equivalent to P, ½[X] = 1,
where the function ½[X] belongs to D, being X compact. Note that we do not
need that f ∈ D; only the integral φ, f should be properly defined for every
φ ∈ D.
So then a probability measure/distribution is a continuous linear functional
acting on a given function space. Two given linear functionals P1 and P2 will
be the same (or similar) if they act identically (or similarly)
on every φ ∈ D.
For instance, if we choose φ = Id, P1 (φ) = P, x = 1dP = EP1 [X] and if P1
and P2 are ‘similar’ then μ1 = EP1 [X] EP2 [X] = μ2 because P1 and P2 are
continuous. Similar arguments apply for variance (take φ(x) = (x − E[X])2 ) and
in general for higher order moments. For φξ (x) = eixξ , ξ ∈ IR, we obtain the
Fourier transform of the probability
measure
(called characteristic functions in
Statistics), given by P̂ (ξ) = P, eixξ = eixξ dP (x).
Thus, two PMs can be identified with their action as functionals on the
test functions and hence, distances between two distributions can be defined
from the differences between functional evaluations for appropriately chosen test
functions.
3 A Metric Based on the Estimation of Level Sets

Given two PMs P and Q, we consider a family of test functions {φi }i∈I ⊆ D
and then define distances between P and Q by weighting terms of the type
A New Distance for Distributions Based on the Estimation of Level Sets 273
d (P, φi , Q, φi ) for i ∈ I, where d is some distance function. Our test functions
will be indicator functions of α-level sets, introduced below.
Given a PM P with density function fP , minimum volume sets (or α-level sets)
are defined by Sα (fP ) = {x ∈ X| fP (x) ≥ α}, such that P (Sα (fP )) = 1 − ν ,
where fP is the density function and 0 < ν < 1. If we consider an ordered
sequence α1 < . . . < αn , αi ∈ (0, 1), then Sαi+1 (fP ) ⊆ Sαi (fP ). Let us define
Ai (P) = Sαi (fP ) − Sαi+1 (fP ), i ∈ {1, . . . , n − 1}. We can choose α1 0 and
αn ≥ maxx∈X fP (x) (which exists, given that X is compact and fP continuous);
then i Ai (P) Supp(P) = {x ∈ X| fP (x) = 0} (equality takes place when
n → ∞, α1 → 0 and αn → 1 ). Given the definition of the Ai , if Ai (P) = Ai (Q)
for every i when n → ∞, then P = Q. Thus, taking φi = ½[Ai ] , our choice is
d
(P, φ i , Q, φi ) = | P, φi − Q, φj | = | ½[Ai ] dP − ½[Bi ] dQ| | ½[Ai ] μ −
½[Bi ] μ|, the ambient measure. Indeed, given the definition of level set and the
choice of Ai , both P and Q are approximately constant on Ai and Bi , respectively,
and so we are using the counting (ambient) measure.
Denote by the symmetric difference operator: A B = (A − B) ∪ (B − A).
Consider φ1i = ½[Ai (P)−Ai (Q)] and φ2i = ½[Ai (Q)−Ai (P)] . Consider di (P, Q) =
| P, φ1i −Q, φ1i |+| P, φ2i −Q, φ2i |. From the previous discussion di (P, Q)
μ (Ai (P) Ai (Q)), what motivates the following:
Definition 1. Weighted level-set distance. Given α = {α(i) }n1 , we define

n−1
μ (Ai (P) Ai (Q))
dα (P, Q) = αi , (1)
i=1
μ (Ai (P) ∪ Ai (Q))
where μ is the ambient measure. We use μ (Ai (P) Ai (Q)) in the numerator
instead of di (P, Q) μ (Ai (P) Ai (Q)) for compactness. When n → ∞ both
expressions are equivalent. Of course, we can calculate dα in eq. (1) only when we
know the distribution function for both PMs. In practice there will be available
two data samples generated from P and Q, and we need to define some plug
in estimator: Consider estimators Âi (P) = Ŝαi (fP ) − Ŝαi+1 (fP ), then we can
estimate dα (P, Q) by

# Âi (P) S Âi (Q)
n−1
dˆα (P, Q) = αi , (2)
i=1 # Âi (P) ∪ Âi (Q)
where #A indicates the number of points in A and S indicates the set estimate
of the symmetric difference, defined below.
Both dα (P, Q) and dˆα (P, Q), as currently defined, are semimetrics. The pro-
posal of Euclidean metrics will be afforded immediately after the present work.
3.1 Estimation of Level Sets

To estimate level sets from a data sample we present the following definitions
and theorems, concerning the One-Class Neighbor Machine [19].
Definition 2. Neighbourhood Measures. Consider a random variable X

with density function f (x) defined on IRd . Let Sn denote the set of random
independent identically distributed (iid) samples of size n (drawn from f ). The
elements of Sn take the form sn = (x1 , · · · , xn ), where xi ∈ IRd . Let M : IRd ×
Sn −→ IR be a real-valued function defined for all n ∈ IN. (a) If f (x) < f (y)
implies lim P (M (x, sn ) > M (y, sn )) = 1, then M is a sparsity measure.
n→∞
(b) If f (x) < f (y) implies lim P (M (x, sn ) < M (y, sn )) = 1, then M is a
n→∞
concentration measure.
Example: Consider the distance from a point x ∈ IRd to its k th -nearest neigh-
bour in sn , x(k) : M (x, sn ) = dk (x, sn ) = d(x, x(k) ): it is a sparsity measure. Note
that dk is neither a density estimator nor is it one-to-one related to a density
estimator. Thus, the definition of ‘sparsity measure’ is not trivial.
The Support Neighbour Machine [19] solves the following optimization
problem:
n
max νnρ − ξi
ρ,ξ
i=1 (3)
s.t. g(xi ) ≥ ρ − ξi ,
ξi ≥ 0, i = 1, . . . , n ,
where g(x) = M (x, sn ) is a sparsity measure, ν ∈ [0, 1], ξi with i = 1, . . . , n are
slack variables and ρ is a predefined constant.
Theorem 1. The set Rn = {x : hn (x) = sign(ρ∗n − gn (x)) ≥ 0} converges to

a region of the form Sα (f ) = {x|f (x) ≥ α}, such that P (Sα (f )) = ν. There-
fore, the Support Neighbour Machine estimates a density contour cluster Sα (f )
(around the mode).
Hence, we take Âi (P) = Ŝαi (fP ) − Ŝαi+1 (fP ), where Ŝαi (fP ) is estimated by
Rn defined above.
3.2 Estimation of the Symmetric Difference between Sets
We should not estimate the region A B by A B = #(Â − B̂) ∪ #(B̂ − Â) =

#(A ∪ B) − #(A ∩ B), given that probably there will be no points in common
between Â and B̂ (which implies A B =A ∪ B). Given that Â is a set of points
estimating the spatial region A, we will estimate the region A by a covering of
the type Â = ∪ni B(xi , r), where B(xi , r) are closed balls with centres at xi ∈ Ai
and radius r [8]. The radius is chosen to be constant because we can assume
density approximately constant inside region Âi (P) if the partition {αi } of the
set (0, 1) is fine enough. The problem of calculating A B reduces therefore to
estimate the points in B̂ not belonging to the covering estimate of A, plus the
points in Â not belonging to the covering estimate of B. This will be denoted by
A S B. Figure 3 illustrates the previous discussion. Notice that S implicitly
gives rise to kernels for sets; for example: K(A, B) = 1 − (A S B)/(A ∪ B), that
allows to consider distance for distributions in the context of kernel methods.
Fig. 1. Set estimate of the symmetric difference. (a) Data sets Â and B̂. (b) Â − B̂.
(c) B̂ − Â.
4 Experimental Work
Being the proposed distance intrinsically nonparametric, there are no ‘simple’
parameters (like mean and variance) on which we can concentrate our attention
to do exhaustive benchmarking. The strategy will be to compare the proposed
distance to other classical PM distances for some well known (and parametrized)
distributions, to get a first impression on its performance. Here we consider
Kullback-Leibler (KL) divergence, Kolmogorov-Smirnov (KS) distance and t-
test (T) measure (Hotelling test in the multivariate case).
We begin by testing our Level Distance (LD) on the most favourable case to
the classical PM metrics: normal distributions. Consider a mixtures of normal
1 1
distributed populations: α N(μ = −d− 2 12 ; Σ = .75I) + (1−α)N(μ = d− 2 12 ; Σ =
1 1
I) and (1 − α)N(μ = −d− 2 12 ; Σ = I) + α N(μ = d− 2 12 ; Σ = .75I), with
α = .6 and d the dimension considered in order to compare the discrimination
performance of the proposed distance relative to other classical multivariate
distances: Kullback-Leibler (KL) divergence and t-test (T) measure (Hotelling
test in the multivariate case). We found the minimum sample size n for which
the PM metrics are able to discriminate between both samples. In all cases we
choose type I error =.05 and type II error =.1. Table 1 report the results, we
can see that the Level Set Distance (LD) measure is more efficient (in terms of
sample size) in all the dimensions considered.
Table 1. Minimum sample size for a 5% type I and 10% type II errors
Metric d: 1 2 3 4 5 10 20 50 100
KL 1300 1700 1800 1900 2000 2700 > 5000 > 5000 > 5000
T 750 800 900 1000 1100 1400 1500 2100 2800
LD 200 380 650 750 880 1350 1400 1800 2200
Fig. 2. MDS plot for texture groups. A representer for each class is plotted in the map.
Fig. 3. Dendrogram with shaded image texture groups
To conclude, we will show an application of the LD measure to evaluate dis-

tances between data sets. To this aim, we consider 9 data sets from the Kylberg
texture data set [14]: ‘blanket’, ‘canvas’, ‘seat’, ‘oatmeal’, ‘rice’, ‘lentils’, ‘lin-
seeds’, ‘stone1’, ‘stone2’ belonging to 3 mean types. There are 160 × 11 = 1760
texture images with a resolution of 576 × 576 pixels. We represent each image
using the 32 parameters of the wavelet coefficient histogram proposed in [16].
Next we calculate the (between sets) distance matrix with the LD measure,
we obtain (by multidimesional scaling -MDS) the representation shown in Figure
2, that results to be a sort of MDS for data sets. It is apparent that textures get
organized in a very coherent way with human criteria, what seems to indicate
that the proposed Level Distance is appropriate for real pattern recognition
problems (high dimensional and small number of data samples).
5 Future Work
In the near future we will afford the study of the geometry induced by the pro-
posed measure and its asymptotic properties. An exhaustive testing on a variety
of data sets following different distributions is needed. We are also working on
a variation of the LD distance that satisfies the Euclidean metric conditions.
Acknowledgments. This work was partially supported by projects DGUCM

2008/00058/002 and MEC 2007/04438/001 (Spain).
References
1. Amari, S.-I., Barndorff-Nielsen, O.E., Kass, R.E., Lauritzen, S.L., Rao, C.R.: Differ-
ential Geometry in Statistical Inference. Lecture Notes-Monograph Series, vol. 10
(1987)
2. Amari, S., Nagaoka, H.: Methods of Information Geometry. American Mathemat-
ical Society (2007)
3. Atkinson, C., Mitchell, A.F.S.: Rao’s Distance Measure. The Indian Journal of
Statistics, Series A 43, 345–365 (1981)
4. Müller, A.: Integral Probability Metrics and Their Generating Classes of Functions.
Advances in Applied Probability 29(2), 429–443 (1997)
5. Banerjee, A., Merugu, S., Dhillon, I., Ghosh, J.: Clustering whit Bregman Diver-
gences. Journal of Machine Learning Research, 1705–1749 (2005)
6. Burbea, J., Rao, C.R.: Entropy differential metric, distance and divergence mea-
sures in probability spaces: A unified approach. Journal of Multivariate Analysis 12,
575–596 (1982)
7. Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A Comparison of String Distance
Metrics for Name-matching Tasks. In: Proceedings of IJCAI 2003, pp. 73–78 (2003)
8. Devroye, L., Wise, G.L.: Detection of abnormal behavior via nonparametric esti-
mation of the support. SIAM J. Appl. Math. 38, 480–488 (1980)
9. Dryden, I.L., Koloydenko, A., Zhou, D.: Non-Euclidean statistics for covariance
matrices, with applications to diffusion tensor imaging. The Annals of Applied
Statistics 3, 1102–1123
10. Dryden, I.L., Koloydenko, A., Zhou, D.: The Earth Mover’s Distance as a Metric
for Image Retrieval. International Journal of Computer Vision 40, 99–121 (2000)
11. Gretton, A., Borgwardt, K., Rasch, M., Schlkopf, B., Smola, A.: A kernel method
for the two sample problem. In: Advances in Neural Information Processing Sys-
tems, pp. 513–520 (2007)
12. Hastie, T., Tibshirani, R., Friedman, J.: The elements of statistical learning, 2nd
edn. Springer (2009)
13. Hayashi, A., Mizuhara, Y., Suematsu, N.: Embedding Time Series Data for Clas-
sification. In: Perner, P., Imiya, A. (eds.) MLDM 2005. LNCS (LNAI), vol. 3587,
pp. 356–365. Springer, Heidelberg (2005)
14. Kylberg, G.: The Kylberg Texture Dataset v. 1.0. Centre for Image Analysis,
Swedish University of Agricultural Sciences and Uppsala University, Uppsala, Swe-
den, http://www.cb.uu.se/gustaf/texture/
15. Lebanon, G.: Metric Learnong for Text Documents. IEEE Trans. on Pattern Anal-
ysis and Machine Intelligence 28(4), 497–508 (2006)
16. Mallat, S.: A Theory for Multiresolution Signal Decomposition: The Wavelet Rep-
resentation. IEEE Trans. on Pattern Analysis and Machine Intelligence 11(7), 674–
693
17. Marriot, P., Salmon, M.: Aplication of Differential Geometry to Econometrics.
Cambridge University Press (2000)
18. Moon, Y.I., Rajagopalan, B., Lall, U.: Estimation of mutual information using
kernel density estimators. Physical Review E 52(3), 2318–2321
19. Muñoz, A., Moguerza, J.M.: Estimation of High-Density Regions using One-
Class Neighbor Machines. IEEE Trans. on Pattern Analysis and Machine Intel-
ligence 28(3), 476–480
20. Ramsay, J.O., Silverman, B.W.: Applied Functional Data Analysis. Springer, New
York (2005)
21. Sriperumbudur, B.K., Fukumizu, K., Gretton, A., Scholkopf, B., Lanckriet, G.R.G.:
Non-parametric estimation of integral probability metrics. In: International Sym-
posium on Information Theory (2010)
22. Strichartz, R.S.: A Guide to Distribution Theory and Fourier Transforms. World
Scientific (1994)
23. Székely, G.J., Rizzo, M.L.: Testing for Equal Distributions in High Dimension.
InterStat (2004)
24. Ullah, A.: Entropy, divergence and distance measures with econometric applica-
tions. Journal of Statistical Planning and Inference 49, 137–162 (1996)
25. Xing, E.P., Ng, A.Y., Jordan, M.I., Russell, S.: Distance Metric Learning, with Ap-
plication to Clustering with Side-information. In: Advances in Neural Information
Processing Systems, pp. 505–512 (2002)
26. Zolotarev, V.M.: Probability metrics. Teor. Veroyatnost. i Primenen 28(2), 264–287
(1983)
Low Complexity Proto-Value Function Learning
from Sensory Observations with Incremental
Slow Feature Analysis
Matthew Luciw and Juergen Schmidhuber
IDSIA-USI-SUPSI,
Galleria 2, 6928, Manno-Lugano, Switzerland
Abstract. We show that Incremental Slow Feature Analysis (IncSFA)

provides a low complexity method for learning Proto-Value Functions
(PVFs). It has been shown that a small number of PVFs provide a good
basis set for linear approximation of value functions in reinforcement
environments. Our method learns PVFs from a high-dimensional sen-
sory input stream, as the agent explores its world, without building a
transition model, adjacency matrix, or covariance matrix. A temporal-
difference based reinforcement learner improves a value function approx-
imation upon the features, and the agent uses the value function to
achieve rewards successfully. The algorithm is local in space and time,
furthering the biological plausibility and applicability of PVFs.
Keywords: Proto-Value Functions, Incremental Slow Feature Analysis,

Biologically Inspired Reinforcement Learning.
1 Introduction
A reinforcement learning [21] agent, which experiences the world from its con-
tinuous and high-dimensional sensory input stream, is exploring an unknown
environment. It would like be able to predict future rewards, i.e., learn a value
function (VF), but, due to its complicated sensory input, VF learning must be
precluded by learning a simplified perceptual representation.
There has been a plethora of work on learning representation for RL, specif-
ically Markov Decision Processes (MDPs); we can outline four types. 1. Top-
Down Methods. Here, the representation/basis function parameter adaptation
is guided by the VF approximation error only [13,16]. 2. Spatial Unsupervised
Learning (UL). An unsupervised learner adapts to improve its own objective,
which treats each sample independently, e.g., minimize per-sample reconstruc-
tion error. The UL feeds into a reinforcement learner. UL methods used have in-
cluded nearest-neighbor type approximators [17] or autoencoder neural nets [11].
3. Hybrid Systems. Phases of spatial UL and top-down VF-based feedback are
interleaved [5,11]. 4. Spatiotemporal UL. Differs from the spatial UL type by
using a UL objective that takes into account how the samples change through
time. Such methods include the framework of Proto-Reinforcement Learning
(PRL) [15], and Slow Feature Analysis (SFA) [22,12].

280 M. Luciw and J. Schmidhuber
There are some potential drawbacks to types 1,2 and 3. The top-down tech-
niques bias their representation for the reward function. They also require the
reward information for any representation learning to take place. In the spa-
tial UL techniques, the encoding need not capture the information important
for reward prediction — the underlying Markov Process dynamics. The spa-
tiotemporal UL do not have these drawbacks. These capture the state-transition
dynamics, the representation is not biased by any particular reward function,
and it can learn when the reward information is not available.
In PRL, the features are called Proto-Value Functions (PVFs); theoretical
analysis shows just a few PVFs can capture the global characteristics of some
Markovian processes [3,4] and that just a few PVFs can be used as building
blocks to approximate value functions with low error. Sprekeler recently showed
how SFA can be considered a function approximation to learning PVFs [20], so
slow features (SFs) can have the same set of beneficial properties for representa-
tion learning for general RL. Kompella, Luciw and Schmidhuber recently devel-
oped an incremental method for updating a set of slow features (IncSFA; [10,9]),
with linear computational and space complexities.
The new algorithm in this paper is the combination of IncSFA and RL — here
we use a method based on temporal-differences (TD) for its local nature, but
other methods like LSTD [1] are possible — for incrementally learning a good
set of RL basis functions for value functions, as well as the value function itself.
The importance is twofold. First, the method gives a way to approximately
learn PVFs directly from sensory data. It doesn’t need to build a transition
model, adjacency matrix, or covariance matrix, and in fact does not need to
ever know what state its in. Second, it has linear complexity in the number of
input dimensions. The other methods that derive such features — batch SFA
and graphical embedding (Laplacian EigenMap) — have cubic complexity and
don’t scale up well to a large input dimension. Therefore our method is suited
to autonomous learning on sensory input streams (e.g., vision), which the other
methods are not suited for due to their computational and space complexities.
2 Slow Features as Proto-Value Functions
Due to space limits, we just skim over the background. See elsewhere
[21,15,3,22,20] for further details.
Value Function Approximation for MDPs. An MDP is a five-tuple:
a
(S, A, P, R, γ), where S is a set of states, A is a set of actions, Ps,s is the
probability of transition from state s to s when taking action a, Ras is the

expected immediate reward when taking action a in state s, and 0 < γ ≤ 1 is

the discount factor. RL often involves learning a value function on S. Values
are future expected cumulative discounted rewards. A complication: in our
case, the agent does not know s. Instead it gets an observation vector: x ∈ RI .
The dimension I is large so it relies on its sensory mapping Φ to map x to y ∈ RJ ,
IncSFA: Low Complexity Proto-Value Function Learning 281
where J << I. Then, values are approximated as a linear combination of these

J mappings:
J

V (x; φ, θ) = φj (x)θj . (1)
j
We’re not so concerned with learning θ since there are good methods to do
this given suitable basis functions. Here we are interested in learning φ. We’d
like a compact set of mappings, which can deliver a reasonable approximation of
different possible value functions, and which could be learned in an unsupervised
way, i.e., without requiring the reward information. PVFs suit all these criteria,
but require that the state is known, so they do not fit Eq. 1.
Proto-Value Functions. PVFs capture global dynamic characteristics of the
MDP in a low dimensional space. The objective is to find a Φ that preserves
similarity relationships between each pair of states st and st , with a small set
2
of basis functions, formally to minimize Ψ (φj ) = t,t At,t (φj (st ) − φj (st ))
with unit norm and orthogonality constraints. In general At,t is a matrix of
similarities, for MDPs typically a binary adjacency matrix, where a one means
the states are connected (i.e., transition probability higher than some threshold).
The objective penalizes differences between mapped outputs of adjacent states,
e.g., if states st and st are connected, and a good φj will have (yt − yt )2 small.
Laplacian EigenMap (LEM) Procedure. Φ can be solved for through an
eigenvalue problem [2,19]. The eigenvectors of the combinatorial Laplacian L,
⎧
⎨ degree(si ) if i = j
Li,j = −1 if i
= j & si is adjacent to sj (2)
⎩
0 otherwise
are used, ordered from smallest to largest eigenvalue, for Φ.

Slow Features. SFA’s objective is to find a few instantaneous functions gj on
the input that generate orthogonal output signals varying as slowly as possible
(but not constant) [22]. The slow features do not require the true state to be
known. The SFA objective can be solved via eigendecomposition of the covari-
ance matrix Ċ of the temporal differences of the (whitened) observations. The
slow features are low-order eigenvectors of Ċ. The batch-SFA (BSFA) technique
involves constructing such a covariance matrix from a batch of data.
Equivalence of PVFs and SFs. Sprekeler showed that the two objectives,
of SFA (slowness) and LEM (nearness), are mathematically equivalent under
some reasonable assumptions [20] (which are fulfilled when we collect data from
a random walk on an MDP). For intuition, note that saying two observations
have a high probability of being successive in time is another way of saying that
the two underlying states have a high probability of being connected. In the
LEM formulation, the neighbor relationships are explicit (through the adjacency
matrix), but in SFA’s they are implicit (through temporal succession).
Algorithm 1. IncSFA-TD(J, η, γ, α, T )
//Autonomously learn J slow features and VF approximation
coefficients from whitened samples x ∈ RI
1 {W, vβ , θ} ← Initialize ()
// W : Matrix of slow feature column vectors w1 , ..., wJ
// vβ : First Principal Component in difference space, with
magnitude equal to eigenvalue
// θ : Coefficients for the VF
2 for t ← 1 to ∞ do
3 (xprev ← xcurr ) //After t = 1
4 xcurr ← GetWhitenedObsv()
5 r ← ObserveReward()
6 if t > 1 then
7 ẋ ← (xcurr − xprev )
8 vβ ← CCIPCA-Update (vβ , ẋ) //For seq. addition parm.
9 β ← vβ /vβ
//Slow features update
10 l1 ← 0
11 for i ← 1 to J do
12 wi ← (1 − η)wi − η [(ẋ · wi ) ẋ + li ] .
13 wi ← wi /wi .
i
14 li+1 ← β j (wj · wi )wj
15 end
16 (yprev ← ycurr ) //After t = 2
17 ycurr ← xTcurr W
18 if t > T then
19 δ ← r + (γycurr − yprev ) θ //TD-error
20 θ ← θ + α δ yprev //TD update
21 end
22 end
23 a ← SelectAction()
24 end
The main reason the slow features (SFs) are approximations of the PVFs
depends on the relation of observations to states. If the state is not extractable
from each single observation, the problem becomes partially-observable (and
out of scope here). Even if the observation has the state information embedded
within, there may not be a linear mapping. Expanded function spaces [22] and
hierarchical networks [8] are typically used with SFA to deal with such cases,
and they can be used with IncSFA as well [14].
3 PVF Learning with IncSFA for VFs

Incremental Slow Feature Analysis updates slow features, incrementally and
covariance-free, eventually converging to the same features as BSFA. It is detailed
elsewhere [10,9]. We want to use it to develop φ in Eq. 1, but we also need something
to learn θ. As a motivation behind this work is to move towards biologically plau-
sible, practical, RL methods, we use TD learning, a simple local learning method
of value function coefficient adaptation. The resulting algorithm, IncSFA-TD (see
Alg. 1) is biologically plausible to the extent that it is local in space and time [18],
and its updating equation (Line 12) has an anti-Hebbian form [6]. The input pa-
rameters: J, the number of features to learn, η, the IncSFA learning rate, γ, the
discount factor, α, the TD learning rate, and T , the time to start adapting the
VF coefficients. For simplicity, the algorithm requires the observation to be drawn
from a whitened distribution. Note the original IncSFA also provides a method for
incrementally doing this whitening.
On Complexity. The following table compares the time and space complexi-
ties of three methods that will give approximately the same features — LEM
(Laplacian EigenMap), BSFA (Batch SFA), and IncSFA — in terms of number
of samples n and input dimension I.
Computational Complexity Space Complexity

LEM O(n3 ) O(n2 )
3
BSFA O(I ) O(n + I 2 )
IncSFA O(I) O(I)
The computational burden on BSFA and LEM is the one time cost of ma-
trix eigendecomposition, which has cubic complexity [7]. SFA uses covariance
matrices of sensory input, which scale with input dimension I. However LEM’s
graph Laplacian scales with the number of data points n. So the computational
complexity of batch SFA can be quite a bit less than LEM, especially for agents
that collect a lot of samples (since typically I << n). IncSFA has linear updating
complexity since it avoids batch-based eigendecomposition altogether. However,
as an incremental method, it will be less efficient with each data point. The
space burden in BSFA and LEM involves collecting the data and building the
matrices, which IncSFA avoids.
4 Experiment
Environment. We use a vision-based Markovian environment (Fig 1), intro-

duced by Lange and Reidmiller, 2010 [11]. The observation at any time is a
30 × 30 (slightly noisy) image, which shows a top-down view of the agent’s po-
sition in a room. The agent is the 5 x 5 white square in the image. It can move
up, down, left, or right, each by 5 pixels. The white “L” are walls. The envi-
ronment borders on the image edges are also impassable. Reward. During an
initial phase, there are no rewards. During a second phase, one point in the image
will be associated with positive reward; all other places have zero reward. It has
to learn features during the first phase (without any rewards) and approximate
(and use) the value function in the second.
IncSFA UL LEM
1
1 UL
0.5 1
0.5
1
0
1
R 0.5 R 0.5
0.5 0
0 0 1 0.5 0 0
Same Features, Different Reward Functions

IncSFA
Reward Per Step Reward Per Step

1
random 1
0.5 walk
LEM
exploit 0.5
(explore)
0
0 2000 4000 6000 00 1000 2000 3000 4000 5000
Fig. 1. Upper left: sample observation (30 × 30 image) from the environment. The
IncSFA features developed from the exploration sequence directly from these observa-
tions are approximately the same as LEM features learned from eigendecomposition
of the graph Laplacian of the true transition model. Upper center/right: embedding
of a trajectory through the environment for both LEM features and IncSFA features
(UL refers to the upper left corner of the room and R refers to the small inner room).
Lower left: feature responses upon a grid of different images, where the agent is at
different possible positions, for each of LEM and IncSFA (best viewed in color). Lower
right: the agent goes to exploitation mode (maximize reward) using its value function
learned upon the incrementally developed slow features. The performance shoots up
to a nearly optimal level, for two different reward positions.
Setup. The agent explores the environment via random walk for 40, 000 steps.
After each step, the slow features are updated, with learning rate η = 0.0002.
At t = 40, 000, the reward appears, and the agent continues its random walk for
3, 000 more steps, while learning the value function coefficients, with learning
rate α = 0.0001. After t = 43, 000, the agent enters exploitation mode, where
it picks the action that will take it to the most valuable possible next state
(using its current VF approximation), but with a 5% random action chance. To
avoid the agent staying at the reward in exploitation mode, when the reward
is reached, the agent teleports away. The features and coefficients continue to
adapt. To show some generality, the reward will be placed in two different places
(in two different instances) — inside the room or at the bottom center of the
image.
Results. Are shown in Fig. 1. First, we want to show the features learned in-
crementally from sensory data actually deliver a reasonable LEM embedding.
Visually compare the features of IncSFA, developed online and incrementally
on the high-dimensional noisy images, to eigenvectors of the graph Laplacian,
using the actual underlying transition model. Also note the similarity of the
graphical embeddings of a single trajectory through the entire room upon the
first three (non-constant) features for each of LEM and IncSFA. After going
into exploitation mode, the agent quickly reaches a near optimal level of reward
accumulation, for both reward functions. The features did not change signifi-
cantly in the roughly 3, 000 exploitative steps.
Discussion. We can discuss some other methods that might apply to this set-
ting. As mentioned, this environment was first developed elsewhere [11]. In that
work, deep autoencoder neural nets are trained to compress the observations,
and the bottleneck layer (with the fewest number of neurons) output becomes the
state representation for the agent. Neural-fitted reinforcement learning (NFQ)
learns the Q-function upon this state representation, and the NFQ net error (the
TD-error) backpropagates throughout the autoencoder, causing the state repre-
sentation to conform to a map-like embedding. This effect only emerges when
the Q-error is backpropagated; otherwise the autoencoder representation does
not resemble a map. In our case, the slow features learn a map representation
in the unsupervised phase and therefore do not need the reward information to
learn such a representation.
Another type of method that would apply is a nearest-neighbor prototype
state quantization, where new prototypes/states are added when the distance of
an observation from all existing prototypes exceeds some threshold. This type
of method provides distinct states for RL but does not provide an embedding.
Additionally, this method can lead to a large number of states, which increases
the search space for the RL.
One might want to try an incremental Principal Component Analysis (PCA),
which like SFA will also give a compressed code in a few features, but captures
directions of highest variance (a spatial encoding). SFA uses the temporal in-
formation to learn spatial features, i.e., it casts the data into a low-dimensional
space where similarity information is preserved. A low-dimensional map is quite
useful for planning and control, but PCA’s encoding does not necessarily have
these properties (it will be good for reconstructing the input).
5 Conclusions
A real-world reinforcement learning agent doesn’t get clean states, but messy
observations. Learning to represent its perceptions in such a way that will aid
its future reward prediction capabilities is just as (if not more) important than
its method for learning a value function. For biological plausibility, the methods
for learning representation and learning value need to be incremental and local
in space and time. IncSFA and TD fulfill these criteria. We hope this method
and the background we provided here influences autonomous real-world rein-
forcement learners.
Acknowledgments. We thank the anonymous reviewers and Sohrob Kazerou-

nian for their useful comments. This work was funded by Swiss National Sci-
ence Foundation grant CRSIKO-122697 (Sinergia project), and through the 7th
framework program of the EU under grant #270247 (NeuralDynamics project).
References
1. Bradtke, S.J., Barto, A.G.: Linear least-squares algorithms for temporal difference
learning. Machine Learning 22(1), 33–57 (1996)
2. Chung, F.R.K.: Spectral graph theory. AMS Press, Providence (1997)
3. Coifman, R.R., Lafon, S., Lee, A.B., Maggioni, M., Nadler, B., Warner, F., Zucker,
S.W.: Geometric diffusions as a tool for harmonic analysis and structure definition
of data: Diffusion maps. Proceedings of the National Academy of Sciences of the
United States of America 102(21), 7426 (2005)
4. Coifman, R.R., Maggioni, M.: Diffusion wavelets. Applied and Computational Har-
monic Analysis 21(1), 53–94 (2006)
5. da Motta Salles Barreto, A., Anderson, C.W.: Restricted gradient-descent algo-
rithm for value-function approximation in reinforcement learning. Artificial Intel-
ligence 172(4-5), 454–482 (2008)
6. Dayan, P., Abbott, L.F.: Theoretical neuroscience: Computational and mathemat-
ical modeling of neural systems (2001)
7. Forsythe, G.E., Henrici, P.: The cyclic Jacobi method for computing the princi-
pal values of a complex matrix. Applied Mathematics and Statistics Laboratories,
Stanford University (1958)
8. Franzius, M., Sprekeler, H., Wiskott, L.: Slowness and sparseness lead to place,
head-direction, and spatial-view cells. PLoS Computational Biology 3(8), e166
(2007)
9. Kompella, V.R., Luciw, M.D., Schmidhuber, J.: Incremental slow feature analy-
sis: Adaptive low-complexity slow feature updating from high-dimensional input
streams. Neural Computation (accepted and to appear, 2012)
10. Kompella, V.R., Luciw, M., Schmidhuber, J.: Incremental slow feature analysis.
In: International Joint Conference of Artificial Intelligence (2011)
11. Lange, S., Riedmiller, M.: Deep auto-encoder neural networks in reinforcement
learning. In: International Joint Conference on Neural Networks, Barcelona, Spain
(2010)
12. Legenstein, R., Wilbert, N., Wiskott, L.: Reinforcement learning on slow features
of high-dimensional input streams. PLoS Computational Biology 6(8) (2010)
13. Lin, L.J.: Reinforcement learning for robots using neural networks. School of Com-
puter Science, Carnegie Mellon University (1993)
14. Luciw, M., Kompella, V.R., Schmidhuber, J.: Hierarchical incremental slow feature
analysis. In: Workshop on Deep Hierarchies in Vision (2012)
15. Mahadevan, S.: Proto-value functions: Developmental reinforcement learning. In:
Proceedings of the 22nd International Conference on Machine Learning, pp. 553–
560. ACM (2005)
16. Menache, I., Mannor, S., Shimkin, N.: Basis function adaptation in temporal dif-
ference reinforcement learning. Annals of Operations Research 134(1), 215–238
(2005)
17. Santamaria, J.C., Sutton, R.S., Ram, A.: Experiments with reinforcement learning
in problems with continuous state and action spaces. Adaptive Behavior 6(2), 163
(1997)
18. Schmidhuber, J.: A local learning algorithm for dynamic feedforward and recurrent
networks. Connection Science 1(4), 403–412 (1989)
19. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Transactions on
Pattern Analysis and Machine Intelligence 22(8), 888–905 (2000)
20. Sprekeler, H.: On the relation of slow feature analysis and laplacian eigenmaps.
Neural Computation, 1–16 (2011)
21. Sutton, R.S., Barto, A.G.: Reinforcement learning: An introduction, vol. 1. Cam-
bridge Univ. Press (1998)
22. Wiskott, L., Sejnowski, T.: Slow feature analysis: Unsupervised learning of invari-
ances. Neural Computation 14(4), 715–770 (2002)
Improving Neural Networks Classification
through Chaining
Khobaib Zaamout and John Z. Zhang
Department of Mathematics and Computer Science

University of Lethbridge, Lethbridge, AB Canada T1K 3M4
{zaamout,zhang}@cs.uleth.ca
Abstract. We present a new ensemble technique, namely chaining neural net-

works, as our efforts to improve neural classification. We show that using predic-
tions of a neural network as input to another neural network trained on the same
dataset will improve classification. We propose two variations of this approach,
single-link and multi-link chaining. Both variations include predictions of trained
neural networks in the construction and training of a new network and then store
them for later predictions. In this initial work, the effectiveness of our proposed
approach is demonstrated through a series of experiments on real and synthetic
datasets.
Keywords: Neural networks, classification, ensemble, chaining.
1 Introduction and Previous Work
Due to the importance of classification, many techniques have been developed over the
past decades to automate it. Artificial Neural Networks are amongst the best techniques
used for classification [1]. They are a simplification of the humans’ central nervous
system and have been the subject of studies for the past decades [2]. In a nutshell,
neural networks are ideal for learning complex relations between inputs and classes.
They consist of multilayered groups of interconnected components that connect to each
other via weighted links. The training process of the networks alters these weights based
on the produced error such that the new weights minimize the classification error [1].
Although neural networks are very effective, there remain interests in enhancing
their effectiveness by increasing learning and convergence speed, estimating optimal
structures, increasing classification accuracy, etc. Many techniques have been proposed
to these ends. For instance, Yu et al. proposed dynamic leaning rate for increasing the
speed of learning, while cascade correlation [3], upstart algorithm [4], and optimal
brain damage [5] were for estimating network structures. In addition, the modular ap-
proach [6], the em ensemble neural networks [7], and the bagging and boosting tech-
niques [8] worked toward increasing classification accuracy. In essence, the modular
approach is concerned mainly with breaking down a problem into smaller sub prob-
lems, to which separate networks are applied. The ensemble technique is to combine

This work is supported by NSERC, Canada.

Improving Neural Networks Classification through Chaining 289
the outputs of structurally or logically different networks trained to predict the same tar-
get through some technique, i.e. voting, averaging, etc. Since varying neural networks
make errors on different areas of the input space, a good combining technique will yield
an ensemble that is more accurate and error tolerant than a typical network [7,8].
2 Chaining Neural Networks Ensemble Approach

Our approach is named chaining neural networks ensemble, or neural chaining for
short. It attempts to improve a neural network’s prediction by including the predictions
of previously trained network(s) into the training process of new network(s), forming a
chain-like ensemble. This approach has two variations, single-link chaining - (SLC) and
multi-link chaining - (MLC). SLC, as shown in Fig. 1 (a), trains a neural network on a
dataset and then uses its predictions as input to another network along with the inputs
from the given dataset. The chaining process continues, forming a chain of networks,
i.e., chain links, until an acceptable error is achieved. Each network in the “chain” is
trained on the original dataset and on the predictions of the network that immediately
precedes it. This approach increases the number of attributes in the original dataset
by only one, keeping computational cost of creating a new network feasible. MLC, as
shown in Fig. 1 (b), is similar to the SLC variation. However, it differs in that each
neural network in MLC is trained on the original dataset and on the predictions of all
networks that precede it.
Inputs Inputs NN1

NN1
NN2 NN2
NN3
NNn
NNn
(a) (b)
Fig. 1. (a) Single-Link Chaining. (b) Multi-Link Chaining.
In general, the intuition behind SLC and MLC is that a neural network’s predictions
can be used to correct the predictions of upcoming networks since these predictions
resulted from attributes that are indicative of the target classes. Therefore, the predic-
tions are highly correlated with the target classes. Using these predictions is expected
to further improve the predictability of the data.
In particular, the intuition behind SLC is that a network may not need the predictions
of all preceding networks in order to correct its classification. A network trained on the
predictions of a previous network produces predictions influenced by that knowledge.
Therefore, it seems reasonable that the predictions of the new network should replace
that of the previous network in the dataset. On the other hand, the intuition behind MLC
290 K. Zaamout and J.Z. Zhang
Table 1. Summary of datasets
Dataset Name # Attributes # Instances Class Type (#) Missing Values? Data Characteristics
DAGA 17 72 Real No Time Series-Real
DSP F 27 1941 Categorical(7) No Integer-Continuous
DSY N 17 7280 Real No Times Series-Real
DRER 40 405 Categorical(2) Yes Integer
is that it may be necessary for a network to have access to the predictions of all the
previous networks such that the training procedure is able to learn the most beneficial
for the network’s own prediction.
3 Experiment Setup
Four datasets were used to experiment with the proposed approaches, as shown in Ta-
ble 1. The first set, Agassiz, denoted as DAGA , was obtained from an agriculture ap-
plication1 [9]. This dataset is a record of the yields of greenhouse tomato plants under
controlled conditions. The second dataset is called Synthetic dataset, denoted as DSY N ,
and was extrapolated from the Agassiz dataset by randomly increasing or decreasing
each attribute’s value by a random amount to create a synthetic value. The third one
is called the Steel Plates Faults dataset, denoted as DSP F . It was obtained from UCI
machine learning repository2. It records various aspects of steel plates, such as type
of steel, thickness, luminosity, etc. which allows predicting various faults in the plates.
The fourth dataset, called Restaurant Reviews dataset, denoted as DRER , was collected
from a website3 . The data is intended to determine from customer reviews whether a
costumer will return to a restaurant or not.
Before building our chained neural networks, our datasets have undergone some pre-
processing tasks. For each dataset, three attribute selection algorithms, Principal Com-
ponent Analysis (PCA) [10], Correlation-based Feature Selection (CFS) [11], and Reli-
efF (ReliefF) [12], were applied to diversify the datasets. PCA is a multivariate analysis
technique that takes as an input a dataset of inter-correlated attributes with their in-
stances and produces a new dataset of reduced dimensionality while retaining most of
its properties. CFS selects a subset of attributes such that they are highly correlated
with the class attribute and the least correlated with each other. ReliefF ranks attributes
based on their relevance to the class attribute. A selected attribute would contain values
that are dissimilar for different classes and are similar for the same class. The produced
datasets resulted from these selection algorithms along with the original ones will assist
us in understanding and interpreting the performance of our chaining neural networks.
1
Provided by Dr. David Ehret of the Pacific Agri-Food Research Centre in BC. Canada.
2
http://archive.ics.uci.edu/ml/
3
Gathered from http://www.restaurantica.com/ by Taha Azizi, the University of Lethbridge,
Canada.
Two filters were applied on the datasets: nominal to binary filter and data normal-
ization/centering filter. The nominal to binary filter converts all nominal variables into
binary variables. If a nominal variable consists of k values, and if the class is also nom-
inal, this filter will transform the variable into k binary attributes. Data normalization
filter transforms variables’ values to ones between a given range. The normalization
filter applied normalizes all variables’ values in a given dataset, including the class, to
values in the range [0, 1]. The need for data normalization and centering stems from the
use of sigmoid thresholding function in neural networks.
For each of the datasets and the original dataset, we create two styles of neural net-
works, an SLC-style chain, denoted SLCs , and an MLC-style chain, denoted M LCs .
In order to determine the best performing network for each chain link, a “trial and er-
ror” process was conducted using a relatively small range of values for learning rates,
momenta, epochs, etc. The best chain link is the one that reduces the largest error (Mean
Absolute Error - (MAE)). The use of small ranges for training parameters is intended to
speed up the chain links selection process.
1 1.2
NNall NNall
NNcfs NNcfs
NNpca NNpca
NNreliefF 1.1 NNreliefF
SLCall MLCall
SLCcfs MLCcfs
0.9 SLCpca MLCpca
SLCreliefF 1 MLCreliefF
0.9
0.8
Error
Error
0.8
0.7
0.7
0.6
0.6
0.5
0.5 0.4
0 200 400 600 800 1000 1200 1400 1600 1800 0 200 400 600 800 1000 1200 1400 1600 1800
Epochs Epochs
Fig. 2. DAGA SLC networks vs. regular Fig. 3. DAGA MLC networks vs. regular
neural networks neural networks
A total of 20 SLCs and M LCs chain links were attempted for each dataset. For
SLC, only the best performing chain link was chosen. For MLC, a subset of chain links
was selected starting from the first chain link until the chain link that reduced error the
most. After the best numbers of chain links have been determined, the entire chain gets
trained using larger range of parameters and evaluated using 10-fold cross-validation.
Then, its performance is compared against a typical individual neural network. Our
proposed approach was implemented on top of Weka4 . All the experiments were run on
a network of computers controlled by a distributed computing system, called Condor5 .
4 Results and Analysis

Figs. 2 to 9 establish the comparison, in terms of prediction errors, between our ap-
proach and typical individual neural networks (N N ∗ ) with the increasing number of
4
http://www.cs.waikato.ac.nz/ml/weka/
5
http://www.cs.wisc.edu/condor/
epochs after their structures have been determined. Tables 2, 3, 4, and 5 show a com-
parison between SLCs and M LCs performance for each dataset and its subsets by
highlighting the lowest error values achieved. In these tables, we also show the num-
bers of chain links for SLC s and M LC s .
Table 2. DAGA chain links summary Table 3. DSP F chain links summary
DAGA all cfs pca reliefF DSP F all cfs pca reliefF
Link # 19 16 14 15 Link # 10 7 19 4
SLC s SLC s
MAE 0.6252 0.7806 0.8611 0.8911 MAE 0.0858 0.0968 0.0913 0.0856
Link # 6 6 6 20 Link # 7 4 2 5
M LC s M LC s
MAE 0.6672 0.7366 0.5821 0.7611 MAE 0.0887 0.0932 0.0943 0.0843
The power of our approach becomes evident when it was compared against typical
neural networks. For DAGA , Figs. 2 and 3 show that both SLCs chains and M LCs
chains have significantly outperformed their corresponding typical neural networks and
have done so early in the learning process, regardless of the subsets used. The same
situation is observed in Figs. 6 and 7 for DRER .
Figs. 4 and 5 for DSP F show that the performances of SLCs and M LCs chains
differ than chains of the other datasets. In Fig. 4 three of the four SLCs chains, i.e.,
SLCcf s , SLCpca , and SLCrelief F , have slightly outperformed their corresponding
typical neural network and in Fig. 5 two M LCs chains, M LCcf s and M LCpca , have
slightly outperformed their corresponding typical neural networks. However, N Nall
have significantly outperformed all SLCs and M LCs chains. This might be due to the
fact that the data in DSP F is already of high quality, since it has been used for a long
time. Any change to its attributes could only lead to poor classification performance.
0.11 0.105
NNall NNall
NNcfs NNcfs
NNpca NNpca
NNreliefF NNreliefF
0.105 SLCall MLCall
SLCcfs MLCcfs
SLCpca 0.1 MLCpca
SLCreliefF MLCreliefF
0.1
0.095
Error
Error
0.095
0.09
0.09
0.085
0.085
0.08 0.08
0 200 400 600 800 1000 1200 1400 1600 1800 0 200 400 600 800 1000 1200 1400 1600 1800
Epochs Epochs
Fig. 4. DSP F SLC networks vs. regular Fig. 5. DSP F MLC networks vs. regular
From Table 2 and Fig. 2 it is interesting to note that the chain that has reduced
error the most in the chain links selection process did not necessarily reduce error
the most when trained. In the case of SLCs , for example, SLCall has reduced the
largest error followed by SLCcf s , SLCpca , and SLCrelief F , respectively, as shown in

the table. However, when crossreferencing the performance of each of the chains with
Fig. 2, we see that SLCcf s has, by far, reduced the largest error followed by SLCpca ,
SLCrelief F , and SLCall . It is rather interesting that the best performing chain in the
chain links selection stage has performed the worst in the training process. This sit-
uation is also evident in M LCs in Table 2. The lowest error among the chains was
achieved by M LCpca , followed by M LCall , M LCcf s , and M LCrelief F . However,
when each of the chains was trained, the best performing chain was M LCpca , followed
by M LCcf s , M LCrelief F , and M LCall , as shown in Fig. 3.
The same situation can be observed for M LCs chains of DSP F in Table 3. M LCpca
performed the worst. But in the training stage in Fig. 5, M LCpca became second. This
could be due to the fact that the parameters used in training these chains were not suffi-
cient to demonstrate the power of these networks. Whether this is true or not generally
needs more investigation.
0.1 0.1
NNall NNall
NNcfs NNcfs
NNpca NNpca
NNReliefF NNreliefF
0.09 SLCall 0.09 MLCall
SLCcfs MLCcfs
SLCpca MLCpca
SLCreliefF MLCreliefF
0.08 0.08
0.07 0.07
Error
Error
0.06 0.06
0.05 0.05
0.04 0.04
0.03 0.03
0 200 400 600 800 1000 1200 1400 1600 1800 0 200 400 600 800 1000 1200 1400 1600 1800
Epochs Epochs
Fig. 6. DRER SLC networks vs. regular Fig. 7. DRER MLC networks vs. regular
Table 4. DRER chain links summary Table 5. DSY N chain links summary
DRER all cfs pca reliefF DSY N all cfs pca reliefF
Link # 19 2 13 16 Link # 12 15 16 11
SLC s SLC s
MAE 0.0537 0.0461 0.0716 0.0367 MAE 4.9494 4.6457 4.7848 4.4919
Link # 10 7 2 10 Link # 3 2 6 12
M LC s M LC s
MAE 0.0562 0.0469 0.0640 0.0387 MAE 5.0767 4.8037 4.9475 4.6738
It is noteworthy that the typical neural networks have demonstrated some overfitting
patterns which disappeared or reduced in our approach. In DAGA Fig. 2 of SLC, we can
see that N Nall , N Npca , and N Nrelief F have shown a slight overfitting pattern which
disappeared in the corresponding SLCs chains. On the other hand, Fig. 3 of MLC shows
that M LCpca demonstrated an overfitting pattern. The same situation is noted in Figs. 6
and 7 for DRER where N Nall , N Ncf s , and N Npca have demonstrated overfitting that
disappeared in SLCall and SLCpca , and reduced in SLCcf s and M LCcf s . This could
hint that our approach allows further training while delaying overfitting.
4.6 4.85
NNall NNall
NNcfs NNcfs
NNpca 4.8 NNpca
NNreliefF NNreliefF
SLCall MLCall
SLCcfs 4.75 MLCpca
4.55 SLCpca MLCcfs
SLCreliefF MLCreliefR
4.7
4.65
4.5
4.6
Error
Error
4.55
4.45
4.5
4.45
4.4
4.4
4.35
4.35 4.3
0 200 400 600 800 1000 1200 1400 1600 1800 0 200 400 600 800 1000 1200 1400 1600 1800
Epochs Epochs
Fig. 8. DSY N SLC networks vs. regular Fig. 9. DSY N MLC networks vs. regular
When applying SLC to DSY N , we expected that the process would fail and produce
large errors from the start, since it is a synthetic dataset. By observing Figs. 8 and 9 it is
easily noticed that this is exactly the case. Generally, the error increases as the number
of epochs increases. Table 5 is shown here for the purpose of the chain selection process.
The performance of SLC and MLC varies when they were compared to each other.
Table 6 shows that MLC have outperformed SLC in three of the four chains of DAGA ,
namely M LCall , M LCpca , and M LCrelief F with considerable differences while SLC
have outperformed MLC in one subset, SLCcf s , with a marginal difference. In DSP F
M LCs performed better than SLCs in M LCcf s and M LCpca and worse than SLCs in
SLCall and SLCrelief F . MLC and SLC for DRER have performed relatively similar.
MLC have slightly outperformed SLC in M LCpca and M LCrelief F while the opposite
is true in SLCall and SLCcf s .
Table 6. Performance summary of typical, SLC s , and M LC s networks

DAGA DSP F DSY N DRER
all cfs pca reliefF all cfs pca reliefF all cfs pca reliefF all cfs pca reliefF
Typical
learning rate .015 .015 .015 .015 .2 .3 .2 .2 .015 .015 .015 .015 .3 .2 .015 .2
momentum .05 .1 .1 .05 .2 .3 .4 .3 .05 .05 .05 .05 .3 .4 .05 .3
epochs 250 1750 400 250 1550 1750 1750 1750 50 150 200 150 25 50 650 100
MAE .826 .815 .761 .788 .081 .096 .090 .083 4.396 4.392 4.414 4.391 .060 .071 .086 .060
SLC s
learning rate .05 .1 .015 .05 .3 .2 .2 .2 .015 .015 .15 .015 .05 .3 .15 .3
momentum .4 .05 .3 .3 .3 .1 .4 .2 .05 .05 .05 .05 .05 .2 .2 .4
epochs 1750 1500 1750 1750 700 1750 1550 1150 25 50 25 50 1750 150 1650 1750
MAE .771 .506 .711 .717 .084 .091 .088 .083 4.374 4.396 4.370 4.382 .050 .046 .067 .036
MLC s
learning rate .1 .1 .3 .1 .1 .05 .1 .3 .015 .015 .05 .015 .3 .3 .3 .2
momentum .1 .05 .05 .2 .2 .4 .2 .2 .05 .05 .4 .05 .4 .3 .2 .3
epochs 1750 1750 1150 1750 1750 1550 1750 800 50 150 50 25 100 150 450 1700
MAE .606 .507 .445 .552 .089 .090 .086 .084 4.385 4.401 4.380 4.350 .056 .046 .064 .034
5 Discussion and Conclusion

We proposed using a chain of neural networks to improve classification accuracy. Our
approach has two variations, single-link chaining and multi-link chaining. We have
performed experiments to establish their effectiveness. Table 6 showed that MLC have
outperformed SLC and the typical neural networks in eight out of 16 datasets while
SLC have outperformed MLC and the typical networks in six out of 16 and the typical
network have outperformed both variations in only two cases.
Our work aligns with the previous efforts on increasing the effectiveness of neural
networks, in particular their classification accuracy. It would be beneficial to compare
our proposed chaining methods to some similar previous results, e.g., the works in [7,8],
to explore further their relative performance. Due to the time and space considerations,
we omit these comparisons in this initial report. But we are planning on pursuing along
this direction further. This will be definitely our main work in the near future.
In addition, we will further analyze our experiment results and provide theoretical
justifications as why our approach works or under what conditions it fails. Moreover,
we plan to investigate how to choose the number of chain links or detecting a single
good chain link. This will lead to automating and speeding up the chaining process.
References
1. Zhang, G.P.: Neural networks for classification: a survey. IEEE Transactions on Systems,
Man, and Cybernetics, Part C: Applications and Reviews 30, 451–462 (2000)
2. Lippmann, R.P.: An introduction to computing with neural nets. IEEE ASSP Magazine 3,
4–22 (1987)
3. Fahlman, S., Lebiere, C.: The cascade-correlation learning architecture. In: Advances in Neu-
ral Information Processing Systems, vol. 2, pp. 524–532 (1990)
4. Frean, M.: The upstart algorithm: A method for constructing and training feedforward neural
networks. Neural Computation 2, 198–209 (1990)
5. Cun, Y.L., Denker, J., Solla, S.: Optimal brain damage. In: Advances in Neural Information
Processing Systems, pp. 598–605 (1990)
6. Lu, B., Ito, M.: Task decomposition and module combination based on class relations: a
modular neural network for pattern classification. IEEE Transactions on Neural Networks 10,
1244–1256 (1999)
7. Hansen, L.K., Salamon, P.: Neural network ensembles. IEEE Transactions on Pattern Anal-
ysis and Machine Intelligence 12, 993–1001 (1990)
8. Maclin, R.: An empirical evaluation of bagging and boosting. In: Proceedings of the Four-
teenth National Conference on Artificial Intelligence, pp. 546–551. AAAI Press (1997)
9. Helmer, T., Ehret, D.L., Bittman, S.: Cropassist, an automated system for direct measurement
of greenhouse tomato growth and water use. Computers and Electronics in Agriculture 48,
198–215 (2005)
10. Abdi, H., Williams, L.J.: Principal component analysis. Wiley Interdisciplinary Reviews:
Computational Statistics 2, 433–459 (2010)
11. Hall, M.A.: Correlation-based feature selection for discrete and numeric class machine learn-
ing, pp. 359–366 (2000)
12. Kira, K., Rendell, L.A.: The feature selection problem: Traditional methods and a new algo-
rithm. In: AAAI, pp. 129–134. AAAI Press and MIT Press (1992)
Feature Ranking Methods
Used for Selection of Prototypes
Marcin Blachnik3 , Włodzisław Duch1,2 , and Tomasz Maszczyk1

1
Department of Informatics, Nicolaus Copernicus University, Toruń, Poland
2
School of Computer Engineering, Nanyang Technological University, Singapore
3
Dept. of Management & Informatics, Silesian University of Technology, Katowice, Poland
http://www.is.umk.pl
Abstract. Prototype selection, as a preprocessing step in machine learning, is

effective in decreasing the computational cost of classification task by reducing
the number of retained instances. This goal is obtained by shrinking the level of
noise and rejecting the irrelevant data. Prototypes may be also used to understand
the data through improving comprehensibility of results. In the paper we discus an
approach for instance selection based on techniques known from feature selection
pointing out the dualism between feature and instance selection. Finally some
experiments are shown which uses feature ranking methods for instance selection.
1 Introduction
Prototype selection is frequently used as a preprocessing step in machine learning. For-
mally it was designed to improve the accuracy of nearest neighbour based classifiers
but as it was shown in [1] these methods are also very useful to improve the quality of
other prediction methods not necessary based on the nearest neighbour approach.
The prototype selection (also called instance selection) is a process of selection or
construction of new instance based on the original set of examples such that n n
where n - is the number of instances, and n - is a set of instances after selection. This
condition is satisfied by removing redundant examples, outliers, or by rejection of ir-
relevant data from the training datasets. It can be shown that instance selection may
be treated as a form of regularization that leads to the improvement of predictive ac-
curacy. There may be also other benefits obtained from instance selection. An example
are prototype-based rules (P-rules) proposed in [2]. P-Rules are based on capturing
similarity to reference examples, and keeping minimum number of prototypes that are
sufficient to achieve appropriate error rate. Such prototypes may be the treated as repre-
sentatives of the original data or transformed into rule based representation, where the
antecedent part of a single rule defines similarity between prototype and a given vector.
P-rules with additive similarity functions may be also represented as a fuzzy rules, but
are even more general [3].
Recently new data is acquired and stored from many sources (sensors, web-logs, etc.)
leading to tremendous number of samples. This makes a real challenge for data mining
tasks because the most of machine learning algorithms are not designed to meet such
needs. Moreover many of the algorithms requires serial processing without any chances

Feature Ranking Methods Used for Selection of Prototypes 297
for distributing the training process among nodes in computer cluster. One of possible
solution which solves such a problem is data filtering. One may observe that in large
datasets the percentage of irrelevant or redundant samples is very high. This allows for
application of instance selection methods such as ENN [4], CNN [5], IB3 [6] (a survey
of almost 70 selection algorithms can be found in [7]) algorithms. However they are not
optimized to preserve low computational complexity.
For many years a great importance was feature selection. This allowed for the devel-
opment of very effective and efficient methods which after appropriate adaptation can
be used for instance selection. A good example of such methods is feature ranking being
one of the most efficient approaches which we believe may also meet the requirements
of instance filtering in large datasets, so in this paper we would like to present how to
adjusting feature ranking method with standard feature ranking coefficients to solve the
instance selection challenge.
The paper is organized as follows: in the next section feature selection algorithm
based on feature ranking are presented, then in section 3 we discus the adaptation steps
to create a prototype selection algorithm based on feature ranking. Illustrative examples
for several benchmark datasets are presented in section (4), and conclusions are given
in the final section.
In the paper the following notation will be used: T =
{[x1 , y1 ] , [x2 , y2 ] , . . . , [xn , yn ]}, is a training set of xi ∈ Rm input vectors de-
fined in m-dimensional feature space f = {f1 , f2 , . . . fm }, and yi is the label
associated with xi vector, and the label attribute is denoted as y.
2 Feature Ranking Methods
Feature selection belongs to the one of the fast-growing areas of machine learning re-
search. One of the reasons for its growing importance is the increasing size of the
datasets, which may consist of tens or hundreds of thousands of variables. The ob-
jective of variable selection is three-fold: improving the prediction performance of the
predictors, providing faster and more cost-effective predictors, and providing a better
understanding of the underlying process that generated the data [8].
Feature selection techniques rely on two main procedures: a search strategy and a
feature quality evaluation. The search strategy controls the order of evaluations of the
quality of feature subsets. Computational complexity of both the search strategy and the
feature evaluation should be low to meet the needs of high scalability. Feature ranking is
based on evaluating each feature independently, calculating the feature-class relevance
index H(fi , y) of some kind [9]. These indices are then sorted in descending order of
quality and m best features are selected.
An overview and comparison of different ranking indices H(·) can be found in [10].
Below a short description of 3 commonly used indices based on Correlation Coefficient
(CC), Mutual Information (MI) and Switch Index (SI) is given, although our approach
may be used with any index.
Correlation Coefficient. The absolute value of the linear correlation coefficient is a
widely used ranking criterion.
298 M. Blachnik, W. Duch, and T. Maszczyk

(f − f¯ ) · (y − ȳ)
j j
RCC (fj , y) = ∈ [0, 1] (1)
(f − f¯ )2 (y − ȳ)2
j j
It is proportional to the dot product between fj and y centered at their average values.
The Correlation Coefficient is applicable to binary and continuous variables or target
values, which makes it very versatile. Categorical variables can be handled by using
some coding method. The Correlation Coefficient is a measure of linear dependency
between variables. Irrelevant variables that are not correlated with the target should
have RCC value near zero, but a value near zero does not necessarily indicate that the
variable is irrelevant: a non-linear dependency may exist, which is not captured by this
measure.
Mutual Information. This criterion measures strength of dependencies between ran-
dom variables. It may assess the “information content” also in cases when methods
based on linear relations are prone to mistakes.
The M I(f ; y) index for random variables f and y is defined as:

Pf y (f, y)
M I(f ; y) = Pf y (f, y) log dxdy (2)
Pf (f )Py (y)
Accuracy of estimation of the joint probability Pf y from experimental data is usually
poor [9], especially for large number of classes. Discretization of continuous variables
into boxes of size Δf Δy is commonly used to calculate discrete Pf y values (Parzen
windows is an obvious alternative). The MI index can then be re-written as:
Pf y (rf , ry )
M I(f ; y) = Pf y (rf , ry ) log (3)
rf ry
Pf (rf )Py (ry )
where, rf and ry are the intervals for the f and y variables. MI dependence of two
variables using Nf , Ny intervals requires estimation of Nf Ny probabilities, and esti-
mation of index for pairs of features f, g requires estimation of Nf Ng Ny probabilities,
which is not only costly but also leads to significant errors, unless a lot of training data
is available.
Switch Index. This simple measure is applicable to the typical classification problem
with features f that can be ordered, and discrete target variables y. Sorting the values of
variable f in ascending order one calculates how many times the values of variable y is
changed when subsequent f values are taken. If the correlation between feature values
and labels is ideal the number of switches is equal to the number of classes minus one;
if there is no correlation each increase of f may result in change of y. The number of
switches is normalized to fit the range [0, 1].
2.1 Redundancy Rejection

Ranking methods do not eliminate feature redundancy in an automatic way. Removing
redundancy is quite important to reduced the number of prototypes. Battiti [11] pro-
posed to use a combination of the feature-class index H(f, y) and the feature-feature
index H(f, f ) to account for redundancy. The algorithm chooses features that maxi-
mize the difference between H(f, y) and all H(f, f ), where f are all features already
selected. Such algorithm selects M features and is formalized as follows:
1. Set F ← “the initial set of features”; S ← ∅.
2. For each feature f ∈ F , compute H(f, y).
3. Find feature f that maximizes H(f, y);
4. Move f from F to S: F ← F \f ; S ← f .
5. repeat until |S| = M
– For all pairs of variables f ∈ F, f ∈ S calculate H(f
; f ).
– Choose feature f that maximizes H(f, y) − β/|S| f ∈S H(f ; f );
– Move f from F to S: F ← F \f ; S ← S ∪ f .
Parameter β regulates the relative importance of feature-feature relevance H(f, f ) for
all already selected features with respect to the feature-class relevance H(f, y). The
recommended β value is between 0.5 and 1 [11]. This ranking selects features that are
highly relevant for the target and loosely correlated with other selected features. This
procedure is more costly than simple ranking because calculation of the feature-feature
indices requires O(m2 n) operations. However, instead of all m features one can take
into account only the top m features that are worth considering.
3 The Ranking Algorithm for Instance Selection (RBIS)

RBIS algorithm adapts feature ranking algorithm to the prototype (instance) selection
task. It starts from calculating the distance matrix for the training data. At this point the
distance/similarity matrix can be viewed as the dataset in the new kernel (equivalent to
the distance function) feature space. After that each column of the distance matrix is
ranked as a normal feature. This is equivalent to ranking instances using specified fea-
ture ranking criterion. To improve the quality of the instance selection process instances
are ranked separately for each class. Then RBIS selects the Ni best instances from each
class, keeping the proportions of the original data. Details are described below:
1. Choose level of compression γ (and β for RBIS+) and ranking criterion H(D, Y ).
2. Compute distance matrix D(·) for training data X.
3. Select and group columns of the distance matrix of each class.
4. Rank, and sort features (columns) separately for each group.
5. From each group select best Ni examples, where Ni is proportional to the number
of examples in i’th class.
6. Train 1NN classifier for subset of vectors (with original features) Sγ and test them
to obtain the performance on test data.
Note that there is a choice of kernel here, for example using Gaussian-weighted dis-
tances instead of simple Euclidean distances. Another variant of this method called
RBIS+ takes into account redundancy rejection described above. It uses modified rank-
ing criterion, subtracting from H(di , y) the mean value of H(di , dj,j=i ).
H (di ) = H(di , y) − βH(di , dj,j=i ). (4)
95 95
90
90
85
85
80
80
accuracy [%]
accuracy [%]
75
70
75
MI MI
CC CC
65
SI SI
70
60
65
55
60 50
10 20 30 40 50 60 70 80 90 100 110 10 20 30 40 50 60 70 80 90 100 110
compression [%] compression [%]
(a) Ionosphere (b) Sonar

75 96
94
70
92
65
90
60
accuracy [%]
accuracy [%]
88
86
55
84
50
82
45 MI MI
CC 80 CC
SI SI
40 78
10 20 30 40 50 60 70 80 90 100 110 10 20 30 40 50 60 70 80 90 100 110
(c) Vehicle (d) Car
Fig. 1. Comparison of the influence of ranking criterion on the accuracy of the RBIS algorithm;
for Diabetes differences were not significant
4 Illustrative Examples
The usefulness of the method presented in this paper has been evaluated on 5 popular
datasets from the UCI repository [12]: Ionosphere, Sonar, Car, and Diabetes (also
known as “Pima Indian diabetes") dataset. Only the last one has relatively simple struc-
ture [13]. In our experiments for each dataset the relation between data compression
and achievable accuracy was analyzed. For that purpose the algorithm selected 20%,
40%, 60%, 80%, or 100% of the samples and the accuracy of the system was evaluated.
All calculations are wrapped in the cross validation test to estimate the expected accu-
racy. For comparison also the Monte Carlo, ENN [4], CNN [5], and IB3 [6] instance
selection algorithms are used, and also the methods based on editing the distance graph
like Gabriel Editing and Relative Neighbor Graph [14,15].
In the first series of experiments two different aspects of the RBIS algorithm are
compared. First, the influence of the ranking criterion is analyzed. Results are presented
graphically in Fig. (4).
Unlike the results of feature ranking comparison [10] the ranking criterion used for
instance selection has strong influence on the results. In almost all cases the Switch
95 90
90
85
85
80
80
accuracy [%]
accuracy [%]
0
75 0.01
0.1
75
0.5
70
70
0
0.01 65
65
0.1
0.5
60 60
10 20 30 40 50 60 70 80 90 100 110 10 20 30 40 50 60 70 80 90 100 110

75 94
92
70
90
65
88
accuracy [%]
accuracy [%]
0
0.01
0.1
86
60 0.5
84
55 0
0.01 82
0.1
0.5
50 80
10 20 30 40 50 60 70 80 90 100 110 10 20 30 40 50 60 70 80 90 100 110
(c) Vehicle (d) Car

78
76
74
72
70
accuracy [%]
68
66
64
62 0
0.01
0.1
60
0.5
58
10 20 30 40 50 60 70 80 90 100 110
compression [%]
(e) Diabetes
Fig. 2. Comparison of the influence of redundancy rejection parameter on the accuracy of the
RBIS algorithm
Index performed best, but for the car data Mutual Information is much better. This may
be related to the special properties of the distance function for which the indices were
applied and in case of car data to discrete nature of features, but it also points out to the
need for selecting optimal ranking coefficient for a given problem.
95 90
← GE
90 85
← CNN ← RNG
← CNN ← GE ← ENN
85 80
← ENN
← RNG
80 75
accuracy [−]
accuracy [−]
← IB3
75 70
MC MC
RBIS RBIS
RBIS+ RBIS+
70 65
65 60 ← IB3
60 55
0 20 40 60 80 100 120 0 20 40 60 80 100 120
compression [−] compression [−]

75 96
94 ← CNN
70
← GE
92 ← ENN
← RNG
← CNN ← ENN
90
65
accuracy [−]
accuracy [−]
88
← IB3
← IB3
60 MC ← GE
86
RBIS ← RNG
RBIS+
84
55
MC
82
RBIS
RBIS+
50 80
10 20 30 40 50 60 70 80 90 100 110 0 20 40 60 80 100 120
compression [−] compression [−]
(c) Vehicle (d) Car
Fig. 3. Comparison of RBIS and RBIS+ algorithms to 3 state-of-the-art methods ENN, CNN,
IB3, GE, RNG
In the second series of experiments the effect of redundancy rejection was tested.
To verify performance of the algorithm described in the section 2.1 five β values were
tested. The results presented in Fig. (4) show that for some data the redundancy filter
may significantly affect the accuracy if high data compression is desired, although the
computational complexity strongly increases in this case.
Comparison of the RBIS and RBIS+ algorithms to state-of-the-art prototype selec-
tion algorithms ENN, CNN, IB3, GE and RNG is presented in Fig. (4). The results show
that our algorithms based on instance ranking are quite competitive to these methods,
allowing for regulation of the number of prototypes left, and keeping low computational
complexity (for RBIS).
5 Conclusions
The Ranking Based Instance Selection (RBIS) algorithm for prototype selection based
on the feature ranking methods has been described. It follows the basic approach of
feature selection, using instance ranking to determine the relevance of particular data
samples. The use of dualism between feature and instance selection in kernel spaces
seems to be a novel idea worth further exploration.
Results of numerical experiments showed the importance of selecting good rank-
ing coefficient. Surprisingly, in almost all experiments a simple Switch Index achieved
better results than the-state-of-the-art algorithms based on Mutual Information or Pear-
son’s Correlation Coefficient. This may be due to the properties of the obtained distance
matrix. The results of RBIS algorithm are competitive with the state-of-the-art meth-
ods, although for small number of prototypes CNN in case of Sonar and Ionosphere
data achieved slightly better results. This may be related to the problem of redundancy,
because similar instances usually have similar ranking indices. The methodology of re-
jection of the redundant instances based on the Battiti’s algorithm of feature selection,
although computationally expensive gives sometimes good results selecting small num-
ber of vectors. Room for improvement includes different kernels replacing Euclidean
distance (ex. Gaussian-weighted kernels that stress the importance of closer distances),
as well as clustering methods to reduce the influence of redundancy, by selecting a
single prototype from each cluster.
Acknowledgment. The work was sponsored by the Polish Ministry of Science and
Higher Education, project No. 4421/B/T02/2010/38 (N516 442138). The software pack-
age is available at http:\www.prules.org
References
1. Grochowski, M., Jankowski, N.: Comparison of Instance Selection Algorithms II. Results
and Comments. In: Rutkowski, L., Siekmann, J.H., Tadeusiewicz, R., Zadeh, L.A. (eds.)
ICAISC 2004. LNCS (LNAI), vol. 3070, pp. 580–585. Springer, Heidelberg (2004)
2. Duch, W., Grudziński, K.: Prototype based rules - new way to understand the data. In: IEEE
International Joint Conference on Neural Networks, pp. 1858–1863. IEEE Press, Washington
D.C. (2001)
3. Duch, W., Blachnik, M.: Fuzzy Rule-Based Systems Derived from Similarity to Prototypes.
In: Pal, N.R., Kasabov, N., Mudi, R.K., Pal, S., Parui, S.K. (eds.) ICONIP 2004. LNCS,
4. Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans.
Systems, Man and Cybernetics 2, 408–421 (1972)
5. Hart, P.E.: The condensed nearest neighbor rule. IEEE Transactions on Information The-
ory 114, 515–516 (1968)
6. Aha, D., Kibler, D., Albert, M.K.: Instance-based learning algorithms. Machine Learning 6,
37–66 (1991)
7. Salvador, G., Joaquin, D., Cano, J.R., Herrera, F.: Prototype selection for nearest neighbor
classification: Taxonomy and empirical study. IEEE Transactions on Pattern Analysis and
Machine Intelligence 34(3), 417–435 (2010)
8. Guyon, I., Gunn, S., Nikravesh, M., Zadeh, L.: Feature extraction, foundations and applica-
tions. Springer, Heidelberg (2006)
9. Duch, W.: Filter methods. In: Guyon, I., Gunn, S., Nikravesh, M., Zadeh, L. (eds.) Feature
Extraction, Foundations and Applications, pp. 89–118. Springer, Physica Verlag, Heidelberg,
Berlin (2006)
10. Duch, W., Wieczorek, T., Biesiada, J., Blachnik, M.: Comparision of feature ranking meth-
ods based on information entropy. In: Proc. of International Joint Conference on Neural
Networks, pp. 1415–1420. IEEE Press, Budapest (2004)
11. Battiti, R.: Using mutual information for selecting features in supervised neural net learning.
IEEE Trans. on Neural Networks 5, 537–550 (1994)
12. Asuncion, A., Newman, D.:UCI machine learning repository (2007),
http://www.ics.uci.edu/~mlearn/MLRepository.html
13. Duch, W., Maszczyk, T., Jankowski, N.: Make it cheap: learning with o(nd) complexity. In:
2012 IEEE World Congress on Computational Intelligence, Brisbane, Australia, pp. 132–135
(2012)
14. Bhattacharya, B.K., Poulsen, R.S., Toussaint, G.T.: Application of proximity graphs to edit-
ing nearest neighbor decision rule. In: Proc. Int. Symposium on Information Theory, Santa
Monica, CA, pp. 1–25 (1981)
15. Toussaint, G.T.: The relative neighborhood graph of a finite planar set. Pattern Recogni-
tion 12(4), 261–268 (1980)
A “Learning from Models” Cognitive Fault
Diagnosis System
Cesare Alippi, Manuel Roveri, and Francesco Trovò
Politecnico di Milano, Piazza L. da Vinci 32, 20133 Milano, Italy

{alippi,roveri,trovo}@elet.polimi.it
Abstract. We present an unsupervised cognitive fault diagnosis frame-

work for nonlinear dynamic systems working in the space of approximat-
ing models. The diagnosis system detects and classifies faults by relying
on a fault dictionary that is empty at the beginning of the system’s life
and is automatically populated as faults occur. Outliers are treated as
separate instances until enough confidence is built and either are inte-
grated in existing classes or promoted to a new faults class. Simulation
results show the effectiveness of the proposed approach.
Keywords: fault cognitive diagnosis systems, evolving clustering.
1 Introduction
Unsupervised cognitive fault diagnosis systems for complex dynamic plants take
advantage of machine learning algorithms to catch the description of their nom-
inal status and assess potential fault-induced changes by inspecting deviations
from nominal conditions. Fault isolation and classification phases follow, by ex-
ploiting information and features extracted from available datastreams. As a
consequence, these cognitive fault diagnosis systems are able to work with ap-
preciable accuracy even when a model for the system under monitoring is par-
tially/totally unavailable. In fact, the ability to learn the nominal condition
during the operational life does not require any a-priori knowledge about the
nature of the fault and its time profile, hence, making it feasible an on-line gen-
eration and maintenance of the fault dictionary, i.e., the dictionary containing
the fault signature.
While the literature about fault diagnosis provides rather well-established
techniques (e.g., see [3]), research on cognitive fault diagnosis systems is rela-
tively new with few works available [4,5,6,7] mainly focusing on specific applica-
tion cases within an evolving framework. [3]-(Chap. 16) suggests an unsupervised
“clustering-labeling” approach based on SOM which builds rules during training
to distinguish nominal from faulty states. Similarly, [4] and [5] provide diagnos-
tic mechanisms for a quality inspection system. An evolution-based fuzzy-neural
approach to fault diagnosis for marine propulsion systems is presented in [6]. In

This research has been funded by the European Commissions 7th Framework Pro-
gram, under grant Agreement INSFO-ICT-270428 (iSense).

306 C. Alippi, M. Roveri, and F. Trovò
[7], nominal states are separated from faulty ones by relying on a fuzzy c-means
algorithm. In general, these solutions confine the evolving approach solely to the
training phase.
The paper suggests an evolving model-free mechanism for cognitive fault di-
agnosis working in the space of model parameters. The approach can be briefly
summed up as follows:
– the parameters of a suitable linear model are extracted during the train-
ing phase and used to generate the nominal conditions for the plant. No
assumptions about the linearity of the process generating the data is made;
– when deviations from the nominal condition is detected the fault dictionary
is used to classify the class of fault. At the beginning no fault dictionary is
given and the algorithm automatically builds it over time by following an
evolving mechanism. Procedures for the management of the dictionary, e.g.,
collapsing of two equivalence classes and other management procedure are
also provided.
The novelty of the proposed approach resides in the justification of the use of
linear models for building the fault dictionary mechanism and in the clustering-
based evolving algorithm, which is able to automatically update the fault dic-
tionary over time. During the operational life, the proposed algorithm detects if
new parameters describing the approximate model can be intended as generated
from the nominal state, from a previously created type of fault (each cluster of
parameters is associated with a class of faults) or it represents a new faulty state
or, again, it is an outlier.
Since the key point of the analysis refers to the use of linear dynamic systems,
and given the fact that the linear hypothesis can be barely accepted in a real
application scenario, we review in Section 2 the theoretical framework justifying
such a choice. Section 3 describes the proposed algorithm for a cognitive fault
diagnosis system, while Section 4 presents the experimental results.
2 Problem Formulation
In the following, we assume that the process P under investigation is unknown
and time invariant. In reality, the above assumption can be weakened by also
considering a time variant process provided that the time variance is explored
during the training phase (e.g., the nominal state can be characterized through
a Markov process in the parameter space). In other words, we are assuming that
the nominal state can be somehow characterized through approximating linear
models despite the fact P is time variant or not.
We approximate P by considering discrete-time linear MISO models:
m
Bi (z) C(z)
A(z)y(t) = ui (t) + d(t)
i=1
F i (z) D(z)
where y(t) ∈ R is the output of the system at time t, u(t) = (u1 (t), . . . , um (t)) ∈
Rm is the vector of input samples at time t, d(t) is an independent and identically
Evolving Fault Isolation 307
distributed (i.i.d.) random variable accounting for the noise, z is the time-shift
operator, A(z), Bi (z), C(z), D(z), and Fi (z) represent the z-transform functions,
whose parameter vectors are θA , θB1 , . . . , θBm , θC , θD , θF1 , . . . , θFm , respectively.
By using this notation, an element fθ in the approximating linear model family
M(θ) can be fully described with a parameter vector θ ∈ Rp encompassing the
above parameter vectors. In the following, we assume that the model does not
degenerate, assumption which, eventually, might require an underdimensioned
model (here we are devoted to the diagnosis ability more than approximation
accuracy).
A linear model places our framework on a solid mathematical ground [1,2]
despite the potentially introduced model bias ||fθ − P||. More specifically, let
M(θ) be a model family fθ parameterized in θ ∈ DM , DM being a compact
subset in Rp . Consider the loss function VN (u, y, θ) : Rm ×R×Rp → R providing,
given a training dataset composed of N samples {u(t), y(t)}N t=1 , an estimate θ̂N
of the optimal parameter θ∗ = arg minθ∈DM limN →+∞ WN (θ), where WN (θ) =
E [VN (u, y, θ)]). Under the hypothesis that:
– P satisfies the exponential stability for the closed loop system, i.e., it is
possible to generate accurate approximations of y(t) given time windows of
y(·) and u(·) without requiring data coming from the remote past;
– fθ is, at most, linear with respect to u(t), y(t) and three times differentiable
w.r.t. θ;
– VN (u, y, θ) has partial derivatives up to order 3 bounded by a constant;
– ∃β ∈ R+ , ∃N0 ∈ R+ s.t. WN (θ) > βI ∀θ ∈ DM , N ≥ N0 ;
from [1] we have that
√ −1
N PN 2 (θ̂N − θ∗ ) ∼ N (0, Ip ) N → ∞, (1)
where Ip is the identity matrix of order p and PN ∈ Rp×p is the covariance

matrix of the model parameters. The theorem assures that, given a large enough
N and a sequence of i.i.d. θ̂N , the distribution underlying the parameter vector
is a multivariate gaussian, with mean θ∗ and covariance matrix PN .
The theorem contemplates the situation P ∈ / M and hence, it supports the
very relevant case in which ||fθ∗ − P|| = 0 (e.g., there exists a model bias
induced by nonlinearities). In fact, under the aforementioned hypothesis, there
it always exists an unique optimal point θ∗ , even if there is a model bias. As far
as the approximation error ||fθ̂N − fθ∗ || is concerned, its contribution decreases
as N → +∞.
A set of model hierarchies satisfying the above are the Extreme-Learning
Machines [8], the Reservoir Computing Networks [9] (under the full invertibility
hypothesis, leading the model parameters to satisfy the above) and, more in
general, any linear dynamic model. The fact that Eq. 1 holds allows us, having
fixed N , to generate a probabilistic neighborhood centered in θ∗ and a probability
that a generic estimate θ̂N belongs to it.
3 The Cognitive Diagnosis System

The suggested diagnostic mechanism is based on a novel evolving clustering al-
gorithm able to exploit both spatial (in DM ) and temporal proximities. The
novelty of what proposed resides in three main features. At first, the frame-
work of Section 2 provides us the formal basis for considering elliptical cluster
shapes (i.e., Eq. 1 guarantees that parameter vectors θ̂N asymptotically follow a
multivariate gaussian distribution), while the literature considers circular shapes
[10,11], and also justifies the use of the Mahalanobis distance for spatial thresh-
olding. Second, we create a logical distinction among nominal, faulty clusters and
outliers, to define suitable evolving rules for cluster update/management (tra-
ditional evolving clustering algorithms do not define a structure over clusters).
Finally, the spatial and temporal proximities are generally considered separately
in the related literature [10,11,12], while here we rely on a joint temporal/spatial
figure of merit to create new clusters.
The proposed Alg. 1 requires an initial fault free data set {(u(t), y(t))}M t=1 to
learn the nominal state. Starting from these data, the designer selects a linear
model hierarchy and estimates its optimal order (to be kept fixed also in the
operational phase). In particular, data are divided into batches of N samples,
each one providing a parameter vector θ̂N . The procedure leads to a sequence
M/N
of vectors {θ̂N,i }i=1 used to model the nominal state Ψ = {Ψj }ψ j=1 (ψ = |Ψ |,
where | · | is the cardinality operator) which is composed by a set of a-priori
overlapping ψ elliptic neighbourhoods induced by the Mahalanobis distance.
Each elliptic neighbourhood is characterized by a mean vector θ̄Ψj ∈ Rp and a
covariance matrix SΨj ∈ Rp×p .
After the training phase, the initially empty fault dictionary Φ = {Φj }φj=1
(φ = |Φ|) and the outliers set O (o = |O|) are populated with data interpreted
as “not belonging” to the nominal state (outliers are instances without enough
confidence to be promoted as faulty states). During the operational life, for each
new parameter vector θ̂N,i , i > M N (Line 2), the algorithm verifies whether
it belongs to the nominal state Ψ (Line 3) or not, by computing its squared
Mahalanobis distances m(θ̂N,i , Ψj ) = (θ̄Ψj − θ̂N,i )T SΨ−1
j
(θ̄Ψj − θ̂N,i), ∀Ψj ∈ Ψ . We
assert that a parameter vector belongs to the nominal state if:
∃j ∗ ∈ {1, . . . , ψ} s.t. j ∗ = arg min m(θ̂N,i , Ψj ) ≤ ηs , (2)
j
where ηs is a properly chosen spatial threshold, strictly related to the probability

φj (φj −p)
of assigning a point to a cluster. In fact, if θ̂N ∈ Ψj then p(φ 2 −1) m(θ̂N,i , Ψj ) ∼
j
F (p, φj − p) where φj = |Φj | and F (p, φj − p) is the Fisher’s distribution with
parameters p and φj − p [13]. As such, by fixing the confidence level, ηs can
be derived from the above Fisher’s distribution. If also the previous in time
parameter vector θ̂N,i−1 belongs to Ψj ∗ , we strengthen the hypothesis that θ̂N,i
belongs to the j ∗ -th nominal cluster and, consequently, we update θ̄Ψj∗ and SΨj∗
by taking into account the new parameter vector. Otherwise, no update action
is performed. Other temporal conditions could be considered as well. The same
inclusion/update policy is adopted for assigning parameter vectors θ̂N to the

faulty clusters Φj ∈ Φ, ∀j ∈ {1, . . . , φ} (Line 9).
When the faulty cluster parameters θ̄Φ∗j and SΦ∗j have been updated (after
the insertion of θ̂N,i into Φj∗ ), the algorithm checks if a not-empty set of outliers
O ⊂ O can be moved from O to Φj∗ to improve its statistical characterization
(Line 13). When O = {θ̂N,h s.t. h ∈ X (O), m(θ̂N,h , Φj ∗ ) ≤ ηs } = ∅, where
X (O) = {i s.t. θ̂N,i ∈ O}, the algorithm removes all points in O from O and
updates (on removal action) ωk (i), ∀k ∈ X (O) (Line 15), (ωk (i) is a weight
associated to each outlier, its updating procedure is described later). Moreover,
= j∗,
the algorithm verifies if two faulty clusters Φj ∗ and Φk , k ∈ {1, . . . , φ}, k
can be merged (Line 18) and this happens when m(θ̄Φj∗ , Φk ) ≤ ηs ∨m(θ̄Φk , Φj ∗ ) ≤
ηs (i.e., we evaluate the distance between the two clusters mean values). If θ̂N,i
does not belong to either the nominal state or a faulty class, it is labeled as
an outlier (Line 22) and added to O. To decide whether a new cluster must be
created with elements in O we rely on the parameter ωk , k ∈ X (O) associated
to each outlier in O. This parameter evaluates both the temporal and spatial
proximity of the outliers and increases in value when a sequence of outliers is
close in time and space. When this parameter overcomes a predefined threshold
ηo a new cluster is created. In more detail, the parameter ωk , which is set to 0
for the first outlier inserted in O, is updated according to the following iterative
rule (inspired by [12]):
⎧
⎨ ωk (i − 1) + exp − ||θ̂N,i −θ̂N,k || − |i − k| k ∈ X (O)
p
ωk (i) =
⎩ ||θ̂N,i −θ̂N,h ||
h|h∈X (O) exp − p
− |i − h| k=i
It is worth noting the the term we add at each parameter vector insertion
is proportional to spatial (||θ̂N,i − θ̂N,k ||) and temporal (|i − k|) proximities.
To decide whether creating a cluster or not, the algorithm verifies if ∃k ∗ ∈
X (O) s.t. ωk∗ (i) ≥ ηo . By differently setting ηo , we decide how densely the
parameter vectors should aggregate in order to identify a new cluster. The even-
tually created cluster Φφ+1 is formed by the set O ⊆ O with the longest and
most recent temporally contiguous sequence of parameter vectors. O becomes
a new cluster and its elements are removed from O. Afterwards, the algorithm
updates the parameter ωk for all remaining points in O:

||θ̂N,k − θ̂N,h ||
ωk (i) = ωk (i) − exp − − |k − h| − α(i − max(k, h)) ,
p
h|h∈X (O )
where α ∈ R+ is an oblivion coefficient. At last, the algorithm calculates ωk (i +

1) = ωk (i)∗e−α ∀k ∈ X (O), to reduce the influence of parameter vectors added
in the remote past. By tuning ηs , ηo and α parameters, we control the cluster
creation frequency, as well as the number of points included in each clusters.
Five figures of merit have been defined to evaluate the method:

Algorithm 1. Fault isolation clustering method

1: Train a cluster for each nominal state Ψ = {Ψj }ψ j=1
2: while A new θ̂N,i is provided do
3: if ∃j ∗ s.t. m(θ̂N,i , Ψj ∗ ) ≤ ηs then
4: Ψj ∗ ← Ψj ∗ ∪ {θ̂N,i }
5: if θ̂N,i−1 ∈ Ψj ∗ then
6: Update mean θ̄Ψj∗ and covariance matrix SΨj∗ using θ̂N,i
7: end if
8: else
9: if ∃j ∗ s.t. m(θ̂N,i , Φj ∗ ) ≤ ηs then
10: Φj ∗ ← Φj ∗ ∪ {θ̂N,i }
11: if θ̂N,i−1 ∈ Φj ∗ then
12: Update mean θ̄Φj∗ and covariance matrix SΦj∗ using θ̂N,i
13: if O = {θ̂N,h s.t. h ∈ X (O), m(θ̂N,h , Φj ∗ ) ≤ ηs } = ∅ then
14: Φj ∗ ← Φj ∗ ∪ O
15: Remove moved vectors from O and update on removal action ωk (i)
16: end if
17: if ∃k, k = j ∗ , m(θ̄Φj∗ , Φk ) ≤ ηs ∨ m(θ̄Φk , Φj ∗ ) ≤ ηs then
18: Merge Φj ∗ and Φk
19: end if
20: end if
21: else
22: O ← O ∪ {θ̂N,i }
23: Update on insertion action ωk (i)
24: if ∃k∗ ∈ X (O) s.t. ωk∗ (i) ≥ ηo then
25: Create a new cluster Φφ+1 and remove used vectors from O
26: Update on removal action ωk (i)
27: end if
28: end if
29: end if
30: ωk (i + 1) = ωk (i) ∗ e−α ∀k ∈ X (O)
31: end while
– pc : the percentage of runs where the algorithm identified the correct number
of clusters;
– ε: the percentage of parameter vectors assigned to the wrong cluster;
– po : the percentage of parameter vectors assigned to the outlier set O;
– τ : the delay (in term of number of batches) between the occurrence of the
change and the creation of the corresponding cluster;
– nc : number of clusters at the end of the experiment (in order to compare it
with respect to the expected one).
All these values are computed at the end of each experiment, ε, po and τ are
calculated only if the algorithm correctly detects the number of clusters and are
averaged over independent experiments. In the following the clustering algorithm
parameters are set to ηs = 3, ηo = 1 and α = 0.001.
Table 1. Experimental results on Application D1 and D2
Application D1 Application D2
δ pc ε po τ nc pc ε po τ nc
0.01 0% N.a. N.a. N.a. 1.02 0% N.a. N.a. N.a. 1.005
0.05 45% 0.0300 0.0795 5.0926 4.86 12.5% 0.2680 0.0983 11.6667 2.055
Profile 1 0.075 40% 0.0273 0.0807 4.8292 4.92 32.5% 0.0968 0.0774 6.3616 4.605
0.1 44.5% 0.0269 0.0785 4.8801 4.865 39.5% 0.0337 0.0589 4.7137 4.93
0.15 42% 0.0263 0.0778 4.8492 4.935 32.5% 0.0330 0.0674 5.0154 5.075
0.01 1% 0.474 0.1178 39.5 1.01 0% N.a. N.a. N.a. 1
0.05 82% 0.0169 0.0343 4.9512 2.21 34% 0.4225 0.0627 6.6912 1.43
Profile 2 0.075 88% 0.0170 0.0363 4.9375 2.14 75% 0.1579 0.0414 5.46 2.295
0.1 84% 0.0172 0.0359 4.9583 2.19 82.5% 0.0204 0.0361 5.0061 2.225
0.15 72.5% 0.0168 0.0458 4.9517 2.375 84.5% 0.0155 0.035469 4.8994 2.185
Application D1 refers to data generated by the ARX(2, 1) model y(t) =

a1 y(t − 1) + a2 y(t − 2) + b1 u(t − 1) + d(t) where a1 = 0.1, a2 = 0.3, b1 = −0.2,
d(t) ∼ N (0, 0.025), while the exogenous input has a dynamic itself u(t) =
0.4u(t − 1) + (t), (t) ∼ N (0, 1). Each experiment lasts 7 × 104 samples: 1 × 104
samples are used for training and 6×104 for test. Each θ̂N is estimated by consid-
ering a non-overlapping batch of N = 400 samples. We modeled the occurrence
of faults by abruptly changing θ = (a1 a2 b1 ) to a perturbed value θδ = (1 + δ)θ
with δ ∈ {0.01 0.05 0.075 0.1 0.15}. In particular, we defined the following fault
profiles:
1. A sequence of intermittent faults with parameters vector θδ , θ2δ and θ−δ

affecting the system in the intervals [1 × 104 ; 2.2 × 104 ], [3.4 × 104 ; 4.6 × 104 ]
and [5.8 × 104 , 7 × 104 ];
2. A sequence of two intermittent faults with parameter vector θδ affecting the
system in the intervals [1 × 104 ; 3 × 104 ] and [5 × 104 ; 7 × 104 ], while the
process is slowly drifting (0.1δ at the end of the sampling).
Each experiment is composed of 200 independent runs.
Application D2 refers to data generated by the non-linear model y(t) =
exp (−a1 y(t − 1) − a2 y(t − 2) − b1 u(t − 1) + d(t)). The fault profiles and the ex-
perimental conditions are those of Application D1.
Table 1 shows the simulation results of the fault profiles for Application D1
and D2. As expected, δs with magnitude values lower than the standard deviation
of the noise cannot be detected. As δ increases (i.e., δ ≥ 0.05), the proposed
evolving clustering algorithm works well. In particular, in Application D1, Profile
2, the algorithm guarantees a very low number of misclassified parameter vectors
(ε < 2%) supporting the idea that what proposed can be effectively used to
create the fault dictionary and, subsequently, identify the occurring faults (See
Figure 1). Similar comments can be done for the very low percentages of outliers
(po < 4.5%). It is worth noting that the algorithm is very reactive in creating
clusters once a fault occurred (i.e., approximately 5 batches of data). As a last
−0.12 −0.17
−0.14 −0.18
−0.19
−0.16
−0.2
−0.18
−0.21
−0.2
−0.22
−0.22
−0.23
−0.24
−0.24
−0.26
−0.25
−0.28 −0.26
−0.3 −0.27
−0.16 −0.14 −0.12 −0.1 −0.08 −0.06 −0.04 −0.02 0 −0.22 −0.2 −0.18 −0.16 −0.14 −0.12 −0.1 −0.08
(a) Fault profile 1 (b) Fault profile 2
Fig. 1. PCA projection of the parameter vector in 2D for Application D1 (δ = 0.3).

Legend: (+) nominal clusters, (·) and (o) faulty clusters, (*) outliers.
comment: results for fault profile 2 show that the algorithm works well also on
slowly drifting time variant environments. The simulation results of Application
D2 are particularly interesting, since the performance of the algorithm is given
in the case of a strong model bias. We appreciate the fact that these results are
coherent with those of the linear case when the magnitude δ is above 0.075. We
comment that the effect of this strong nonlinearity is an increase in the value of
the covariance matrix for each state. For this reason, the magnitude δ must be
equal or above 0.075 to be able to detect the change, while in Application D1
the algorithm provides good performance even when δ is as low as 0.05.
5 Conclusion
The paper presents an evolving mechanism for cognitive fault diagnosis able to
detect and classify faults by automatically creating the fault dictionary (initially
empty) during the operational phase. The novelty of the proposed approach
resides in the theoretically grounded framework that allows us for working in
the space of linear approximation models even if the system under monitoring
is nonlinear. The experimental section shows the effectiveness of the proposed
solution.
References
1. Ljung, L.: Convergence analysis of parametric identification methods. IEEE Trans-
actions on Automatic Control 23(5), 770–783 (1978)
2. Ljung, L., Caines, P.E.: Asymptotic normality of prediction error estimators for
approximate system models. IEEE Decision and Control 17, 927–932 (1978)
3. Isermann, R.: Fault-diagnosis systems, an introduction from fault detection to fault
tolerance. Springer (2006)
4. Fochem, M., Wischnewski, P., Hofmeier, R.: Qualitycontrolsystems on the produc-
tionline of tapedeckchassis using selforganizing feature maps. In: Proc. 1st Euro-
pean Symp. on Applications of Intelligent Technologies (1997)
5. Naresh, R., Sharma, V., Vashisthcv, M.: An integrated neural fuzzy approach for
fault diagnosis of transformers. IEEE T. Power Del. 23(4), 2017–2024 (2008)
6. Kuo, H.C., Chang, H.K.: A new symbiotic evolution-based fuzzy-neural approach to
fault diagnosis of marine propulsion systems. Engineering Applications of Artificial
Intelligence 17, 919–930 (2004)
7. Joentgen, A., Mikenina, L., Weber, R., Zeugner, A., Zimmermann, H.J.: Auto-
matic fault detection in gearboxes by dynamic fuzzy data analysis. Fuzzy Sets and
Systems 105, 123–132 (1999)
8. Huang, G., Zhu, Q.Y., Siew, C.K.: Extreme learning machine: Theory and appli-
cations. Neurocomputing 70(1-3), 489–501 (2006)
9. Schrauwen, B., Verstraeten, D., Campenhout, J.: An overview of reservoir comput-
ing: Theory, applications and implementations. E. S. on Artificial NNs, 471–482
(2007)
10. Nasraoui, O., Rojas, C.: Robust clustering for tracking noisy evolving data streams.
In: Proc. SIAM Conf. Data Mining, pp. 618–622 (2006)
11. Song, Q., Kasabov, N.: ECM - A Novel On-line, Evolving Clustering Method and
Its Applications. Found. of cognitive science, pp. 631–682. MIT Press (2001)
12. Angelov, P., Filev, D.P., Kasabov, N.: Evolving intelligent systems: Methodology
and Applications, vol. 12. Wiley-IEEE Press (2010)
13. Johnson, R.A. and Wichern, D.W.: Applied multivariate statistical analysis. Pren-
tice hall, Upper Saddle River (2002)
Improving ANNs Performance on Unbalanced
Data with an AUC-Based Learning Algorithm
Cristiano L. Castro and Antônio P. Braga
Federal University of Lavras

Department of Computer Science 37200-000, Lavras, MG, Brazil
ccastro@dcc.ufla.br,
apbraga@ufmg.br
Abstract. This paper investigates the use of the Area Under the ROC
Curve (AUC) as an alternative criteria for model selection in classifica-
tion problems with unbalanced datasets. A novel algorithm, named here
as AUCMLP, which incorporates AUC optimization into the Multi-layer
Perceptron (MLPs) learning process is presented. The basic principle
of AUCMLP is the solution of an optimization problem that aims at
ranking quality as well as the separability of class distributions with re-
spect to the threshold decision. Preliminary results achieved on real data,
point out that our approach is promising, and can lead to better decision
surfaces, specially under more severe unbalance conditions.
Keywords: unbalanced datasets, classification, Area Under the ROC

Curve, parameter estimation criteria.
1 Introduction
Global squared error functions are often used in error-correction learning since
they yield simplification of the optimization problem, specially of those algo-
rithms which are based on gradient descent. Many of current learning algorithms
for Artificial Neural Networks (ANNs) have inherited this learning principle from
Backpropagation [1]. Nevertheless, the use of a global error function may fail to
represent properly the true error of unbalanced classification problems. In such
kind of problems, the discrimination function tend to favor the majority class
since the global error function assumes uniform losses for all training samples,
regardless of the prior probability of the corresponding class [2]. Model perfor-
mance on each separate class is often considered as a final criteria for model
assessment, but it is not usually embodied in the adaptive learning procedures.
Performance assessment and model selection for unbalanced learning prob-
lems have been often accomplished with the aid of the ROC (Receiver Operating
Characteristic) curve [3] which represents the relationship between the true pos-
itive rate (TPrate) and the false positive rate (FPrate) of a family of classifiers
resulted from different output thresholds. A more robust criteria extracted from
the ROC curve is the AUC (Area Under the ROC Curve) which is a global
metric for all thresholds regardless of class prior probabilities. Because of that,

Improving ANNs with an AUC-Based Learning Algorithm 315
the AUC has been applied to ranking quality estimation [4] and also to highly
unbalanced learning problems [5].
Despite being a more robust metric for unbalanced classification problems,
AUC maximization is not guaranteed by global error minimization learning al-
gorithms [6]. In order to guarantee AUC maximization, learning algorithms are
expected to incorporate AUC optimization into the learning procedures, ap-
proach that has been adopted by some learning algorithms [7,8,9]. It has been
also shown [6,4] that the RankBoost algorithm under specific conditions com-
putes a function that is equivalent to the AUC.
Although the aforementioned algorithms have been proposed to maximize
ranking in specific domains, such as Information Retrieval, their application
in the context of unbalanced learning have not been investigated yet in the
literature. Since the inherent properties of the AUC metric motivates its use
for model selection in the presence of uneven data, it is natural to suppose
that AUC optimization-based algorithms could represent an alternative to the
well known sampling and cost-sensitive learning approaches, which have been
commonly used to increase ANNs discrimination ability [10,11]. This work aims
to investigate this hypotheses with a novel algorithm for Multi-layer Perceptrons
(MLPs), named here as AUCMLP, which embodies AUC optimization in the
learning process.
Our goal is to adopt AUC as a general cost function in order to improve MLPs’
performance for representing classification functions, particularly those induced
from unbalanced datasets. The main principles of AUCMLP are the solution
of the optimization problem independently of the prior distributions yielded by
the AUC, as well as its relationship with the quality of classification ranking. In
contrast with a global error cost function, it may yield better performances in
(highly) unbalanced datasets, specially those with class overlapping.
The paper is organized as follows: Section 2 describes the foundations of our
AUC-based learning approach for MLPs. Section 3 presents the methodology
of the empirical study conducted to evaluate the effectiveness of our approach.
Also presented are the discussions on the results obtained. At last, the final
conclusions are provided in Section 4.
2 Area under the ROC Curve

N
Let D = {(xi , yi )}i=1 be a dataset with N examples belonging to two classes
C1 and C2 , where yi ∈ {+1, −1} denotes the label (target output) of each input
xi ∈ Rn . The dataset D consists of N1 examples of the minority (or positive)
class C1 and N2 examples of the majority (or negative) class C2 . The union
N1 N2
of the two sets D1 = {(xp , yp )}p=1 and D2 = {(xq , yq )}q=1 corresponds to the
dataset D (N = N1 + N2 and D = D1 ∪ D2 ).
The Area Under the ROC Curve (AUC) of a classifier f can be expressed
by the probability P (f (X+ ) > f (X− )), where f (X+ ) is the random variable
corresponding to the distribution of the classifier outputs for the positive ex-
amples and f (X− ) the one corresponding to the negative examples. A discrete
316 C.L. Castro and A.P. Braga
approximation of the probability function above can be obtained by the Wilcoxon-

Mann-Whitney statistic [12], as described in Equation 1 that follows
N N
1 1 2
AU C(f ) = G (f (xp ) − f (xq )) (1)

N1 N2 p=1 q=1
where G(t) is defined as

⎧
⎨ 0 if t < 0
G(t) = 0.5 if t = 0 (2)
⎩
1 if t > 0
Thus, the AUC can be seen as a performance metric that is based on pairwise
comparisons between classifications of elements from the two classes.
2.1 Cost Function Definition
Since the original expression of Wilcoxon-Mann-Whitney statistic (1) is nondif-

ferentiable, a smoothing strategy was proposed in [7] by replacing the heaviside
function G(t) (Equation (2)) by the function R(t) defined as

(−(t − κ))τ if t < κ,
R(t) = (3)
0 otherwise.
with 0 < κ ≤ dmax and τ > 1.
Let fp and fq be the MLP outputs (scores) due to the presentation of the
p-th positive example and the q-th negative example, respectively; the difference
between these scores is denoted by dpq ; for instance, for a MLP with outputs
in the range −1 ≤ fi ≤ 1, dmax = 2. From expression (3) it is now possible to
obtain a differentiable approximation of the functional (1) [7]:
N N
1 1 2

AU C(w) = R (dpq (w)) (4)
N1 N2 p=1 q=1
where w is the collection of all MLP parameters (weights and biases).

The AU C(w) minimization implies on a search for solutions (models) whose
values of dpq are larger or equal to κ for all pairs of examples. As mentioned
earlier, the authors in [7] have argued that κ should assume values larger than
0, so that the scores drawn from the positive class are larger than the negatives
ones. Their strategy aims at ranking by maximizing the AUC without the need
to consider (or assess) a decision threshold.
In order to ensure not only ranking quality, but also that the scores obtained
for each separate class are located on opposite sides of the default threshold
(θz = (fmax + fmin )/2), we suggest a new range for κ in the function R(t):
(dmax /2) < κ ≤ dmax . Let us again consider, for instance, a MLP with continuous
output in the range −1 ≤ fi ≤ 1. In this case, κ should be set larger than 1 so
that the solution achieved from the optimization of (4) leads to fp ≥ θz ∀ xp ∈ D1
and fq < θz ∀ xq ∈ D2 , with θz = 0.0. Moreover, θz can also be used to calculate

metrics extracted from the confusion matrix such as TPr (sensitivity) and TNr
(specificity).
The parameter τ only influences the slope of the function R(t). In empirical
tests, we have observed that better results have been achieved when τ = 2, 3
and, 1.2 ≤ κ ≤ 1.5.
2.2 Gradient Descent for AUC Learning

The learning rule is an extension of the gradient descent in batch mode, in
which the weight updates are accumulated over an entire presentation of the
training set (an epoch) before being applied. In order to derive this rule, let us
first consider an MLP classifier with n inputs, one hidden layer with h units
(neurons) and one output layer containing a single unit. Network output due to
the presentation of an arbitrary input vector xi is given by the expression
h n

s sr r
fi = φ (vi ) = φ w φ w xi (5)
s=0 r=0
Let g(w) be the gradient vector associated to the current weight vector w. Each
component of vector g(w) is given by the partial derivative of AU C(w) with
respect to an network weight w, as described by the expression
1
N1

∂ AU C(w)
N2
∂R (dpq (w))
= (6)
∂w N1 N2 p=1 q=1 ∂w
∂R(d pq(w))
where ∂w corresponds to the gradient scalar due to the presentation of
the pair of examples xp and xq . For an arbitrary output layer weight ws , this
term can is obtained as follows
∂R (dpq (w))
= τ (−fp + fq − κ)τ −1 −φ (vp ) zps + φ (vq ) zqs (7)

∂ws

with zis = φ (usi ) = φ ( r wsr xri ). Similarly, for a hidden layer weight wsr , the
gradient scalar is given by
∂R (dpq (w))
= τ (−fp + fq − κ)τ −1 [−φ (vp ) ws φ usp xrp
∂wsr
+ φ (vq ) ws φ usq xrq ] (8)
The weight vector (w) can then be updated at each iteration in the opposite
direction of the gradient vector, as follows
Δw = −η g(w) (9)
wnew = wold + (1 − ρ) Δwold + ρ Δwold−1 (10)

Table 1. Characteristics of the datasets
Dataset alias # attr. N1 N2 N1 /(N1 + N2 ) % synt.

Ionosphere iono 34 126 225 0.359 100%
Vehicle (4 vs. all) veh 18 199 647 0.235 200%
WP Breast Cancer wpbc 33 47 151 0.237 200%
Segmentation (1 vs. all) seg 19 30 180 0.143 500%
Euthyroid (1 vs. all) euth 24 238 1762 0.119 600%
Satimage (4 vs. all) sat 36 626 5809 0.097 800%
Vowel (1 vs. all) vow 10 90 900 0.091 900%
Abalone (18 vs. 9) a18-9 08 42 689 0.057 1000%
Glass (6 vs. all) gls6 10 9 205 0.042 1000%
Yeast (5 vs. all) y5 08 51 1433 0.034 1000%
Abalone (19 vs. all) a19 08 32 4145 0.008 1000%
where η is a positive number (learning rate) that controls the size of the update
term to be applied to the weight vector. The momentum constant 0 ≤ ρ ≤ 1 is
also used in order to speed up convergence.
An empirical study was conducted in order to evaluate the effectiveness of the

proposed algorithm in improving the performance of MLPs. AUCMLP was com-
pared with known methods in the literature for unbalanced learning: Smote
+ Tomek-Links (SMTTL) [13], Weighted Wilson’s Editing (WWE) [11] and
RAMOBoost (RBoost) [14]. A MLP classifier without any strategy to deal with
uneven data was also tested under the same conditions of the other algorithms.
The experiments were performed with 10 datasets from the UCI repository
[15], which were selected according to the imbalance of their class sizes. The
dataset names are listed in Table 1 along with their characteristics: number
of attributes (# attr.), number of positive examples (N1 ), number of negative
examples (N2 ), unbalanced ratio (N1 /(N1 + N2 )) and percentage of synthetic
data (% synt.) to be generated by a particular sampling procedure.
We followed the same methodology that has been adopted in [16]. Twenty
different cases were generated for each dataset by shuffling the original indexes of
its elements (a random permutation procedure was applied 20 times). Then, each
case was split into training (2/3) and test (1/3) subsets in a stratified manner,
ensuring that each of them had the same imbalance degree as the original dataset.
Hence, there were 20 different training/test cases for each dataset. Accordingly,
the average performance of a particular algorithm was calculated over 20 runs
(training/test cases) with the following metrics commonly adopted in imbalanced
learning problems:
√
– G-mean metric defined as T P r · T N r, where T P r and T N r are the posi-
tive and negative classes’ accuracy, respectively.
– AUC estimated according to the trapezoidal estimation described in [3].
Table 2. G-mean values (in %) on UCI data sets
Dataset MLP SMTTL WWE RBoost AUCMLP

iono 84.33 ± 4.18 85.20 ± 5.31 85.21 ± 4.60 85.60 ± 5.49 85.15 ± 4.65
wpbc 62.07 ± 16.85 61.33 ± 7.96 61.72 ± 8.38 66.63 ± 8.55 63.39 ± 7.77
veh 95.41 ± 1.97 95.91 ± 1.51 96.00 ± 1.81 97.07 ± 1.14 96.76 ± 1.06
seg 94.54 ± 4.37 95.96 ± 3.71 95.94 ± 2.68 96.53 ± 2.58 96.98 ± 2.51
euth 87.39 ± 3.79 87.29 ± 2.60 88.64 ± 2.35 87.55 ± 1.98 89.34 ± 1.86
sat 69.90 ± 3.68 76.86 ± 2.27 76.88 ± 2.97 77.65 ± 2.31 85.77 ± 0.75
a18-9 70.36 ± 6.85 67.08 ± 8.95 69.51 ± 10.47 70.39 ± 12.11 83.04 ± 6.15
gls6 85.29 ± 14.46 78.13 ± 11.86 81.37 ± 23.15 81.95 ± 13.27 87.26 ± 13.27
y5 42.69 ± 16.48 61.49 ± 9.75 67.80 ± 9.34 62.76 ± 7.97 79.63 ± 5.23
a19 0.00 ± 0.00 21.79 ± 17.22 0.00 ± 0.00 68.14 ± 7.06 75.09 ± 6.90
Table 3. AUC values (in %) on UCI data sets
Dataset MLP SMTTL WWE RBoost AUCMLP

iono 90.29 ± 4.75 91.36 ± 4.85 91.14 ± 4.51 91.47 ± 3.71 89.76 ± 4.05
wpbc 73.84 ± 6.40 71.34 ± 7.15 68.38 ± 6.25 73.52 ± 7.61 73.48 ± 6.25
veh 99.17 ± 0.69 99.46 ± 0.33 98.99 ± 0.68 99.39 ± 1.11 99.29 ± 0.36
seg 98.15 ± 2.19 98.52 ± 2.21 99.14 ± 1.23 98.42 ± 2.14 99.70 ± 0.46
euth 93.04 ± 2.92 94.11 ± 1.51 93.77 ± 1.93 94.60 ± 1.72 95.27 ± 1.19
sat 87.07 ± 2.63 90.90 ± 1.13 91.22 ± 1.40 91.07 ± 2.18 93.58 ± 0.35
a18-9 81.51 ± 6.88 84.70 ± 5.42 84.15 ± 6.13 85.28 ± 9.09 92.35 ± 3.69
gls6 97.35 ± 5.08 97.18 ± 1.26 96.19 ± 5.91 98.38 ± 4.69 98.26 ± 2.87
y5 82.41 ± 5.72 82.89 ± 5.43 82.42 ± 7.77 84.00 ± 6.07 88.00 ± 4.17
a19 68.24 ± 12.71 62.91 ± 12.23 67.44 ± 9.67 81.93 ± 7.15 84.02 ± 5.23
The algorithms were configured following suggestions from the literature. For
RBoost, the number of nearest neighbors used to adjust the sampling probabili-
ties of the minority examples and generate the synthetic data were set to 5 and
10, respectively; the number of boosting iterations and the scaling coefficient
were set to 20 and 0.3, respectively [14]. For SMTTL, the number of nearest
neighbors was set to 5 [13], while the value 3 was chosen for WWE [11]. Finally,
as suggested in Section 2.1, the AUCMLP cost function parameters κ and ρ were
set to 1.2 and 2 on all trials.
Tables 2 and 3 list the values of G-mean and AUC achieved by MLP, SMTTL,
WWE, RBoost and AUCMLP. Means and standard deviations were calculated
on 20 different test cases. The best performances achieved on each dataset are
highlighted in bold.
As can be observed from Table 2, AUCMLP has superior G-mean perfor-
mance than the other methods in seven out of ten tested datasets. Although a
statistical test was not accomplished to point out significance on the results, the
observation of means and standard deviations achieved by all methods suggest
that AUCMLP has better performance with datasets that have larger unbal-
anced degrees. The increase in performance with AUCMLP is more expressive
for the following datasets: euth, sat, a18-9, gls6, y5 and a19. In the particular
case of a19, characterized by a huge imbalance degree (0.008), when both MLP
and WWE were unable to classify positive data (T P R = 0.0), AUCMLP had a
satisfactory performance.
Similar observations can be drawn from Table 3. AUCMLP performed better
than other algorithms in six out of ten evaluated datasets. It should be noted,
however, that the score gains obtained by AUCMLP with the less unbalanced
datasets (iono, wpbc, veh, seg, euth) are lower than those obtained with sat,
a18-9, y5 and a19. This suggest that AUCMLP can be more effective in optimiz-
ing ROC curves under more severe unbalanced conditions. This remark agrees
with the discussion previously conducted in [6]. In that study, the authors have
formally showed that algorithms that embody AUC optimization in the learning
process should perform better than overall error-based algorithms in situations
with high level of unbalancing and class overlapping. They also argued that in
roughly even conditions the optimization of an overall error-based cost function
also implies on AUC optimization, what could also explain the similar perfor-
mances obtained with more balanced datasets, such as iono and wpbc.
Finally, average ROC curves were plotted by applying the threshold averaging
technique [3] on 20 test cases. Fig. 1 shows the ROC curves obtained for the data
set a19 which has the lowest degree of imbalance, where it can be observed that
AUCMLP generates a better ROC curve than the other algorithms.
0.9
0.8
0.7
0.6
TPr
MLP
0.5 SMTTL
WWE
0.4 RBoost
AUCMLP
0.3
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1
FPr
Fig. 1. Average ROC Curves for the dataset a19
4 Conclusion
It is well accepted in the literature that the bias induced by scarce and unbal-
anced learning sets is a direct consequence of the minimization of the overall
error rate. Changes in this parameter estimation criteria in order to improve the
detection of the underrepresented class have been proposed in different forms.
Most common strategies involve the incorporation of unequal costs to the indi-
vidual errors (cost-sensitive learning approach), as well as modifications in the
class probability distributions via sampling data.
A different approach was considered in this paper. Motivated by the theoret-
ical aspects of the AUC, we evaluate the effectiveness of this metric as a criteria
for neural model selection in unbalanced classification. Preliminary results sug-
gest the validity of our initial hypothesis about the advantages of optimizing
AUC, in contrast with other error-based methods. Such results point out that
our algorithm (AUCMLP), which embodies AUC optimization in the learning
process, can be used to compensate the bias imposed by the dominant class,
leading to better decision surfaces, specially under severe unbalance situations.
References
1. Rumelhart, D.E., McClelland, J.L.: Parallel distributed processing: Explorations
in the microstructure of cognition, vol. 1: Foundations. MIT Press (1986)
2. Lan, J., Hu, M.Y., Patuwo, E., Zhang, G.P.: An investigation of neural network
classifiers with unequal misclassification costs and group sizes. Decis. Support
Syst. 48, 582–591 (2010)
3. Fawcett, T.: An introduction to ROC analysis. Pat. Rec. Lett. 27, 861–874 (2006)
4. Rudin, C., Schapire, R.E.: Margin-based ranking and an equivalence between Ad-
aBoost and RankBoost. J. of Mach. Learn. Research 10, 2193–2232 (2009)
5. Bradley, A.P.: The use of the area under the ROC curve in the evaluation of
machine learning algorithms. Pattern Recognition 30, 1145–1159 (1997)
6. Cortes, C., Mohri, M.: AUC optimization vs. error rate minimization. In: Advances
in Neural Information Processing Systems 16. MIT Press, Cambridge (2004)
7. Yan, L., Dodier, R.H., Mozer, M., Wolniewicz, R.H.: Optimizing classifier perfor-
mance via an approximation to the wilcoxon-mann-whitney statistic. In: ICML
2003: Proceedings of the 20th Int. Conf. on Machine Learning, pp. 848–855 (2003)
8. Joachims, T.: A support vector method for multivariate performance measures. In:
ICML 2005: Proc. of the 22nd Int. Conf. on Machine learning, pp. 377–384 (2005)
9. Herschtal, A., Raskutti, B., Campbell, P.K.: Area under ROC optimization using
a ramp approximation. In: Proc. of 6th Int. Conf. on Data Mining, pp. 1–11 (2006)
10. He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. on Knowledge
and Data Engineering 21, 1263–1284 (2009)
11. Khoshgoftaar, T.M., Hulse, J.V., Napolitano, A.: Supervised neural network mod-
eling: An empirical investigation into learning from imbalanced data with labeling
errors. IEEE Trans. on Neural Networks 21, 813–830 (2010)
12. Hanley, J.A., Mcneil, B.J.: The meaning and use of the area under a receiver
operating characteristic (ROC) curve. Radiology 143, 29–36 (1982)
13. Batista, G., Prati, R., Monard, M.: A study of the behavior of methods for bal-
ancing machine learning training data. SIGKDD Expl. Newsl. 6, 20–29 (2004)
14. Chen, S., He, H., Garcia, E.A.: Ramoboost: ranked minority oversampling in boost-
ing. IEEE Trans. on Neural Networks 21, 1624–1642 (2010)
15. UCI machine learning repository, http://archive.ics.uci.edu/ml/
16. Wu, G., Chang, E.: KBA: Kernel boundary alignment considering imbalanced data
distribution. IEEE Trans. on Knowl. and Data Eng. 17, 786–795 (2005)
Learning Using Privileged Information
in Prototype Based Models
Shereen Fouad, Peter Tino , Somak Raychaudhury, and Petra Schneider
The University of Birmingham, Birmingham B15 2TT, United Kingdom

{saf942,P.Tino}@cs.bham.ac.uk,
somak@star.sr.bham.ac.uk,
p.schneider@bham.ac.uk
http://www.birmingham.ac.uk
Abstract. In some pattern analysis problems, there exists expert knowl-

edge, in addition to the original data involved in the classification process.
Most of existing approaches ignore such auxiliary (privileged) knowledge.
Recently a new learning paradigm - Learning Using Hidden Information -
was introduced in the SVM+ framework. This approach is formulated for
binary classification and, as typical for many kernel based methods, can
scale unfavorably with the number of training examples. In this contribu-
tion we present a more direct novel methodology, based on a prototype
metric learning model, for incorporation of valuable privileged knowl-
edge. This is done by changing the global metric in the input space,
based on distance relations revealed by the privileged information. Our
method achieves competitive performance against the SVM+ formula-
tions. We also present a successful application of our method to a large
scale multi-class real world problem of galaxy morphology classification.
Keywords: Learning Using Hidden Information (LUHI), Generalized

Matrix Learning Vector Quantization (GMLVQ), Information Theoretic
Metric Learning (ITML).
1 Introduction
Designing classifiers that incorporate auxiliary ’privileged’ knowledge along with

the original data set during learning is an important and challenging research
issue. Recently, [1] integrated privileged knowledge in Support Vector Machine
(SVM) classifier via new learning paradigm called Learning Using Hidden In-
formation (LUHI). In the training stage, along with training input xi ∈ X, a
classifier may be given some additional information x∗i ∈ X ∗ about xi . Such

Shereen Fouad and Peter Tino are with the School of Computer Science, The
University of Birmingham.

Somak Raychaudhury is with the School of Physics and Astronomy, The University
of Birmingham.

Petra Schneider is with the School for Clinical and Experimental Medicine, The
University of Birmingham.

Learning Using Privileged Information in Prototype Based Models 323
additional (privileged) information, however, will not be available in the test

phase, where labels must be estimated using the trained model for previously
unseen inputs x ∈ X only (without x∗ ). In the SVM context, the additional
information is used to estimate a slack variable model in SVM+. However, SVM
classifiers are inherently constructed to deal with binary classification problems.
In addition, SVM+ training can be computationally expensive (even infeasible
for large-scale data sets)1 .
To address these issues, we propose a completely different approach to learning
with privileged information through metric learning in prototype based models.
Prototype based models lend themselves naturally to multi-class problems, and
can be constructed at a smaller computational cost. In particular, we extend the
recently proposed Generalized Matrix Learning Vector Quantization (GMLVQ)
[4], to the case of (privileged) information available only during the training
phase. The main idea behind our approach is modification of the metric in the
original data space X based on data proximity ‘hints’ obtained from the priv-
ileged information space X ∗ . We also introduce two methods for incorporating
the new metric in X in the context of prototype based classification. We ex-
perimentally study the performance of our general methodology and compare
it with the SVM+ model [1]. In addition, we illustrate its advantage in a large
scale astronomical classification problem.
This paper has the following organization: Section 2 gives the background and
briefly describes previous methods related to this study. Sections 3 and 4 intro-
duce novel approaches for incorporation of privileged knowledge in prototype
based classification. Experimental results are presented in section 5. Section 6
concludes the study by summarizing the key contributions.
2 Background and Related Work

2.1 Learning Using Hidden Information (LUHI)
Learning Using Hidden Information (LUHI) framework [1] aims to improve learn-
ing in the presence of an additional (privileged) information x∗ ∈ X ∗ about
training examples x ∈ X, where the privileged information will not be available
at the test stage.
In a Support Vector Machine (SVM) framework, Vapnik and Vashist [1] pro-
posed the SVM+ algorithm, which takes as input training triplets, {(xi , x∗i , yi ),
xi ∈ X, x∗i ∈ X ∗ , yi ∈ {−1, 1} , i = 1...n, generated according to a fixed (un-
known) probability measure P (x, x∗ , y). The training triplets are used to es-
timate two linear functions concurrently; (i) standard SVM decision function,
and (ii) a “correcting function” which models the slack variables based on the
privileged information x∗i . In addition, [1] introduced another SVM approach
for LUHI, called dSVM+. In dSVM+ the space of admissible non-negative cor-
recting functions is constrained to a 1-dimensional space (d-space). Privileged
1
There have been developments in the SVM literature aiming to handle multi-class
problems (e.g.[2]) and large data sets (e.g.[3]). However, direct transformation of the
LUHI framework to such formulations in a unified single model would be non-trivial.
324 S. Fouad et al.
information x∗i is transformed into so-called deviation (scalar) values di (substi-

tuting the privileged information) and the SVM+ method is applied to training
triplets (xi , di , yi ). It has been experimentally verified that classifiers trained
with both privileged information x∗i ∈ X ∗ and original data xi ∈ X can improve
over classifiers fitted on xi ∈ X only [1].
2.2 Prototype Learning Algorithms
Learning Vector Quantization (LVQ) constitutes a family of supervised learn-

ing multi-class classification algorithms. Consider training data (xi , yi ) ∈ Rm ×
{1, ..., c}, i = 1, 2, ..., n, m denoting the data dimensionality and c the number of
distinct classes. A LVQ network consists of a number of prototypes wi ∈ Rm , i =
1, 2, ..., L, with class labels c(wi ) ∈ {1, ..., c}. Given a distance measure d(x, w)
in Rm , classification is based on a winner-takes-all scheme: a data point x is as-
signed the label c(wi ) of the closest prototype i, i.e. d(x, wi ) < d(x, wj ), ∀j
= i.
The original LVQ scheme [5] introduced by Kohonen uses Hebbian learning to
adapt the prototypes to the training data. Meanwhile, researchers proposed nu-
merous modifications of the basic learning scheme. For instance, in Generalized
LVQ [6] (GLVQ), training is derived as a minimization of an explicit cost func-
tion through the steepest descent. However, GLVQ suffers from the problem that
classification is based on a squared Euclidean distance d(x, w) = (x− w)T (x− w)
which can be problematic in case of higher dimensional, heterogeneous data sets,
where different scaling and correlations of dimensions can be expected. Gener-
alized Matrix LVQ (GMLVQ) [4] directly addresses this issue by allowing the
input space metric to be adaptable. The algorithm uses a generalized form of the
distance which is obtained by extending the metric by a full matrix of adaptive
weight factors,
dΛ (x, w) = (x − w)T Λ(x − w). (1)
where Λ is an (m × m) matrix restricted to positive-definite forms to guarantee

metricity. This can be achieved by substituting Λ = ΩT Ω, where Ω ∈ Rm×m
is full rank. Furthermore, Λ needs to be normalized aftereach learning step
m
to prevent the algorithm from degeneration. Here, we set i=1 Λii = 1 to fix
the sum of eigenvalues to be constant [4]. GMLVQ’s training is derived as a
minimization of a cost function which adapts the metric parameters Ωij together
with the prototypes w by means of a stochastic gradient descent.
2.3 Information Theoretic Metric Learning (ITML)
Over the last few years, there has been considerable research on Distance Metric
Learning algorithms which aim to optimize a target distance for a given set
of data points under various types of constraints (given in the form of side
information), e.g. [7],[8]. In Information Theoretic Metric Learning (ITML) [8],
given a set of n points {x1 , ..., xn }, xi ∈ Rm , one learns a positive definite matrix
T
A defining the squared metric dA (xi , xj ) = (xi − xj ) A (xi − xj ) (see eq. (1)),
subject to categorical pairwise similarity information on the data points that

should be preserved. The closeness relation between the original Euclidean metric
and the new one is measured through K-L divergence between the multivariate
zero-mean Gaussian having I 2 and A as precision matrices.
3 Information Theoretic Approach for LUHI in GMLVQ
In this section we present an approach in which the privileged information in

X ∗ is used to describe ‘closeness’ relations between training points in a cate-
gorical manner only - e.g. the points are ‘close’ or ‘far apart’. This categorical
information is then imposed on the original space X through the framework of
Information Theoretic Metric Learning (ITML) [8] in 2.3.
Consider training data (xi , yi ) ∈ Rm , i = 1, 2, ..., n, as in section 2.2. Assume
that additional information x∗i ∈ X ∗ is given for r ≤ n training examples xi ∈ X,
i = 1, 2, ..., r. Assume further that we are given a global metric tensor M on
space X defining the squared distance,
dM (xi , xj ) = (xi − xj )T M (xi − xj ), xi , xj ∈ X.
In this formulation we would like to modify dM so that the distances under the
’new’ metric dC on X are enlarged and shrunk for pairs of points that have
‘dissimilar’ and ‘similar’ privileged information, respectively.
In the ITML approach, two sets of pairs of data points form X are formed
corresponding to the ‘similar and dissimilar’ data items: S+ = {(xi , xj )|xi and
xj are judged to be ‘similar’} and S− = {(xi , xj )|xi and xj are judged to be
‘dissimilar’}. While in ITML similarity/dissimilarity is decided purely based
on class labels, we use proximity information in the privileged space X ∗ as well.
In particular, assume we are given a global metric tensor M ∗ on X ∗ giving the
squared distance,
dM ∗ (x∗i , x∗j ) = (x∗i − x∗j )T M ∗ (x∗i − x∗j ), x∗i , x∗j ∈ X ∗ .
We calculate all pairwise distances dM ∗ (x∗i , x∗j ), 1 ≤ i < j ≤ r. These distances

are then sorted in ascending order and, given a lower percentile parameter a∗ > 0,
a distance threshold l∗ is found such that a∗ percent of the lowest pairwise
distances dM ∗ (x∗i , x∗j ) are smaller than l∗ . Analogously, given an upper percentile
parameter b∗ > a∗ , a distance threshold u∗ > l∗ is found such that (1 − b∗ )
percent of the largest pairwise distances dM ∗ (x∗i , x∗j ) are greater than u∗ . The
sets S+ and S− are then constructed as follows3 :
– If dM ∗ (x∗i , x∗j ) ≤ u∗ and c(xi ) = c(xj )(same label), then (xi , xj ) ∈ S+ .

– If dM ∗ (x∗i , x∗j ) ≥ l∗ and c(xi )
= c(xj )(different label), then (xi , xj ) ∈ S− .
2
The identity matrix I stands for the metric tensor of the Euclidean metric.
3
It is not necessary for all training points in X to be included in S+ or S− .
326 S. Fouad et al.
In ITML the ‘similarity’ between two metrics dC and dM on X ⊂ Rm , given

by (m × m) metric tensors C and M , respectively, is measured through the
Bregman divergence (Burg). The divergence is defined over the cone of positive
definite matrices as [8]:
−1
DBurg (C, M ) = tr (C, M ) − log det (C, M ) − m, (2)
where tr denotes the trace operator. Given distance thresholds 0 < l < u on
X, the new metric tensor C is found by minimizing the Bregman divergence
(2) subject to constraints: (xi , xj ) ∈ S+ ⇒ dC (xi , xj ) ≤ l and (xi , xj ) ∈ S− ⇒
dC (xi , xj ) ≥ u. In ITML the trade-off between the divergence minimization and
satisfying the constraints is controlled by a regularization parameter set through
cross-validation. For more details on ITML we refer the interested reader to [8].
4 Incorporating Privileged Information into GMLVQ

We propose two approaches for incorporation of the learnt metric dC into the
GMLVQ operating on X.
1. GMLVQ in transformed basis: The metric tensor C is found in the
parametrized form C = U T U . Then for any x ∈ X, we have
x 2C = xT Cx = xT U T U x = x̃T x̃ = x̃ 22 ,
where (x̃ = U x) is the image of x under the basis transformation U . The layout
of the transformed points x̃i = U xi now reflects the ‘similarity/dissimilarity’ in-
formation from X ∗ . Data points with ‘similar’ privileged data representation will
now in general be closer than in the original data layout. Likewise, data points
with more distant privileged representations will tend to move further apart.
The GMLVQ algorithm (in its original form) is now applied to the transformed
data {(x̃1 , y1 ), ...., (x̃n , yn )}.
2. Extended GMLVQ: GMLVQ is first run on the original training set with-
out privileged information, yielding a global metric dM (given by metric tensor
M ) and a set of prototypes wj ∈ Rm , j = 1, 2, ..., L. Then, using the privileged
information, the ITML technique finds metric dC on X that will replace the
metric dM originally found by GMLVQ. The metric dC now incorporates the
privileged information. Finally, GMLVQ is run once more with metric tensor C
fixed to modify the prototype positions4 .
5 Experiments and Evaluations

The effectiveness of the proposed formulation, referred to as ITML-GMLVQ5 ,
was evaluated in the context of classification accuracy obtained against the state
4
The prototype positions will in general change, since the metric has been changed
from dM to dC .
5
We developed the ITML-GMLVQ code using the ITML Matlab code available from
http://www.cs.utexas.edu/users/pjain/itml/
of art GMLVQ (trained without privileged information) used as a baseline in two

experiments. In the first experiment, we compare our approach with SVM+ on
a binary classification task used in [1]. In the second experiment we apply our
approach to a large scale multi-class real world problem of galaxy morphology
classification. In our approach we used one prototype per class. The hyper-
parameters of all models were obtained by cross-validation on the training set.
5.1 Comparison with SVM and SVM+
The data set used in this section [1] represents binary classification of digit images
for which privileged information is available in the form of ‘poetic descriptions’
of the images [1]. We followed the experimental settings used by [1].
Original space X: Training inputs consist of the first 50 examples of digits ’5’
and ’8’ from the MNIST6 training data (100 data points); test data has 1,866
samples of digits ’5’ and ’8’ from the MNIST test data.
Privileged space X ∗ : Poetic descriptions describing the first 50 examples of
digits ’5’ and ’8’. Poetic description of the digit images incorporated what lan-
guage experts saw and interpreted using their words in the form of a poem.
An example of poetic description for the first image of 5 is given in [1]. Poetic
description were then translated into 21-dimensional feature vectors7.
As in [1], we used training sets of increasing size 40, 50, ..., 100 (each train-
ing set containing the same number of digits ’5’ and ’8’). The training subsets
were sub-sampled randomly from the original 100 training inputs 5 times. Fig.1a
shows the mean8 number of misclassified points as a function of training set size.
The results show that ITML-GMLVQ outperforms the standard GMLVQ, with
slightly better performance for ’transformed basis’ over the ’extended GMLVQ’
technique. Comparison of ITML-GMLVQ with the baseline SVM (trained with-
out privileged information) and SVM+ is presented in Fig.1b. ITML-GMLVQ
achieves relative performance improvement of 14%, 6%, and 2% over the SVM,
SVM+, and dSVM+, respectively.
5.2 Galaxy Morphological Classification Using Full Spectra as

Privileged Information
Morphological galaxy classification aims to classify galaxies based on their struc-

ture and appearance (e.g. [9]). Most approaches rely on coarse-grained informa-
tion, e.g. galaxy photometric data, that can be obtained relatively ‘cheaply’,
ignoring possible detailed spectroscopic information that is more expensive to
6
Data images were extracted from http://yann.lecun.com/exdb/mnist/ and were
rescaled from 28×28 to 10×10 pixels.
7
http://www.nec-labs.com/research/machine/
ml website/department/software/learning-with-teacher/
8
Standard deviations across training set resamplings were generally very small (less
then 1% of the mean value), never exceeding 5% of the reported mean values.
328 S. Fouad et al.
280 300
GMLVQ SVM+
GMLVQ−ITML (GMLVQ in transformation) dSVM+
Number of Test Errors (Test data size is 1866)

Number of Test Errors (Test data size is 1866)
280
260 GMLVQ−ITML (Extended GMLVQ) GMLVQ−ITML (GMLVQ in transformed basis)
SVM
260
240
240
220
220
200
200
180
180
160 160
140 140
120
120 40 50 60 70 80 90
40 50 60 70 80 90 Training digits size
Training digits size
(a) (b)
Fig. 1. Mean number of misclassifications obtained by (a) GMLVQ based methods (b)
SVM based algorithms ([1]) and the ITML-GMLVQ on the MNIST data
Table 1. Mean misclassification rates (standard deviations across 10 runs are reported
in brackets) in the galaxy morphological classification
Learning algorithm Metric incorporation method Misclassification rate

GMLVQ N/A 0.22 (0.005)
ITML-GMLVQ GMLVQ in transformed basis 0.11 (0.01)
ITML-GMLVQ extended GMLVQ 0.106 (0.003)
obtain, but is available for a number of galaxies. When classifying a new galaxy,
full spectral ‘privileged’ information will typically be not available.
Our dataset contained 20,000 galaxies, extracted from Galaxy Zoo project cat-
alog [10] (galaxy IDs and their labels), characterized by 13 photometric features
(in X) extracted based on galaxy IDs from the SDSS DR7 data catalogues9. In
addition, 8 privileged spectral features (in X ∗ ) were extracted from the MPA-
JHU DR7 release of spectrum measurements10 .On the set of this size, we found
it infeasible to run extensive sets of experiments using the SVM+ based ap-
proaches. Galaxies were classified into three morphological classes - Elliptical,
Spiral, and Irregular. We report in Table 1 the means and standard deviations of
the performance measures across 10 experimental runs. In each run the galaxy
set was randomly split into training set (75%) and test set (25%). Inclusion of
the spectral privileged information in the training phase via the ITML-GMLVQ
model reduces the misscalssification rate, even though in the test phase the mod-
els are fed with the original features only. In general, ITML-GMLVQ achieves the
average relative improvement over GMLVQ of about 50%, with slightly better
performance for ’extended GMLVQ’ over the ’transformed basis’ technique.
6 Conclusion
We have introduced a novel methodology, based on Information Theoretic Ap-
proach to metric adaptation [8], for learning with privileged information in the
9
http://cas.sdss.org/astro/en/tools/crossid/upload.asp
10
http://www.mpa-garching.mpg.de/SDSS/DR7/
prototype based model GMLVQ [4]. The privileged information is incorporated

into the model operating on the original space X by changing the global met-
ric in X, based on categorical assessment of distances in X ∗ . Furthermore, we
introduced two scenarios for incorporating the newly learnt metric on X in the
prototype based modeling.
We verified our methodology in two experiments. Using the same data and
experimental settings employed in SVM+ model for LUHI [1], we demonstrate
comparable (or slightly better) performance of our method. On a large scale
galaxy morphological classification problem, privileged information takes the
form of costly-to-obtain full galaxy spectra. Such information is vital for galaxy
characterization. This is a situation where high quality privileged information
can boost the classifier performance substantially.
References
1. Vapnik, V., Vashist, A.: A New Learning Paradigm: Learning Using Privileged
Information. In: Neural Networks (NNs), vol. 22(5-6), pp. 544–555. Elsevier Ltd.
(2009)
2. Cervantes, J., Li, X., Yu, W.: Multi-Class SVM for Large Data Sets Consider-
ing Models of Classes Distribution. In: International Conference on Data Mining
(DMIN), pp. 30–35. CSREA Press, Las Vegas (2008)
3. Cervantes, J., Li, X., Yu, W., Li, K.: Support Vector Machine Classification For
Large Data Sets Via Minimum Enclosing Ball Clustering. In: Neural Networks
(NNs): Algorithms and Applications, 4th International Symposium on Neural Net-
works 2008, vol. 71(4-6), pp. 611–619. Elsevier Science Publishers, Amsterdam
(2008)
4. Biehl, M., Hammer, B., Schneider, P., Villmann, T.: Metric Learning for Prototype-
Based Classification. In: Innovations in Neural Information Paradigms and Appli-
cations, vol. 247, pp. 183–199. Springer (2009)
5. Kohonen, T.: Learning Vector Quantization for Pattern Recognition. Technical re-
port, No (TKK-F-A601), Helsinki University of Technology. Espoo, Finland (1986)
6. Sato, A.S., Yamada, K.: Generalized Learning Vector Quantization. In: Touretzky,
D., Leen, T. (eds.) Advances in Neural Information Processing Systems (ANIPS),
vol. 7, pp. 423–429. MIT Press (1995)
7. Xing, E.P., Ng, A.Y., Jordan, M.I., Russell, S.: Distance Metric Learning with
Application to Clustering with Side-Information. In: Neural Information Processing
Systems, vol. 15, pp. 505–512. MIT Press (2002)
8. Davis, J.V., Kulis, B., Jain, P., Sra, S., Dhillon, I.S.: Information-Theoretic Met-
ric Learning. In: Proceedings of the 24th International Conference on Machine
Learning, ICML 2007, pp. 209–216. ACM Press, New York (2007)
9. Wijesinghe, D.B., Hopkins, A.M., Kelly, B.C., Welikala, N., Connolly, A.J.: Mor-
phological Classification of Galaxies and Its Relation to Physical Properties.
Monthly Notices of the Royal Astronomical Society (MNRAS) 404(4), 2077–2086
(2010)
10. Lintott, C., Schawinski, K., Slosar, A., Land, K., Bamford, S., Thomas, D., Rad-
dick, M., Nichol, B., Szalay, A., Andreescu, D.: Galaxy Zoo: Morphologies Derived
From Visual Inspection of Galaxies From The Sloan Digital Sky Survey. Monthly
Notices of the Royal Astronomical Society 389(3), 1179–1189 (2008) ISSN 0035-
8711
A Sparse Support Vector Machine Classifier
with Nonparametric Discriminants
Naimul Mefraz Khan1 , Riadh Ksantini2 ,

Imran Shafiq Ahmad2 , and Ling Guan1
1
Department of Electrical and Computer Engineering, Ryerson University
2
School of Computer Science, University of Windsor
Abstract. This paper introduces a novel Sparse Support Vector Ma-

chine model with Kernel Nonparametric Discriminants (SSVMKND)
which combines data distribution information from two classifiers, namely,
the Kernel Support Vector Machine (KSVM) and the Kernel Nonpara-
metric Discriminant (KND). It is a convex quadratic optimization prob-
lem with one global solution, so it can be estimated efficiently with the
help of numerical methods. It can also be reduced to the classical KSVM
model, and existing SVM programs can be used for easy implementa-
tion. We show that our method provides a sparse solution through the
Bayesian interpretation. This sparsity can be used by existing sparse
classification algorithms to obtain better computational efficiency. The
experimental results on real-world datasets and face recognition applica-
tions show that the proposed SSVMKND model improves the classifica-
tion accuracy over other classifiers and also provides sparser solution.
Keywords: Nonparametric Discriminant Analysis, Support Vector Ma-

chines, Sparsity, Gaussian Prior, kernel, face recognition.
1 Introduction
The Kernel Nonparametric Discriminant (KND) [4,16] provides improvements
over the the Kernel Fisher Discriminant Analysis (KFD) [11] by relaxing the
normality assumption [4]. KND measures the between-class scatter matrix on a
local basis in the neighborhood of the decision boundary in the feature space.
This is based on the observation that the normal vectors on the decision bound-
ary are the most informative for discrimination [9]. We can consider KND as
a classifier based on the “near-global” characteristics of data realized by the κ-
nearest neighbors for each data point. Although KND gets rid of the underlying
assumptions of KFD and results in better classification performance, it is not
always an easy task to find an appropriate choice of κ − N N ’s on the decision
boundary for all data points to obtain the best discrimination.
On the other hand, the Kernel Support Vector Machine (KSVM) is based
on the idea of maximizing the margin or degree of separation in the training
data. KSVM tries to find the optimal decision hyperplane using support vectors,
which are the training samples that approximate the hyperplane and are the

SSVMKND Classifier 331
most difficult patterns to classify [14]. In other words, they consist of those data
points which are closest to the optimal hyperplane. So it can be said that the
KSVM solution is based on the “local” variations of the training data. However,
KSVM does not take into consideration the “near-global” properties of the class
distribution on which the KND is based.
In this paper, we propose a novel SSVMKND model which combines the
KND and the KSVM methods. In that way, a decision boundary is obtained
which reflects both near-global characteristics (realized by the KND) of the
training data in feature space and its local properties (realized by the local
margin concept of the KSVM). The proposed method provides the following
significant advantages over our recently developed models [7,8] :
– Unlike the methods in [7,8], our proposed model forms a convex optimiza-
tion problem because the final matrix used to modify the objective function
is positive-definite. As a result, the method generates one global optimum
solution and existing numerical methods can be used to solve this problem
easily and efficiently.
– We provide a probabilistic viewpoint of the proposed method to show that
the optimization problem of SSVMKND can be formulated to be represented
by a sparsity-promoting Gaussian prior [13]. On the other hand, the matrices
in the objective functions in [7,8] are not positive-definite and cannot form
Gaussian priors.
Moreover, the method in [8] uses KFD [11], while the proposed method is a
combination of KSVM and KND. Since KND relaxes the normality assumption,
the proposed method is more robust. We also show that our method is a variation
of the KSVM optimization problem, so that existing SVM implementations can
be used. The experimental results also verify these claims. Finally, we provide
an application of our method for face recognition, which has been one of the
most challenging tasks in the pattern recognition area.
2 KSVM and KND

Let X1 = {xi }N N1 +N2
i=1 and X2 = {xi }i=N1 +1 are two different classes of data samples
1
constituting an input space of N = N1 + N2 samples and the associated tags are

represented by T = {ti }N i=1 , where ti ∈ {0, 1}, ∀i = 1, 2, . . . , N . Since real-life
data has inherent non-linearity, KSVM tries to map the data samples to a higher
dimensional feature space F , where linear classification might be achieved. Let,
N1
the two classes are mapped to higher dimensional feature classes F1 = {Φ(xi )}i=1
N
and F2 = {Φ(xi )}i=N1 +1 by the function Φ. Our target is to learn the weight

vector w = (w0 , w1 , . . . , wN ) for functions of the form: y(x; w) = N x
i=1 fi wi +
T x x x
w0 = Φ (xi )w + w0 , where Φ(x) = (f1 , f2 , . . . , fN ) : X → F describes the non-
linear mapping from the input space to the feature space for the input variable
x [3]. The kernel trick [14] is used in case when the dimension of F is very high,
where a kernel function K calculates the inner products of the higher dimensional
data samples: K(xi , xj ) =< Φ(xi ), Φ(xj ) >, ∀i, j ∈ {1, 2, . . . , N }.
332 N.M. Khan et al.
In the feature space, KSVM tries to find the optimal decision hyperplane
which has largest minimal distance from any of the samples. To tackle the cases
of misclassification even in the higher dimensional space, the margin constraint
is slacked and a penalty factor is introduced in the objective function to control
the amount of slack. The KSVM optimization problem is:
N
1 T T
min w w+C max(0, 1 − ti (Φ (xi )w + w0 )) , (1)
w=0,w0 2 i=1
Here, max(0, 1−ti (ΦT (xi )w+w0 )) is the hinge loss function [14]. The loss factor
for misclassified samples is controlled by C.
Only a fraction of the total samples (Support Vectors or SVs) contribute to the
solution of KSVM [14]. Therefore, KSVM considers only those data points which
are close to the decision hyperplane. In other words, KSVM only considers the
local variations in data samples. The overall distribution of the training samples
is not taken into consideration. Incorporating some kind of global distribution
(e.g. results from classifiers like KND) can provide better classification.
In [4], the Nonparametric Discriminant Analysis (NDA) is proposed to remove
the normality assumption in the Linear Discriminant Analysis (LDA). The NDA
can be extended to the feature space F based on the ideas developed in extending
the LDA to the KFD [11]. We call this the Kernel Nonparametric Discriminant
(KND).
Instead of calculating the simple mean vectors, the nearest neighbor mean vec-
tors are calculated to formulate the between-class scatter matrix ∇ of the KND.
The motivation behind KND is the observation that essentially the nearest neigh-
bors represent the classification structure in the best way because the nearest
neighbor mean vectors represent the direction of the gradients of the respec-
tive class density functions in the feature space [4]. This way, the between-class
scatter matrix ∇ preserves the classification structure.
The KND does not make any modifications to the within-class scatter matrix
Δ, so it is similar to the KFD. With these definitions of ∇ and Δ, the KND
method proceeds the same way as other LDA-based methods i.e., by computing
the eigenvectors and eigenvalues of (Δ+βI)−1 ∇, and considering the eigenvector
corresponding to the largest eigenvalue to form the optimal decision hyperplane.
βI is added before inversion of Δ to tackle the small sample size problem [11].
The most significant advantage of the KND over the KFD is the removal of
the normality assumption. Because of this, KND is a robust classifier which can
perform better in case of real-life datasets. However, finding the optimum number
of nearest neighbors is not an easy task. Also, KND considers the near-global
variations in the data distribution by considering the k-nearest neighbors. Unlike
the KSVM, the points that are crucial to classify (the local variations) are not
given any special consideration. This can result in degradation of performance.
3 The SSVMKND Method
In our proposed SSVMKND method, we try to utilize both the local and the
near-global variational information obtained from KSVM and KND. Our objec-
tive is to keep the method close to the KSVM approach as much as possible,
since the KSVM optimization approach is more robust than the discriminant
approach of KND in the sense that it is less sensitive to data distribution or the
small sample size problem. We modify the KSVM optimization problem (Equa-
tion (1)) to incorporate the scatter matrices from the KND. Hence, our method
can be described by the following convex optimization problem:
N
1 T −1 1 T T
min w (ηΔ(∇ + βI) Δ)w + w w + C max(0, 1 − ti (Φ (xi )w + w0 )) .
w=0,w0 2 2 i=1
(2)
Here, the first term is added to the optimization problem to incorporate the KND
scatter matrices. The coefficient ηΔ(∇ + βI)−1 Δ changes the orientation of the
weight vector w by incorporating near-global variational information obtained by
the KND. The parameter η is the key control parameter to find out the optimal
weight vector orientation. η can take values from 0 to ∞. The second term
1 T
2 w w is the traditional KSVM term to retain the local variational information.
A compact form of the problem is:
N
1 T T
min w Θw + C max(0, 1 − ti (Φ (xi )w + w0 )) , (3)
w=0,w0 2 i=1
where, Θ = ηΔ(∇ + βI)−1 Δ + I. Problem (3) is a convex quadratic optimiza-

tion problem. This is because the matrices Δ and ∇ are positive-definite by
definition (being the within-class and between-class scatter matrices). Moreover,
the inverse of a positive-definite matrix is positive-definite [6]. So (∇ + βI)−1 is
positive-definite. We also know that if two matrices A and B are positive-definite,
then ABA is positive-definite [6]. Hence, the term Θ = ηΔ(∇ + βI)−1 Δ + I in
Equation (3) is positive-definite. This clearly makes SSVMKND a convex opti-
mization problem with only one global optimum solution. This is a significant
advantage, since numerical methods can be used to solve our problem easily and
efficiently while retaining the advantages of both KND and KSVM.
3.1 Implementing the Model
Since our optimization problem is similar to the KSVM optimization problem, we

can solve it using Lagrange multipliers [14]. However, obtaining the SSVMKND
solution this way requires an entirely new implementation to test this method.
The following lemma gives us an easier alternative to implement this method [15]:
Lemma 1. The SSVMKND method formulation is equivalent to:

N
1 T T
min ŵ ŵ + C max(0, 1 − ti (Φ̂ (xi )ŵ + w0 )) , (4)
ŵ=0,w0 2 i=1
where ŵ = Θ1/2 w, Φ̂(xi ) = Θ−1/2 Φ(xi ) and Θ = ηΔ(∇ + βI)−1 Δ + I.

Proof. Substituting the terms for ŵ,Φ̂(xi ) and Θ into equation (4) we get the
original SSVMKND problem (Equation (3)).
This essentially means that we can use the existing SVM implementations pro-
vided we can calculate the terms Θ1/2 and Θ−1/2 . This can easily be done
through the help of eigenvalue decomposition [6].
3.2 Probabilistic Viewpoint and Sparsity

We will convert Equation (3) into a log posterior probability term with respect to
the parameters to be optimized (w and w0 ) [13]. The SSVMKND optimization
problem can then be converted into a Maximum a Posteriori Probability (MAP)
estimation problem. The first term in Equation (3) can be treated as a Gaussian
1 T
prior with probability distribution P (w) ∼ e− 2 w Θw . In the second term, we are
only interested in the components involving w and w0 . But w0 , being an offset
from the origin, has a flat prior and can be ignored for simplicity. Hence, the
variable part of the second term essentially reduces down to ΦT (xi )w. To easily
obtain the Gaussian prior, for an input x, we denote this as a function ζ(x) =
ΦT (x)w. Similar to the components of w, ζ(x) has a joint Gaussian distribution
with covariances: < ζ(x), ζ(x ) >=< (ΦT (x)w), (wT Φ(x )) >=< Φ(x), Φ(x ) >=
K(x, x ). This essentially means that we can treat the SSVMKND prior as a
zero mean Gaussian Process where the covariance function equals to the kernel
function, K(x, x ).
The probability distribution of the second term for each sample is the con-
ditional probability of the output t when the input x and the function value
ζ is given, i.e.: P (t|x, ζ) = BC e−Cmax(0,1−tζ(x)), where, t ∈ {0, 1}. Here, BC is
set in such a way so that the probabilities do not add up to a value greater
than one [13]. Now, the total likelihood for the training data can be found as
the product of the likelihood of the individual training samples i.e. P (X , T |ζ) =
N
Πi=1 P (ti |xi , ζ)P (xi |ζ). Now, let us assume that the MAP solution of this prob-
abilistic model is ζ ∗ i.e. ζ ∗ = arg max P (ζ|X , T ). The posterior probability
can be calculated from the prior probabilities defined so far i.e. P (ζ|X , T ) =
P (X , T |ζ)P (ζ). Since the prior probabilities are in fact, the SSVMKND priors,
this solution will provide us the SSVMKND solution. To find the solution, we
first calculate the log posterior probability:
lnP (ζ|X , T ) = lnP (X , T |ζ)P (ζ)
N
1
=− ζ(x)ΘK−1 (x, x )ζ(x ) − C max(0, 1 − ti ζ(xi )) + Constant. (5)
2 i=1
x,x
For non-training inputs, the solution ζ ∗ can be found by differentiating

N with re-
spect to ζ and setting it to zero. The solution is: ζ ∗ (x) = i=1 αi ti Θ−1 K(x, xi ).
The corresponding ζ ∗ for the training inputs divide the training dataset into
three parts: ti ζ ∗ (xi ) > 1 ⇒ αi = 0, ti ζ ∗ (xi ) = 1 ⇒ αi ∈ [0, C] and ti ζ ∗ (xi ) <
1 ⇒ αi = C.
Here, our interest is in the second and third sets or the Support Vectors, which
are inherently difficult to classify [14]. The sparsity of the SSVMKND method
can be understood by noting the difference between the Gaussian prior of the
SSVMKND method and other popular Gaussian priors. The main difference
is the use of the hinge loss function in our method and the use of sigmoidal
functions in the other Gaussian priors. The hinge loss function max(0, 1 − tζ(x))
value is zero when tζ(x) > 1. As a result, the corresponding data points do
not contribute to the solution (in other words, number of support vectors <<
total number of data points). This is why the solution provided by SSVMKND
is a sparse solution. On the other hand, other Gaussian priors like the sigmoidal
function has nonzero gradients for all the points resulting in non-sparse solutions.
In this section we evaluate the proposed SSVMKND method against the KSVM,
KND and the KFD. For kernelization of the data, we use the popular Gaussian
RBF Kernel [3]. For KND and KFD, after finding the optimal eigenvector, Bayes
classifier is used for conducting the final classification. The different control
parameters for different methods (e.g. weight parameter η for SSVMKND, RBF
width parameter σ for all methods) are optimized using exhaustive search. If the
optimization needs to be faster, efficient methods can be used at the cost of a
small degradation in accuracy values.
We have applied the classification algorithms on 7 real-world and artificial
datasets obtained from the Benchmark Repository in [12]. There are 100 par-
titions available for each dataset, where about 60% data is used for training
and the rest for testing. We randomly picked 5 out of these 100 partitions and
repeated the random picking 5 times to achieve the unbiased average result.
Table 1 contains the average accuracy values and the standard deviations
obtained over all the runs. We see that the SSVMKND method outperforms
the KSVM, KND and KFD methods in all of the cases. Since the SSVMKND
combines the global and near-global variations provided by the KSVM and the
KND, respectively, it can classify the relatively difficult test samples.
The sparsity values (ratio of number of support vectors to number of total
training samples) in Table 1 indeed verify that the SSVMKND classifier pro-
vides sparse solution, as we have explained in Section 3.2. A sparse solution is
important for real-world problems where the data is usually of high dimension.
The sparsity ensures that the solution model consists of lower number of param-
eters than that of a non-sparse solution [2]. Hence, a sparse solution can directly
result in a model with lower computational complexity by using efficient math-
ematical programming-based algorithms. We also observe that in all cases, the
Table 1. Average % classification accuracy, standard deviation (in parentheses) of each

method and % sparsity for SSVMKND and KSVM for the 7 datasets (best method in
bold, second best emphasized )
% Sparsity % Sparsity
Dataset SSVMKND KSVM KND KFD
(SSVMKND) (KSVM)
Flare-Sonar 67.7 (0.47) 66.9 (0.41) 67.1 (0.65) 66 (0.40) 4.65 5.25
German 78.1 (0.40) 77 (0.38) 76.3 (0.68) 75.7 (0.51) 39.1 38.85
Heart 86.5 (2.21) 85.4 (2.3) 81.7 (1.58) 82.9 (2.13) 39.4 36.47
Banana 89.8 (0.25) 89.6 (0.29) 89.6 (0.22) 89.5 (0.20) 60.5 59.5
Diabetes 78.6 (0.50) 77.7 (0.69) 75.7 (0.90) 77.3 (1.03) 53.2 52.8
Ringnorm 98.5 (0.04) 98.4 (0.04) 98.3 (0.03) 97.4 (0.07) 46 38.5
Thyroid 97.3 (0.6) 96.5 (1.02) 97.1 (0.64) 96.8 (0.49) 35.7 31.4
SSVMKND method provides better sparsity than the KSVM. This is because
of the nature of the Gaussian prior used. The prior in the KSVM method [13]
considers the underlying coefficients of the weight vectors to be uncorrelated.
However, in real-world applications, this assumption is not necessarily true. In
the SSVMKND, the prior (Section 3.2) incorporates the between-class and the
within-class scatter matrices from the KND. As a result, in our prior, the co-
efficients are not considered uncorrelated. This is why the SSVMKND provides
better sparsity values than the KSVM.
4.1 Face Recognition
Our purpose is to build a basic face recognition application that can act as a
“proof-of-concept” system for comparison purposes among different methods.
Hence, we have chosen the well-established eigenfaces technique [1] for our ap-
plication which essentially uses Principal Component Analysis (PCA) to project
the images onto a reduced subspace. To observe the results of varying PCA di-
mension on different classification methods, we have repeated our experiment
100
SSVMKND
97 SSVMKND
KSVM
KSVM
KND
KND
96 KFD
99.5 KFD
95
99
94
% Accuracy
% Accuracy
98.5
93
92
98
91
97.5
90
89 97
3 6 9 12 15 18 21 24 27 30 3 6 9 12 15 18 21 24 27 30
PCA Dimension PCA Dimension
Yale Database JAFFE Database
Fig. 1. Evaluation and Comparison on face databases. Classification accuracies of each

method in % vs. PCA dimensions for facial databases.
with projecting all the images onto PCA subspace of 3, 6, 9, . . . , 30 dimensions

each. The face databases we used are the Yale face database [5] and the Japanese
Female Facial Expression database (JAFFE) [10]. One-against-one algorithm is
used to achieve two-class classification. We have randomly divided each class
into a training and a testing set (50 % each) and repeated this process twice
to achieve unbiased results. Figure 1 shows the achieved results for the face
datasets. As we can see, regardless of the PCA dimension used, the SSVMKND
achieves better classification accuracies in majority of the cases by combining
necessary information provided by both the KSVM and the KND.
5 Conclusion
We have proposed the SSVMKND method which combines variational informa-

tion from KSVM and KND. SSVMKND is free from any underlying assumption
regarding class distribution and avoids the small sample size problem. Being a
convex optimization problem, our method provides a global optimum solution
and can be solved easily and efficiently by using numerical methods. We also
proved that our method can be reduced to the classical KSVM model. By in-
corporating a better Gaussian prior, our method provides more sparse solution
than the one obtained by the KSVM. The experimental results on some contem-
porary datasets and an application of face recognition verify the superiority of
our method compared to KSVM, KND and KFD. In future, we plan to build a
multi-class classifier based on the principles of the SSVMKND method.
References
1. Belhumeur, P., Hespanha, J., Kriegman, D.: Eigenfaces vs. fisherfaces: recognition
using class specific linear projection. IEEE Transactions on Pattern Analysis and
Machine Intelligence 19, 711–720 (1997)
2. Camps-Valls, G., Bruzzone, L.: Kernel-based methods for hyperspectral image clas-
sification. IEEE Transactions on Geoscience and Remote Sensing 43(6), 1351–1362
(2005)
3. Cristianini, M., Shawe-Taylor, J.: An Introduction to Support Vector Machines.
Cambridge University Press, Cambridge (2000)
4. Fukunaga, K.: Introduction to Statistical Pattern Recognition, 2nd edn. Academic
Press (2000)
5. Georghiades, A.: Yale face database (1997)
6. Horn, R., Charles, R.: Matrix Analysis. Cambridge University Press (1990)
7. Khan, N., Ksantini, R., Ahmad, I., Boufama, B.: A novel SVM+NDA model for
classification with an application to face recognition. Pattern Recognition 45(1),
66–79 (2012)
8. Ksantini, R., Ahmad, I., Boufama, B., Khan, N.: A new combined KSVM and
KFD model for classification and recognition. In: Fifth International Conference
on Digital Information Management, pp. 188–193 (2010)
9. Lee, C., Landgrebe, D.: Feature Extraction Based on Decision Boundaries. IEEE
Transactions on Pattern Analysis and Machine Intelligence 15(4), 388–400 (1993)
10. Lyons, M., Budynek, J., Akamatsu, S.: Automatic classification of single facial
images. IEEE Transactions on Pattern Analysis and Machine Intelligence 21(12),
1357–1362 (1999)
11. Mika, S., Ratsch, G., Weston, J., Scholkopf, B., Mullers, K.: Fisher Discriminant
Analysis with Kernels. In: Neural Networks for Signal Processing, pp. 41–48 (Au-
gust 1999)
12. Ratsch, G., Onoda, T., Muller, K.: Soft Margins for Adaboost. Machine Learn-
ing 42(3), 287–320 (2000)
13. Sollich, P.: Bayesian Methods for Support Vector Machines: Evidence and Predic-
tive Class Probabilities. Machine Learning 46(1-3), 21–52 (2002)
14. Vapnik, V.: Statistical Learning Theory. John Wiley & Sons (1998)
15. Xiong, T., Cherkassky, V.: A Combined SVM and LDA Approach for Classification.
In: International Joint Conference on Neural Networks, pp. 1455–1459 (2005)
16. You, D., Hamsici, O.C., Martinez, A.M.: Kernel Optimization in Discriminant
Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 33(3),
631–638 (2011)
Training Mahalanobis Kernels
by Linear Programming
Shigeo Abe
Kobe University Rokkodai, Nada, Kobe, Japan

abe@kobe-u.ac.jp
http://www2.kobe-u.ac.jp/~ abe
Abstract. The covariance matrix in the Mahalanobis distance can be

trained by semi-definite programming, but training for a large size data
set is inefficient. In this paper, we constrain the covariance matrix to
be diagonal and train Mahalanobis kernels by linear programming (LP).
Training can be formulated by ν-LP SVMs (support vector machines)
or regular LP SVMs. We clarify the dependence of the solutions on the
margin parameter. If a problem is not separable, a zero-margin solu-
tion, which does not appear in the LP SVM, appears in the ν-LP SVM.
Therefore, we use the LP SVM for kernel training. Using the benchmark
data sets we show that the proposed method gives better generalization
ability than RBF (radial basis function) kernels and Mahalanobis kernels
calculated using the training data and has a good capability of selecting
input variables especially for a large number of input variables.
1 Introduction
Regular support vector machines do not assume a priori data distributions and
determine the decision boundary using only support vectors that are near the
boundary. However, if data of one class have a large variance and those of the
other class have a small variance, it may not be good to place the hyperplane in
the middle of the unbounded support vectors. In such a situation, instead of the
Euclidean distance, the Mahalanobis distance is sometimes effective [1,2,3,4,5].
There are two ways to incorporate the Mahalanobis distance into support vec-
tor machines: one is to reformulate support vector machines so that the margin is
measured by the Mahalanobis distance [1,2], and the other is to use Mahalanobis
kernels [3,4,5], which calculate the kernel value according to the Mahalanobis
distance between the associated two argument vectors.
Radial basis function (RBF) kernels are widely used because they usually give
good performance for most applications. To improve the generalization ability of
RBF kernels, generalized RBF kernels are proposed, in which each input variable
has a weight in calculating the kernel value. Mahalanobis kernels are an extension
of generalized RBF kernels and if the covariance matrix is restricted to a diagonal
matrix, the Mahalanobis kernels reduce to generalized RBF kernels [4].

340 S. Abe
In [3], training of a support vector machine is reformulated so that a gen-

eralized RBF kernel is trained simultaneously. The formulation, however, is no
longer quadratic. In [4], the covariance matrix for Mahalanobis kernels is calcu-
lated using the training data.
There are several discussions to obtain Mahalanobis metric by training, which
results in semi-definite programming [6]. Usually, semi-definite programming is
difficult to solve for large size problems and its speedup has been discussed.
However if we confine the covariance matrix to a diagonal matrix, training results
in linear programming.
In this paper, we train Mahalanobis kernels with the diagonal covariance ma-
trix by linear programming (LP). Restricting the covariance matrix to a diagonal
matrix in the formulation of [6], we obtain a linear programming program with
the explicit margin similar to a ν-LP SVM [7]. We analyze the dependence of
the solution of the ν-LP SVM for the margin parameter and show that if a
training problem is not linearly separable, the zero-margin solution is obtained
for the margin parameter value larger than some value. We show that this does
not happen for the LP SVM, which is equivalent to the ν-LP SVM with the
positive margin and objective function values. We also derive the lower bound
of the margin parameter value of the LP SVM, in which the nonzero solution is
obtained. We, therefore, use the LP SVM for kernel training. We compare the
proposed method with the RBF kernels and Mahalanobis kernels whose diagonal
elements are calculated using training data.
In Section 2, we formulate kernel training by linear programming. In Section
3, we clarify the characteristics of ν-LP SVMs and LP SVMs. In section 4, we
demonstrate the effectiveness of the proposed method using some benchmark
data sets.
2 Formulation of Kernel Training

In [4], the Mahalanobis kernel for m-dimensional inputs x and x is defined by
K(x, x ) = exp(−(δ/m) (x − x ) Q−1 (x − x )), (1)
where δ (> 0) is a parameter, and −1 denote the matrix transpose and inverse,
respectively, and Q is the covariance matrix with the M data {x1 , . . . , xM }:
M M
1 1
Q= (xi − c) (xi − c) , c= xi . (2)
M i=1 M i=1
According to the computer experiment [4], the diagonal covariance matrix is

sufficient for high generalization ability. Therefore, in the following we consider
only diagonal covariance matrix for (1).
In [6], the inverse of Q is trained using the training data. We reformulate the
method discussed in [6] for the diagonal Q. We let R = Q−1 .
Let set P be a set of training triplets:
P = {Pr } = {(xi , xj , xk )} for r = 1, . . . , |P |, i, j, k ∈ {1, . . . , M }, (3)
Training Mahalanobis Kernels by Linear Programming 341
where xi and xj belong to the same class but xk and xi (and thus xj ) belong to
different classes, and |P | denotes the number of elements in P . In [6], the details
of how to generate triplets are not described. Here, we consider generating the
set of triplets, P . For a training sample xr (r = 1, . . . , M ) belonging to a class,
we find the nearest xj belonging to the same class and the nearest xk belonging
to the other class and make (xr , xj , xk ) the triplet. In this way, we obtain P
with M triplets.
A Mahalanobis distance between two data belonging to the same class needs
to be shorter than that between data of different classes. Thus, we define a
margin ρr for the rth triplet as follows:
m

ρr = (xr − xk ) R (xr − xk ) − (xr − xj ) R (xi − xj ) = air Rii , (4)
i=1
where air = (xir − xik )2 − (xir − xij )2 for i = 1, . . . , m, r = 1, . . . , M , xir
is the ith element of xr , and Rii is the ith diagonal element of R.
We want ρr as large as possible but similar to support vector machines, for
some triplets we allow negative margins. Then, we formulate the following opti-
mization problem:
M

maximize Jρ (ρ, R, ξ) = ρ − Cρ ξr (5)
r=1
m

subject to Rii = 1 (6)
i=1
Rii ≥ 0 for i = 1, . . . , m, (7)
m

air Rii ≥ ρ − ξr for r = 1, . . . , M, (8)
i=1
ξr ≥ 0 for r = 1, . . . , M, ρ > 0, (9)
where ρ is the margin, ξr are slack variables to allow negative margins, Cρ is a
margin parameter, and (6) is to make the left-hand side of (8) be unique.
The above formulation is similar to the ν-LP SVM discussed in [7]. Because
the ν-LP SVM treats a two-class problem, air Rii in (8) is multiplied by yr which
takes 1 or −1. But this is a trivial difference. Thus, we call the above support
vector machine primal ν-LP SVM or ν-LP SVM for short.
According to [7], for positive ρ, the ν-LP SVM is equivalent to the following
formulation:
m
M

minimize J(R, ξ) = Rii + CM ξr (10)
i=1 r=1
subject to Rii ≥ 0 for i = 1, . . . , m, (11)
m

air Rii ≥ 1 − ξr for r = 1, . . . , M, (12)
i=1
ξr ≥ 0 for r = 1, . . . , M, (13)
where CM is a margin parameter.
342 S. Abe
The above formulation is similar to standard LP SVMs with linear kernels.

The differences are that the coefficients of the hyperplane, Rii , are non-negative
and no bias term is included. In the following, we call the above support vector
machine regular LP SVM or LP SVM for mshort.
In the ν-LP SVM,m the margin is ρ/ i=1 Rii = ρ, whereas in the
mLP SVM,
the margin is 1/ i=1 Rii . By negating (5) and replacing ρ with − i=1 Rii , the
objective function (10) is obtained. Because the equality constraint (6) makes
the optimization meaningless it is deleted, and ρ in (8) is replaced with 1 in the
LP SVM.
Let the optimal solution of the ν-LP SVM be (ρ̄, R̄, ξ̄), that of the LP SVM
be (R̂, ξ̂), J¯ρ = Jρ (ρ̄, R̄, ξ̄), Jˆ = J(R̂, ξ̂), and R̄ = 0 denote that at least one
diagonal element of R is nonzero. In [7], the equivalence of the ν-LP SVM and
LP SVM is shown as follows:
Theorem 1. If the ν-LP SVM has the optimal solution (ρ̄, R̄, ξ̄) with ρ̄ > 0
and J¯ρ > 0 for Cρ , (R̂, ξ̂) = (R̄/ρ̄, ξ̄/ρ̄) is the optimal solution of the LP SVM
with CM = Cρ /J¯ρ . Conversely, if the LP SVM with CM has the optimal solution
m
(R̂, ξ̂) with R̂
= 0, (ρ̄, R̄, ξ̄) = (1/ i=1 Rîi , ρ̄ R̂, ρ̄ ξ̂) is the optimal solution of
the ν-LP SVM with Cρ = CM J. ˆ
We add the condition J¯ρ > 0 at the first part of Theorem 1 to guarantee equiv-
alence.
In [7], it is recommended to use the ν-LP SVM rather than the LP SVM
for the ease of parameter value selection for Cρ : Cρ = ν 1M for 1/M < ν < 1.
Therefore, 1/M < Cρ < 1.
3 Properties of ν-LP SVMs and LP SVMs

In this section we clarify the properties of ν-LP SVMs and LP SVMs and then
compare them. Because of the space limitation, we omit proofs of the theorems.
3.1 ν-LP SVMs

We investigate the dependence of the ν-LP SVM on the Cρ value. To do so, we
derive the dual form of the ν-LP SVM given by (5) to (9) as follows:
minimize z (14)
M

subject to δr ≥ 1 (15)
r=1
M

air δr ≤ z for i = 1, . . . , m, (16)
r=1
Cρ ≥ δr ≥ 0 for r = 1, . . . , M, (17)
where z is the dual variable associated with the constraint (6) and δr are dual
variables associated with (8). We call the above formulation dual ν-LP SVM. In
[7], the equality is used for (15) but that is valid only for ρ > 0.
From (15) and (17), the dual ν-LP SVM has a feasible solution for Cρ ≥ 1/M ,
but does not have a feasible solution for 1/M > Cρ > 0. For the ν-LP SVM,
because of the slack variables ξi , a feasible solution always exists for Cρ > 0.
But for 1/M > Cρ > 0, the optimal solution of the ν-LP SVM is unbounded:
Theorem 2. For 0 < Cρ < 1/M , the solution of the ν-LP SVM is unbounded.
But for Cρ ≥ 1, Rii (i = 1, . . . , m) are the same:
Theorem 3. For Cρ ≥ 1 the optimal Rii (i = 1, . . . , m) are the same for dif-
ferent values of Cρ .
From Theorem 3, it is clear that for 1 ≥ Cρ ≥ 1/M , the bounded optimal
solutions exist for the primal and dual ν-LP SVMs. From (15) and (17), for the
optimal solution of the ν-LP SVM with 1/(j − 1) > Cρ ≥ 1/j (j = 2, . . . , M ),
at least j inequality constraints in (8) are active. Theorem 3 is further refined
as follows:
Theorem 4. If for Cρ = C0 (≥ 1/M ), the solution of the ν-LP SVM satisfies
ρ > 0 and ξ = 0, or ρ = 0, the same solution is obtained for Cρ = t C0 (t > 1).
Assume that for Cρ = 1/j (j = 2, . . . , 1/M ) the solutions of the primal and
dual ν-LP SVMs are (ρ̄, R̄, ξ̄) and (z̄, δ̄), respectively. Then what can we say
about the solutions for Cρ = t/j (j/j − 1 > t > 1)? From Theorem 4, if ρ̄ > 0
and ξ̄ = 0, or ρ̄ = 0, R does not change for Cρ = t/j. Then, for ρ̄ > 0 and
ξ̄
= 0, can (ρ̄, R̄, ξ̄) and (z̄, δ̄) or (ρ̄, R̄, ξ̄) and (t z̄, t δ̄) be the solutions for
Cρ = t/j? Neither can. The solutions (ρ̄, R̄, ξ̄) and (z̄, δ̄) do not satisfy the
complementarity condition for ξ¯r > 0. And the solutions (ρ̄, R̄, ξ̄) and (t z̄, t δ̄)
do not satisfy the complementarity condition. This means that the optimal R
may change for 1/(j − 1) > Cρ > 1/j.
3.2 LP SVMs
Now investigate the dependence of the solution of the LP SVM on the CM value.
Because of the slack variables, the LP SVM has a feasible solution. In addition,
because the objective function given by (10) is restricted to be non-negative, the
optimal solution of the minimization problem always exists.
For a small CM value, the solution Rii = 0 (i = 1, . . . , m) is obtained as the
following theorem shows:
Theorem 5. Define
M

Cmin = min 1/
i=1,...,m
air . (18)
M
a >0
r=1 ir
r=1
Then, if Cmin exists, for Cmin > CM > 0 the solution of the LP SVM is
Rii = 0 for i = 1, . . . , m. (19)
Otherwise, (19) is satisfied for CM > 0.

344 S. Abe
This is a degenerate solution but because tradeoff between minimization of

the margin and the maximization of the classification accuracy is controlled by
the margin parameter
CM , the degenerate solution is not an anomaly solution.
The relation M r=1 air ≤ 0 may hold when the nearest training pairs, measured
by the ith input variable, with the opposite classes are dominant.
Now consider a general case where all the constraints (12) are not active for
the optimal solution and let S be the index set for the active constraints. Then,
(10) for the optimal solution becomes
m

minimize J(R, ξ) = |S| CM + (1 − CM air )Rii . (20)
i=1 r∈S

The set of {xr | r ∈ S} is the set of support vectors. Thus, for 1 − CM r∈S air >
1, Rii = 0. This means that by training the LP SVM, input variable selection is
simultaneously performed.
Similar to the ν-LP SVM, for a large value of CM , the optimal solution does
not change as the following theorem shows:
Theorem 6. There exists a positive C0 such that for CM ≥ C0 the optimal
solution (R, ξ) of the LP SVM does not change.
3.3 Comparison of ν-LP SVMs and LP SVMs
As Theorem 1 shows, the ν-LP SVM and LP SVM are equivalent when ρ is
positive and the objective function value for the ν-LP SVM is positive. To obtain
a solution with positive ρ and the positive objective function, Cρ needs to be
selected in [1/M, 1/j], where j ∈ {1, . . . , M −1}. But, there is no way of selecting
the value of j.
For the optimal solution with Cρ = 1/j, at least j constraints are active.
Therefore, controlling the number of active constraints is easy but again the
optimal j is not known in advance.
For the LP SVM, the lower bound of CM that gives nonzero R is Cmin given
by (18). But unlike the ν-LP SVM, there is no upper bound of CM . This is
because a zero-margin solution (i.e., Rii → ∞) is not obtained.
Using either the ν-LP SVM or LP SVM, we need to optimize the value of
Cρ or CM by e.g., cross-validation. And because by the LP SVM, we do not
worry about the upper bound of CM , in the following we use the LP SVM for
Mahalanobis kernel training.
4 Computer Experiment
We compared performance of the proposed method with that of Mahalanobis

kernels given by (1) and (2), RBF kernels, and the Mahalanobis distance pro-
posed in [6]. The RBF kernel is given by exp(−γ x − x 2 /m), where γ is a
positive constant. We used one-against-all fuzzy L1 SVMs [8].
We determined the parameter values by fivefold cross-validation. For the mar-

gin parameter C of the L1 SVM, we selected the value from {1, 10, 50, 100,
500, 1, 000, 2, 000}. For the proposed Mahalanobis kernel and the RBF kernel,
we selected the value of δ and γ from {0.1, 0.5, 1, 5, 10, 20, 50, 100, 200} and for
the Mahalanobis kernel given by (1) and (2), we selected the value of δ from
{0.01, 0.05, 0.1, 0.5, 1, 2, 5, 10, 20}.
For the proposed Mahalanobis kernel, to speed up model selection, we deter-
mined the value of CM in two stages for single training and test data sets: First,
we trained the LP SVM with CM = 2, 000, and determined the values of C and
δ for the SVM by cross-validation. Then, we determined the values of CM and
C by cross-validation, fixing δ value with the value selected by cross-validation.
For multiple training and test data sets, we set CM = 1 or 2,000, and determined
the C and δ values by cross-validation. And we selected the CM value with the
higher recognition rate for the cross-validation data set.
Table 1 shows the results. In the table, the In/Cl/Tr/Te column shows the
numbers of inputs, classes, training data, and test data, the following four
columns show the recognition rates for the test data sets: the proposed method;
the Mahalanobis kernel calculated by (1) and (2); RBF kernels; and the Maha-
lanobis distance in [6]. The “Final” column shows the average number of inputs
per decision function selected by the proposed method.
The results of the balance and Pima Indians data sets were obtained by ran-
domly dividing files 10 times. Among the first three methods, the best recog-
nition rate is shown in boldface, the second in Roman, and the third in italic.
The last row of the table shows the summary: e.g., the first numeral in the three
numerals shows the number that gives the best recognition rate. From the table,
the proposed kernel shows the best and the Mahalanobis and RBF kernels are
comparable. For the last three data sets, all the three methods showed better
recognition rates than the Mahalanobis distance by [6] did.
In the “Final” column, the reduction rates for the hiragana-50, hiragana-105,
and USPS data sets are high because they are gray scale images.
Table 1. Comparison of recognition rates and the number of selected inputs
Data In/Cl/Tr/Te Proposed Mahalanobis RBF [6] Final

Thyroid [9] 21/3/3772/3428 97.93 97.49 96.47 — 17.7
Blood [8] 13/12/3097/3100 93.77 93.32 93.52 — 12.3
Hiragana-50 [8] 50/39/4610/4610 98.85 99.31 99.26 — 24.0
Hiragana-13 [8] 13/38/8375/8356 99.86 99.80 99.77 — 10.4
Hiragana-105 [8] 105/38/8375/8356 100 100 100 — 39.2
Abalone [9] 8/3/3133/1044 66.38 67.63 65.04 — 7.0
Satimage [9] 36/6/4435/2000 91.20 91.00 91.90 — 31.8
USPS [10] 256/10/7291/2007 95.32 94.77 95.47 — 116
Letter [9] 16/26/16000/4000 98.23 97.58 97.83 96.54 15.8
Balance [9] 4/3/532/93 99.03±1.12 97.96±2.01 99.03±1.22 90.22±3.17 4
Pima Indians [9] 8/2/653/115 74.96±3.92 75.31±3.98 75.13±4.14 72.36±3.71 8
Ranking — 6:3:2 4:2:5 4:4:3 —
346 S. Abe
5 Conclusions
We discussed training Mahalanobis kernels with a diagonal covariance matrix by

linear programming support vector machines (LP SVMs). Training can be for-
mulated either by ν-LP SVMs or regular LP SVMs. We clarified the dependency
of ν-LP SVMs and regular LP SVMs on the margin parameter and proved that
the ν-LP SVM may give a zero-margin solution for the margin parameter value
close to 1. But this kind of solution does not appear in the regular LP SVM.
We also derive the lower bound of the margin parameter for the LP SVM that
gives the nonzero solution. Because of the zero-margin problem, we used regular
LP SVMs. Using several benchmark data sets we showed that the generalization
ability of the proposed method is better than that of RBF kernels.
References
1. Lanckriet, G.R.G., Ghaoui, L., El Bhattacharyya, C., Jordan, M.I.: A robust min-
imax approach to classification. Journal of Machine Learning Research 3, 555–582
(2002)
2. Xue, H., Chen, S., Yang, Q.: Structural regularized support vector machine:
A framework for structural large margin classifier. IEEE Trans. Neural Net-
works 22(4), 573–587 (2011)
3. Grandvalet, Y., Canu, S.: Adaptive scaling for feature selection in SVMs. In: Neural
Information Processing Systems 15, pp. 569–576. MIT Press (2003)
4. Abe, S.: Training of Support Vector Machines with Mahalanobis Kernels. In: Duch,
W., Kacprzyk, J., Oja, E., Zadrożny, S. (eds.) ICANN 2005. LNCS, vol. 3697, pp.
5. Wang, D., Yeung, D.S., Tsang, E.C.C.: Weighted Mahalanobis distance kernels for
support vector machines. IEEE Trans. Neural Networks 18(5), 1453–1462 (2007)
6. Shen, C., Kim, J., Wang, L.: Scalable large-margin Mahalanobis distance metric
learning. IEEE Trans. Neural Networks 21(9), 1524–1530 (2010)
7. Demiriz, A., Bennett, K.P., Shawe-Taylor, J.: Linear programming boosting via
column generation. Machine Learning 46(1-3), 225–254 (2002)
8. Abe, S.: Support Vector Machines for Pattern Classification. Springer, Heidelberg
(2010)
9. Asuncion, A., Newman, D.J.: UCI machine learning repository (2007),
http://www.ics.uci.edu/~ mlearn/MLRepository.html
10. USPS Dataset,
http://www-i6.informatik.rwth-aachen.de/~ keysers/usps.html
Correntropy-Based Document Clustering
via Nonnegative Matrix Factorization
Tolga Ensari1, Jan Chorowski2, and Jacek M. Zurada2

1
Computer Engineering Department, Istanbul University, Istanbul, Turkey
2
Electrical and Computer Engineering, University of Louisville, Louisville, USA
ensari@istanbul.edu.tr,
{jan.chorowski,jacek.zurada}@louisville.edu
Abstract. Nonnegative Matrix Factorization (NMF) is one of the popular

techniques to reduce the number of attributes of the data. It has been also
widely used for clustering. Several types of the objective functions have been
used for NMF in the literature. In this paper, we propose to maximize the
correntropy similarity measure to produce the factorization itself. Correntropy
is an entropy-based criterion defined as a nonlinear similarity measure.
Following the discussion of minimization of the correntropy function, we use it
to cluster document data set and compare its clustering performance with the
Euclidean Distance (EucD)-based NMF. The comparison is illustrated with 20-
Newsgroups data set. The results show that our approach has better clustering
compared with other methods which use EucD as an objective function.
Keywords: Nonnegative Matrix Factorization (NMF), Correntropy, Document

Clustering.
1 Introduction
Large size of the data is one of the central issues in data analysis research. The size of
the data is always increasing; therefore it becomes more and more important to reduce
its size without losing its most essential features. There are several methods to reduce
the dimensionality of large data such as Principal Component Analysis (PCA),
Singular Value Decomposition (SVD) and Independent Component Analysis (ICA).
Recently defined Nonnegative Matrix Factorization (NMF) approach also allows to
reduce the number of attributes of the data [1, 4, 25].
The NMF technique tries to approximate a data matrix with the product of low
rank matrices and , such that and the elements of and are
nonnegative. If columns of are data samples, then the columns of can be
interpreted as basis or parts from which data samples are formed, while the columns
of give the position of encoded samples in the feature space. It is also common to
find a clustering of samples based on values of the matrix. Typically, the quality of
the approximation is measured by the Euclidean distance between the elements of
and the elements of , however measures derived from the Kullback-Leibler
divergence have also been studied in the literature [1-4, 9]. In this paper, we propose
348 T. Ensari, J. Chorowski, and J.M. Zurada
to find NMF factorization through maximizing the correntropy similarity measure

between elements of and . It should be noted that the correntropy similarity
measure use has been reported for numerous other tasks except for NMF [14-19], and
this provided a motivation for its evaluation for this specific application. We have
used a document clustering task to compare the results obtained with the correntropy
objective function to those obtained with minimization of the Euclidean distance.
Experimental results on 20-Newsgroups1 data set illustrate advantages of our
approach.
The rest of the paper is organized as follows: Section 2 defines NMF in context of
related studies. Section 3 contains a brief explanation of correntropy. Section 4 shows
experimental results for different objective functions, and section 5 summarizes the
outcomes.
2 Nonnegative Matrix Factorization (NMF)
Given a data matrix we define its rank min , nonnegative

factorization into matrices and as:
minimize: ,
(1)
subject to: 0 0, , ,
where the function is a measure of the goodness of the approximation. We will

compare the following two suitable measures:
, , 0.5 (2)
,
, , exp (3)
2
,
where is a parameter of the correntropy measure. We note that we need to minimize

the negative of correntropy since it is a similarity and not a distance measure.
Various algorithms have been proposed to solve the problem (1). Lee and Seung,
first presented Multiplicative Update rules for NMF in 2001 [4]. They used EucD and
generalized KL-divergence as an objective function. Newer algorithms are the
projected gradient descent (PGD) [23] and the alternating least squares (ALS)
algorithm [2].
Different, often application-dependent loss functions have also been studied. In
[11], the authors used the Itakura-Saito (IS) divergence as an objective function and in
[7] authors used β-divergence. However, the correntropy similarity measure has not
been evaluated for NMF yet. Other studies about NMF relate to sparseness
constraints, orthogonality, Bayesian approach, weighted NMF and other conditions
1
From http://people.csail.mit.edu/jrennie/20Newsgroups
Correntropy-Based Document Clustering via Nonnegative Matrix Factorization 349
[3, 8, 10, 12]. NMF has been also successfully applied to face recognition,
bioinformatics, text data mining and audio (speech) processing [2, 3, 5, 8-13].
Clustering is also one of the topics for NMF and has been extensively discussed in the
literature [9, 12, 21, 22].
3 Correntropy Similarity Measure

Correntropy was proposed as a localized similarity measure between two random
variables [15]. In this paper we use it to compute the element-wise similarity between
two matrices. From its definition given in formula (3) we can see that ,
is always nonnegative and bounded. This causes correntropy to be less sensitive to
outliers – when no good approximation can be obtained for a few elements of the
matrix , the error for those elements saturates and they have less influence on the
global match. The saturation of errors is controlled by the parameter . In Fig. 1 we
show the error surface for a single element of matrix . By adjusting the parameter
we can change the radius of the inverse bell curve.
Correntropy and other entropy based objective functions have been successfully
applied in several studies [14-19]. Correntropy objective function can also be used as
a cost function for NMF.
Fig. 1. Correntropy objective function 1 , with 0.5, 1
We have compared the Euclidean distance and correntropy based NMF formulations
by evaluating quality of clusters computed from factorizations. We have used the 20-
newsgroups data set, which is one of the popular benchmarks used for clustering and
classification of the text data. It has approximately 11,000 documents taken from 20
different newsgroups pertaining to various subjects.
To generate factorizations based on the correntropy loss (3), we have used a

general purpose second order function minimization package that allows imposing
simple (upper and/or lower bound) constraints on variables [20]. It requires formulas
for the gradient of the loss function, which we have computed to be:
,
exp (4)
2
,
exp (5)
2
We have used the popular PGD algorithm to minimize the Euclidian distance based
loss (2). Both algorithms were run with the same random initial matrices and .
We have used the same stopping criteria by setting a relative tolerance of 10 and by
allowing at most 1000 iterations. We refer to the results obtained when minimizing
correntropy as NMF-Corr and to those obtained while minimizing the Euclidean
distance as NMF-PGD (EucD).
4.1 Evaluation of Clustering Quality

After the factorization process, we obtain , . can be used to group the data ( )
into clusters by choosing the largest value of each column in .
We evaluate the clustering performance with the entropy measure. The clustering
is obtained by the labels of the data set and obtained labels. Total entropy for a set of
clusters is calculated as the weighted mean of the entropies of each cluster weighted
by the size of each cluster. Firstly, we calculate the distribution of the data for each
cluster. For class we compute , the probability that a member of cluster belongs
to class as / , where is the number of objects in cluster and is
the number of objects of class in cluster . Entropy of each cluster is defined as:
∑ (6)
where is the number of classes. Entropy of the full data set as the sum of the
entropies of each cluster weighted by the size of each cluster:
∑ (7)
where is the number of clusters and is the total number of data points [24].
4.2 Clustering with NMF

Table 1 shows the entropy values of NMF-PGD (EucD) and NMF-Corr approaches for
20-Newsgroups data set. We graph these values ( NMF-PGD (EucD) and NMF-Corr
(for σ 1, σ 0.5 and σ 0.01) ) in Fig. 2. Here, “r” denotes the assumed number of
clusters and equals to the ranks of W, H. We change it from 2 to 20 to track the clustering
performance. We show all entropy values in Fig. 2, but for brevity we only illustrate 10
data points in Table 1. Since lower entropy values indicate better clustering performance
we see that NMF-Corr (σ 0.5) demonstrates superior clustering performance than
NMF-PGD (EucD) for every evaluated number of clusters.
Table 1. Entropy of 20-Newsgroups data set with NMF-PGD (EucD) and NMF-Corr
Number of NMF-PGD NMF-Corr NMF-Corr NMF-Corr
Clusters (r) (EucD) (σ = 1) (σ = 0.5) (σ = 0.01)
r=2 3.841439 3.856538 3.850511 4.298769

r=3 3.856632 3.791665 3.577477 4.269878
r=4 3.779671 3.494205 3.499493 4.271371
r=5 3.743832 3.603604 3.383459 4.241927
r=6 3.490545 3.356921 3.301818 4.230222
r=7 3.436018 3.281271 3.265505 4.201747
r=8 3.303913 3.260363 2.939666 4.191849
r=9 3.305351 3.336344 3.131309 4.182656
r = 10 3.162412 3.226913 2.935471 4.202036
Fig. 2. Entropy comparison for NMF-PGD (EucD) and NMF-Corr
Therefore, we can conclude that correntropy-based NMF 0.5 has

comparatively better clustering performance than EucD-based NMF for the evaluated
data set. However, NMF-Corr does not show improved performance for 1 and has
specifically worst performance for 0.01. This can be seen from Fig. 2 and Table 1.
The parameter controls the threshold of saturation of the correntropy loss for
individual residues . When is large, no saturation will occur. On
the other hand, when is small nearly all residues saturate, and there may not be
enough information in the unsaturated ones to accurately learn the NMF
decomposition. The optimum value of sigma depends on the distribution of the
residues . Determining the exact nature of this relationship warrants further studies.
Table 2. Clustering results with NMF-Corr(Sigma=0.5) for 8
soc.religion.christian
talk.politics.mideast
Cluster ID Number
rec.sport.baseball
talk.religion.misc
comp.sys.ibm.pc.
talk.politics.guns
talk.politics.misc
comp.windows.x
rec.sport.hockey
rec.motorcycles
comp.graphics
comp.sys.mac.
windows.misc
sci.electronics
comp.os.ms-
misc.forsale
alt.atheism
hardware
hardware
rec.autos
sci.space
sci.crypt
sci.med
1 25 3 5 5 15 7 11 11 6 1 48 15 140 25 434 553 242 465 212 324
2 63 73 147 113 16 48 1 1 2 2 6 25 1 6
3 3 21 196 164 2 111 19 14 1 2 32 2 1 1 1 2 1
4 176 114 13 17 84 9 1 5 4 19 11 1 23 5 4 6 2 2
5 162 64 107 113 149 153 55 39 35 23 9 131 82 63 23 12 11 7 7
6 10 9 16 31 28 5 12 2 2 441 56 4 4 1 5 7 4 4
7 38 247 49 14 235 27 7 6 1 4 5 1 3
8 104 41 54 118 63 222 486 518 544 571 65 316 366 476 34 20 276 74 234 37
We show document clustering results for NMF-Corr(Sigma=0.5) in Table 2. Each

row represents a cluster and the numbers give the document counts in each cluster.
Highlighted in bold are up to three dominant newsgroups for each cluster. There are
fewer clusters than newsgroups and we can see that the clusters often contain
documents from neighboring and related categories. This was expected, as the natural
number of clusters (20) is considerably higher than the selected number of clusters.
5 Conclusion
This paper introduces correntropy measure for improved clustering via NMF. It uses
correntropy-based objective function and compares the clustering performance results
with Euclidean Distance objective function. We have implemented this approach to
cluster well-known 20-Newsgroups data set. Our approach applied to the clustering of
documents in the 20-Newsgroups data set shows that minimized correntropy yields
better clustering quality than EucD objective function. We will also test the limits of
our approach with other data sets in the future studies.
Acknowledgments. This work is supported by “The Council of Higher Education

(Turkey)” for the post-doctoral study, 2012.
References
1. Lee, D.D., Seung, H.S.: Learning the Parts of Objects with Nonnegative Matrix
Factorization. Nature 401, 788–791 (1999)
2. Berry, M.W., Browne, M., Langville, A.N., Pauca, V.P., Plemmons, R.J.: Algorithms and
Applications for Approximate Nonnegative Matrix Factorization. Computational Statistics
and Data Analysis 52(1), 155–173 (2007)
3. Hoyer, P.O.: Non-negative Matrix Factorization with Sparseness Constraints. Journal of
Machine Learning Research 5, 1457–1469 (2004)
4. Lee, D.D., Seung, H.S.: Algorithms for Non-negative Matrix Factorization. In: Proc. of
Advances in NeuralInformation Processing, vol. 13, pp. 556–562 (2001)
5. Schmidt, M.N., Olsson, R.K.: Single-channel Speech Separation Using Sparse Non-
negative Matrix Factorization. In: Proc. of Interspeech, pp. 2614–2617 (2006)
6. Xu, J.W., Bakardjian, H., Cichocki, A., Principe, J.C.: A New Nonlinear Similarity
Measure for Multichannel Biological Signals. In: Proc. of Int. Joint Conf. on Neural
Networks, Orlando, Florida, USA, August 12-17 (2007)
7. Fevotte, C., Idier, J.: Algorithms for Nonnegative Matrix Factorization with the β-
Divergence. Neural Computation 13(3), 1–24 (2010)
8. Choi, S.: Algorithms for Orthogonal Nonnegative Matrix Factorization. In: Proc. of the Int.
Joint Conf. on Neural Networks, Hong Kong, June 1-6, pp. 1828–1832 (2008)
9. Zhao, W., Ma, H., Li, N.: A Nonnegative Matrix Factorization Algorithm with Sparseness
Constraints. In: Proc. of the Int. Conf. on Machine Learning and Cybernetics, Guilin,
China
10. Schmidt, M.N., Winther, O., Hansen, L.K.: Bayesian Non-negative Matrix Factorization.
In: Adali, T., Jutten, C., Romano, J.M.T., Barros, A.K. (eds.) ICA 2009. LNCS, vol. 5441,
pp. 540–547. Springer, Heidelberg (2009)
11. Fevotte, C., Bertin, N., Durrieu, J.L.: Nonnegative Matrix Factorization with the Itakura-
Saito Divergence. Neural Computation 21, 793–830 (2009)
12. Shahnaz, F., Berry, M.W., Pauca, V.P., Plemmons, R.J.: Document Clustering Using
Nonnegative Matrix Factorization. Int. Journal of Information Processing and
Management 42(2), 373–386 (2006)
13. Guillamet, D., Vitria, J., Schiele, B.: Introducing a Weighted Non-negative Matrix
Factorization for Image Classification. Pattern Recognition Letters 24, 2447–2454 (2003)
14. He, R., Zheng, W.S., Hu, B.G.: Maximum Correntropy Criterion for Robust Face
Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 3(8), 1561–
1576 (2011)
15. Liu, W., Pokharel, P.P., Principe, J.C.: Correntropy: Properties and Applications in Non-
Gaussian Signal Processing. IEEE Trans. on Signal Processing 55(11), 5286–5298 (2007)
16. Singh, A., Principe, J.C.: Using Correntropy as a Cost Function in Linear Adaptive Filters.
In: Proc. of Int. Joint Conference on Neural Networks, Atlanta, USA, June 14-19 (2009)
17. He, R., Hu, B.G., Zheng, W.S., Kong, X.W.: Robust Principal Component Analysis Based
on Maximum Correntropy Criterion. IEEE Trans. on Image Processing 20(6) (2011)
18. Chalasani, R., Principe, J.H.: Self Organizing Maps with Correntropy Induced Metric. In:
Proc. of Int. Joint Conf. on Neural Networks, Spain, pp. 1–6 (2010)
19. Jeong, K.H., Principe, J.C.: Enhancing the Correntropy MACE Filter with Random
Projections. Neurocomputing 72(1-3), 102–111 (2008)
20. Matlab Software by Mark Schmidt,
http://www.di.ens.fr/~mschmidt/Software/minConf.html
21. Berry, M.W., Gillis, N., Glineur, F.: Document Classification Using Nonnegative Matrix
Factorization and Underapproximation. In: Int. Symp on Circuits and Systems, Taiwan
(2009)
22. Zhao, W., Ma, H., Li, N.: A New Non-negative Matrix Factorization Algorithm with
Sparseness Constraints. In: Proc. of the 2011 Int. Conf. on Machine Learning and
Cybernetics, Guilin, July 10-13 (2011)
23. Lin, C.J.: Projected Gradient methods for Non-Negative Matrix Factorization. Neural
Computation 19, 2756–2779 (2007)
24. Tan, P., Steinbach, M., Kumar, V.: Introduction to Data Mining. Pearson Addison Wesley
(2006)
25. Paatero, P.: Least Squares Formulation of Robust Non-negative Factor Analysis.
Chemometricsand Intelligent Laboratory System 37, 23–35 (1997)
SOMM – Self-Organized Manifold Mapping
Edson Caoru Kitani1, Emilio Del-Moral-Hernandez1, and Leandro A. Silva2

1
University of Sao Paulo, Polytechnic School, Av. Prof. Luciano Gualberto,
travessa 3 nº 380 ZIP 05508-010 - SP – Brazil
2
School of Computing and Informatics, Mackenzie Presbyterian University,
Rua da Consolação, 930 - ZIP 01302-907 – SP – Brazil
{ekitani,emilio}@lsi.usp.br,
prof.leandro.augusto@mackenzie.com.br
Abstract. The Self Organizing Map (SOM) [1] proposed by Kohonen has
proved to be remarkable in terms of its range of applications. It can be used for
high dimensional space visualization, pattern recognition, input space dimen-
sionality reduction and for generating prototyping to extrapolate information.
Basically, tasks conducted by the SOM method are closely related with input
space mapping in order to preserve topological and metric relationship between
samples. These maps are meant to create a low dimensional output representa-
tion of high dimensional input space. Although maps higher than two dimen-
sions can be created by SOM, it is common to work with the limit of one or two
dimensions. This work presents a methodology named SOMM (Self-Organized
Manifold Mapping) that can be useful to discover structures and clusters of in-
put dataset using the SOM map as a representation of data distribution structure.
Keywords: Manifold Learning, Self Organizing Maps, Dimensionality

Reduction.
1 Introduction
Based on the biological principles of organization, Kohonen, in developing his well
known SOM [1], postulates that there are good reasons to have the following aspects
of organization: a) grouping similar stimulus minimizes neural wiring, b) it creates a
robust and logical structure in the brain, avoiding “crosstalk”, c) from information
organized by attributes, a natural manifold structure can emerge from input patterns d)
reduces dimensionality by creating representations (prototypes) that preserve neigh-
borhood relationship between input patterns. Each representation, also known as pro-
totype, retains the most important features that represent a group of input patterns. It
can be argued that patterns with high similarities are memorized and retrieved from
memory by similarities with features of input patterns. Then, the SOM map tends to
be a discretized nonlinear representation of input distribution surface, because the
algorithm is based upon Vector Quantization (VQ) [1] technique. In this direction,
SOM is characterized by mapping high dimensional input space to low dimensional
output space preserving the input space topology. Although maps higher than two
dimensions can be created by SOM, the usual is to work with the limit of one or two
356 E.C. Kitani, E. Del-Moral-Hernandez, and L.A. Silva
dimensions. However, there are some challenges to understand the results of the SOM
map, since the neurons from the map have the same dimensionality of the original
space. Several visualization techniques to interpret and extract information from the
map were proposed [2], [3], [4]. Nevertheless, most of them require some level of
human interpretation.
This paper proposes to address three aspects of SOM that have increasingly ap-
peared in the neural networks community and received attention: Manifold Learning,
Dimensionality Reduction and Intrinsic Dimension. Based on the assumption that the
map created by SOM is related with the input space distribution structure, we devel-
oped a methodology that uses the neurons from the SOM map as support vectors for a
“walk” along the local manifold.
This paper is organized as follows: in Section 2, a brief review of manifold learn-
ing and nonlinear dimensionality reduction techniques is presented. In Section 3, the
new proposal of how to interpret and understand the results of SOM map based on the
input data manifolds is discussed. Section 4 presents some aspects of manifold learn-
ing that are captured by the new methodology; Section 5 presents the computational
experiments using an artificial image database. Finally, Section 6 discusses the results
and concludes the paper.
2 Manifold Learning
Manifold learning algorithms can be considered a class of nonlinear dimensionality

reduction techniques. Fundamentally, manifold learning algorithms are based on the
assumption that the data set lies on a local low-dimensional surface that is embedded
in a high-dimensional input space, although the manifold can be defined as topologi-
cal space that is locally Euclidean. The parameters related with the low dimensional
manifold can be uncovered by a class of algorithms that explore the high-dimensional
space, projecting them into a low-dimensional space. The projection can be unders-
tood as principal surfaces where the data points can lie.
Dimensionality reduction is embedded in the manifold learning, because it is a tool
to explore the data set in the original space and then create a low-dimensional repre-
sentation. In the absence of previous knowledge of the original data structure, any
new assumption extracted from the data sample can be useful to indicate a possible
real structure. From the computational viewpoint, the field of dimensionality reduc-
tion techniques and pattern prototyping is close to manifold learning methods. How-
ever, from the Data Mining and Knowledge Discovery perspective, high-dimensional
data can be difficult to be understood and interpreted, due to our lack of capability to
visualize data beyond three dimensions. Therefore, dimensionality reduction may be
necessary in order to discard redundancy and simplify further operations. Manifold
learning techniques such as: Sammon’s Nonlinear Mapping [5], ISOMAP [6], Locally
Linear Embedding (LLE) [7], as well as the Self-Organized Manifold Mapping
(SOMM) [8] addressed in this paper, belong to this category of nonlinear dimensio-
nality reduction methods. The main point about manifold learning methods is the
assumption that the input data lie on a low-dimensional manifold embedded in a high-
dimensional space. Therefore, we need to learn the underlying intrinsic manifold
SOMM – Self-Organized Manifold Mapping 357
geometry in order to address the problem of dimensionality reduction. Thus, instead

of seeking for an optimum linear subspace, manifold learning methods try to discover
an embedding function, locally defined, that describes the intrinsic similarities of the
data.
3 Self-Organized Manifold Mapping
Several studies have provided us with some insight about how to interpret the SOM
map [9], [10], [11]. One of the best known tools in this regard is the U-Matrix [4] that
provides a quantitative summary of the topological relationships between similar data
samples. The result of the U-Matrix map is a complex image (colored or monochro-
matic) indicating peaks and valleys that represent the Euclidean distances between
neighbor neurons. Essentially, the resulting map preserves the topological distribution
at the input space of the entire sample data considered.
However, to understand the relationship between the information captured in the
U-Matrix and the input samples, as well as to identify and explain the nature of the
groups or clusters defined by the manifolds, it would be helpful to represent all the
SOM neurons and their corresponding similarities and dissimilarities in the original
data space, instead of restricting such an analysis to the neighbor neurons. Based on
the principle of the locally optimal pathway and the idea of navigating on the neurons
that compose the SOM, in [8] we proposed the algorithm SOMM that seeks the path-
ways or manifolds described by the standard SOM. This work extends the understand-
ing of how SOMM extracts information from the manifold.
Considering that neurons are support vectors of the data distribution structure, the
possibility of using these vectors to “walk” on the surface of the manifold that they
represent is assumed. The SOMM algorithm constructs a graph G = (V, E) consider-
ing the edge E as a path-cost function f for connectivity and the neurons from the
classical SOM map as vertices V. As pointed out above, manifold can be defined as a
topological space that is locally Euclidean. Thus, the path-cost function f is assigned
to a distance πk = d(i,j) from neuron i to neuron j.
Starting from an initial neuron seed, the idea is to find the locally shortest path in-
volving the neuron seed and all neurons rk from the SOM map and incrementally
create a list L={r1, r2,…,rk} that represents such a pathway. Basically, the SOMM
algorithm is similar to Dijkstra algorithm with multiple sources [13]. However, a
main difference is that the SOMM does not find the minimum sum path-costs from
neuron i to m; SOMM finds the minimum local path-cost πk from neuron i to its near
neighbor j. The next path cost πk+1 will start from neuron j and so on. SOMM algo-
rithm considers that the pathway started from neuron i will always create a path ahead
using the constraint of not going back to the previous neuron. However, the sub graph
will end when path πk+1 indicates a neuron that already belongs to list L, except the
last path πk.
The motivation for this proposal was the result obtained by the visual reconstruc-
tion of the neurons of the SOM map from faces images, as described in [12]. The
SOMM algorithm can be described through the steps presented in Table 1.
Table 1. SOMM Algorithm
a) Calculate the SOM composed of k neurons using Kohonen´s algorithm.

Create a list A = {1,2,3,...,k};
b) Calculate the Euclidean distance dij of all k neurons pairwisely;
c) Create a k×k matrix with all pairwise distance between all k neurons;
d) Create list L = NULL. Set r = A[1], dmin = ∞ , c = 0;
d.1) Insert L ← {r},
d.2) c = c + 1,
d.3) If A-L = ∅ go to (e);
d.4) Find d = min(dsr, s ∈ A-{L[c - 1, L[c]}). Let s* such that d=ds*r,
d.5) If s* ∈{L} go to (e), else set r = s* and L ←{r}. Go to (d.2).
e) Loop:
e.1) If L[c] is already included in the list L go to (f),
e.2) Find d = min(dsr, s ∈ L-{L[c - 1, L[c]}). Let s* such that d=ds*r,
e.3) Insert L ← {s*}, c = c + 1.
f) Group neurons according to the order L[1],L[2],…;
g) If A-L ≠ ∅ then:
g.1) A = (A - L)∪{L},
g.2) k = number of elements in A,
g.3) Go to step (b);
h) End.
A simple way to explain this algorithm is to understand the output neurons,

represented by the weights wk computed in the SOM algorithm, as a set of nodes of a
fully connected graph [13], [3] in the parameter space. Each edge in this graph has a
cost function given by the Euclidean distance between its ends. Therefore, the k×k
matrix calculated in steps (b) - (c) is a symmetric one holding the values of the edge
costs in the graph.
Fig. 1. Two connected pathways with a common loop. The first one starts at neuron 1 and final-
ly enters loop (11) → (3) → (4) → (11), whereas the second path starts at neuron (13) and ends
in the same loop.
More specifically, in step (d), a list L is created and, in step (d.1), the algorithm in-
serts each visited neuron in L. Given a neuron r, step (d.3) seeks for the closest neuron
s* such that s* ∉ L – {Lc-1, Lc}; that means s* does not belong to the last visited edge
of the graph. This step implements a greedy algorithm that makes the locally optimal
choice at each stage, generating a locally optimal pathway that connects a subset of
SOM neurons. This is necessary because the idea is to generate a pathway that crosses
different clusters but without losing the notion of similarity in the parameter space. If
s*∈ {L}, then we have a loop, as the one exemplified in figure 1. In this case, the
pathway that starts at neuron 1 ends in loop (11)→(3)→(4)→(11). Step (e) completes
the pathway, which, in this figure 1, is composed by the sequence: L =
(1)→(2)→(11)→(3)→(4)→(11).
4 Manifold Learning by SOMM
In order to clarify how the SOMM methodology works and mainly two aspects of the
data properties that can be extracted by SOMM, the capability of SOMM to find clus-
ters and intrinsic dimensionality in an unsupervised way is presented. Using two dis-
tinct sets of artificial images, 10 ellipses and 10 squares with a smooth rotation in 10
degrees between each image in a set, a SOM map with 20 neurons in hexagonal 5×4
format was trained using the Matlab® SOM Toolbox released by [14]. Each artificial
image has 229×229 pixels of resolution with background in white color and lines in
black.
The idea is to show how the result of the standard SOM map is useful to extract
that information of relationship (through rotation) between the sample images in the
training set. Indeed, at a first moment, clusters are not visible directly from the SOM
map. The SOMM methodology will apply the concept of manifolds, in which the
Euclidean metric is locally feasible. The PCA (Principal Components Analysis) me-
thod was applied in the training set before the SOM processing, in order to minimize
memory and computational time consumption. The projection matrix of the data in
the PCA space was used for training the SOM network. However, when we return the
neurons to the visual space (through reverse mapping from the PCA space back to the
image space), it is observed that the clusters are not so obvious. Figure 2(B) shows the
visual reconstruction of each neuron of the SOM map of figure 2(A) using the Back-
ward Visualization method described in [12]. The images that are not within in a red
square frame do not represent any sample of the input space. In fact, they are called
interpolating unit and were not associated with any input data at the end of training
phase.
In figures 2(A) and 2(B), the neurons 5, 9, 15, and 19 at the SOM map are ob-
served to represent the squares, but the visual reconstruction does not provide this
sense of similarity. As the SOM works by Vector Quantization (VQ), neurons are
representations of regions occupied in the input space by the samples of the training
set. The projections are extrapolations of input data that do not exist in the training
set. As mentioned before, manifold learning methods try to discover an embedding
function, locally defined, that describes the intrinsic similarities of the data. Although
the visual reconstruction in figure 2(B) could be illustrative, it does not provide clear
visual evidence that the clusters were correctly determined. To better assess these
clusters, we applied the SOMM methodology. The visual result of the clustering
found by the SOMM methodology can be evaluated in figure 2(C). That figure shows
clusters found by the SOMM methodology. All neurons embedded in red rectangle
represent a prototype. The remaining neurons did not become prototypes at the end of
the training phase and do not represent any sample. Yet, for certain applications, they
can be considered an interpolation unit between two neurons. It can be observed that
the SOMM captured a dynamic of movement embedded in the training set. This dy-
namic is intrinsic to the manifold.
Fig. 2. In (A), the SOM map results after training with ellipse and square image database. In
(B), the visual reconstruction of each neuron from the SOM map and in (C), the clusters formed
by the SOMM methodology using the SOM map. All the images inside the red square are con-
sidered prototypes. All the numbers and labels were placed manually.
5 Experimental Evaluation with Face Image Database
The computational experiment conducted with the artificial dataset showed that the
SOMM can extract information from the SOM map and reorder it considering local
properties in terms of Euclidean distance. This information is related with some level
of similarity and some adjacency relation between neighbor neurons. This new adja-
cency relation connects the neurons in a different way with respect to that defined by
the initial SOM training. In other words, SOMM cuts the initial neighbor relationship
between neurons and creates a new one, based on the final structure learned from the
training set.
It means that the SOMM seeks local pathways using manifold formed by all neu-
rons, based on the assumption that all neurons will be distributed along the input data
structure. In [6], a computational experiment with ISOMAP was proposed using a set
of 698 face images artificially generated to represent different conditions in terms of
pose, illumination and lighting directions. The purpose was to evaluate the capability
of ISOMAP to discover an intrinsic dimensionality embedded in that faces database.
This work uses the same database and conducts an experiment to evaluate the mani-
fold found by SOMM using the SOM map as a reduced embedded structure from an
input data set. In order to reduce the computational effort, a pre-processing using PCA
was conducted over the face database. The final projection in PCA space created a
new matrix with 698×697 in size. As all the techniques used in this work are related
with distance measurement, the pre-processing with PCA is important in order to have
normalized data, avoiding influence by scale. The size of the SOM map was defined
to have 48 neurons in dimension ℜ697 and, after the training phase, Backward Visuali-
zation [12] is applied. Figure 3 (left) shows the nearest face image from the data set,
regarding each neuron of the SOM map. There is one neuron (39) that does not
became a prototype; however, the “ghost” image can be considered as an interpolation

image from that region.
Fig. 3. On the left, the neurons from the SOM map (8×6) were replaced by their near face im-
age from the training dataset. Six clusters found by the SOMM methodology can be seen on the
right. The white track lines in the SOM map represent the first pathway of the manifold that
represents cluster 1 and face images along that pathway. The remaining clusters represent dif-
ferent regions of the manifold.
Additionally, the rest of the clusters represent different regions of the manifold and
can be understood as regions where similar images can lie. As each neuron represents
a group of input samples, the cluster can be seen as a visualization of that region ma-
nifold surface, showing which and how input data samples are distributed along that
surface. Not only was the direction of the pose clustered, but also illumination differ-
ences and up-down head position and smooth changes across the images. However,
cluster number six seems to have a discontinuity. Analyzing the first three principal
components that carry 60.1% of the global variance, it was observed that projections
of that group of neurons belong to an overlap region. It can be understood as a region
that belongs to some borders. This is a different approach compared with ISOMAP
[6] or equivalent methods that try to create a new and reduced mapping from the orig-
inal manifold. It can be observed by the images that they are near to a local manifold
based on their distribution.
6 Conclusion and Discussion
This research proposes to develop a methodology (SOMM) to discover the structure

of multi-dimensional input space based on the manifold concept. No previous as-
sumption of data samples structure is made, and the SOMM finds the manifold sur-
face in an unsupervised way. The method is not only able to identify and explain the
nature of the clusters defined by the SOM manifolds, but also to represent all the
SOM neurons and their corresponding similarities and dissimilarities in the original
data space. To describe the possible self-organized pathways to navigate in the high-
dimensional and sparse image faces space (faces, in one of the data sets explored
here), we constructed a neighborhood graph of the SOM neurons based on the prin-
ciple of the locally optimal path. Such graph visualization method explicitly provides
information about the number of clusters that describe the sample data under investi-
gation, as well as the specific features extracted and explained by them. We believe
that the methodology proposed here might be a useful tool in SOM analysis, provid-
ing an intuitive explanation of the topologically constrained manifolds modeled by the
SOM and highlighting latent variables such as left-right pose, up-down pose and light
direction with all combinations in head pose.
Acknowledgements. The authors thank Dr. Gilson A. Giraldi and Dr. Carlos E.
Thomaz for their important and helpful discussions and collaboration during the pre-
vious work in [8] and [12].
References
1. Kohonen, T.: Self-organization and associative memory. Springer-Verlag New York, Inc.,
New York (1989)
2. Mayer, R., Rauber, A.: Visualising Clusters in Self-Organising Maps with Minimum
Spanning Trees. In: Diamantaras, K., Duch, W., Iliadis, L.S. (eds.) ICANN 2010, Part II.
3. Pölzbauer, G., Rauber, A., Dittenbach, M.: Graph projection techniques for self-organizing
maps. In: ESANN 2005 European Symposium on Artificial Neural Networks, Bruges, pp.
533–538 (2005)
4. Ultsch, A.: Maps for the visualization of high-dimensional data spaces. In: Proc. of Work-
shop of Self Organizing Maps, pp. 225–228 (2003)
5. Sammon Jr., J.W.: A nonlinear mapping for data structure analysis. IEEE Transaction on
Computers, 401–409 (1969)
6. Tenenbaum, J.B., Silva, V.D., Langford, J.C.: A global geometric framework for nonlinear
dimensionality reduction. Science Magazine 290, 2319–2323 (2000)
7. Roweis, S.T., Saul, L.K.: Nonlinear dimensionality reduction by local linear embedding.
Science Magazine 290, 2323–2326 (2000)
8. Kitani, E.C., Del-Moral-Hernandez, E., Giraldi, A.G., Thomaz, C.E.: Exploring and under-
standing the high dimensional and sparse image face space: A self organized manifold
mapping. New approaches to characterization and recognition of faces, pp. 225–238. In-
tech Open Access Publisher (2011)
9. Brugger, D., Bogdan, M., Rosenstiel, W.: Automatic cluster detection in Kohonen’s SOM.
IEEE Transaction on Neural Networks 19(3), 442–459 (2008)
10. Bauer, H.U., Pawelzik, K.R.: Quantifying the neighborhood preservation of Self-
Organizing Feature Maps. IEEE Transaction on Neural Networks 3(4), 570–579 (1992)
11. Kiviluoto, K.: Topology preservation in self-organizing maps. In: IEEE International Con-
ference on Neural Networks, vol. 1, pp. 294–299 (1996)
12. Kitani, E.C., Del-Moral-Hernandez, E., Thomaz, C.E., Silva, L.A.: Visual interpretation of
Self Organizing Maps. In: IEEE-CS 11th Brazilian Symposium on Neural Networks
(SBRN), pp. 37–42. São Bernardo do Campo (2010)
13. Cormen, T.H., et al.: Introduction to algorithms, 2nd edn. MIT Press (2001)
14. Vesanto, J., et al.: SOM Toolbox for Matlab 5. Helsinki University of Technology, Helsin-
ki, pp. 1-60. Report A57 (2000)
Self-Organizing Map and Tree Topology
for Graph Summarization
Nhat-Quang Doan, Hanane Azzag, and Mustapha Lebbah
Université Paris 13, Sorbonne Paris Cité

Laboratoire d’Informatique de Paris-Nord (LIPN)
CNRS (UMR 7030) 99, av. J-B Clément - F-93430 Villetaneuse
{nhat-quang.doan,Hanane.Azzag,
Mustapha.Lebbah}@lipn.univ-paris13.fr
Abstract. In this paper, we present a novel approach called SOM-tree to sum-

marize a given graph into a smaller one by using a new decomposition of orig-
inal graph. The proposed approach provides simultaneously a topological map
and a tree topology of data using self-organizing maps. Unlike other clustering
methods, the tree-structure aim to preserve the strengths of connections between
graph vertices. The hierarchical nature of the summarization data structure is par-
ticularly attractive. Experiments evaluated by Accuracy and Normalized Mutual
Information conducted on real data sets demonstrate the good performance of
SOM-tree.
Keywords: Self-Organizing Map, hierarchical clustering, tree topology, graph

summarization.
1 Introduction
Large graphs are used in several applications such as social networks, social relations,
interactions between proteins, etc... Currently, large graphs correspond to real-world
data with millions of nodes and edges. Thus, there is an increasing need for graph
summarization for reasons related to either privacy protection or space efficiency. It
sometimes makes sense to replace the original graph with a summary, which removes
some details from this graph. Most of works in graph summarization suggest to reduce
the number of nodes by grouping similar nodes in same clusters [1]. Other works pro-
pose compression techniques that consist of choosing super-nodes and super-edges in
the original graph [2]. In graph compression methods, nodes are grouped based on the
similarity of their relationships to other nodes, not by their (direct) mutual relations [2].
The process of summarization is to remove some information from the original large
graph to simplify its use. To address this problem, we propose SOM-tree to summarize
large graphs into smaller graphs. Our idea is to depict graphs in 2D map using Self-
organizing map (SOM). Each cell of our topogical map consists of a tree-like structure.
SOM-based methods try to preserve the topological properties of the input space.
However, we are now interested to the topology tree-like structure of data gathered in
each cell [3]. Thus, it is possible to summarize large graphs into topological and tree-
like organizations. The preliminary obtained results encourage further evaluation with
additional (and more sophisticated) graph datasets.

364 N.-Q. Doan, H. Azzag, and M. Lebbah
There is large literature on graph clustering algorithms, that can be applied to graph
summarization because they tend to group similar nodes. We note an important work
in community detection [4,5] and also using modularity [6,7]. These graph clustering
algorithms have been used to detect community structures (dense subgraphs) in large
networks and are only based on nodes connectivities. In other graph summarization
methods [8,9] authors use simple statistical approaches to represent graph properties,
this type of summary contains poor information. In fact, statistical measures are not
sufficient to understand information lying within the graph structure content.
In this paper, we present a new approach named SOM-tree composed of topological
and hierarchical organization for graph summarization. The proposed method provides
a summarized graph which contains the same number of nodes but less edges than the
original graph. Experts observe not only the general topological view between nodes of
graph but also a hierarchical view of a particular part of summarized graph.
The remainder of this paper is organized as follows: Section 2 provides graph sum-
marization model. Section 3 is dedicated to the experiments that have been conducted
on three graph datasets. The paper ends with Section 4, which contains conclusions and
possible future works.
2 Self-Organizing Tree Model: SOM-Tree

Our model seeks to find an automatic clustering that provides a new summarization of
graph using a self-organizing map. The main idea of our algorithm is to cluster the node
of the graph by creating a new clear structure. In order to not lose the main structure
of the graph, data in each cluster are arranged in tree-like structure, where each node
of the tree presents an observation or a vertex in original graph. In another way, our
algorithm allows simultaneously to cluster and to decompose the input graph in a 2D
map whose cells possess tree-like structures. The proposed structure is particularly at-
tractive because it allows for an immediate comparison between graph and summarized
graph. It can be exploited by descending from topological structure to the last level of
hierarchical structure, which provides information about the summarized graph as well
as its topology.
As traditional self-organizing map, we assume that the lattice C has a discrete topol-
ogy defined by an indirect graph. Usually, this graph is a regular grid in one or two
dimensions. We denote K the number of cells in the map C. For each pair of cells (c,r)
on the map, the distance δ(c, r) is defined as the length of the shortest chain allowing
to link cell r and c. If the cells are associated with trees, this leads to the concept hi-
erarchical structure. In this lattice each unit c ∈ C represents a tree structure denoted
by treec . The mutual influence is defined by the function KT (δ(c, r)) = exp( −δ(c,r) T )
where T represents the temperature that controls the size of the neighborhood.
Let G(V, E) an input graph, where V is a set of n vertices and E a set of edges.
A direct link between two vertices vi ∈ V and vj ∈ V creates an edge (vi , vj ) ∈ E.
Following spectral graph theory [10,11], we take the first d smallest eigenvectors of
Laplacian matrix. Instead of having n × n dimensional space, we project n vertices in d
dimensional space. An element xi = (x1i , x2i , ...., xdi ) ∈ X illustrates vertices vi ∈ V .
Since a graph G is now mapped in feature space X, therefore G can be clustered and
visualized using our self-organizing process.
Self-Organizing Map and Tree Topology for Graph Summarization 365
Within a cluster, a hierarchical tree is generated. The principle of tree construction is

inspired from the self-assembly rules of AntTree method [3]. Connection of a node to
another one in tree structure is based on their similarity. We denote some notations as
following:
– treer : the tree associated to cell r. Each node of tree presents a node of the input
graph xi .
– treexi : the subtree of data, which have xi as root and all nodes recursively con-
nected to it (treexi ⊂ treer ).
– wr is representative vector (or prototype) of cell r.
The objective function of self-organizing map using tree structure is written as follows:

K
K
R(φ, W) = K(δ(φ(treexi ), r))||xi − wr ||2 (1)
c=1 xi ∈T reec r=1
where W = ∪K r=1 wr and φ is the assignment function of tree.

It is clear that using statistical characteristics provided by trees (associated to each
cell) allow to accelerate the convergence of the algorithm. Minimizing the cost function
R(φ, W) is a combinatorial optimization problem. In this work we propose to minimize
the cost function in iterative process with three steps:
1. Assignment step: assuming that W is fixed, we have to minimize R(φ, W) with
respect to φ. To take advantage of the tree structure, the assignment function is defined
as follows:
∀xj ∈ treexi , φ(treexi ) = φ(xj ) = φ(xi )
2
φ(xi ) = arg min KT (δ(r, c)) xi − wc (2)
r
c=1..K
2. Tree construction: In this step we seek to find the best position of a given data xi
(vi ∈ V ) in the treec associated to cell c. We use connection/disconnection rules in-
spired from AntTree [3]. The particularity of the obtained tree is that each node whether
it is a leaf or an internal node represents a given data xi . In this case, Nxi denotes the
node that is connected and associated to the data xi , Nxpos represents current node of
the tree and Nxi+ the node connected to Nxpos , which is the most similar (closest by
distance) to Nxi . TDist (Nxpos ) is the highest distance value which can be observed
among the local neighborhood Npos . xi is connected to Nxpos if and only if the con-
nection of Nxi further increases this value:
2
TDist (Nxpos ) = M axj,k Nxj − Nxk
= M axj,k xj − xk 2 (3)
In other words, connections rules consist of comparing a node Nxi to the nearest node
Nxi+ . In the case where both nodes are sufficiently far away (Nxi − Nxi+ 2 >
TDist (Nxpos )) then the node Nxi is connected to its current position Nxpos .
Otherwise, the node Nxi associated to data xi is moved toward the nearest node
Nxi+ . Therefore, the value TDist increases for each node connected to the tree. In fact,
each connection of a given data xi implies a local minimization of the value of the cor-
responding TDist . Therefore a minimization of the cost function (1). At the end of the
tree construction step, each cell c of the map C will be associated to a treec . Connec-
tions rules are based on Nearest Neighbor approach. Each data will be connected to its
nearest neighbor.
3. Optimization step: assuming that φ is fixed, this step minimizes R(φ, W) with
respect to W in the space Rd . It is easy to see that this minimizations allow to define the
prototype for each treec using the expression (4). Instead of centroids we choose leaders
[12] as the representative vertices for all the vertices of a graph. Leader is referred to the
node vi ∈ V ( xi ∈ X) that have the highest degree in original graph G. The expression
is defined as follows:
Lr = arg max (deg(vi )) (4)
∀xi ∈treer ,vi ∈V
where the local degree deg(vi ) is the number of internal edges incident to vi in original
graph G.
In this section, we performed extensive experiments to evaluate the performance of our
method on real graph data. In this experiment, we first compare the efficiency of differ-
ent clustering algorithms : MST (Minimum Spanning Tree) that builds tree structures
[13] and
√ SOM. In the case of SOM we adopt the same initial parameters. Here we fix
d = n for each dataset to reduce the number of dimensions. The size of map is re-
spectively selected 3 × 3 (K = 9 clusters) for ”Adjective and noun”; 5 × 3 (K = 15)
for ”Football Teams”; 5 × 5 (K = 25) for ”Political blogs”. The motivation is to choose
the map size superior to the number of class so that a large map is able to cover graph
space.
In order to evaluate the performance, we selected two different criteria, each of them
should be maximized: Accuracy and Normalized Mutual Information [14]. The studied
databases are presented in Table 1 and are available at http://www-personal.
umich.edu/˜mejn/netdata/. Results have been averaged over 10 runs.
Table 1. Data Features
Datasets # Node # Edge # Class

Adjective and Noun 112 425 2
Football Teams 115 616 10
Political blogs 1490 19090 2
First of all, the quality of the clustering depends on the choice of d, but we intend
to optimize neither the value of d, the map size nor their influence to quality measures
in this paper. Looking to columns associated to SOM-tree (Self-organizing Map-Tree),
Table 2. Performance of quality criterion on labelled graphs datasets
Method d Accuracy NMI

Adjective and noun (3 × 3)
SOM 11 0.576 0.072
MST 11 0.562 0.280
SOM-tree 11 0.560 0.134
Football Teams (5 × 3)
SOM 11 0.406 0.544
MST 11 0.400 0.313
SOM-tree 11 0.878 0.685
Political Blogs (5 × 5)
SOM 39 0.827 0.056
MST 39 0.512 0.141
SOM-tree 39 0.767 0.178
we can notice that our approach provides a comparable and better results. Our purpose
through this comparison, is not to assert that our method is the best, but to show that
SOM-tree method can obtain quite the same good results as other clustering algorithms
with more information provided by topological and hierarchical.
We are interested on analyzing the connection between a pair of nodes in the new
representation. For this purpose we explore the structure of the graph by analyzing the
added and/or the deleted direct links between a pair of nodes in the original graph G.
We show how our method simplifies the exploration of the original graph by offering
user-friendly summarization and visualization. We use Tulip [15] as the framework
to visualize and analyze the graph. Using GEM layout, we provide here two types of
architecture for visualizing graph:
1. Default: the graph is drawn from the original collection of edges and vertices. We
want to discriminate leaders from other nodes, the size of a node depends on its
degree.
2. Summarized: here we propose new organization of graph which has less edges
than the default organization so that is easy and visible to interpret. The new graph
is drawn from graph nodes as well as map nodes, however their structure has a form
of hierarchy and topology. The map nodes with black color are located in the center
surrounded by trees. We have the same number of trees as the map size.
We note that the leader node is represented in the graph by a big node with symbol L.
Every cluster is represented by one leader and one color.
Case of ”Adjective and Noun”: In multi-level representation of Figure 1, SOM-tree

reveals six major trees and three single-node trees for 112 nodes. This graph has com-
plicated set of edges that leads to low values of purity, a noun (blue color) does not
always follow an adjective (pink color). Observing the results, three clusters (top of the
(a) Default graph (b) Summarized graph

Fig. 1. Graph visualization of ”Adjective and noun”. Each cluster is indicated by a color.
map) contain less than 3% of nodes from the original graph and the other clusters are
very big compared to the three smallest ones.
Case of ”Football”: Different from others, this dataset has more balanced distribu-
tion of data. ”Football” dataset has 115 nodes classified into 10 different classes, the
obtained visualization is shown in Figure 2. The number of nodes are quite balanced
in each cell. The most of nodes belong to the same class are grouped in one cluster.
We also observe several pure clusters or trees represented by leaders L1, L2, L6. In
this case, the differences between the proposed decomposition and the ground truth are
not important. As additional information, SOM-tree organization and the original graph
have several common links. We observe that 80% of direct edges in SOM-tree organi-
zation are also direct links in the original graph.
Case of ”Political Blogs”: In Figure 3, we illustrate 1490 nodes in a map 5 × 5 with

several major trees and some minor trees. A great reduction in graph connectivity (from
19090 edges in the original graph to 1439 edges in the SOM-tree organization) can lead,
Fig. 2. Graph visualization of ”Football”. Each cluster is indicated by a color.

in this situation, to clusters that are more adapted to visual exploration. Indeed SOM-
tree organization contains only 4% of direct edges from the original graph whereas the
value of purity is about 0.85. SOM-tree eliminates external direct links (between two
clusters) and replaces internal direct links by building a path between the corresponding
nodes in the same tree. SOM-tree organization also provides a decomposition of graph
that allows a better interaction with the data (Figure 3(b)). The external interactions are
showed by topological map. It is very difficult to visualize the interactions between the
node in the original graph.
Fig. 3. Graph visualization of ”Political blogs”. Each cluster is indicated by a color.
After studying the obtained summarization we notice that visual results given by our
method lead to important insight on the graph content.Our approach improve the stan-
dard decomposition and visualization by building topological map and tree topology of
data. Atypical nodes are clearly pinpointed with this approach and can be further studied
by the analyst. The summarization provides a clear visualization in which the analysts
can easily navigate. A hierarchical visual exploration is provided by topological level
to the last level of trees provide useful graph information.
4 Conclusion
In this paper, we have proposed a summarization of graph using Self-organizing Map
and new hierarchical clustering. This novel method provides a new look at self-
organizing models allowing better visualization of graph organization. The obtained
graph is able to determine both the hierarchical distribution of the nodes and its struc-
tured topology. We notice that SOM-tree works well with several real world datasets
through the experiments. In practice, our model reduces the number of edges and pro-
vides tree-like graph for every cell of the map (grid). Furthermore, our model offers a
user-friendly visualization space which consists of both tree structures and topological
map. The benefit of our approach is to permit to the user a fast data analysis using a
smaller graph.
As future work, we will study the influence of the number of selected eigenvectors
and an incremental graph summarization. We will show how sub-graphs evolve over
time. Another perspective is to study biological and biomedical graph. Several problems
can be faced with hierarchical structure in the case of expressed genes.
References
1. Navlakha, S., Rastogi, R., Shrivastava, N.: Graph summarization with bounded error. In:
SIGMOD Conference, pp. 419–432 (2008)
2. Toivonen, H., Zhou, F., Hartikainen, A., Hinkka, A.: Compression of weighted graphs. In:
Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, KDD 2011, pp. 965–973. ACM, New York (2011)
3. Azzag, H., Venturini, G., Oliver, A., Guinot, C.: A hierarchical ant based clustering al-
gorithm and its use in three real-world applications. European Journal of Operational Re-
search 179(3), 906–922 (2007)
4. Flake, G.W., Lawrence, S., Lee Giles, C.: Efficient identification of web communities. In:
Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, KDD 2000, pp. 150–160. ACM, New York (2000)
5. Ino, H., Kudo, M., Nakamura, A.: Partitioning of web graphs by community topology. In:
Proceedings of the 14th International Conference on World Wide Web, WWW 2005, pp.
661–669. ACM, New York (2005)
6. Rossi, F., Villa-Vialaneix, N.: Optimizing an organized modularity measure for topographic
graph clustering: A deterministic annealing approach. Neurocomput. 73, 1142–1163 (2010)
7. Newman, M.E.J.: Modularity and community structure in networks. Proceedings of the Na-
tional Academy of Sciences 103(23), 8577–8582 (2006)
8. Chakrabarti, D., Faloutsos, C.: Graph mining: Laws, generators, and algorithms. ACM Com-
put. Surv. 38 (June 2006)
9. Chakrabarti, D., Faloutsos, C., Zhan, Y.: Visualization of large networks with min-cut plots,
a-plots and r-mat. Int. J. Hum.-Comput. Stud. 65(5), 434–445 (2007)
10. Chung, F.R.K.: Spectral Graph Theory (CBMS Regional Conference Series in Mathematics,
No. 92). American Mathematical Society (February 1997)
11. Luxburg, U.: A tutorial on spectral clustering. Statistics and Computing 17, 395–416 (2007)
12. Stanoev, A., Smilkov, D., Kocarev, L.: Identifying communities by influence dynamics in
social networks. CoRR, abs/1104.5247 (2011)
13. Grygorash, O., Zhou, Y., Jorgensen, Z.: Minimum spanning tree based clustering algorithms.
In: Proceedings of the 18th IEEE International Conference on Tools with Artificial Intelli-
gence, ICTAI 2006, pp. 73–81. IEEE Computer Society, Washington, DC (2006)
14. Strehl, A., Ghosh, J., Cardie, C.: Cluster ensembles - a knowledge reuse framework for com-
bining multiple partitions. Journal of Machine Learning Research 3, 583–617 (2002)
15. Auber, D.: Tulip: A huge graph visualisation framework. In: Mutzel, P., Jünger, M. (eds.)
Graph Drawing Softwares. Mathematics and Visualization, pp. 105–126. Springer (2003)
Variable-Sized Kohonen Feature Map
Probabilistic Associative Memory
Hiroki Sato and Yuko Osana
Tokyo University of Technology,

1404-1 Katakura Hachioji, Tokyo, Japan
osana@cs.teu.ac.jp
Abstract. In this paper, we propose a Variable-sized Kohonen Fea-

ture Map Probabilistic Associative Memory (VKFMPAM). The pro-
posed model can realize the probabilistic association for the training
set including one-to-many relations, and neurons can be added in the
Map Layer if necessary. We carried out a series of computer experiments
and confirmed the effectiveness of the proposed model.
Keywords: Self-Organizing Map (Kohonen Feature Map), Probabilistic

Association, Successive Learning, One-to-Many Associations.
1 Introduction
In the real world, it is very difficult to get all information to learn in advance.
So we need the model which can realize successive (additional) learning. How-
ever, most of the conventional neural network models can not realize successive
learning. As the model which can realize successive learning, some associative
memories based on the Kohonen Feature Map (KFM)[1] have been proposed[2]–
[6]. Although most of them can realize one-to-many associations[4]–[6] and the
probabilistic associations[5], their storage capacities depends on the number of
the neurons in the Map Layer and they can not learn new patterns more than
their original storage capacities.
In this paper, we propose the Variable-sized Kohonen Feature Map Prob-
abilistic Associative Memory (VKFMPAM). This model is based on the con-
ventional Improved KFM Probabilistic Associative Memory based on Weights
Distribution[5] and the Variable-sized KFM Associative Memory with Refrac-
toriness based on Area Representation[6]. In the proposed model, the proba-
bilistic association for the training set including one-to-many relations can be
realized. And the connection weights fixed and semi-fixed neurons are intro-
duced, and a new pattern can be memorized. Moreover, when unknown patterns
are given, neurons can be added in the Map Layer if necessary.
2 Variable-Sized Kohonen Feature Map Probabilistic

Associative Memory
The proposed Variable-sized Kohonen Feature Map (KFM) Probabilistic As-
sociative Memory is based on the conventional Improved KFM Probabilistic

372 H. Sato and Y. Osana
Associative Memory based on Weights Distribution[5] and the Variable-sized

KFM Associative Memory with Refractoriness based on Area Representation[6].
2.1 Structure
Figure 1 shows the structure of the proposed model. As shown in Fig.1, the
proposed model has two layers; (1) Input/Output Layer and (2) Map Layer,
and the Input/Output Layer is divided into some parts. In the proposed model,
neurons can be added in the Map Layer if necessary, so the distance between
neurons in the Map Layer is not equal.
2.2 Learning Process

In the proposed model, if enough area for a new learning pattern can not be
taken, some neurons are added in the Map Layer.
The learning procedure of the proposed model is as follows.
(1) The connection weights are initialized randomly. In the proposed model, the
initial Map Layer has xmax × ymax neurons.
(2) The Euclidean distances between the input vector X (p) and the weight vector
W i , d(X (p) , W i ) are calculated for all neurons in the Map Layer.
(3) If d(X (p) , W i ) > θt is satisfied for all neurons in the Map Layer, the input
pattern X (p) is regarded as the unknown pattern, and go to (4). Otherwise,
the input pattern is regarded as one of the known patterns, and go to (8).
(4) The center neuron of the learning area is selected.
(4-1) If there is no weight-fixed neuron, the neuron c whose Euclidean distance
is shortest is selected as the center of the learning area, and go to (4-3).
Otherwise go to (4-2).
(4-2) Whether the area for the input pattern X (p) can be taken without over-
lapping to the areas for the stored patterns is checked. For the weight-
fixed neurons z, if
dmax Diz + dmin

z Dzi < diz (1)
Input / Output Layer

´µ ´µ
½ ¾
Ï ½
Ï ¾

Map Layer
Fig. 1. Structure of Proposed Model

Variable-Sized KFM Probabilistic Associative Memory 373
is satisfied, the neuron i can be a center of the learning area. Here, dmax
is the maximum distance between adjacent neurons and dmax = 1. dmin z
is the distance from the weight-fixed neuron z to the nearest neuron.
And Diz is the constant which decides the area size whose center is the
neuron i in the direction to the neuron z. In the proposed model, the
real moving radius of the area whose center is the neuron z is given by
dmin
z Dzi . diz is the distance between the neuron i and the weight-fixed
neuron z.
If there are some candidate neurons, go to (4-6). Otherwise, go to
(4-3).
lapping to the areas for the stored patterns when the distance between
neurons in the area for the stored patterns is reduced to φn (dmin z ) is
checked. Here, φn (·) is given by

d/ 2n , d/ 2n ≥ dmin
φn (d) = (2)
dmin (otherwise)
where n is the number of check in (4-3). dmin is the minimum distance

between adjacent neurons. In the area for the input pattern, the distance
between adjacent neurons is set to φn−1 (dmax ).
For all weight-fixed neurons z,
φn−1 (dmax )Diz + φn (dmin

z )Dzi < diz (3)
is satisfied, the neuron i can be a center of the learning area. The neurons
that satisfy the condition given by Eq.(3) for all weight-fixed neurons are
selected as the candidate of the center of the learning area. If there are
some candidate neurons, go to (4-6). Otherwise, go to (4-4).
lapping to the areas for the stored patterns when the distance between
neurons in the area for the stored patterns is reduced to φn (dminz ) and
the distance between neurons in the area for the input pattern is set to
φn (dmax ) is checked.
For all weight-fixed neurons z,
φn (dmax )Diz + φn (dmin

z )Dzi < diz (4)
is satisfied, the neuron i can be a candidate of the center of the learn-

ing area. The neurons that satisfy the condition given by Eq.(4) for all
weight-fixed neurons are selected as the candidate of the center of the
learning area. If there are some candidate neurons, go to (4-6). Other-
wise, go to (4-5).
(4-5) In (4-3) and (4-4), if
φn+1 (dmax ) ≥ dmin (5)
is satisfied, back to (4-3). Otherwise, it judges that the input pattern
X (p) can not be learned as a new pattern.
(4-6) From the neurons which are selected as the center candidates of the
learning area in (4-2)∼(4-4), the neuron c that the Euclidean distance
between the input vector X (p) and its weight vector W i is minimum is
selected.
(5) Some neurons are added to the Map Layer if necessary.
(5-1) If the center candidates are selected in (4-3) or (4-4), the distance be-
tween neurons in the areas for stored patterns is reduced, and some
neurons are added.
If the center candidates are selected in (4-3), for the area whose center
is the neuron z which satisfies
φn−1 (dmax )Dcz + φn−1 (dmin

z )Dzc < dcz , (6)
the distance between neurons in the area is reduced, and neurons are
added. The neurons i which satisfy
dmin
z Dzi ≥ diz (7)
are generated as new neurons. The neuron i corresponding to the neuron

i ((xi , yi )) is generated at (xi , yi ). Here, xi and yi are given by
xi = (xi − xz )φn (dmin

z ) + xz (8)
yi = (yi − yz )φn (dmin
z ) + yz . (9)
If the neuron exists at (xi , yi ), no neuron is added there. The weight
vector of the neuron i W i is set to W i .
If the center candidates are selected in (4-4), the new neurons are
added in the area whose center z that satisfy
φn (dmax )Dcz + φn−1 (dmin

z )Dzc < dcz . (10)
(5-2) If the center candidates are selected in (4-3) or (4-4), new neurons are
added in the area for the new pattern X (p) .
If the center candidates are selected in (4-3) and n ≥ 1, the neurons
are added in the area whose center is the neuron c. Here, the neurons
corresponding to the neurons which satisfy
φn−1 (dmax )Dci ≥ dic (11)
are generated. The neuron i corresponding to the neuron i ((xi , yi )) is

generated at (xi , yi ). Here, xi and yi are given by
xi = (xi − xc )φn−1 (dmax ) + xc (12)

yi = (yi − yc )φn−1 (dmax ) + yc . (13)
If the neuron exists at (xi , yi ), no neuron is added there, and the weight
vector W i is generated randomly.
If the center candidates are selected in (4-4), new neurons are added in
the area whose center is the neuron c. Here, the neurons corresponding
to the neurons which satisfy
φn (dmax )Dci ≥ dic (14)
are generated. The neuron i corresponding to the neuron i ((xi , yi )) i s

generated at (xi , yi ). Here, xi and yi are given by
xi = (xi − xc )φn (dmax ) + xc (15)

yi = (yi − yc )φn (d
max
) + yc . (16)
If the neuron exists at (xi , yi ), no neuron is added there. The weight
vector W i is generated randomly.
(6) The input pattern X (p) is trained in the area whose center is the neuron c.
The connection weights which are not fixed are updated by
⎧ (p)
⎪
⎪ X , dci ≤ dminc Dci and
⎪
⎪
⎪
⎪ dx
/dmin
∈ Z and dyci /dmin ∈Z
⎨ ci c c
W i (t + 1) = W i (t) + H(dcimin
)(X (p) − W i (t)), (17)
⎪
⎪ dc Dci < dci ≤ l(dmin Dci )
⎪
⎪ c
⎪
⎪ and dmin Dzi < dzi (∀z ∈ Cf ix )
⎩ z
W i (t), (otherwise)
where Z is the set of integers, l is the coefficient which decides the neighbor-
hood area size, and Cf ix is the set of the weight-fixed neurons. And H(dci )
is the neighborhood function and is given by
1
H(dij ) = (18)
1 + exp (dij − D)/ε
where ε is the steepness parameter. And, dci is the normalized distance

between the neuron c and the neuron i dividing the moving radius dmin c Dci .
dmin
c is the distance between neurons in the learning area whose center is
the neuron c and is given by
⎧ max
⎨d , (center candidates are selected in (4-1) or (4-2))
dmin = φn−1 (dmax
), (center candidates are selected in (4-3)) (19)
c
⎩
φn (dmax ), (center candidates are selected in (4-4)) .
(7) The connection weights of the neuron c, W c are fixed.

(8) (2)∼(7) are iterated when a new pattern set is given.
2.3 Recall Process

In the proposed model, one of the neurons whose connection weights are sim-
ilar to the input pattern are selected randomly as the winner neuron. So, the
probability that the pattern whose area is large is recalled is higher than the
probability that the pattern whose area is small. As a result, the probabilistic
association can be realized based on the weights distribution.
When the pattern X is given to the Input/Output Layer, the output of the
neuron i in the Map Layer, xmap
i is calculated by

1, (i = r)
xmap
i = (20)
0, (otherwise)
where r is selected randomly from the neurons which satisfy
1
g(Xk − Wik ) ≥ θmap (21)
N in
k∈Cin
where N in is the number of neurons in the Input/Output Layer, and Cin is the
set of the neurons which receive the input in the Input/Output Layer. And g(·)
is given by
1, (|u| < θd )
g(u) = (22)
0, (otherwise)
where θd and θmap are the thresholds.
When the pattern X is given to the Input/Output Layer, the output of the
neuron k in the Input/Output Layer xiok is given by
⎧
⎨ Wrk (X is analog pattern)
xio = 1, (X is binary pattern andWrk ≥ 0.5) (23)
k
⎩
0, (X is binary pattern andWrk < 0.5) .
3 Computer Experiment Results

Here, we show the computer experiment results to demonstrate the effectiveness
of the proposed model.
3.1 Probabilistic Association

In this experiment, the analog patterns including one-to-many relations shown
in Fig.2(a) were memorized in the proposed model composed of 800 neurons in
the Input/Output Layer and 100 neurons in the initial Map Layer. Figure 2(b)
and (c) show a part of the association results. From these results, we can confirm
that the proposed model can recall the corresponding plural patterns correctly.
Figure 3 shows the Map Layer after the pattern pairs shown in Fig.2 (a) were
memorized. In Fig.3, green neurons show the center neuron in each area, sky-
blue neurons show the neurons in areas for the patterns. As shown in this figure,
the proposed model can learn each learning pattern with various size area.
Table 1 shows the recall times of each pattern in the trial of Fig.2 (b)
(t=1∼500) and Fig.2 (c) (t=501∼1000). In this table, normalized values are
also shown in ( ). From these results, we can confirm that the proposed model
can recall each pattern by the probability according to the area size.
(a) t=1 (b) t=2 (c) t=4 (a) t=501 (b) t=504 (c) t=514
(a) Stored Patterns (b) “lion” was given (c) “crow” was given
Fig. 2. One-to-Many Associations
lion-bear lion-monkey
Table 1. The Number of Recall Times
Input Output Area Size Recall Times

crow-chick lion bear 11 (1.0) 73 (1.0)
lion-mouse
monkey 23 (2.1) 174 (2.4)
mouse 33 (3.0) 253 (3.5)
crow-hen crow hen 11 (1.0) 60 (1.0)
chick 23 (2.1) 175 (2.5)
crow-penguin
penguin 33 (3.0) 265 (4.4)
Fig. 3. Areas in Map Layer
3.2 Storage Capacity

Figure 4 shows the storage capacity of the proposed model. In this experiment,
we used the network composed of 800 neurons in the Input/Output Layer and
100/400 neurons in the initial Map Layer, and 1-to-2 random pattern pairs were
memorized as the area (ai =2.5 and bi =1.5). This figure shows the average of 100
trials, and the storage capacities of the conventional model[5] are also shown for
reference. From these results, we can confirm that the storage capacity of the
proposed model is large when dmin is small and the number of neurons in the
initial Map Layer is large.
3.3 Robustness for Damaged Neurons/Noisy Input

Figure 5 shows the robustness for damaged neurons and noisy input. As shown
in these figures, the proposed model has the robustness for damaged neurons
and noisy input. And, the robustness for noisy input depends on the threshold
θmap (See Fig. 5(b)). If the threshold θmap is large and the threshold θd is small,
the proposed model is sensitive to noise.
3.4 Learning Speed

Table 2 shows the learning time of the proposed model. In this experiment, 10
random patterns pairs were memorized in the network composed of 800 neurons
in the Input/Output Layer and 100/225/400 neurons in the initial Map Layer.
As shown in these tables, the learning time of the proposed model is short when
the number of neurons in the initial Map Layer is small.
1.0 1.0
1.0 Proposed Model

# of Neurons in Map Layer : 100
0.8 0.8

0.8

Recall Rate
Storage Capacity
Recall Rate
Conventional Model [5]
# of Neurons in Map Layer : 400 0.6 0.6 Conventioanl
Model [5]
0.6
0.4 Proposed Conventional 0.4
Conventional Model [5]
Model [5]
0.4 Model
0.2 0.2

0.2
0.0 0.0
0 20 40 60 80 100 0 10 20 30 40 50 60
0.0 Rate of Damaged Neurons in Map Layer Noise Rate
0 100 200 300 400 500 600
The Number of Stored Patterns
(a) Damage (b) Noise
Fig. 4. Storage Capacity Fig. 5. Robustness for Damaged Neurons/Noise
Table 2. Learning Speed
Learning Time (msec)

Binary Analog
Proposed Model (100) 1.53 1.63
Conventional Model[5] (400) 5.14 5.18
4 Conclusions
In this paper, we have proposed the Variable-sized Kohonen Feature Map Prob-
abilistic Associative Memory which can realize the probabilistic association for
the training set including one-to-many relations. We carried out a series of com-
puter experiments and confirmed the effectiveness of the proposed model.
References
1. Kohonen, T.: Self-Organizing Maps. Springer (1994)
2. Ichiki, H., Hagiwara, M., Nakagawa, M.: Kohonen feature maps as a supervised
learning machine. In: ICNN, pp. 1944–1948 (1993)
3. Yamada, T., Hattori, M., Morisawa, M., Ito, H.: Sequential learning for associative
memory using Kohonen feature map. In: IJCNN, Washington D.C. (1999)
4. Abe, H., Osana, Y.: Kohonen feature map associative memory with area represen-
tation. In: IASTED AIA, Innsbruck (2006)
5. Noguchi, S., Osana, Y.: Improved Kohonen feature map probabilistic associative
memory based on weights distribution. In: IJCNN, Barcelona (2010)
6. Imabayashi, T., Osana, Y.: Variable-sized KFM associative memory with refractori-
ness based on area representation. In: SMC, San Antonio (2009)
Learning Deep Belief Networks
from Non-stationary Streams
Roberto Calandra1 , Tapani Raiko2 , Marc Peter Deisenroth1 ,

and Federico Montesino Pouzols3
1
Fachbereich Informatik, Technische Universität Darmstadt, Germany
2
Department of Information and Computer Science, Aalto University, Finland
3
Department of Biosciences, University of Helsinki, Finland
Abstract. Deep learning has proven to be beneficial for complex tasks

such as classifying images. However, this approach has been mostly ap-
plied to static datasets. The analysis of non-stationary (e.g., concept
drift) streams of data involves specific issues connected with the tempo-
ral and changing nature of the data. In this paper, we propose a proof-
of-concept method, called Adaptive Deep Belief Networks, of how deep
learning can be generalized to learn online from changing streams of
data. We do so by exploiting the generative properties of the model to
incrementally re-train the Deep Belief Network whenever new data are
collected. This approach eliminates the need to store past observations
and, therefore, requires only constant memory consumption. Hence, our
approach can be valuable for life-long learning from non-stationary data
streams.
Keywords: Incremental Learning, Adaptive Learning, Non-stationary

Learning, Concept drift, Deep Learning, Deep Belief Networks, Genera-
tive model, Generating samples, Adaptive Deep Belief Networks.
1 Introduction
Machine learning typically assumes that the underlying process generating the
data is stationary. Moreover the dataset must be sufficiently rich to represent this
process. These assumptions do not hold for non-stationary environments such as
time-variant streams of data (e.g., video). In different communities, a number of
approaches exist to deal with non-stationary streams of data: Adaptive Learning
[4], Evolving Systems [2], Concept Drift [15], Dataset Shift [12]. In all these
paradigms, incomplete knowledge of the environment is sufficient during the
training phase, since learning continues during run time.
Within the adaptive learning framework, there is a new set of issues to be
addressed when dealing with large amounts of continuous data online: limita-
tions on computational time and memory. In fast changing environments, even
a partially correct classification can be valuable.
Since its introduction [8,3] Deep Learning has proven to be an effective method
to improve the accuracy of Multi-Layer Perceptrons (MLPs) [6]. In particular,

380 R. Calandra et al.
Deep Belief Networks (DBNs) have been well-established and can be considered
the state of the art for artificial neural networks. To the best of our knowledge,
DBNs have not been used to incrementally learn from non-stationary streams
of data. Dealing with changing streams of data with classical DBNs requires to
store at least a subset of the previous observations, similar to other non-linear
approaches to incremental learning [10,1]. However, storing large amounts of
data can be impractical.
The contributions of this paper are two-fold. Firstly, we study the generative
capabilities of DBNs as a way to extract and transfer learned beliefs to other
DBNs. Secondly, based on the possibility to transfer knowledge between DBNs,
we introduce a novel approach called Adaptive Deep Belief Networks (ADBN)
to cope with changing streams of data while requiring only constant memory
consumption. With the ADBN it is possible to use the DBN parameters to
generate observations that mimic the original data, thus reducing the memory
requirement to storing only the model parameters. Moreover, the data compres-
sion properties of DBNs already provide an automatic selector of the relevant
extracted beliefs. To the best of our knowledge, the proposed ADBN is the first
approach toward generalizing Deep Learning to incremental learning.
2 Deep Belief Networks

Deep Belief Networks are probabilistic models that are usually trained in an
unsupervised, greedy manner. DBNs have proven to be powerful and flexible
models [14]. Moreover, their capability of dealing with high-dimensional inputs
makes them ideal for tasks with an innate number of dimensions such as image
classification. The basic building block of a DBN is the Restricted Boltzmann
Machine (RBM) that defines a joint distribution over the inputs and binary
latent variables using undirected edges between them. A DBN is created by
repeatedly training RBMs stacked one on top of the previous one, such that the
latent variables of the previous RBM are used as data for the next one. The
resulting DBN includes both generative connections for modeling the inputs,
and recognition connections for classification (see [8] for details).
One possible use of DBNs is to initialize a classical MLP: While a DBN
works with activation probabilities of binary units, they can be re-interpreted as
continuous-valued signals in an equivalent MLP. This “pre-training” proves to
be better than random initialization and has been shown to lead to consistent
improvements in the final classification accuracy [7,6]. A second use for DBNs is
dimensionality reduction. In this case, a classifier can be trained “on top” of the
DBN, which means that the input of the classifier is nothing else but the data in
the reduced space (i.e., the DBN output). This second configuration (from now
on referred as DBN/classifier) has the advantage that the generative capabilities
of the DBN are maintained.
DBNs typically require the presence of a static dataset. To deal with changing
streams, the usage of DBNs would require storing all the previous observations
to re-train the DBN. For infinite-lasting streams of data this would translate into
an infinite memory requirement and increasing computational time for training.
Learning Deep Belief Networks from Non-stationary Streams 381

...

...

...

...

Fig. 1. Regenerative Chaining: Alter- Fig. 2. ADBN: DBN+classifier are

nately learning a DBN from data and trained from both the generated sam-
generating data from a DBN ples and newly incoming data
3 Adaptive Deep Belief Networks

To address the limitations of DBNs for non-stationary streams of data (memory
consumption, training time), we propose a novel approach based on the gen-
erative capabilities of a DBN. Three novel contributions are presented in this
section in a logical sequence where each of them extend upon the previous. At
first, we investigate the possibility of using samples generated from a DBN to
transfer the learned beliefs (i.e., knowledge) to a second DBN. Then we show how
to extend this approach to transfer not only unsupervised but also supervised
knowledge (i.e., including labels). Finally, we present our novel approach called
Adaptive Deep Belief Networks (ADBN). This approach is based on transferring
supervised knowledge by means of generated samples and, jointly, learns new
knowledge from the novel observations.
Belief Regeneration. An interesting feature of DBNs is the capability of gener-

ating samples. These samples can be considered a representation of the beliefs
learned during the training phase [8]. Under this assumption we can exploit
the generated samples as an approximation of the knowledge of the DBN. We
propose to train a second regenerated DBN from the samples generated from a
trained original DBN. From this procedure, which we call Belief Regeneration,
we theoretically obtain an equivalent model to the original DBN.
This procedure can be iterated by training an nth DBN from the samples
generated by the (n − 1)th regenerated DBN as shown in Fig. 1. We call this
procedure Regenerative Chaining, and a DBN generated from n repetitions of
Belief Regeneration is an nth-generation DBN.
Classifier Regeneration. DBNs can generate unlabeled samples that mimic the
distribution of the training inputs. When making use of the DBN/classifier con-
figuration it is still possible to generate unlabeled samples with the generative
connections. Furthermore, these samples can be used as a standard input for the
recognition connections and, thus, be classified. Hence, this procedure allows the
generation of datasets of labeled samples. Similarly to Belief Regeneration, we
use this artificially generated dataset to train a second DBN/classifier,in what
we call Classifier Regeneration.Chaining the Classifier Regeneration process is

the building block for ADBNs.
Adaptive Deep Belief Networks. When dealing with non-stationary streams of

data, there is a need to consider two different aspects. While it is necessary
to retain past knowledge, new one must also be incorporated as well. We saw
how generated labeled samples can be used to repeatedly reconstruct both the
DBN and the classifier that approximate the original ones. The DBN/classifier
regeneration can effectively keep acquired knowledge in our model even when
discarding the past observations (i.e., the dataset). We propose to exploit this
belief transfer to generalize DBNs to an incremental learning paradigm. In order
to incorporate also new knowledge in the model, we can use both generated
samples and novel data from the stream, for the re-train of the DBN/classifier,
as shown in Fig. 2. The use of such training data allows the DBN/classifier
to incorporate new knowledge while retaining old one. Moreover, the memory
consumption is constant as after each training period all the previous data (both
artificially generated and real) are discarded.
4 Experiments
To evaluate the properties of our models we used the hand-written digit recog-
nition MNIST dataset [11] in our experiments. The MNIST dataset consists of a
training set of 60000 observations and a test set of 10000 observations where ev-
ery observation is an image of 28x28 binary pixels stored as a vector (no spatial
information was used).
To train the RBMs, we used the algorithm introduced by Cho et al. [5] that
makes use of Contrastive Divergence learning (CD-1). We used Gibbs sampling
to generate samples from a trained DBN. The reconstruction error over a dataset
is defined as
N D
R(X) = 1
N (Xij − X̂ij )2 , (1)
i=1 j=1
where N is the number of observations, D the dimensionality of the input X,

and X̂ is the reconstructed input. For fine-tuning the neural network, we used
the Resilient Propagation (Rprop) algorithm [13]; and in particular the IRprop
variant [9]1 . We used the Log-sigmoid as transfer function and the Mean Squared
Error as error function to train the network.
Belief Regeneration. To experimentally evaluate the Belief Regeneration process,

a DBN with topology [784-600-500-400] was trained. Fig. 3 shows how the num-
ber of samples used to train the regenerated DBN influences the reconstruction
error in Eq. (1). A higher number of samples better approximates the original
1
Our implementation is available at
http://www.mathworks.com/matlabcentral/fileexchange/32445-rprop
200
Reg. DBN
Reconstruction Error
DBN
150
100
50
0 2 3 4
10 10 10
Number of samples used for regeneration
Fig. 3. Reconstruction error of the Fig. 4. MNIST samples generated from

MNIST test set using a DBN regener- DBNs: first row from original DBN,
ated with a varying amounts of gener- then from DBN regenerated with 10000,
ated samples 2500, 750, 500, and 100 generated sam-
ples, respectively
DBN trained with the full dataset. However, it is also computationally expensive
to generate many samples. In our experiment, there seems to be a clear threshold
at 750 samples above which the original DBN can be considered well approxi-
mated. A further indication is given by the visual inspection of the generated
samples from the regenerated DBN in Fig. 4. Above 750 samples there is little
difference between the samples generated from original and reconstructed DBN
(top row of Fig. 4), for a human observer.
Fig. 5 shows that for chained regenerations the reconstruction error gradually
increases with the number of sequential reconstructions. Similar conclusions are
visually drawn from the generated samples in Fig. 6 where after 100 generations
of regeneration (using 10000 samples at each generation) there is a visible degra-
dation in the quality of the generated samples. The reason of this degradation
is the error propagation between sequential regenerations.
However, fine-tuning a 100th generation DBN shows little decrease in terms
of classification accuracy compared to fine-tuning the original DBN, as shown
in Fig. 7. This result suggests that despite becoming humanly incomprehensible
(Fig. 6), the generated samples retain valuable features for training a DBN and
still prove to be useful during an eventual fine-tuning: Fine-tuning initialized
from a regenerated DBN led to a similar optimum as the original DBN.
Classifier Regeneration. Using a DBN/classifier allows us to generate labeled

samples. Examples of the generated samples and respective labels are shown in
Fig. 8. These artificially generated datasets are used to train subsequent DBNs/
classifiers, as described in Sec. 3. Fig. 9 shows the classification accuracies of the
regenerated DBNs/classifiers after n generations. While the decrease in perfor-
mance is consistent, we are using samples generated from our model. Further-
more, the number of samples is only a fraction of the original dataset size.
150
Reg. DBN
Reconstruction Error
DBN
100
50
0
0 20 40 60 80 100
Number of sequential regenerations
Fig. 5. Reconstruction error of the Fig. 6. MNIST samples generated from

MNIST test set using a DBN regener- DBNs: first row from the original DBN,
ated a varying amounts of times. De- then for 1, 50 and 100th-generation
spite the increase in reconstruction er- DBNs. Despite the degeneration, even
ror, the classification accuracies of fine- after 100 generations the samples retain
tuning do not change (see Fig. 7). useful features to train a new DBN.
Adaptive Deep Belief Networks. We trained a DBN and classifier using 3 digits
(8,6,4) of the MNIST dataset. Every 50 fine-tuning iterations, we presented a
new batch of data containing samples from a novel digit to the ADBN. These
samples, together with the generated ones, were then used to re-train both the
DBN and the classifier, see Sec. 3. Fig. 10 shows the classification accuracy and
memory consumption of the ADBN on all 10 digits when adding new digits to the
data set. The accuracy increases, which means that the ADBN can successfully
learn from new data while at the same time retaining the knowledge previously
acquired. In Fig. 10, we compare to a DBN that is trained on all the previous
observations which led to a higher classification accuracy but at the expense of
the memory consumption. While the DBN memory increase linearly (as we store

!"

!"

Fig. 7. Classification accuracies on the

MNIST test set during the fine-tuning. Fig. 8. MNIST samples generated from
Using regenerated DBNs does not sub- DBNs, labeled and then used for re-
stantially decrease the classification ac- generation: every row corresponds to la-
curacy compared to the original DBN. beled samples
4000 100
ADBN accuracy
DBN accuracy
Classification accuracy (%)

3000 ADBN memory 80

DBN memory
Memory (MB)
2000 60

1000 40
0 20
Init (8,6,4) add 0 add 3 add 5 add 2 add 7 add 1 add 9

!

!

Fig. 10. Classification accuracy and memory

consumption of the DBN and the ADBN on
all 10 digits during the training phase: Ini-
Fig. 9. Comparison of the classi- tially trained with 3 digits (8,6,4), for every
fication accuracies on the MNIST novel digit introduced the DBN/classifier is
test set during the training of the regenerated. A classical DBN achieves higher
classifier on top, using the origi- classification accuracy but at the expense
nal and regenerated classifiers af- of memory consumption. In contrast, the
ter different N-generations ADBN requires only a constant memory.
more and more observations), the amount of memory required by the ADBN is
constant: Only the model parameters need to be stored.
5 Conclusions and Discussion

In this paper, we presented the Adaptive Deep Belief Networks, a promising ap-
proach that generalizes DBNs to deal with changing streams of data (e.g., num-
ber of classes and shift in distribution) while requiring only constant
memory.
ADBNs can be incrementally trained by means of Belief Regeneration. Be-
lief Regeneration consists of iteratively transferring beliefs between DBNs by
training a second DBN using samples from the original DBN.
In ADBNs, new data can be integrated into the already acquired beliefs re-
training the DBN with both the novel and artificially generated samples. While
the classification accuracy suffers compared to fully trained DBNs, the ADBN
does not need to store past observations. Hence, the memory requirements are
constant since only the model parameters have to be stored. Moreover, unlike
other Adaptive Learning methods, the ADBN is a generative model and can deal
with high-dimensional data.
In our approach the generative capabilities and the sampling method adopted
are of great importance. Although the Contrastive divergence (CD) learning
(used in this work), produces good classifiers, it does not work as well as a
generative model. We therefore believe that the use of more appropriate training
and sampling scheme could be beneficial.
Many adaptive learning approaches make use of explicit mechanisms to forget
the past. To this regard, ADBNs present no explicit mechanism to forget selected
observations. Instead, the less representative observations are naturally forgotten
during the regeneration process. The choice of the number of samples to use for
each epoch of the ADBN training can be a sensitive parameter. In particular, the
ratio between generated samples and novel observations, can be used to modify
the stability/plasticity trade-off.
Finally, an interesting extension to our approach is the possibility to change
the topology of the network adaptively at running time in order to adapt the
capability to the complexity of the environment.
Acknowledgments. We thank Olli Simula and Jan Peters for invaluable dis-
cussions and the friendly environments they provided to realize this paper. The
research leading to these results has received funding from the European Commu-
nity’s Seventh Framework Programme (FP7/2007-2013) under grant agreement
#270327 and by the DFG within grant #SE1042/1.
References
1. Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: On demand classification of data
streams. In: Proceedings of KDD 2004, pp. 503–508 (2004)
2. Angelov, P., Filev, D.P., Kasabov, N.: Evolving Intelligent Systems: Methodology
and Applications. Wiley-IEEE Press (2010)
3. Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H.: Greedy layer-wise training
of deep networks. In: Proceedings of NIPS 2006, vol. 19, pp. 153–160 (2006)
4. Bifet, A. (ed.): Proceeding of the 2010 Conference on Adaptive Stream Mining:
Pattern Learning and Mining from Evolving Data Streams (2010)
5. Cho, K., Raiko, T., Ilin, A.: Enhanced gradient and adaptive learning rate for
training restricted Boltzmann machines. In: Proceedings of ICML 2011, pp. 105–
112 (2011)
6. Erhan, D., Courville, A., Bengio, Y., Vincent, P.: Why does unsupervised pre-
training help deep learning? In: Proceedings of AISTATS 2010, pp. 201–208.
7. Erhan, D., Manzagol, P.-A., Bengio, Y., Bengio, S., Vincent, P.: The difficulty of
training deep architectures and the effect of unsupervised pre-training. In: Pro-
ceedings of AISTATS 2009, pp. 153–160 (2009)
8. Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief
nets. Neural Computation 18(7), 1527–1554 (2006)
9. Igel, C., Hüsken, M.: Improving the RPROP learning algorithm. In: Proceedings
of NC 2000, pp. 115–121 (2000)
10. Last, M.: Online classification of nonstationary data streams. Intelligent Data Anal-
ysis 6(2), 129–147 (2002)
11. LeCun, Y., Cortes, C.: MNIST handwritten digit database (2010),
http://yann.lecun.com/exdb/mnist/
12. Quiñonero Candela, J., Sugiyama, M., Schwaighofer, A., Lawrence, N.D.: Dataset
Shift in Machine Learning. The MIT Press (2009)
13. Riedmiller, M., Braun, H.: A direct adaptive method for faster backpropagation
learning: the RPROP algorithm. In: IEEE International Conference on Neural
Networks, vol. 1, pp. 586–591 (1993)
14. Salakhutdinov, R.: Learning deep generative models. PhD thesis, University of
Toronto (2009)
15. Zliobaite, I.: Learning under concept drift: an overview. CoRR (2010)
Separation and Unification of Individuality
and Collectivity and Its Application to Explicit Class
Structure in Self-Organizing Maps
Ryotaro Kamimura
IT Education Center, 1117 Kitakaname Hiratsuka Kanagawa 259-1292, Japan

ryo@keyaki.cc.u-tokai.ac.jp
Abstract. In this paper, we propose a new type of learning method in which in-
dividuality and collectivity are separated and unified to control the characteristics
of neurons. This unification is expected to enhance the characteristics shared by
individual and collective outputs, while the characteristics specific to them are
weakened. We applied the method to self-organizing maps to demonstrate the
utility of unification. In self-organizing maps, the introduction of unification has
the effect of controlling cooperation among neurons. Experimental results on the
glass identification problem from the machine learning database showed that ex-
plicit class boundaries could be obtained by introducing the unification.
1 Introduction
In this paper, we propose a new type of learning method based upon the separation and
unification of individuality and collectivity of neurons. The utility of this unification can
be explained through self-organizing maps. The collectivity of neurons has received due
attention in the field of self-organizing maps [1], because SOMs have been concerned
with the collective behavior of neurons. One of the main principles for this interaction
is that neighboring neurons behave in the same way. We think that this property of
neighboring neurons has been related to the difficulty in visualizing final results by
the conventional SOMs. For example, conventional SOMs have been used to clarify
class boundaries. Naturally, class boundaries are based on some discontinuity between
neighboring neurons. In SOMs, the neighboring neurons cooperate with each other, and
discontinuity between neurons tends to be reduced. Because we have had difficulty in
representing class boundaries using conventional SOMs, there have been many different
kinds of visualization techniques proposed, [2], [3], [4], [5], [6], to cite a few. However,
even with these visualization methods, it remains clear that class structure cannot be
easily identified. This fact suggests that we need to weaken the collectivity of neurons
for improved visualization. Our method aims to control cooperation or collectivity by
introducing the individuality of neurons.
2 Theory and Computational Methods

2.1 Separation and Unification
We consider a neuron from two different points of view: we separate the individual-
ity and collectivity of the neuron. Output from an individually treated neuron becomes

388 R. Kamimura
(a1) Collective output

wjk (b1) Collecitve
output
Separation Unification Unification

xks
(a3) Unified output (b3) Unified
output
(a2) Individual output (b2) Individual output

(a) Enhancement (b) Inhibition
Fig. 1. Two examples of learning processes where individually outputs interact with collective
outputs to produce unified outputs
individual output, while output from a collectively treated neuron becomes collective
output. We try to unify these two forms of output into a third form, namely unified out-
put. One of the main objectives of this unification is to enhance characteristics common
to individual and collective outputs, and to weaken the characteristics specific to each
output. For example, Figure 1(a) shows an example of the enhancement. Because the
individual and collective outputs are large in Figure 1(a1) and (a2), the unified output
becomes large in Figure 1(a3). This corresponds to the enhancement of characteristics
common to the two types of outputs. Concordantly, characteristics specific to the two
types of outputs should be weakened. This is an example of inhibition, as shown in
Figure 1(b). The collective output is small, see Figure 1(b1), while individual output
is large, as in Figure 1(b2). Because the two outputs are different from each other, the
unified output should be small, as in Figure 1(b3). Finally, we should note that in the
unification process, we try to weight individual outputs by collective outputs as shown
in Figure 1. Because we apply our method to self-organizing maps, collective output
plays a more important role in forming ordered maps.
2.2 Computational Methods

Let us define three types of outputs, namely, individual, collective, and unified
outputs. The sth input pattern of total S patterns can be represented by xs =
[xs1 , xs2 , · · · , xsL ]T , s = 1, 2, · · · , S. Connection weights into the jth competitive unit
of total M units are computed by wj = [wj1 , wj2 , · · · , wjL ]T , j = 1, 2, . . . , M. Then,
the jth individual output can be computed by:

s 1 s T s
vj ∝ exp − (x − wj ) Λ(x − wj ) , (1)
2
Separation and Unification of Individuality and Collectivity 389
where xs and wj are supposed to represent L-dimensional input and weight column
vectors, where L denotes the number of input units. The L × L matrix Λ is called a
”scaling matrix,” and the klth element of the matrix denoted by (Λ)kl is defined by:
p(k)
(Λ)kl = δkl , k, l = 1, 2, · · · , L. (2)
σβ2
where σβ is a spread parameter, and p(k) shows the importance of the kth input unit
and is initially set to 1/L, because we have no preference in input units. We represent
relations between the jth neuron and mth neuron by: φjm . Then, the collective output
is defined by:
M

s 1 s T s
yj ∝ φjm exp − (x − wm ) Λ(x − wm ) . (3)
m=1
2
The unified output is based upon individual and collective outputs. For the first approx-
imation to the unified outputs, we consider the unified output as the individual outputs
weighted by the normalized collective outputs:

1
z(j | s) ∝ q(j | s) exp − (xs − wj )T Λ(xs − wj ) (4)
2
We use the normalized collective outputs,
M 1 s T s

m=1 φjm exp − 2 (x − wm ) Λ(x − wm )
q(j | s) = M M 1 , (5)
m=1 φjm exp − 2 (x − wm ) Λ(x − wm )
s T s
j=1
because the collective outputs are the sum of all neighboring individual neurons’ outputs
and much larger than the individual outputs. This equation shows that when the jth
collective and individual outputs are larger, the unified output becomes larger in turn.
On the other hand, when the collective and individual outputs are different, the unified
output is weakened. Thus, this equation principally describes our above explained idea.
To obtain connection weights, we introduce the free energy function: [7],
S M
1
F = −2σβ2 p(s) log q(j|s) exp − (xs − wj )T Λ(xs − wj ) . (6)
s=1 j=1
2
This equation can be transformed as:

S
M

F = p(s) p∗ (j | s)xs − wj 2
s=1 j=1
S
M
p∗ (j | s)
+2σβ2 p(s) p∗ (j | s) log , (7)
s=1 j=1
q(j | s)
where p∗ (j | s) is the normalized unified outputs and computed by:

∗ q(j | s) exp − 12 (xs − wj )T Λ(xs − wj )
p (j | s) = M 1 . (8)
s T s
m=1 q(m | s) exp − 2 (x − wm ) Λ(x − wm )
390 R. Kamimura
When the spread parameter σβ is larger, the unified outputs tend to be more similar
to the collective outputs. When the spread parameter is smaller, the quantization er-
rors tend to be much smaller. By differentiating the free energy, we can compute the
connection weights:
S ∗ s
s=1 p (j | s)x
wj = S
. (9)
∗
s=1 p (j | s)
2.3 Application to SOM

When we apply the method to self-organizing maps, the relation φjm must be replaced
by the neighborhood function. Let us consider the following neighborhood function that
is usually used in self-organizing maps:

rj − rc 2
hjc ∝ exp − , (10)
2σγ2
where rj and rc denote the position of the jth and the cth unit on the output space and
σγ is a spread parameter. Using this neighborhood function, we can compute collective
outputs:
M
s 1 s T s
yj ∝ hjm exp − (x − wj ) Λ(x − wj ) . (11)
m=1
2

In this study we used the glass identification problem of the machine learning database
[8]. The numbers of input variables and patterns were 9 and 216, respectively. The
number of units was 21 by 12. The data was mainly classified into window and non-
window glasses in unsupervised ways. We used the well-known SOM toolbox [9] for
the easy reproduction of our results. The performance evaluation was measured in terms
of quantization and topographic errors due to the simplicity of the measures. The quan-
tization error is the average distance from each data vector to its BMU (best-matching
unit). The topographic error is the percentage of data vectors for which the BMU and
the second-BMU are not neighboring units [10]. In this experiment, it was shown that
our method produced much clearer class boundaries than those producible by the con-
ventional SOM, with some degradation in the fidelity to input patterns.
Figure 2(a) shows the U-matrix obtained by the conventional SOM. As can be seen
in the figure, no explicit class boundaries could be identified on the matrix. Figures 2(b)
and (c) show the results of the principal component analysis (PCA) on the connection
weights using the conventional SOM and the data itself. A cluster of connection weights
in Figure 2(b) and data points in Figure 2(c) were detected on the right hand side.
Figure 3 shows the U-matrices when the parameter β was increased from 5 (a) to
15 (f). When the parameter β was five, as in Figure 3(a), no regularity could be seen
on the U-matrix. When the parameter β was increased to seven, as in Figure 3(b), a
class boundary in light blue was detected on the upper side of the matrix. When the
parameter β was increased to nine, as in Figure 3(c), the class boundary became much
0.8
0.6
0.4
0.2
-0.2
-0.4
-1 -0.5 0 0.5
(a) U-matrix (b) PCA(SOM)
1.2
0.8
0.6
0.4
0.2
-0.2
-0.4
-0.6
-1 -0.5 0 0.5
(c) PCA(data)
Fig. 2. U-matrices (a) by the conventional SOM, the results of PCA applied to the weights (b)
by the SOM and to actual data (c) for the glass identification problem
clearer, represented in warmer colors. When the parameter β was increased to 11, as in
Figure 3(d), the class boundary became slightly differentiated. When the parameter β
was further increased to 13 and 15, as in Figures 3(e) and (f), the main boundary was
decomposed into minor class boundaries.
Figure 4 shows the results of the PCA on the connection weights when the parameter
β was increased from 5 (a) to 15 (f). When the parameter β was five, as in Figure 4(a),
connection weights were uniformly distributed. When the parameter was increased to
seven, as in Figure 4(b), a clear class appeared on the right hand side. This class became
further clearer when the parameter was increased to nine, as seen in Figure 4(c). When
the parameter was further increased from 11 in Figure 4(d) to 15 in Figure 4(f), the class
was decomposed into several subclasses.
Figure 5 shows quantization and topographic errors for training and testing patterns.
As shown in Figure 5(a1) and (a2), quantization errors decreased gradually as the pa-
rameter β was increased for the training and testing data. Figure 5(b1) shows the topo-
graphic errors for the training patterns. As shown in the figure, the topographic errors
were below the level obtained by the conventional SOM when the parameter β was
small. They increased gradually when the parameter β was increased. Figure 5(b2)
shows topographic errors as a function of the parameter β for the testing data. The
392 R. Kamimura
(a) Beta=5 (b) Beta=7 (c) Beta=9
(d) Beta=11 (e) Beta=13 (f) Beta=15
Fig. 3. U-matrices by the new method when the parameter β was increased from 5 (a) to 15 (f)
for the glass identification problem
topographic errors were much lower than those obtained with the conventional SOM.
When the parameter β increased beyond six, the topographic errors increased rapidly.
The experimental results showed that very clear class boundaries could be detected
by our method. However, when the parameter β was increased, topographic errors in-
creased as well, meaning that fidelity to input patterns tended to be smaller. Thus, we
must carefully choose the value for the parameter β and compromise between fidelity
and clear class boundaries.
4 Conclusion
In this paper, we separated the individuality and collectivity of neurons, which were rep-
resented in terms of the different outputs from the neurons. Then, we tried to unify the
individuality and collectivity. Unification has the effect of controlling the
0.15 0.6 0.8
0.5
0.1 0.6
0.4
0.4
0.05 0.3
0.2 0.2
0
0.1
0
-0.05 0
-0.2
-0.1
-0.1
-0.4
-0.2
-0.3 -0.6
-0.25 -0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2 0.25 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2
(a) Beta=5 (b) Beta=7 (c) Beta=9
0.8 1 1
0.6 0.8
0.6
0.4
0.5
0.4
0.2
0.2
0
0
0
-0.2
-0.2
-0.4 -0.4
-0.6 -0.5 -0.6

-0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6
(d) Beta=11 (e) Beta=13 (f) Beta=15
Fig. 4. The results of PCA applied to the connection weights when the parameter β was 10, 13,
and 15 for the glass identification problem
individuality and collectivity of neurons and enhancing the characteristics of neurons.

For example, when individual and collective outputs are larger, the unified output should
be larger as well. This means that the characteristics shared by the individual and collec-
tive outputs can be enhanced. On the other hand, if individual and collective outputs are
different from each other, the characteristics should be weakened. This idea has been
implemented in self-organizing maps. In self-organizing maps, much attention has been
paid to the collectivity of neurons. This focus upon collectivity has made it difficult to
visualize class boundaries. We applied our method to the glass identification problem
of the machine learning database. Experimental results showed that much more explicit
class boundaries were obtained on the U-matrix and from the results of PCA. However,
topographic errors tended to increase when the parameter β was increased. Thus, we
must develop a method to compromise between explicit class structure and fidelity to
input patterns.
394 R. Kamimura
0.14 0.14
0.12 0.12
0.1 0.1
QE
QE
0.08 0.08
0.06 0.06
0.04 0.04
0.02 0.02
5 10 15 5 10 15
Beta Beta
(a1)Training (a2)Testing
(a) Quantization error
0.2 0.2
0.15 0.15
0.1 0.1
TE
TE
0.05 0.05
0 0
5 10 15 5 10 15
Beta Beta
(b1)Training (b2)Testing
(b) Topographic error
Fig. 5. Quantization errors (a) and topographic errors (b) for training (1) and testing (2) data for
the glass identification problem.
References
1. Kohonen, T.: Self-Organizing Maps. Springer (1995)
2. Sammon, J.W.: A nonlinear mapping for data structure analysis. IEEE Transactions on Com-
puters C-18(5), 401–409 (1969)
3. Ultsch, A., Siemon, H.P.: Kohonen self-organization feature maps for exploratory data analy-
sis. In: Proceedings of International Neural Network Conference, pp. 305–308. Kulwer Aca-
demic Publisher, Dordrecht (1990)
4. Vesanto, J.: SOM-based data visualization methods. Intelligent Data Analysis 3, 111–126
(1999)
5. Kaski, S., Nikkila, J., Kohonen, T.: Methods for interpreting a self-organized map in data
analysis. In: Proceedings of European Symposium on Artificial Neural Networks, Bruges,
Belgium (1998)
6. Yin, H.: ViSOM-a novel method for multivariate data projection and structure visualization.
IEEE Transactions on Neural Networks 13(1), 237–243 (2002)
7. Kamimura, R.: Self-enhancement learning: target-creating learning and its application to
self-organizing maps. Biological Cybernetics, 1–34 (2011)
8. Frank, A., Asuncion, A.: UCI machine learning repository (2010)
9. Vesanto, J., Himberg, J., Alhoniemi, E., Parhankangas, J.: SOM toolbox for Matlab. tech.
rep., Laboratory of Computer and Information Science, Helsinki University of Technology
(2000)
10. Kiviluoto, K.: Topology preservation in self-organizing maps. In: Proceedings of the IEEE
International Conference on Neural Networks, pp. 294–299 (1996)
Autoencoding Ground Motion Data
for Visualisation
Nikolaos Gianniotis, Carsten Riggelsen, Nicolas Kühn, and Frank Scherbaum
University of Potsdam, Institute of Earth and Environmental Science,

Karl-Liebknecht-Str. 24-25, 14476 Potsdam-Golm, Germany
{ngianni,riggelsen,nico,frank.scherbaum}@geo.uni-potsdam.de
Abstract. We present a new visualisation method for physical data

based on the autoencoder that allows a transparent interpretation of the
induced visualisation. The autoencoder is a neural network that com-
presses high-dimensional data into low-dimensional representations. It
defines a fan-in fan-out architecture, with the middle layer composed of
a small number of neurons referred to as the ‘bottleneck’. When data are
propagated through the network, the bottleneck forces the autoencoder
to reduce the dimensionality of the data. Physical data are manifesta-
tions of physical models that express domain knowledge. Such knowledge
should be reflected in the visualisation in order to help the analyst un-
derstand why the data are projected to their particular locations. In this
work we endow the standard autoencoder with this capability by extend-
ing it with extra layers. We apply our approach on a dataset of ground
motions and discuss how the visualisation reflects physical aspects.
Keywords: Topographic Mapping, Autoencoder, Ground Motion.
1 Introduction
Neural networks have been widely used in visualisation of high-dimensional data
[8,9,6]. A particular instance is the autoencoder [10,1,13]. Its architecture typi-
cally defines three hidden layers, with the middle layer being smaller than the
others in terms of number of neurons, thus commonly referred to as the ‘bot-
tleneck’. The input and output layers have the same number of neurons. The
autoencoder learns an identity mapping by training on targets identical to the
inputs y. Training is hampered by the bottleneck that forces the autoencoder to
reduce the dimensionality of the inputs. Thus, output ỹ is only an approximate
reconstruction of input y. Training minimises the L2 norm between inputs and
reconstructions, y − ỹ2 . If the number of neurons in the bottleneck is set to
two, the autoencoder can be used for visualisation. Inputs y induce activations
in the bottleneck that are taken as 2-D coordinates x in a two-dimensional space
where reduced representations of the inputs live. The autoencoder may be viewed
as the composition of an encoding fenc and a decoding fdec function. Encoding
maps inputs to coordinates, fenc (y) = x, while decoding approximately maps
coordinates back to inputs, fdec (x) = ỹ. The complete mapping is denoted as
f (y; w) = fdec (fenc (y)) = ỹ, where w are the weights of the autoencoder.

396 N. Gianniotis et al.
Our data of interest are ground motions (GMs), i.e. the acceleration of ground
movement during an earthquake. Such data are manifestations of an underlying
physical model. However, the autoencoder does not guarantee that reconstruc-
tions ỹ respect the underlying physics. Interpreting a visualisation plot where the
2-D coordinates x map to items ỹ that make no physical sense is problematic.
Moreover, physical data embody domain knowledge that should be reflected in
the visualisation. This helps the analyst inspecting the visualisation understand
why a data item is projected to a particular location x. Being able to physically
interpret the results is important because it increases the analyst’s confidence
in the visualisation. Equally important, physical interpretation may also reveal
inconsistencies, informing the analyst that the visualisation is not meaningful.
In this work we focus on the visualisation of GMs. We extend the autoencoder
by coupling its outputs to a physical model that generates GMs. Effectively, we
add an output layer that physically constrains the reconstructions ỹ. Currently,
visualisation of GMs has been addressed with the application of SOM [12]. How-
ever, the model-free nature of SOM [9] does not permit a physical interpretation
of the visualisation. This work takes a step towards this direction.
2 Physical Model for Ground Motion

Seismic ground motion (GM) is generated by an earthquake and recorded at a
certain distance from the source. Here we consider only the logarithmic peak
ground acceleration (log P GA), a measure of the acceleration of ground move-
ment, which is a parameter of engineering interest important to seismic hazard
analysis. A simplified physical model for GM is the stochastic model [5]. It gen-
erates GMs by successive applications of filters which represent source path and
site contributions to ground motion. Its operation is governed by several param-
eters, here we explore a small subset expected to have large influence on the
properties of ground motion, known in the seismological community as vs30, κ,
stress drop (sd). We denote them as the vector θ = [vs30, κ, sd]. The stochastic
model may be seen as the convolution of shaped white noise with a series of
filters controlled - among others - by the parameters θ. SD operates as a high
pass filter, κ as a low pass filter and vs30 describes amplification of the GM
due to site effects. This convolution produces a time-series with properties that
closely resemble physical GMs. log P GA is extracted from this time-series as the
maximum peak. The action of these filters is clear when looking at the Fourier
spectrum of the GM. However, log P GA is an aggregate measure over all fre-
quencies and the effect of these filters on log P GA is not easily understood. In
this work we try to understand these effects on log P GA better via visualisation.
It is a general assumption that the logarithm of GM is normally distributed1
around the prediction of the stochastic model with a known empirical variance
2
σstoch . Six examples of GM prediction from the stochastic model are shown in
Fig. 4. Each GM prediction consists of six curves, each stands for a different
1
For brevity, henceforth we say that GM is normally distributed, when actually its
logarithm is normally distributed.
Autoencoding Ground Motion Data 397
magnitude. We note that log P GA increases with magnitude and decreases with
distance. Instead of working with continuous GM curves, we discretise them by
keeping only 60 log P GA values over the grid of distance-magnitude pairs in-
dicated with × markers in the GM examples in Fig. 4. Thus, each GM curve
becomes a discrete vector y of 60 values.
The non-analytical nature of the stochastic model [5] does not permit cal-
culating derivatives with respect to its inputs θ. We want to use this physical
model as an output layer for the autoencoder. Training the autoencoder via
backpropagation requires calculating derivatives of the activations at all layers
with respect to the weights [2], thus rendering the stochastic model unsuitable.
Our solution was to emulate the stochastic model with a differentiable surrogate
model. This was chosen to be a neural network2 of 12 neurons, trained on a
dataset of pairs (θ, y) generated by the stochastic model. The trained neural
network was tested on independent data and its error was judged low enough
as to accept it as a replacement to the stochastic model. Its weights were fixed
and no further adaptation took place when coupled to the autoencoder. We de-
note the surrogate neural network by function g(θ) = y. As aforementioned,
GM is normally distributed. Consequently, we define a noise model that outputs
observed GM curves y obs according to y obs ∼ N (g(θ), σstoch
2
).
3 Autoencoding Ground Motion Data

By coupling surrogate model g to the outputs of the autoencoder we can ensure
that the reconstructions ỹ are physically sound. Thus we need the autoencoder
to output valid physical parameters θ which are then fed to the surrogate model.
To that end, we introduce the following modifications:
i) we modify the linear output layer of the autoencoder so that it is a 3-D

output, f (y) = [f1 (y), f2 (y), f3 (y)]T , i.e. same dimension as the number of
physical parameters.
ii) the autoencoder’s linear output is fed to a vector-valued mapping ξ(·) =
Aψ(·) + β that transforms them to valid physical parameters:
1
– ψ is a vector-valued version of the logistic sigmoid σ(fi ) = 1+exp(−f i)
that “squashes” its inputs in [0, 1]: ψ(f ) = [σ(f1 ), σ(f2 ), σ(f3 )]T .
– A is a 3 × 3 diagonal matrix that scales parameters to the appropriate
range. Its elements are the ranges (θimax − θimin ) of parameters, so that
A = diag((2800 − 630), (0.08 − 0.01), (100 − 1)).
– the 3 × 1 vector β holds the minimum value θimin of each parameter, and
shifts parameters to the appropriate interval: β = [630, 0.01, 1]T .
iii) the output of the vector-valued mapping is finally fed to surrogate model g.
The entire model, shown in Fig. 1, is summarised as the composition h(y) =

g(ξ(f (y; w))) = ỹ. In analogy to autoencoder f , we define an encoding part
henc (y) = fenc (y), and a decoding part hdec (x) = g(ξ(fdec (x))), so that h(y) =
2
A feedforward neural network with 1 hidden layer of tanh neurons and linear outputs.
encoder h enc decoder h dec
f1 θ
s
u
r
r
x vs30 o
g
a
f2 t
e
k n
e
u
f3 r ~
y sd a y
l
n
e
t
ξ
bottleneck
g
"old" autoencoder f
Fig. 1. Architecture of the extended autoencoder h. Input y is encoded to 2−D

coordinates x at the bottleneck. x are decoded to a 3−D output which is mapped
by the first non-adaptable additional layer to valid physical parameters θ. Finally θ
is fed to the second non-adaptable additional layer, surrogate model g, that produces
physically valid GMs ỹ. Bias neurons are not plotted for clarity.
hdec (henc (y)). The introduced modifications may be seen as extending the au-
toencoder with extra non-adaptable layers.
As aforementioned, the surrogate model defines a noise model N (y, σstoch
2
).
The similarity of a reconstruction ỹ to its input y is judged by the likelihood
implied by the noise model. Thus, optimisation of the extended autoencoder
proceeds by optimising the log-likelihood function, where the contribution of
input y n is defined as:
E n = log N (ỹ n |y n , σstoch
2
) = log N (h(y n )|y n , σstoch
2
), (1)
N
with the log-likelihood over all N inputs being E = n=1 E n . Discarding con-
stants, this simplifies to the negative L2 norm:
1
E n = − (y n − h(y n ))T (y n − h(y n )) , (2)
2
which is also the objective optimised by the standard autoencoder.
The only adaptable part in h is autoencoder f . Optimising cost function (2)
requires derivating it with respect to the free parameters w. Cost function (2)
is a composition of f, ξ, g, thus its derivatives can be obtained by using the
chain rule. For f we need the gradient ∇f (y) of its outputs with respect to its
weights w. Since f is the standard autoencoder, the gradient ∇f (y) is obtained
via standard back-propagation [2]. Surrogate model g is also a neural network
and its Jacobian J (g) is also obtained via back-propagation. For mapping ξ, the
Jacobian J (ξ) is easily calculated as the diagonal 3 × 3 matrix:
J (ξ) = A diag (σ(f1 )(1 − σ(f1 )), σ(f2 )(1 − σ(f2 )), σ(f3 )(1 − σ(f3 ))) . (3)
Thus, the contribution of input y n to the gradient is:
T
G(n) = (y n − h(y n ))J (g) J (ξ) (∇f (y)) , (4)
N
with the gradient over all N inputs being G = n=1 G(n). Having obtained
gradient G we use a gradient-based optimisation on cost function (2).
4 Magnification Factors
Since the 2-D coordinates x induced at the bottleneck are products of non-linear
mappings, the topographic map ‘stretches’ and ‘contracts’. Such manifestations
are known as magnification factors [3]. It means that as we traverse the map
by taking equally small steps, we may encounter regions where models change
slowly, but also regions where models change rapidly as we move. Thus, distances
between the projected data items are distorted and a correct interpretation re-
quires the calculation of magnification factors. Models such as SOM, measure
magnifications via the U-matrix [15], while the Generative Topographic Map [3]
uses tools from differential geometry. Here we also calculate magnification factors
x1
x + Δx
x2
Fig. 2. A coordinate x that lives in Fig. 3. Magnification as maxi-

the 2−D low dimensional space, is per- mum DKL . Bright/dark shades in-
turbed by Δx in predefined directions dicate high/low magnifications.
by displacing coordinates x by small perturbations Δx in different directions

(see Fig. 2) in the fashion of [14]. An unperturbed coordinate x maps to a vec-
tor hdec (x) = ỹ while the perturbed version x+Δx maps to a perturbed vector
hdec (x + Δx) = ỹ. The distance between x and x+Δx must be understood via
the change incurred in the output space when moving from ỹ to ỹ. Recall that
the output of the surrogate model is modelled by the density N (g(θ), σstoch
2
),
and thus we measure this distance as the Kullback-Leibler divergence [7]:
2
DKL (N (hdec (x), σstoch )||N (hdec (x + Δx), σstoch
2
)) . (5)
This procedure is repeated for all coordinates x, each displaced by Δx in a set

of predefined directions. At each x we record the maximum DKL .

The purpose of our work is to investigate the behaviour of the stochastic model
via visualisation. For each parameter vs30, κ and sd we took 60 values in regular
intervals within its respective range. Taking all combinations of these values,
we generated via the stochastic model a dataset of 603 = 216000 GMs. The
architecture of autoencoder f was (inputs=60) × (hidden=10) × (bottleneck=2)
× (hidden=10) × (out=3).
Fig. 4 presents the visualisation obtained via the extended autoencoder. The
plot is annotated with GMs that are typical for their respective regions. The
main feature of the plot is the organisation of GMs in the 4 annotated regions.
In particular, as we move from region (1) towards region (2) we see that the
shapes of the encountered GMs change systematically in that the curves fall lower
0
logPGA
−1
3
−2
2
−3
1
−4
0
−5 3
logPGA
−1
−6 2
−2
−7 1
−3 20 23 28 34 40 48 58 69 83 100
Distance 0
−4
3
logPGA
−1
−5
−6
x1
1 −2
−3
−7
20 23 28 34 40 48 58 69 83 100 −4
Distance
3 −5
2 −6
1 2 −7
20 23 28 34 40 48 58 69 83 100
0
4 3
Distance
logPGA
−1
2
−2
x2
1
−3
0
−4
3
logPGA
−1
−5
2
−2
−6
1
−3
−7
20 23 28 34 40 48 58 69 83 100 0
−4
Distance
logPGA
−1
−5
−2
−6
−3
−7
20 23 28 34 40 48 58 69 83 100
−4
Distance
−5
−6
−7
20 23 28 34 40 48 58 69 83 100
Distance
Fig. 4. Visualisation of GMs
x1 x1 x1
x2 x2 x2
Fig. 5. From left to right, parameter plots for vs30, κ and sd. We stress that the
plots do not display the parameters of the visualised data items, but of the parameters
obtained via the ξ(fdec (x)) for any x. The parameter plots show how the physical
model changes in order to accommodate the projections.
and lower in terms of log P GA. We discern two trends: moving from region (1)
towards (2) along the solid line, we find GMs whose log P GA-curves show greater
slopes, while along the dashed line the log P GA-curves show lower slopes.
We also visualised the dataset of GMs using a state-of-the-art method, the
Gaussian Process Latent Variable Model (GPLVM) [11]. We empirically observed
that it produced the same level of topographic organisation as our approach, not
displayed here due to lack of space. This shows that a model-based approach like
ours, is not always privileged in terms of topographic organisation. However, the
real advantage of a model-based over a model-free approach (e.g. autoencoder
[10], SOM [9], GTM [4], GPLVM [11]) is that it provides an interpretation of
how the visualisation arises. Any coordinate x in the visualisation plot maps via
ξ(fdec (x)) to a valid GM with parameters θ. Fig. 5 display plots for parameters
vs30, κ and sd corresponding to coordinates x in the visualisation plot. Bright
and dark shades indicate high and low values respectively.
A data item y is mapped to a coordinate x, if x maps to a GM with parame-
ters θ that resembles y. Thus, the parameter plots in Fig. 5 help us understand
why data items are projected to their particular locations. Specifically: region
(1) corresponds to low values of vs30 which physically means more site ampli-
fications. Region (2) corresponds to low sd and high vs30, generally leading to
GMs of low log P GA. These two findings are not surprising and meet our prior
physical expectations. However, κ is in general a parameter difficult to interpret.
Parameter κ separates GMs into region (3) of low κ and (4) of high κ that ex-
hibit high and moderate slopes respectively. The visualisation informs us that κ
affects the distance-dependency of log P GA. This effect is not apriori expected,
it is rather an indirect effect that is raised due to the complex behaviour of the
stochastic model.
Fig. 3 displays the magnification factors. Bright and dark shades show high
and low magnifications respectively. We see that magnification is overall fairly
moderate apart from the lower-right of region (4) where the space is stretched,
i.e. the stochastic model is sensitive to parameter change in this region. Such
changes are easily detected via magnification factors. Indeed, upon inspection
of the corner of region (4), we find GMs whose curves change rapidly towards
lower values in terms of log P GA, which explains the peaked lower-right corner.
6 Conclusions
We presented a new visualisation method for GMs based on the autoencoder that
allows interpreting the visualisation through a physical model which explains
why data are projected at their particular locations x. The parameter plots
clearly show how the physical model drives the visualisation which helps us better
understand the behaviour of the stochastic model. Magnification factors reveal
parameter sensitivities. Such interpretability can be achieved only by adopting
a model-based approach. Other data types can be accommodated by coupling a
suitable model to the autoencoder.
Acknowledgments. The reviewers’ comments greatly helped improve the pro-

posed work.
References
1. Baldi, P., Hornik, K.: Neural networks and principal component analysis: Learning
from examples without local minima. Neural Networks 2, 53–58 (1989)
2. Bishop, C.M.: Neural networks for pattern recognition. Oxford University Press
(1996)
3. Bishop, C.M., Svensén, M., Williams, C.K.I.: Magnification Factors for the GTM
Algorithm. In: Gerstner, W., Hasler, M., Germond, A., Nicoud, J.-D. (eds.) ICANN
1997. LNCS, vol. 1327, pp. 64–69. Springer, Heidelberg (1997)
4. Bishop, C.M., Svensén, M., Williams, C.K.I.: GTM: The generative topographic
mapping. Neural Computation 10(1), 215–234 (1998)
5. Boore, D.M.: Simulation of ground motion using the stochastic method. Pure and
Applied Geophysics 160, 635–676 (2003)
6. Hagenbuchner, M., Sperduti, A., Tsoi, A.C.: A self-organizing map for adaptive
processing of structured data. IEEE Transactions on Neural Networks 14(3), 491–
505 (2003)
7. Hastie, T., Tibshirani, R., Friedman, J.H.: The Elements of Statistical Learning.
Springer (2001)
8. Hinton, S.: Reducing the dimensionality of data with neural networks. SCIENCE:
Science 313 (2006)
9. Kohonen, T.: The self-organizing map. Proceedings of the IEEE 78(9), 1464–1480
(1990)
10. Kramer, M.A.: Nonlinear principal component analysis using autoassociative neu-
ral networks. AICHE Journal 37, 233–243 (1991)
11. Lawrence, N.D.: Gaussian process latent variable models for visualisation of high
dimensional data. In: NIPS 16 (2004)
12. Scherbaum, F., Kuehn, N.M., Ohrnberger, M., Koehler, A.: Exploring the prox-
imity of ground-motion models using high-dimensional visualization techniques.
Earthquake Spectra 26(4), 1117–1138 (2010)
13. Tan, C.C., Eswaran, C.: Autoencoder Neural Networks: A Performance Study
Based on Image Reconstruction, Recognition and Compression. Lambert Academic
Publishing (2009)
14. Tiňo, P., Gianniotis, N.: Metric properties of structured data visualizations through
generative probabilistic modeling. In: IJCAI 2007, pp. 1083–1088 (2007)
15. Ultsch, A., Siemon, H.P.: Kohonen’s self organizing feature maps for exploratory
data analysis. In: INNC Paris, vol. 90, pp. 305–308 (1990)
Examining an Evaluation Mechanism
of Metaphor Generation with Experiments
and Computational Model Simulation
Asuka Terai1 , Keiga Abe2 , and Masanori Nakagawa1

1
Tokyo Institute of Technology, 2-12-1 O-okayama, Meguro-ku, Tokyo, Japan
2
Gifu Shotoku Gakuen University, 1-1 Takakuwanishi, Yanaidu-cho, Gifu, Japan
Abstract. This study examines the metaphor generation process, in

which an expression consisting of a topic modified by certain features be-
comes a metaphorical expression of the form ”TOPIC like VEHICLE”,
through model simulation and psychological experiments. A computa-
tional model of metaphor generation has been developed that includes
an evaluation mechanism[4]. The model assumes two processes within
metaphor generation. First, vehicle candidates are generated within a
candidate metaphor generation process, and then, the candidates are
evaluated within an evaluation process. This study focuses specifically
on examining the timing of the evaluation mechanism with a simula-
tion and psychological experiments. The results support the claim that
the evaluation mechanism follows the metaphor generation mechanism
within the overall metaphor generation process.
Keywords: Metaphor generation, Statistical language analysis.
1 Introduction
The purpose of this paper is to examine the evaluation mechanism within
metaphor generation using a computational model. Metaphor generation is re-
garded as a process where an expression consisting of a TOPIC modified by
certain FEATURES (e.g. ”a sad song and a song narrates”), referred to as the
input expression, becomes a metaphorical expression of the form TOPIC like VE-
HICLE” (e.g. ”the song like tears”). Some computational models of metaphor
generation have already been developed, and, among these, some utilize linguistic
corpora. For instance, Kitada and Hagiwara[1] constructed a figurative compo-
sition support system that included a model of metaphor generation based on
an electronic dictionary. In contrast, Abe, Sakamoto and Nakagawas model[2] is
based on the results of statistical language analysis, which is more objective than
existing dictionaries which must be compiled through the considerable efforts of
language professionals. Moreover, Terai and Nakagawa[3][4] constructed a model

This research is supported by MEXT’s program ”Promotion of Environmental Im-
provement for Independence of Young Researchers” and Grant-in-Aid for Young
Scientists (B) (23700160).

404 A. Terai, K. Abe, and M. Nakagawa
that incorporates the dynamic interaction among features by employing the re-
sults of statistical language analysis. In particular, Terai and Nakagawas model[4]
has a mechanism of metaphor evaluation and a simulation of that model was
slightly closer to human metaphor-generation performance than a model without
the evaluation mechanism[3].
The model[4] consists of a candidate metaphor generation process and a
metaphor evaluation process. Within the candidate metaphor generation pro-
cess, the model outputs candidate nouns for the vehicle. Within the metaphor
evaluation process, the vehicle candidate nouns are evaluated based on similar-
ities between the meaning of a metaphor including the candidate noun and the
meaning of an input expression, such that the metaphor that is most similar
to the input expression is output as the most adequate metaphor. Thus, the
model assumed a two-process structure where the evaluation mechanism starts
after the metaphor generation mechanism is complete. It is, however, feasible
that the generation mechanism and the evaluation mechanism operate simulta-
neously. Accordingly, the two-process assumption warrants further examination,
specifically with respect to the timing of the evaluation mechanism. Moreover,
previous versions of the model have not considered the degree of dis-similarity
between the topic and the vehicle within the metaphor generation mechanism,
even though such dissimilarities are taken into account in Kitada and Hagiwaras
model[1]. Thus, this paper improves the model of the metaphor generation mech-
anism by considering this dissimilarity, and subsequently investigates the timing
of the evaluation mechanism through a simulation of the improved model.
2 Metaphor Generation Models
2.1 Estimating Knowledge Structure Using Corpora

The metaphor generation model is constructed utilizing a knowledge structure
based on the results of statistical language analysis[5], which was also used in
previous studies[2][3][4]. The statistical language analysis estimates latent classes
among nouns and adjectives (or verbs) as a knowledge structure using four kinds
of modification frequency data (adjective-noun (Adj), noun(subject)-verb (S-
V), verb-noun(modification) (V-M), and verb-noun(object) (V-O)) which were
extracted from a Japanese newspaper for the period 1993-2002. The statistical
method assumes that the co-occurrence probabilities of these terms, P (nri , arj ),
can be computed using the following formula(1):

P (nri , arj ) = P (nri |crk )P (arj |crk )P (crk ), (1)
k
where crk indicates the kth latent class assumed within this method for the r
type of modification data. The parameters (P (nri |crk ), P (arj |crk ), and P (crk )) are
estimated as the values that maximize the log likelihood of the co-occurrence
frequency data between nri and arj using the EM algorithm. The statistical lan-
guage analysis is applied to each set of co-occurrence data fixing the number of
Examining an Evaluation Mechanism of Metaphor Generation 405
latent classes at 200. P (crk |nri ) and P (crk |arj ) are computed using Bayes’ theory.
The 18,142 noun types (n∗h ) that are common to all four types modification data
were used in the system. The meaning vector of the noun and the meaning vector
of the feature were estimated based on P (crk |nri ) and P (crk |arj ).
2.2 Metaphor Generation Mechanism

The metaphor generation mechanism is essentially realized using the metaphor
generation model[3][4]. The model outputs candidate nouns for vehicles from
inputs consisting of the topic and its features that are represented by adjectives
or verbs. The model consists of three layers: an input layer, a hidden layer, and
an output layer. The input layer consists of feature nodes, where each node
indicates either an adjective or a verb. Each feature node relating to a topic
has mutual and symmetric connections with other feature nodes relating to the
topic. The mutual connections represent dynamic interaction among the features.
The hidden layer consists of nodes that are the latent classes estimated using
the statistical language analysis. The output layer consists of noun nodes. An
input expression of the form L(n∗h0 , ar1 r2 r1 ∗ r2 ∗
j1 , aj2 , ...), such as aj1 -nh0 and aj2 -nh0 , is
∗
input into the model. The output (Oa(nh )) for each noun node indicates how
strongly the nouns are associated with an input expression as the VEHICLE in
the metaphor ”n∗h0 ” like VEHICLE1 .
Moreover, the model has been improved to take into consideration the de-
gree of dissimilarity between a topic and a candidate noun for the vehicle,
because the topic should be classified into a different semantic category from
one of the vehicle in order to make the sentence metaphorical. The dissimilar-
ity (dSim(n∗h , n∗h0 )) between a topic (n∗h0 ) and a vehicle candidate noun (n∗h )
is represented using −1 times the cosine of the angle between their respective
meaning vectors. Thus, the adequacy of a noun as a vehicle candidate (Og(n∗h ))
is computed using formula(2):
Og(n∗h ) = λ1 Oa(n∗h ) + λ2 dSim(n∗h , n∗h0 ), (2)
where Oa(n∗h ) indicates the adequacy of the vehicle candidate from the input
expression, λ1 means the influence of the adequacy and λ2 indicates the degrees
of dissimilarity between the topic and the vehicle candidate noun. Output from
the model without the evaluation mechanism is in the form of a ranking of the
vehicle candidate nouns according to their adequacy values.
2.3 Metaphor Evaluation Mechanism

The metaphor evaluation mechanism functions based on the similarity between
the meaning of the metaphor including the candidate noun and the meaning of
the input expression, which is defined within the models evaluation process[4].
1
The model is not Bayesian model. The weights from input to hidden nodes and
from hidden to output nodes are estimated using P (crk |arj ) or P (crk |n∗h ) in order to
represent the relationships between the latent classes and features or nouns.
The meaning of the metaphor and the meaning of the input expression are
estimated according to Kintschs predication algorithm[6].
The meaning of a metaphor consisting of a topic and a candidate vehicle
(M (n∗h0 , n∗h )), which indicates ”topic (n∗h0 ) like vehicle (n∗h )”, is estimated using
meaning vectors. First, the semantic neighborhood (N (n∗h )) of a candidate vehi-
cle (n∗h ) of size Snm is computed on the basis of similarity to the vehicle, which
is represented by the cosine of the angles between meaning vectors. Next, Sm
nouns are selected from the semantic neighborhood (N (n∗h )) of the vehicle on
the basis of their similarity to the topic (n∗h0 ). Finally, a vector (V (M (n∗h0 , n∗h ))
is computed as the centroid of the meaning vectors for the topic, the vehicle
and the selected Sm nouns. The computed vector (V (M (n∗h0 , n∗h )) indicates the
assigned meaning of the topic (n∗h0 ) as a member of the ad-hoc category of the
vehicle n∗h in the metaphor M (n∗h0 , n∗h ).
The meaning of the input expression (L(n∗h0 , aru ju , ...)) consisting of the inputs
for the topic (n∗h0 ) and its features (aru ju ) is also estimated. First, the semantic
neighborhood (N (aru ju )) of a feature of size Snl is computed on the basis of
the similarity to the feature (aru j ), which is represented by the cosine of the
angles between feature vectors. Next, Sl features are selected from the semantic
neighborhood (N (aru ju )) of the feature on the basis of their similarities to the
topic (n∗h0 ), which are computed as the coine of the angles between the meaning
vectors of the features and the topic using P (cru ru ru ∗
k |aju ) and P (ck |nh0 ). A vector
(V (L(n∗h0 , aruju ))) is computed as the centroid of the meaning vectors for the
topic, an input feature and the selected Sl features. When the input expression
has more than one feature, namely u > 1 , the centroid was first computed for
each feature separately and then a centroid of the centroids was computed as
the meaning vector of the input expression (V (L(n∗h0 , aru ju , ...))).
The similarity between a metaphor including a candidate noun and the input
expression is represented as the cosine of the angle between the meaning vector
of the metaphor (V (M (n∗h0 , n∗h )) and the meaning vector of the input expression
(V (L(n∗h0 , aruju , ...))). The similarity is represented as Oe(nh ).
∗
2.4 Three Different Metaphor Generation Models

Two-Process Model. The top P nouns are evaluated based on their adequacy
values as a vehicle candidate (Og(n∗p )). The P nouns are sorted by the degree
of similarity between the metaphor including the noun and the meaning of the
input expression (Oe(n∗p )). The ranking of other nouns is not changed. The
ranking of nouns indicates their degree of adequacy as vehicles in the metaphor.
The model is based on the two-process assumption. The evaluation mechanism
operates after the generation mechanism is complete. That is to say, a noun’s
adequacy value as a vehicle is computed using the following formula (3):

∗ Oe(n∗h ) + c if n∗h ∈ Np
Oet (nh ) = (3)
Og(n∗h ) else,
where c is a constant value, which estimates the smallest value of Og(n∗p ), and
Np indicates the set of the top P nouns.
Simultaneous Evaluation Model. The adequacy values of nouns as vehicles

are computed using the following formula (4):
Oes (n∗h ) = λ3 Oe(n∗h ) + Og(n∗h ), (4)
where λ3 indicates the extent of the evaluation. The nouns are sorted according
to their adequacy values (Oes (n∗h )). The ranking indicates the adequacy of a
noun as a vehicle in the metaphor. In the model, the metaphor generation mech-
anism and the evaluation mechanism operate simultaneously. However, within
this model, all the nouns have to be evaluated. Thus, this model includes a higher
computational load than the two-process model.
Hybrid Model. The top P nouns are evaluated according to their adequacy
values as vehicle candidates (Og(n∗h )). Only P nouns are sorted according to their
adequacy values as a vehicle (Oes (n∗h )), which are computed using formula(4).
The ranking of other nouns is not changed. That is to say, the adequacy values
for noun as vehicles are computed using the following formula (5):

∗ λ3 Oe(n∗h ) + Og(n∗h ) if n∗h ∈ Np
Oeh (nh ) = (5)
Og(n∗h ) else,
The model incurs a similar computational load as the two-process model. After
the generation mechanism, the top P nouns are evaluated, but, within the evalu-
ation process, the generation mechanism and the evaluation mechanism operate
simultaneously for the top P nouns.
3 Experiments
Two kinds of experiment were conducted in order to elucidate the timing of the
evaluation mechanism using same input expressions in Japanese: a sad song and
a song narrates and hot words and severe words. One was an experiment that
included no time for thinking and the other was an experiment that included
thinking time. In the experiment without thinking time, the participants were 90
undergraduates. They were presented with the input expressions and were asked
to make a metaphorical expression of the form ”TOPIC like VEHICLE” and
to respond with the vehicles within 1.5 minutes. In contrast, in the experiment
with thinking time included, the participants were 71 undergraduates. They were
presented with the input expressions and were asked to think about the vehicles
during the thinking time of 3 minutes before responding. Then, they were asked
to provide their vehicle responses within 1.5 minutes.
We hypothesized that the influence of the evaluation mechanism would be
more strongly reflected in the results from the experiment including think-
ing time than in the results of the experiment without thinking time. If the
two-process assumption is correct and the evaluation mechanism operates af-

ter metaphor generation, the results from the experiment with the thinking
time in particular should be more consistent with the simulation results for the
two-process model. In contrast, if the generation mechanism and the evaluation
mechanism operate simultaneously, then the simulation results for the simul-
taneous evaluation model should match more closely to the results from both
experiments with and without thinking time. Moreover, if the hybrid model is
the best model, it would indicate that the evaluation mechanism begins after
the generation mechanism but that they operate simultaneously.
4 Comparisons of the Experimental Results and the

Model Simulation Results
The degrees of coincidence between the experimental results and the model sim-
ulation results were computed using the rank correlation coefficient between
them. The computation of the rank correlation coefficient used the rankings of
nouns, that more than one participant responded to and are included in the
corpora. Thus, nouns that were provided as a vehicle by only one participant
in the respective experiments were omitted, in order to eliminate personal and
outlier responses and to reduce individual differences.
In this study2 , the evaluation models were simulated using the parameters
of Snm = 50, Sm = 3, Snl = 10, Sl = 3. The P parameter within the two-
process model was changed from 1000 to the number of nouns (18,142) in 1000
increments. The parameters of λ1 and λ2 in both models and λ3 in the model
with the simultaneous evaluation mechanism were changed from 0.1 to 3.0 in
0.1 increments, respectively. The hybrid model used the fixed value for P that
yielded to the highest correlation in the two-process model. The set of param-
eters adopted was that which provided the highest correlation coefficient. The
results for the highest rank correlation coefficients and the parameters sets are
shown in Table1. When various sets of parameter values yielded the same highest
correlation coefficient, only one set is shown for each condition3 .
When the input expression was a sad song and a song narrates, the two-process
model was the most consistent with the results from both experiments with
and without thinking time. However, when the input expression was hot words
and severe words, the results from the experiment with thinking time are most
consistent with the simulation results for the hybrid model. The results from
the experiment without thinking time are most consistent with the two-process
model. Regardless of the input expression and the experimental condition, the
simulation results for the model without an evaluation mechanism are the least
consistent with the experimental results.
2
The metaphor generation mechanism[3] was simulated using the parameters of α =
ln(5), β = 0.01, γ = 10.
3
The sets of parameter values that yielded the highest correlation coefficients had
similar proportions for λ1 , λ2 and λ3 . Furthermore, there were no large differences
the parameter P values within the sets.
Table 1. The results of the rank correlation coefficients between the experimental
results and the parameters. Significancies at 1% level(**), 5% level(*) and 10% level(†)
are shown.
”a sad song and a song narrates”

Experiment Model rank correlation coefficient λ1 λ2 λ3 P
with thinking time two-process 0.300** 2.4 0.1 5000
N = 43 simultaneous evaluation 0.288** 2.8 0.3 0.3
hybrid 0.295** 2.4 0.1 2.8 5000
without evaluation 0.233* 2.2 0.1
without thinking time two-process 0.261* 2.0 0.1 16000
N = 35 simultaneous evaluation 0.250* 2.7 0.7 0.6
hybrid 0.246* 1.5 0.1 2.5 16000
without evaluation 0.146 2.3 0.1
hot words and severe words

Experiment Model Rank correlation coefficient λ1 λ2 λ3 P
with thinking time two-process 0.172 0.1 0.2 16000
N = 41 simultaneous evaluation 0.178 0.1 2.2 1.4
hybrid 0.181† 0.1 2.2 1.4 16000
without thinking time two-process 0.194† 2.4 0.1 15000
N = 41 simultaneous evaluation 0.162 0.1 2.7 1.4
hybrid 0.191† 2.4 0.1 1.9 15000
5 Discussion
The model without evaluation exhibited the lowest correlation coefficients for
all conditions. It is, therefore, clear that some form of evaluation mechanism
is involved in metaphor generation. The two-process model provides the most
comprehensive explanation of the experimental results. However, there are no
major differences among the three types of models that include an evaluation
mechanism. From a cognitive perspective, the simultaneous evaluation model
would appear to necessitate excessive cognitive loads in order to evaluate all
candidate nouns, and it is, thus, rather unlikely that people generate metaphors
in a manner that is similar to that assumed in that model. Based purely on the
results obtained from the present study, it is not possible to distinguish between
the two-process model and the hybrid model in terms of which is better, because
of the limited set of just two input expressions. However, both models assume
that vehicle candidates are initially generated, and then, some of the candidates
are subsequently evaluated. Essentially, the results indicate that an evaluation
mechanism is executed while the generation mechanism is continuing to operate.
In addition, no differences due to the experimental contrast were observed in the
present results. The participants could have evaluated candidate nouns as much
as in the without-thinking time condition as in the thinking time. However,
the results suggest that the evaluation mechanism may start at an early stage
of the generation mechanism, even though it is initiated after the generation
mechanism has begun its operation.
In addition, different tendencies were observed in the parameter values de-

pending on the type of input expression. For a sad song and a song narrates,
a low value for the λ2 parameter (the degree of dissimilarity between the topic
and the vehicle candidate) led to the highest correlations, while in contrast, for
hot words and severe words, the parameter values were relatively high in the
thinking-time condition. Within the metaphor generation from hot words and
severe words, the degree of dissimilarity between the topic and the vehicle may
have played a more important role in the thinking-time condition. It is, there-
fore, possible that the parameter values reflect the overall metaphor generation
process, for the most appropriate parameter values would seem to have been
determined by characteristics of the input expressions.
However, the all rank correlation coefficients between the experimental and
the model results are all between 0.1 and 0.3. And, there is no significance
difference among correlation coefficients in each condition. Furthermore, for the
two-process model, the number of the evaluated nouns (P ) is too high from
the cognitive point of view. Because the experimental results include individual
difference, it is difficult for the model to estimate the experimental results. In
order to verify the validity of the model, other types of experiment, for example
rating of the model results, have to be conducted. In order to represent cognitive
processing of concept metaphors[7], it is necessary to modify the model. But, it
is possible that the model is applied to a metaphor generation support system
because the model was constructed based on the corpus and it is able to cover
the nouns and features which are used in metaphorical expressions.
References
1. Kitada, J., Hagiwara, M.: Figurative composition support system using electronic
dictionaries. Transactions of Information Processing Society of Japan 42(5), 1232–
1241 (2001)
2. Abe, K., Sakamoto, K., Nakagawa, M.: A computational model of metaphor genera-
tion process. In: Proc. of the 28th Annual Meeting of the Cognitive Science Society,
pp. 937–942 (2006)
3. Terai, A., Nakagawa, M.: A Neural Network Model of Metaphor Generation with
Dynamic Interaction. In: Alippi, C., Polycarpou, M., Panayiotou, C., Ellinas, G.
(eds.) ICANN 2009, Part I. LNCS, vol. 5768, pp. 779–788. Springer, Heidelberg
(2009)
4. Terai, A., Nakagawa, M.: A Computational System of Metaphor Generation with
Evaluation Mechanism. In: Diamantaras, K., Duch, W., Iliadis, L.S. (eds.) ICANN
2010, Part II. LNCS, vol. 6353, pp. 142–147. Springer, Heidelberg (2010)
5. Kameya, Y., Sato, T.: Computation of probabilistic relationship between concepts
and their attributes using a statistical analysis of Japanese corpora. In: Proc. of
Symposium on Large-scale Knowledge Resources: LKR 2005, pp. 65–68 (2005)
6. Kintsch, W.: Metaphor comprehension: A computational theory. Psychonomic Bul-
letin & Review 7(2), 257–266 (2000)
7. Lakoff, G., Johnson, M.: Metaphors We Live By. University of Chicago Press (1980)
Pairwise Clustering with t-PLSI
He Zhang, Tele Hao, Zhirong Yang, and Erkki Oja

Aalto University School of Science, Espoo, Finland
{he.zhang,tele.hao,zhirong.yang,erkki.oja}@aalto.fi
Abstract. In the past decade, Probabilistic Latent Semantic Indexing

(PLSI) has become an important modeling technique, widely used in
clustering or graph partitioning analysis. However, the original PLSI is
designed for multinomial data and may not handle other data types.
To overcome this restriction, we generalize PLSI to t-exponential family
based on a recently proposed information criterion called t-divergence.
The t-divergence enjoys more flexibility than KL-divergence in PLSI such
that it can accommodate more types of noise in data. To optimize the
generalized learning objective, we propose a Majorization-Minimization
algorithm which multiplicatively updates the factorizing matrices. The
new method is verified in pairwise clustering tasks. Experimental results
on real-world datasets show that PLSI with t-divergence can improve
clustering performance in purity for certain datasets.
Keywords: clustering, divergence, approximation, multiplicative update.
1 Introduction
Probabilistic clustering has been obtaining promising results in many applica-
tions (e.g. [1,3]). Especially, Probabilistic Latent Semantic Indexing (PLSI) [10]
that provides a nice factorizing structure for optimization and statistical in-
terpretation has attracted much research effort in the past decade. PLSI was
originally used for topic modeling and later also found a good application in
clustering (e.g. [5]).
Despite its success in many tasks, PLSI is restricted to multinomial data -
originally, word counts in documents. That is, it assumes that data is generated
from a multinomial distribution. Besides the nonnegative integer limitation, the
multinomial assumption may not hold for other data with different types of
noise.
In this paper we generalize PLSI with a more flexible formulation based on
nonnegative low-rank approximation. Maximizing the PLSI likelihood is equiv-
alent to minimizing the Kullback-Leibler (KL) divergence between the input
matrix and its approximation. KL-divergence was recently generalized to a fam-
ily called t-divergence for measuring the approximation error. The t-divergence

This work is supported by the Academy of Finland in the project Finnish Center of
Excellence in Computational Inference Research (COIN).

412 H. Zhang et al.
is more flexible than KL-divergence in the sense that more types of noise model,
e.g. data with a heavy-tailed distribution, can be accommodated [8]. Here we
integrate t-divergence in our new PLSI formulation and name the generalized
method t-PLSI.
As the algorithmic contribution, we propose a Majorization-Minimization
algorithm to solve the t-PLSI optimization problem. The t-divergence is con-
structed through the Fenchel dual of the log-partition function of t-exponential
family distributions [8]. The resulting convexity facilitates developing convenient
multiplicative variational algorithms for t-PLSI.
We apply t-PLSI to pairwise clustering analysis. Fourteen real-world datasets
are selected for comparing PLSI and the generalized method. Experimental re-
sults show that PLSI based on KL-divergence (t = 1) is not always the best. For
many selected datasets, t-PLSI can achieve better purities with other t values.
The rest of the paper is organized as follows. Section 2 reviews the PLSI model
and gives its formulation for the pairwise clustering framework. In Section 3,
we present the generalized PLSI based on t-divergence, including its learning
objective and optimization algorithm. Section 4 gives the experimental settings
and results. Conclusions and some future work are given in Section 5.
2 Probabilistic Latent Semantic Indexing (PLSI)
PLSI [10] was originally developed for text document analysis. Let C be an m×n
word-document matrix, where Cij is the number of times the ith word appears in
the jth document. PLSI assumes that the data is generated from a multinomial
distribution and maximizes the likelihood

m
n
P (w = i, d = j)Cij . (1)
i=1 j=1
The multinomial parameters P (w = i, d = j) are factorized by

r
P (w = i, d = j) = P (w = i|z = k)P (d = j|z = k)P (z = k), (2)
k=1
with conditional independence between the word variable w and the document
variable d given the latent class variable z. In the following, we rewrite the
probabilities by matrix notations for convenience: X = W SH, where X ij =
P (w = i, d = j), Wik = P (w = i|z = k), Hkj = P (d = j|z = k) and S a diagonal
matrix with Skk = P (z = k).
C
Given X as the normalized version of C, i.e. Xij = ijCab , maximizing the
ab
likelihood in Eq. (1) is equivalent to minimizing the Kullback-Leibler divergence
Xij
=
DKL (X||X) Xij log (3)
ij
(W SH T )ij
Pairwise Clustering with t-PLSI 413
r n
subject to W ≥ 0, H ≥ 0, m i=1 Wik = 1, k=1 Skk = 1, j=1 Hjk = 1 (for
details, see [5]).
In this paper we focus on the symmetric form of PLSI for pairwise clustering,
where X is taken as the affinity matrix of a weighted undirected graph that
represents pairwise similarities between data samples. No distinction between
“words” and “documents” is made like in the original PLSI, but data can be of
any type. In this case, H = W T .
3 PLSI with t-Divergence
Data in real-world applications is not necessarily multinomially distributed. Dif-

ferent types of data may contain different types of noise (e.g. [9]). We therefore
argue that, for certain data, the approximation error in X ≈ X should be mea-
sured by more suitable divergences than the Kullback-Leibler.
To capture different statistical properties of data, e.g. for heavy-tailed distri-
butions, the exponential family can be generalized to the t-exponential family
by replacing the exponential function with the t-exponential function (see e.g.
[7]). Correspondingly, there are several ways to generalize the KL-divergence for
approximate inference. Here we adopt the t-divergence proposed by Ding et al.
[8] because it can maintain the Fenchel duality in t-exponential family.
The t-divergence for two densities p and p̃ is defined as

Dt (p||p̃) = q(x) logt p(x) − q(x) logt p̃(x)dx, (4)
t
p(x)
where q(x) = p(x) t dx is a normalization term called the escort distribution of
p(x), and logt is the inverse of t-exponential function (see e.g. [7]):
x1−t − 1
logt (x) = (5)
1−t
for t ∈ (0, 2)\{1}. When t = 0, logt (x) = x. When t = 2, logt (x) = x1 . When
t → 1, logt (x) = log(x) and the t-divergence (4) reduces to KL-divergence. It
is worth noticing that the logt decays towards 0 more slowly than the usual log
function for 1 < t < 2, which leads to the heavy-tailed nature of t-exponential
family and is desired for robustness.
3.1 Learning Objective
We now generalize the PLSI model by using t-divergence. The discrete version of
t-divergence between a normalized nonnegative matrix X and its approximate
X is given by

=
Dt (X||X) ij ,
qij logt Xij − qij logt X (6)
i,j
414 H. Zhang et al.
Xt = W SW T for symmetric inputs. The

where qij = ijX t . Recall that we use X
ab ab
resulting t-PLSI optimization problem for pairwise clustering is
t 1−t
1 − ij Xij Xij
minimize Dt (X||X) = (7)
W,S (1 − t) ab Xab t

subject to Wik = 1, for all k, and Skk = 1. (8)
i k
3.2 Optimization
ij for t ∈ [0, 2]. We can therefore ap-
The t-divergence in Eq. (7) is convex in X
ply Jensen’s inequality to develop a Majorization-Minimization algorithm which
iterates the following update rules:
⎧
⎪
⎨W t 2t−1
1
ik A W SW T W 0<t<1
Wik ∝ ik (9)
⎪W
⎩ A W SW T
t
W 1 < t < 2,
ik
ik
t 1t
Skk ∝Skk W T A W SW T W , (10)
ik
Xt
where Aij = ijX t and denotes the element-wise division between two
ab ab
matrices of equal size.
Theorem 1. The t-PLSI objective in Eq. (7) monotonically decreases by using

the multiplicative update rules (9) and (10).
The proof is given in the appendix. It is a basic fact that a lower-bounded

monotonically decreasing sequence is convergent. Because t-divergence is lower-
bounded by zero, the monotonicity shown in the above theorem guarantees that
the objective function in Eq. (7) converges to a local minimum.
4 Experiments
We have compared the clustering performances on a number of undirected graphs
between the original PLSI and the proposed PLSI based on t-divergence. The
graphs have ground-truth
r classes. The evaluation criterion that we adopt is the
clustering purity = n1 k=1 max1≤l≤q nlk , where nlk is the number of vertices in
the partition k that belong to ground-truth class l. A larger purity in general
corresponds to a better clustering result.
We have used fourteen datasets to evaluate the two compared methods. These
datasets can be retrieved from Pajek database1 , Newman’s collection2 , or UCI3
1
http://vlado.fmf.uni-lj.si/pub/networks/data/
2
http://www-personal.umich.edu/~ mejn/netdata/
3
http://archive.ics.uci.edu/ml/index.html
Table 1. Clustering performances measured by purity for the two methods

(a) t = 0.2 in the case 0 < t < 1
Dataset type # instance # classes PLSI t-PLSI
strike graph 24 3 0.96 1.00
sawmill graph 36 3 0.50 0.53
cities graph 46 4 0.57 0.67
dolphins graph 62 2 0.98 0.98
adjnoun graph 112 2 0.52 0.54
parkinsons multivariate 195 2 0.75 0.75
balance multivariate 625 3 0.68 0.63
(b) t = 1.5 in the case 1 < t < 2

Dataset type # instance # classes PLSI t-PLSI
highschool graph 63 6 0.89 0.87
scotland graph 108 8 0.39 0.42
football graph 115 12 0.93 0.93
journals graph 124 14 0.44 0.49
glass multivariate 214 6 0.88 0.89
ecoli multivariate 336 8 0.80 0.80
vowel multivariate 990 11 0.34 0.35
machine learning repository. Nine datasets are sparse graphs and the remaining
five are multivariate data. We preprocessed the latter type into sparse similarity
matrices by symmetrizing their K-Nearest-Neighbor graphs (K = 15).
We set the number of clusters as the true number of classes for the two
clustering algorithms on all datasets. The parameters W and S were initialized
by following the Fresh Start procedure proposed by Ding et al. [6]. Table 1 shows
the clustering performances of the compared algorithms.
From the results in Table 2, we can see that the original PLSI based on KL-
divergence, i.e. t = 1, is not always the best. For certain datasets, the clustering
purity can be improved by using t-divergence with other t values. For example
when t = 0.2 in the case 0 < t < 1 (Table 1a), the t-PLSI achieves a perfect
clustering result on “strike” dataset with purity 1. As for the “cities” dataset,
the clustering purity has been improved by 10% over that of the original PLSI.
Similar analysis can be made for the case 1 < t < 2 (Table 1b). Compared with
KL-divergence, t-divergence has an extra degree of flexibility given by the free
parameter t. Especially, as t → 2, the tail of the logt function becomes heavier
than that of the usual log function, which makes the new method more robust
against noise.
416 H. Zhang et al.
5 Conclusions
We have studied the generalization of Probabilistic Latent Semantic Indexing
with t-divergence family. The generalized PLSI was formulated as a nonnega-
tive low-rank approximation problem. The formulation is more flexible and can
accommodate more types of noise in data. We have proposed a Majorization-
Minimization algorithm for optimizing the constrained objective. Empirical com-
parison shows that clustering performance in purity can be improved by using
the generalized method with suitable t-divergences other than KL-divergence.
The proposed generalization is not restricted to PLSI. The t-divergence could
be used in other nonnegative approximation problems with, for example, other
matrix factorization/decomposition or other constraints, where the flexibility
might also help to improve the performance.
An interesting question raised with the t-exponential family is whether we can
find its conjugate prior. The multinomial distribution that underlies PLSI has
Dirichlet as its conjugate prior. This conjugacy mainly accounts for the success
of generative topic modeling in recent years. Inference based on t-exponential
family might also benefit from its conjugate prior if they exist. That is, if we could
find a conjugate prior for t-PLSI, we may also apply the nonparametric Bayesian
treatment similar to Latent Dirichlet Allocation and thus avoid overfitting.
The t-divergence is related to two other divergence families, α-divergence and
Rényi divergence (see e.g. [3,4,13]). One of the major differences is normalization
on the input: α-divergence involves no normalization and could be problematic
when combined with prior information; Rényi divergence normalizes the input
before the power operation while t-divergence employs t-power before normal-
ization. The latter has the ability to smooth nonzero entries (for 0 < t < 1)
or exclude outliers (for 1 < t < 2). More thorough comparison among these
divergence families should be carried out in the future.
Another important and still open problem is how to select among various
t-divergences. This strongly depends on the nature of the data to be analyzed.
Usually a larger t leads to more exclusive approximation while a smaller t to more
inclusive approximation. Automatic selection among parameterized divergence
family generally requires extra information or criterion, for example, ground
truth data [2] or cross-validations with a fixed reference parameter [12].
Appendix: Proof of Theorem 1

The development follows the Majorization-Minimization steps proposed by Yang
and Oja [11,13]. In the derivation, we use W and S for the current estimates
and S for the variables.
while W
Proof. Introducing Lagrangian multipliers {λk }rk=1 ,

J(W
, S) ≡J (W
, S) + λk Wik − 1 . (11)
k i
(Majorization)

1−t
, S) ≤ 1
J(W Aij φijk Wik Skk W
jk + λk ik − 1 ,
W
t − 1 ij i
k k
t t
where Aij =
Xij
t and φijk =
Wik Skk Wjk
, W ).
. Denote the r.h.s. by G0 (W
ab Xab l Wil Sll Wjl
, W ) ≤ G1 (W
Case 1: when 0 < t < 1, we have G0 (W , W ), where

1 2(1−t)
W
, W ) =
G1 (W Aij 1−t 1−t
φijk Skk Wjk ik ik − 1 .
t − 1 ij 1−t + λk W
k
Wik k i
, W ) ≤ G2 (W
Case 2: when 1 < t < 2, we have G0 (W , W ), where

1 (1−t) W
W 1−t

G2 (W , W ) = Aij 1−t 1−t 1−t
φijk Skk Wik Wjk
ik
1 + log (1−t)
jk
t − 1 ij Wik Wjk 1−t
k

+ λk
Wik − 1 .
k i
(Minimization)
For Case 1: when 0 < t < 1
∂G1 2(1−t)−1
1−t Wpq
= −2 1−t
Apj φpjq Sqq Wjq + λq = 0.
pq
∂W Wpq1−t
j
1−t 1
−1 1−t 1

This gives Wpq = Wpq λq2t−1 2t−1 2t−1
2Sqq 1−t 2t−1
j Apj φpjq Wjq . By using

p Wpq = 1, we obtain
⎛ ⎞ 2t−1
1
1 1−t t−1
λq 2t−1
= 2Sqq 2t−1
Waq 2t−1 ⎝ 1−t ⎠
Aaj φajq Wjq . (12)
a j
For Case 2: when 1 < t < 2

∂G2
1−t 1
=− 1−t
Apj φpjq Sqq 1−t
Wpq Wjq + λq = 0.
pq
∂W pq
W
j

pq = λ−1
This gives W 1−t
q Sqq
1−t 1−t
j Apj φpjq Wpq Wjq . By using p Wpq = 1, we
obtain
1−t 1−t 1−t
λq = Sqq Waq Aaj φajq Wjq . (13)
a j
pq gives the update rule (9).

Inserting Eqs. (12) and (13) back to W
418 H. Zhang et al.
Similarly we can prove the monotonicity over S. Let J(W, S)

≡ J (W, S) +

λ k Skk − 1 , which is tightly upper bounded by

1 1−t
G(S, S) ≡ Aij φijk Wik Skk Wjk +λ Skk − 1 .
t − 1 ij
k k
1
S) gives Sqq = λ− 1t
Zeroing the derivative of G(S, A φ W 1−t
W 1−t t
.
ij ij ijq iq jq
1
Using q Sqq = 1, we obtain λ t = a 1−t t

1 1−t 1
ij Aij φija Wia Wja . Inserting λ t
back to Sqq gives the update rule (10).
References
1. Arora, R., Gupta, M., Kapila, A., Fazel, M.: Clustering by left-stochastic ma-
trix factorization. In: International Conference on Machine Learning (ICML), pp.
761–768 (2011)
2. Choi, H., Choi, S., Katake, A., Choe, Y.: Learning alpha-integration with partially-
labeled data. In: Proc. of the IEEE International Conference on Acoustics, Speech,
and Signal Processing, pp. 14–19 (2010)
3. Cichocki, A., Lee, H., Kim, Y.D., Choi, S.: Non-negative matrix factorization with
α-divergence. Pattern Recognition Letters 29, 1433–1440 (2008)
4. Cichocki, A., Zdunek, R., Phan, A.H., Amari, S.: Nonnegative Matrix and Tensor
Factorizations: Applications to Exploratory Multi-way Data Analysis. John Wiley
(2009)
5. Ding, C., Li, T., Peng, W.: On the equivalence between non-negative matrix factor-
ization and probabilistic latent semantic indexing. Computational Statistics and
Data Analysis 52(8), 3913–3927 (2008)
6. Ding, C., Li, T., Jordan, M.: Convex and semi-nonnegative matrix factorizations.
IEEE Transactions on Pattern Analysis and Machine Intelligence 32(1), 45–55
(2010)
7. Ding, N., Vishwanathan, S.: t-logistic regression. In: Advances in Neural Informa-
tion Processing Systems, vol. 23, pp. 514–522 (2010)
8. Ding, N., Vishwanathan, S., Qi, Y.A.: t-divergence based approximate inference.
In: Advances in Neural Information Processing Systems, vol. 24, pp. 1494–1502
(2011)
9. Févotte, C., Bertin, N., Durrieu, J.L.: Nonnegative matrix factorization with the
Itakura-Saito divergence: With application to music analysis. Neural Computa-
tion 21(3), 793–830 (2009)
10. Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd
Annual International Conference on Research and Development in Information
Retrieval (SIGIR), pp. 50–57. ACM (1999)
11. Hunter, D.R., Lange, K.: A tutorial on MM algorithms. The American Statisti-
cian 58(1), 30–37 (2004)
12. Mollah, M., Sultana, N., Minami, M.: Robust extraction of local structures by the
minimum of beta-divergence method. Neural Networks 23, 226–238 (2010)
13. Yang, Z., Oja, E.: Unified development of multiplicative algorithms for linear and
quadratic nonnegative matrix factorization. IEEE Transactions on Neural Net-
works 22(12), 1878–1891 (2011)
Selecting β-Divergence for Nonnegative Matrix
Factorization by Score Matching
Zhiyun Lu, Zhirong Yang , and Erkki Oja

Aalto University, 00076, Finland
{zhiyun.lu,zhirong.yang,erkki.oja}@aalto.fi
Abstract. Nonnegative Matrix Factorization (NMF) based on the fam-

ily of β-divergences has shown to be advantageous in several signal pro-
cessing and data analysis tasks. However, how to automatically select
the best divergence among the family for given data remains unknown.
Here we propose a new estimation criterion to resolve the problem of
selecting β. Our method inserts the point estimate of factorizing matri-
ces from β-NMF into a Tweedie distribution that underlies β-divergence.
Next, we adopt a recent estimation method called Score Matching for β
selection in order to overcome the difficulty of calculating the normal-
izing constant in Tweedie distribution. Our method is tested on both
synthetic and real-world data. Experimental results indicate that our se-
lection criterion can accurately estimate β compared to ground truth or
established research findings.
Keywords: nonnegative matrix factorization, divergence, score match-

ing, estimation.
1 Introduction
Nonnegative Matrix Factorization (NMF) has been recognized as an important
tool for signal processing and data analysis. Ever since Lee and Seung’s pioneer-
ing work [13,14], many NMF variants have been proposed (e.g. [5,12]), together
with their diverse applications including text, images, music, bioinformatics, etc.
(see e.g. [11,6]). See [4] for a survey.
Divergence or approximation error measure is an important dimension in the
NMF learning objective. Originally the approximation error was measured by
squared Euclidean distance or generalized Kullback-Leibler divergence [14]. Later
these measures were unified and extended to a broader parametric divergence
family called β-divergence [12].
Compared to the vast number of NMF cost functions, our knowledge about
how to select the best one among them for given data is small. It is known that a
large β leads to more robust but less efficient estimation. Some more qualitative

Corresponding author.

This work is supported by the Academy of Finland in the project Finnish Center of
Excellence in Computational Inference Research (COIN).

420 Z. Lu, Z. Yang, and E. Oja
discussion on the trade-off can be found in [2,3] and references therein. Currently,
automatic selection among a parametric divergence family generally requires
extra information, for example, ground truth data [1] or cross-validations with a
fixed reference parameter [15], which is infeasible in many applications, especially
for unsupervised learning.
In this paper we propose a new method for the automatic selection of β-
divergence for NMF which requires no ground truth or reference data. Our
method can reuse existing NMF algorithms for estimating the factorizing ma-
trices. We then insert the estimated matrices into a Tweedie distribution which
underlies the β-divergence family. The partition function or normalizing con-
stant in Tweedie distribution is intractable except for three special cases. To
overcome this difficulty, we employ a non-maximum-likelihood estimator called
Score Matching, which avoids calculating the partition function and has shown
to be both consistent and computationally efficient.
Evaluating the estimation performance is not a trivial task. We verify our
method by using both synthetic and real-world data. Experimental results show
that the estimated β conforms well with the underlying distribution that is
used to generate the synthetic data. Empirical study on music signals using our
method is consistent to previous research findings in music.
After this introduction, we recapitulate the essence of NMF based on β-
divergence in Section 2. Our main theoretical contribution, including the prob-
abilistic model, Score Matching background, and the selection criterion, is pre-
sented in Section 3. Experimental settings and results are given in Section 4.
Finally we conclude the paper and discuss some future work in Section 5.
2 NMF Based on β-Divergence
Denote R+ = {x|x ∈ R, x ≥ 0}. Given X ∈ Rm×n + , Nonnegative Matrix Fac-

torization finds a low-rank approximation matrix X ≈ X = W H such that
W ∈ R+ m×k
and H ∈ R+ , with k < min(m, n).
k×n
The approximation error can be measured by various divergences, among

which a parametric family called β-divergence has shown promising results in
signal processing and data analysis. NMF using β-divergence, here called β-
NMF, solves the following optimization problem:

minimize Dβ (X||X), (1)
W ≥0,H≥0
where the β-divergence between two nonnegative matrices is defined by
1 β
=
Dβ (X||X) β − βXij X
Xij + (β − 1)X β−1 (2)
β(β − 1) ij ij ij
for β ∈ R\{0, 1}. For example, when β = 2, β-divergence becomes the squared
Euclidean distance. When β → 1, β-divergence reduces to the (generalized)
Selecting β-Divergence for NMF by Score Matching 421
Kullback-Leibler (KL) divergence

Xij
=
DKL (X||X) Xij ln ij
− Xij + X . (3)
ij
X
ij
When β → 0, β-divergence is called Itakura-Saito (IS) divergence:

Xij X
=
DIS (X||X) − ln
ij
−1 . (4)
Xij ij
X
ij
The performance of NMF can be improved by using the most suitable

β-divergence, not restricting to the squared Euclidean distance or Kullback-
Leibler divergence, as in the original NMF (e.g. [6]).
The β-NMF optimization problem is non-convex and usually there are only
local optimizers available. A popular algorithm iterates the following multiplica-
tive update rules [12]:
W T [(W H).(β−2) .X]

H ←H. (5)
W T [W H].(β−1)
[(W H).(β−2) .X]H T
W ←W. (6)
[W H].(β−1) H T
where the division and the power are taken elementwise, and ‘.’ refers to ele-
mentwise multiplication. The cost function Dβ (X|W H) monotonically decreases
under the above updates for β ∈ [1, 2] and thus usually converges to a local min-
imum [12]. Empirical convergence of β outside this range is also observed in
most cases. Moreover, there are several other more comprehensive algorithms
for β-NMF (see e.g. [7]).
3 Selecting β Divergence for NMF
Different divergences usually correspond to different types of noise in data and

there is no universally best divergence. Despite significant progress in optimiza-
tion algorithms and in applications using certain β-divergences, there is little
research on how to choose the best β-divergence for given data. The objective in
Eq. (2) cannot be used for selecting β as it favors β → ∞ if Xij > 1 and Xij > 1
or β → −∞ if 0 ≤ Xij < 1 and 0 ≤ X ij < 1. In the following, we propose a new
criterion for determining the value of β.
3.1 Tweedie Distribution
It is known that the β-divergence is associated with the classical Tweedie distri-
butions which are a special case of an exponential dispersion model [16]. Mini-
mizing a β-divergence is usually equivalent to maximizing the likelihood of the
corresponding Tweedie distribution given the β, if the distribution exists. This

inspires us to develop a statistical method to select β.
Consider the Tweedie distribution with unitary dispersion
1
p(X; β) = exp {−Dβ (X||W ∗ H ∗ )} , (7)
Z(β)
where
(W ∗ , H ∗ ) = arg min Dβ (X||W H). (8)

W ≥0,H≥0
It is important to notice that W ∗ and H ∗ can be seen as functions over X and

β. Therefore, p(X; β) in Eq. (7) is a distribution of X parameterized by β. We
can then use certain estimation methods to select the best β.
Maximum-Likelihood (ML) estimator seems a natural choice. However, we
face the difficulty in calculating the partition function because Z(β) is intractable
except for β = 0, 1, 2. Moreover, no partition function exists for 1 < β < 2. For
β = 1, the partition function is only available for non-negative integers. To
avoid these problems, we turn to a non-maximum-likelihood estimator called
Score Matching (SM) introduced by Hyvärinen [9], which requires only the non-
normalized part of the distributions.
3.2 Score Matching for Nonnegative Data
Consider a non-normalized statistical model of d-dimensional nonnegative vec-

1
tors ξ: p(ξ, θ) = Z(θ) q(ξ, θ), where Z(θ) = q(ξ, θ)dξ is difficult to compute.
In score matching, the score of a density is the gradient of log-likelihood with
respect to the data variable. For real vectorial data, SM minimizes the expected
squared Euclidean distance between the scores of true distribution and the model
[9]. For nonnegative vectorial data, SM uses data-scaled scores to avoid the dis-
continuity of data around zero [10]. Under certain mild conditions, the resulting
score matching objective does not rely on the scores of true distribution by us-
ing integration by parts. In addition, the expectation can be approximated by
sample average:
d
1
T
1
J (θ) = 2Xi (t)ψi + Xi (t)2 ∂i ψi + ψi2 Xi (t)2 (9)
T t=1 i=1 2
where X(1), . . . , X(T ) are observations of ξ, and

∂ ln q(ξ, θ) ∂ 2 ln q(ξ; θ)
ψi = and ∂i ψi = (10)
∂ξi ξi2
ξ=X(t) ξ=X(t)
Hyvärinen [9,10] showed that the score matching estimator θ∗ = arg min J (θ) is
θ
consistent, i.e. θ∗ approaches the true value when T → ∞.
3.3 Selecting β by Score Matching

We now come back to the Tweedie distribution in Eq. (7). Here we have one
nonnegative observation matrix of size m × n, and β is the parameter to be
= W ∗ H ∗ . Inserting
estimated. Denote X
β−1
Xij β−1
−X ij β
ψi = − and ∂i ψi = −Xij , (11)
β−1
we obtain the score matching objective (with some additive constants):

2
β 1 Xij β−1
β−1 + 2
L(β) = Xij − Xij − X (12)
ij
2 ij β − 1 ij
Xij
2
for β
= 1. When β → 1, L(1) = ij Xij − 1
2 ij Xij ln X
+ 2 . The best β
ij
selected by using score matching is
β ∗ = arg min L(β) (13)

β
which is easily computed by standard minimization.
4 Experiments
Next we empirically evaluate the performance of the proposed selection method.
The estimated β using the criterion in Eq. (13) is compared with the ground
truth or established results in a particular domain. Because the β-NMF objective
in Eq. (1) is non-convex, only local optimizers are available. We repeat the
optimization algorithms multiple times from different starting points to get rid
of poor local optima.
4.1 Synthetic Data

There are three special cases of the Tweedie distribution which have closed form.
We have considered three popularly used β values: (1) NMF with squared Eu-
clidean distance (β = 2) corresponding to the approximation of X with additive
Gaussian noise; (2) KL-divergence (β = 1) corresponding to approximation of
X with Poisson noise; and (3) IS-divergence (β = 0) corresponding to approx-
imation of X in multiplicative Gamma noise [6]. We generated synthetic data
matrices X according to the above distributions. In evaluation, the estimated β
should be close to the one used in the ground truth distribution.
In detail, we first randomly generated matrices W 0 and H 0 from the uniform
distribution over (0,20). The input matrix X was then constructed by imposing a
certain type of noise (either additive standard normal, Poisson, or multiplicative
Gamma with unitary scale and shape) on the product W 0 H 0 . Next, we searched
for the β which minimizes L(β) in the interval [−2, 4] with step size 0.5. We
Table 1. Estimation results for synthetic data
matrix size k Euclidean(β = 2) KL(β = 1) IS(β = 0)

30 × 20 5 2.06 ± 0.04 1.08 ± 0.03 −0.06 ± 0.06
50 × 30 8 2.02 ± 0.02 1.07 ± 0.02 0.15 ± 0.00
60 × 25 10 1.99 ± 0.04 1.10 ± 0.00 0.18 ± 0.02
100 × 20 5 1.99 ± 0.06 1.05 ± 0.00 0.12 ± 0.02
repeated the process ten times and recorded the mean and standard deviation
for the selected β’s. We have also tried different low ranks k and input matrices
of different sizes.
The results are shown in Table 1. We can see that our method can estimate
β quite accurately. For various matrix sizes and low ranks, the selected β’s are
very close to the ground truth values with small deviations.
4.2 Real-World Data: Music Signal
It is more difficult to evaluate the β selection method for real-world datasets

where there is no ground truth. Here we perform the evaluation by compar-
ing the estimated results to an established research finding for music signals.
Many researchers have found that a particular case of β-divergence, Itakura-
Saito divergence (β → 0) is especially suitable for music signals (see e.g. [6] and
references therein). It is interesting to see whether our estimation method can
reveal the same finding for music data.
We have tested our method on piano and jazz signals. We extracted the power
spectrogram matrix from the music piece by the same procedure as Févotte et
Fig. 1. Score matching objectives for β-NMF on the piano signal: (top) coarse search
where objectives are shown in log-scale, (bottom) fine search
Fig. 2. Score matching objectives for β-NMF on the jazz signal: (left) coarse search
where objectives are shown in log-scale, (right) fine search
al. [6]. In this way the matrices used as input in β-NMF are of sizes 513 × 676 for
piano and 129 × 9312 for jazz. We also followed their choice of the compression
rank: k = 5, 6 for piano and k = 10 for jazz. The search interval [−2, 4] was the
same as in the synthetic case. We first performed coarse search using step size 0.5
and then fine search in [−0.1, 0.15] with smaller steps. For each β, we repeated
the NMF algorithms ten times with different starting points and recorded the
mean and standard deviation of score matching objectives.
The results for piano and jazz are shown in Figures 1 and 2, respectively.
In coarse search, β = 0 is the only value that causes negative score matching
objective, which is also the minimum. The fine search results confirm that the
minimum or the elbow point is very close to zero for both piano and jazz. These
results are well consistent with the previous finding that Itakura-Saito divergence
is good for music signals (e.g. [6]).
5 Conclusions and Future Work

We have proposed an efficiently computable criterion for selecting β-divergence
for nonnegative matrix factorization. Our method starts from the classical
Tweedie distribution. We have adopted Score Matching, a non-maximum-
likelihood estimation method, to avoid the theoretical and computational diffi-
culty in the partition function of the Tweedie distribution. Experimental results
on both synthetic data and music signals showed that our method can estimate
the β value accurately compared to ground truth or the known research findings.
The behavior of the selection method needs more thorough investigation. Al-
though our method builds on the Tweedie distribution, the proposed objective is
well defined for all β’s, even for those where Tweedie distribution does not exist
for nonnegative real matrices. It is also surprising that the score matching works
in the selection with only one observation. Here the sample averaging required
by score matching might be done in an implicit or alternative way.
Our method could be generalized to other divergence families, such as α-
divergence. We could also develop a more efficient parameter selection algorithm
by alternatively minimizing over β and the factorizing matrices. The connec-
tion to other automatic model selection methods such as Minimum Description
Length [8] and Bayesian Ying-Yang Harmony learning [17] needs further study.
Several issues still need to be resolved in the future. The selection of the rank
k was specified in advance in our method, which would affect the result of the
β-selection. It would be interesting to see whether score matching also works for
the low rank selection. In addition, β-divergence is a scale dependent measure,
upon which scale tuning of the data matrix might be required.
References
1. Choi, H., Choi, S., Katake, A., Choe, Y.: Learning alpha-integration with partially-
labeled data. In: Proc. of the IEEE International Conference on Acoustics, Speech,
and Signal Processing, pp. 14–19 (2010)
2. Cichocki, A., Amari, S.I.: Families of alpha- beta- and gamma- divergences: Flexible
and robust measures of similarities. Entropy 12, 1532–1568 (2010)
3. Cichocki, A., Cruces, S., Amari, S.I.: Generalized alpha-beta divergences and their
application to robust nonnegative matrix factorization. Entropy 13, 134–170 (2011)
4. Cichocki, A., Zdunek, R., Phan, A.H., Amari, S.: Nonnegative Matrix and Tensor
Factorizations: Applications to Exploratory Multi-way Data Analysis. John Wiley
(2009)
5. Dhillon, I.S., Sra, S.: Generalized nonnegative matrix approximations with breg-
man divergences. In: Advances in Neural Information Processing Systems, vol. 18,
pp. 283–290 (2006)
6. Févotte, C., Bertin, N., Durrieu, J.L.: Nonnegative matrix factorization with the
Itakura-Saito divergence: With application to music analysis. Neural Computa-
tion 21(3), 793–830 (2009)
7. Févotte, C., Idier, J.: Algorithms for nonnegative matrix factorization with the
β-divergence. Neural Computation 23(9), 2421–2456 (2011)
8. Grunwald, P.D., Myung, I.J., Pitt, M.A.: Advances in Minimum Description
Length: Theory and Applications. MIT Press (2005)
9. Hyvarinen, A.: Estimation of non-normalized statistical models by score matching.
Journal of Machine Learning Research 6(1), 695 (2006)
10. Hyvarinen, A.: Some extensions of score matching. Computational Statistics &
Data Analysis 51(5), 2499–2512 (2007)
11. Kim, H., Park, H.: Sparse non-negative matrix factorizations via alternating non-
negativity-constrained least squares for microarray data analysis. Bioinformat-
ics 23(12), 1495 (2007)
12. Kompass, R.: A generalized divergence measure for nonnegative matrix factoriza-
tion. Neural Computation 19(3), 780–791 (2006)
13. Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix fac-
torization. Nature 401, 788–791 (1999)
14. Lee, D.D., Seung, H.S.: Algorithms for non-negative matrix factorization. In: Ad-
vances in Neural Information Processing Systems, vol. 13, pp. 556–562 (2001)
15. Mollah, M., Sultana, N., Minami, M.: Robust extraction of local structures by the
minimum of beta-divergence method. Neural Networks 23, 226–238 (2010)
16. Tweedie, M.: An index which distinguishes between some important exponential
families. In: Statistics: Applications and New Directions: Proc. Indian Statistical
Institute Golden Jubilee International Conference, pp. 579–604 (1984)
17. Xu, L.: Bayesian ying yang harmony learning. In: Arbib, M.A. (ed.) The Handbook
of Brain Theory and Neural Networks, 2nd edn., pp. 1231–1237. MIT Press (2002)
Neural Networks for Proof-Pattern Recognition
Ekaterina Komendantskaya and Kacper Lichota
School of Computing, University of Dundee, UK
Abstract. We propose a new method of feature extraction that allows

to apply pattern-recognition abilities of neural networks to data-mine
automated proofs. We propose a new algorithm to represent proofs for
first-order logic programs as feature vectors; and present its implementa-
tion. We test the method on a number of problems and implementation
scenarios, using three-layer neural nets with backpropagation learning.
Keywords: Machine learning, pattern-recognition, data mining, neural

networks, first-order logic programs, automated proofs.
1 Introduction
Automated theorem proving has been applied to solve sophisticated mathe-
matical problems (e.g., verification of the Four-Colour Theorem in Coq), and
for industrial-scale software and hardware verification (verification of micro-
processors in ACL2). However, such “computer-generated” proofs require consid-
erable programming skills, and overall, are time-consuming and hence expensive.
Programs in automated provers may contain thousands of theorems of variable
sizes and complexities. Some proofs will require programmer’s intervention. In
this case, a manually found proof for one problematic lemma may serve as a
template for several other lemmas needing a manual proof. Automated discovery
of common proof-patterns using tools of statistical machine learning such as
neural networks could potentially provide the much-sought automatisation for
statistically similar proof-steps; as was argued e.g. in [1,2,3,4,10,11,12].
As was classified in [1], applications of machine-learning assistants to mech-
anised proofs can be divided into symbolic (akin e.g. Inductive logic program-
ming), numeric (akin neural networks or Kernels), and hybrid. In this paper,
we focus on neural networks. The advantages of the numeric methods over sym-
bolic is tolerance to noise and uncertainty, as well as availability of powerful
learning functions. For example, the standard multi-layer perceptrons with er-
ror back-propagation algorithm are capable of approximating any function from
finite-dimensional vector space with arbitrary precision. In this case, it is not
the power of the learning paradigm, but the feature selection and representation
method that sets the limits. Consider the following example.
Example 1. Let ListNat be a logic program defining lists of natural numbers:
1. nat(0) ← ; 2. nat(s(x)) ← nat(x);

The work was supported by EPSRC grants EP/F044046/2 and EP/J014222/1.

428 E. Komendantskaya and K. Lichota
3. list(nil) ← ; 4. list(cons x y) ← nat(x), list(y)
For ListNat and a goal G0 = list(cons(x, cons(y, z))), SLD-resolution

produces a sequence of subgoals: G1 = nat(x),list(cons(y, z)), G2 =
list(cons(y,z)), G3 = nat(y),list(z), G4 = list(z), G5 = 2. If we con-
sider applications of each of the clauses 1-4 as proof tactics, then the proof steps
above could be presented as a training example (feature vector) 4,1,4,1 to a neu-
ral network. However, we cannot statistically generalise this to future examples,
as with some frequency, the same sequence of “tactics” will fail, e.g. take the goal
G0 = list(cons(x,cons(y,x))). This happens because the given features do
not capture the essential proof context – given by the term structure, unification
procedure, and other parameters.
Recursive logic programs, such as the program above, are traditionally problem-
atic for neural network representation, as they cannot be soundly proposition-
alised and represented by the vectors of truth values, but see e.g. [5,7].
The method we present here is designed to steer away from these problems.
It covers (co-)recursive first-order logic programs. To manage (co-)recursion ef-
ficiently, we use the formalism of coinductive proof trees for logic programs, see
[8]. The coinductive trees possess more regular structure than e.g. SLD-trees.
In Section 2, we propose an original feature extraction algorithm for arbitrary
proof-trees. It allows to capture intricate structural features of the proof-trees
such as branching, dependencies between the terms and predicates; as well as
internal dependencies between structures of terms appearing at different stages
of the proof. We implement the feature extraction algorithm: see [6].
The main advantages of the feature selection method we propose are its accu-
racy, generality and robustness to changes in classification tasks. In Section 3, we
test this method on a range of classification tasks and possible implementation
scenarios with very high rates of success. All our experiments involving neural
networks were made in MATLAB Neural Network Toolbox (pattern-recognition
package), with a standard three-layer feed-forward network, with sigmoid hid-
den and output neurons. The network was trained with scaled conjugate gradi-
ent back-propagation. Such networks can classify vectors arbitrarily well, given
enough neurons in the hidden layer, we tested their performance on 40, 50, 60,
70, 90 hidden neurons for all experiments. All the software, datasets, and detailed
reference manual are available in [6].
2 Feature Extraction Method and Algorithm
In this Section, we assume familiarity with first-order logic programming [9].

For our experiments, we chose coinductive proof trees [8]: they resemble the
and-or trees used in concurrent logic programming; but have more structured
approach to unification, by restricting it to term-matching. Moreover, they allow
to treat (co-)recursive derivations in lazy (finite) steps. But note that the feature
extraction we present applies to many other kinds of proof trees.
Neural Networks for Proof-Pattern Recognition 429
→ →
li(c(x, c(y, z))) li(c(s(w), c(s(w), nil))) li(c(s(0), c(s(0), nil)))
nat(x) li(c(y, z)) nat(s(w)) li(c(s(w), nil) nat(s(0)) li(c(s(0), nil)
nat(y) li(z) nat(w) nat(s(w)) li(nil) nat(0) nat(s(0)) li(nil)
nat(w) 2 2 nat(0) 2
Fig. 1. Coinductive trees for ListNat. We abbreviate cons by c and list by li. The
last tree is a success tree which implies that the sequence of derivation steps succeeded.
1θ θ
2 3 θ
→ → ... →
stream(x) stream(scons(z, y)) stream(scons(0, scons(y1 , z1 )))
bit(z) stream(y) bit(0) stream(scons(y1 , z1 ))
2 bit(y1 ) stream(z1 )
Fig. 2. Coinductive derivation for the goal G = stream(x) and the program Stream
Definition 1. Let P be a logic program and G =← A be an atomic goal. The

coinductive tree for A is a tree T satisfying the following properties.
– A is the root of T .
– Each node in T is either an and-node or an or-node: Each or-node is given
by •. Each and-node is an atomic formula.
– For every and-node A occurring in T , if there exist exactly m > 0 dis-
tinct clauses C1 , . . . , Cm in P (a clause Ci has the form Bi ← B1i , . . . , Bni i ,
for some ni ), such that A = B1 θ1 = ... = Bm θm , for some substitu-
tions θ1 , . . . , θm , then A has exactly m children given by or-nodes, such
that, for every i ∈ m, the ith or-node has n children given by and-nodes
B1i θi , . . . , Bni i θi .
Example 2. The following corecursive program Stream defines infinite streams

of binary bits. The program will induce infinite derivations if the SLD-resolution
algorithms is applied, but only finite coinductive proof trees, see Figure 2.
1. bit(0) ← ; 2. bit(1) ←; 3. stream(scons (x,y)) ← bit(x),stream(y)
We represent coinductive trees as feature vectors. Several ways of representing

graphs as matrices are known in the literature: e.g., incidence matrix and adja-
cency matrix. However, these traditional methods obscure some patterns found
in coinductive trees; but see [11] for adjacency matrix encoding of logic formulae
represented as trees. We propose a new method as follows.
Matrix M1 list nat • 2 Matrix M2 list nat • 2

cons(x, cons(y, z)) - 21211 0 2 0 cons(s0, cons(s0, nil)) 2562563 0 2 0
cons(y, z)) - 211 0 2 0 cons(s0, nil)) 2563 0 2 0
x 0 -1 0 0 s0 0 56 1 0
y 0 -1 0 0 0 0 6 1 1
z -1 0 0 0 nil 3 0 1 1
Fig. 3. Matrices M1 and M2 encode the left-most and right-most trees in Figure 1
1. Numerical Encoding of Terms. Define a one-to-one function . that

assigns a numerical value to each function symbol in T . Assign −1 to any vari-
able occurring in T . Gödel numbering is one of the classical ways to show that
first-order language can be enumerated. But in our case, the signature of each
given program is restricted, and we use a simplified version of enumeration for
convenience. Complex terms are encoded by simple concatenation of the values
of the function symbols and variables. If a term t = (f (x1 , . . . xn )) contains vari-
ables x1 , . . . xn , the numeric values x1 . . . xn are negative. In this case, the
positive values |x1 | . . . |xn | are concatenated, but the value of the whole term
is assigned a negative value.
Example 3. For program ListNat, O = 6, S = 5, cons = 2, nil = 3,

x = y = z = −1. Thus, cons(x,cons(y,x)) = −21211.
2. Matrix Representation of the Proof Trees. For a given tree T , we build

a matrix M as follows. The size of M is (n + 2) × m, where n and m are the
number of distinct predicates and terms appearing in T . The entries of M are
computed as follows. For the predicate Ri , and the term tj , the ijth matrix
entry is tj if R(t , . . . , tj , . . . , t ) is a node of T , and 0 otherwise. The columns
marked as • and 2 trace tree-branching factors and patterns of success leaves:
For the n + 1 column and the term tj , if every node containing tj has exactly k
children given by or-nodes, then the (n + 1)jth entry in M is equal to k; if the
parameter k for tj varies, then the (n + 1)jth entry is −1; and it is 0 otherwise.
For the n + 2 column and the term tj , if all children of the node Q(t), for some
Q ∈ P are given by or-nodes, such that all these or-nodes have children nodes
2, then (n + 2)jth entry is 1; if some but not all such nodes are 2, then the
(n + 2)jth entry is −1; and it is 0 otherwise.
3. Vector Representation. The matrix M is then flattened into a vector.
See Figure 3 for examples and [6] for implementation and manual.
Proposition 1 (Properties of the encoding).

1. For a given matrix M , there may exist more than one corresponding coin-
ductive tree T ; i.e., the mapping from the set T of coinductive trees to the set
M of the corresponding matrices is not bijective.
2. If there exist two distinct root nodes F1 and F2 whose coinductive trees are
encoded by matrices M1 and M2 such that M1 ≡ M2 , then F1 and F2 differ only
in variables, however, F1 and F2 are not necessarily α-equivalent.
Neural Net - ListNat Neural net - Stream SVM - Stream

Problem 1 76.4% 84.3 % 89 %
Problem 2 96.3% 99.1 % 88 %
Problem 3 86 % n/a n/a
Problem 4 n/a 85.7 % 90%
Problem 5 82.4 % 82.4% n/a
Fig. 4. Summary of the best-average results (positive recall) of classification of coinduc-

tive proof trees for the classification problems of Section 3 and proof trees for programs
ListNat and Stream, performed in neural networks and SVMs
3 Evaluation in Neural Networks

Here, we test generality, accuracy, and robustness of the method on a range of
classification tasks and implementation scenarios.
For the experiments of this section, we used data sets of various sizes - from
120 to 400 examples of coinductive trees for various experiments; we sampled
trees produced for several distinct logic programs – such as ListNat and Stream
above, see [6]. Finally, we repeated all experiments using three-layer neural net-
works of various sizes, and compared the results with those given by the SVMs
with kernel functions.
Note that we deliberately did not tune the learning functions in neural net-
works to fit our symbolic data; but see e.g. [1,11,12].
3.1 Tests on Various Classification Tasks

Problem 1. Classification of well-formed and ill-formed proofs. Figures 1 and 2
show well-formed trees. Trees that do not conform to Definition 1 are ill-formed,
see [6] for more examples. This task is one of the most difficult for pattern-
recognition, due to a wide range of possible erroneous proofs compared to the
correct ones, the results are summarised in Figure 4.
Problem 2. Discovery of proof families is exceptionally robust; cf. Figure 4.

Definition 2. Given a logic program P , and an atomic formula A, we say that
a tree T belongs to the family of coinductive trees determined by A, if
– T is a coinductive tree with root A and;
– there is a substitution θ such that Aθ = A .
Example 4. The three trees in Figure 1 belong to the family of proofs determined
by list(cons(x,cons(y,z))); similarly for Figure 2.
Determining whether a given tree belongs to a certain proof family has practical
applications. For Figure 1, knowing that the right-hand side tree belongs to the
same family as the left-hand side tree would save the intermediate derivation
step. Note that unlike [11], it is not the shape of a formula, but proof patterns
it induces when the program is run, that influence classification.
Problem 3. Discovery of potentially successful proofs. Significance of this classi-

fication problem is asserted in the proposition below.
Definition 3. We say that a proof-family F is a success family if, for all T ∈ F ,

T generates a proof-family that contains a success subtree (cf. [8]).
Proposition 2. Given a coinductive tree T , there exists a success family F such

that T ∈ F if and only if T has a successful derivation.
Trees like the ones given in Figure 1 were positive examples, and trees like in
Figure 2 – negative. Bearing in mind subtlety of the notion of a success family,
the accuracy of classification was astonishing (86%), cf. Figure 4.
Problem 4. Discovery of ill-typed proofs in a proof family. An example of ill-

typed formula would be nat(cons(x,y)), as by definition, the term cons(x,y)
must be of type list. A tree is ill-typed if it contains ill-typed terms.
When it comes to coinductive logic programs like Stream, detection of success
families is impossible, see Figure 2. In such cases, detection of well-typed and
ill-typed proofs within a proof family will be an alternative.
Problem 5. Discovery of ill-typed proofs is similar to Problem 4, however, the

restriction that all proofs belong to the same proof family is lifted.
Problems 4 and 5 have conceptual significance for future applications, see [4];
our experiments show high accuracy in recognition of such proofs. Overall, the
proposed method works well, and applies to the variety of classification tasks.
3.2 Testing on a Range of Implementation Scenarios

In real-life extensions of this technique, the neural network proof assistant (NN-
tool ) will be used on a wide variety of problems. How robust will be neural
network learning when it comes to less regular and more varied data structures?
We designed three Implementation Scenarios (IS) below to test the method:
IS 1. NN-tool can create a new neural network for every new logic program.
Example 5. Using our running examples, a separate sets of feature vectors for
ListNat and Stream can be used to train two separate neural networks, see also
Figure 4 for experiments supporting this approach.
The obvious objection to such approach is that creating a new neural network for
every new fragment of a big program development may be cumbersome; it will
not capture possible common patterns across different programs and fragments,
but also, it will handle badly the cases where some apparently disconnected
programs are bound by a newly added definition. The next two implementation
scenarios address these problems.
IS 2. NN-tool can use only one neural network, and re-train it irrespective of
the changes in program clauses, new predicates or proof structures.
X-Y — Problem Initial accuracy for X Test 1 on Y Test 2 Test 3 X-Y Mixed data
List-Stream — P1 76.4% 44.2% 51.9% 63.9% 67.1%
Stream-List — P1 84.3% 36.7% 44% 67% 67.1%
List-Stream — P5 82.4% 65.6% 80.% 99% 80.1%
Stream-List — P5 79% 43.5% 63.5% 85.9% 80.1%
Fig. 5. Gradual adaptation to new types of proofs. Letters X and Y stand for
logic programs ListNat and Stream interchangeably; P1 and P5 stand for Problems 1
and 5. First logic program X is taken, and neural network’s accuracy is shown in the
first column. Then these trained networks were used to classify examples of the proofs
for a new logic program Y. The accuracy drops at the start, see the “Test 1” column.
Further columns show how the neural network regains its accuracy as it is trained
and tested on more examples of type Y. For comparison, the last column shows batch
learning on mixed data without gradual adaptation.
In this case, the main question is how quickly the neural network will adapt
to new patterns determined by a new logic program. We designed an experiment
to test it, see Figure 5. It shows that gradual adaptation of previously trained
neural network is at least as efficient as training on a mixed data. In fact, for
Problem 5, it is more successful than training on mixed data!
IS 3. NN-tool re-defines and extends feature vectors to fit all available programs.
Example 6. Suppose the NN-tool was used to work with proofs constructed for
two programs – ListNat and Stream; and maintains two corresponding neural
networks. However, a new clause is added by the user:
listream(x,y) ← list(x), stream(y).
This new program Listream subsumes also ListNat and Stream.
The old neural networks will not accept the changed feature vectors for new proofs,
as additional new predicate will infer change in the size of the feature vectors. In
this case, it is possible to extend feature matrices, and thus treat the proof
features from different programs as features of one meta-proof. An example of
an extended feature matrix and results of tests are given in Figure 6. Note
Matrix M3 listream stream bit list nat • 2

cons(x, x) - 211411 0 0 -211 0 2 0 Prob 1 Prob 5
scons(y, z)) - 211411 -411 0 0 0 2 0 Merged matrices 84.3% 82%
x 0 0 0 -1 -1 0 0 Listream 76.3% 88.6%
y 0 0 -1 0 0 0 0 Merged-Listream 51.2% 64.9%
z 0 -1 0 0 0 0 0
Fig. 6. Left: Feature matrix for coinductive tree for listream(cons(x,x),

cons(y,z)). Right: Accuracy of pattern recognition for feature matrices extended
to encode trees for both ListNat and Stream (“Merged matrix” row); proofs for ex-
tended program Listream (“Listream” row); and experiment on first training neural
networks on “Merged Matrix”, and testing the trained network on Listream (last row).
that accuracy for Listream (Figure 6) exceeds accuracy for Stream and List
separately (cf. Figure 4), despite of the growth of the feature vectors.
Finally, as Figure 6 shows, we tried to mix the Scenarios 2 and 3. It is encour-
aging that for Problem 5, training on merged-matrix features over-performed
simple mixing of data sets (as in Figure 5). When working with extended fea-
ture vectors, Listream over-performed the simpler merged-matrix data training.
This shows that the feature-selection method we present allows extensions that
capture significant and increasingly intricate proof-patterns.
4 Conclusions
The advantage of the learning method presented here lies in its ability to capture
intricate relational information hidden in proof trees, such as patterns arising
from interdependencies of predicate type, term structure, branching of proofs
and ultimate proof success. This method allows to apply neural networks to a
wide range of data mining tasks; and universality of the method is its another
advantage. We implemented it in [6]. The future work is to integrate the neural
network tool into one of the existing theorem provers. Another direction is to
apply the method to other kinds of contextually-rich data, such as e.g. web-pages.
References
1. Denzinger, J., Fuchs, M., Goller, C., Schulz, S.: Learning from previous proof ex-
perience: A survey. Technical report, Technische Universitat Munchen (1999)
2. Denzinger, J., Schulz, S.: Automatic acquisition of search control knowledge from
multiple proof attempts. Inf. Comput. 162(1-2), 59–79 (2000)
3. Duncan, H.: The use of Data-Mining for the Automatic Formation of Tactics. PhD
thesis, University of Edinburgh (2002)
4. Grov, G., Komendantskaya, E., Bundy, A.: A statistical relational learning chal-
lenge - extracting proof strategies from exemplar proofs. In: ICML 2012 Worshop
on Statistical Relational Learning, Edinburgh, July 30 (2012)
5. Hitzler, P., Hölldobler, S., Seda, A.K.: Logic programs and connectionist networks.
Journal of Applied Logic 2(3), 245–272 (2004)
6. Komendantskaya, E.: ML-CAP home page (2012),
http://www.computing.dundee.ac.uk/staff/katya/MLCAP-man/
7. Komendantskaya, E., Broda, K., d’Avila Garcez, A.: Neuro-symbolic Represen-
tation of Logic Programs Defining Infinite Sets. In: Diamantaras, K., Duch, W.,
Iliadis, L.S. (eds.) ICANN 2010, Part I. LNCS, vol. 6352, pp. 301–304. Springer,
Heidelberg (2010)
8. Komendantskaya, E., Power, J.: Coalgebraic semantics for derivations in logic pro-
gramming. In: CALCO. LNCS, pp. 268–282. Springer, Heidelberg (2011)
9. Lloyd, J.: Foundations of Logic Programming, 2nd edn. Springer (1987)
10. Lloyd, J.: Logic for Learning: Learning Comprehensible Theories from Structured
Data. Cognitive Technologies Series. Springer (2003)
11. Tsivtsivadze, E., Urban, J., Geuvers, H., Heskes, T.: Semantic graph kernels for
automated reasoning. In: SDM 2011, pp. 795–803. SIAM / Omnipress (2011)
12. Urban, J., Sutcliffe, G., Pudlák, P., Vyskocil, J.: Malarea sg1- Machine Learner for
Automated Reasoning with Semantic Guidance. In: Armando, A., Baumgartner,
P., Dowek, G. (eds.) IJCAR 2008. LNCS (LNAI), vol. 5195, pp. 441–456. Springer,
Heidelberg (2008)
Using Weighted Clustering and Symbolic Data
to Evaluate Institutes’s Scientific Production
Bruno Almeida Pimentel, Jarley P. Nóbrega, and Renata M.C.R. de Souza
Centro de Informática, Av. Jornalista Anibal Fernandes,

s/n - Cidade Universitária 50.740-560, Recife (PE), Brazil
{bap,jpn,rmcrs}@cin.ufpe.br
http://www.cin.ufpe.br
Abstract. It is increasingly common to use tools of Symbolic Data

Analysis to reduce the data set without losing much information. More-
over, symbolic variables can be used to preserving the privacy of indi-
viduals when their information are present in the data set. In this work,
we use information about researchers of institutions from Brazil through
the tools of Symbolic Data Analysis and a weighted clustering method
for interval data. The main goal is to analyze the scientific production
of Brazilian institutions. Results of the cluster analysis and concluding
remarks are given.
Keywords: Symbolic Data Analysis, Researcher Scientific Production,

Interval Data, Clustering Method.
1 Introduction
Nowadays the use of databases is increasingly more important and they grow
hugely. In the classical approach, these databases are described by arrays of
quantitative or qualitative values, where each column represents a variable. In
particular, each individual takes a single value for each variable [1]. However,
this representation can be restrictive to use large databases. A possible solution
is to use techniques of Symbolic Data Analysis (SDA) to summarize the data us-
ing symbolic variables that can assume intervals, histograms, and even functions
as values, in order to consider the variability and/or uncertainty innate to the
data [1]. The main goal of SDA is to model information using symbolic data and
extend classical data analysis and data mining techniques, such as, clustering,
factorial techniques, decision trees and another to symbolic data. Therefore, sym-
bolic objects can be used to reduce, improve the understandability and enhance
the recovery of the data.
In many situations, a basic concept of interest (second level of observation)
can be observed in the data set by aggregating observations (first level) that have
this concept in common, that is, a collection of individuals satisfy the concept of
interest [2]. Thus, to summarize the data set, categories of variables are selected
and considered new statical units. A way widely used to aggregate the observa-
tions is to use interval-valued variables. For this, SDA has tools to generalize the

436 B.A. Pimentel, J.P. Nóbrega, and R.M.C.R. de Souza
standard approach using symbolic data [1], in special using the interval-valued
variables. Standard descriptive statistics (like mean, variance and histogram)
and methods (like K-Means, Fuzzy C-Means and Principal Component Analy-
sis) were extended by symbolic data and several works showed the importance
of this extension [1].
A few works shows applications of SDA on actual data bases. Neto and De
Carvalho [3] showed an application about administrative managements concern-
ing cities of the Pernambuco state (Brazil) that used interval-valued variables to
describe public services. Zuccolotto [4] presented the use of symbolic data analy-
sis in a data base about job satisfaction of Italian workers through the principal
component analysis method. Silva et al. [5] made experiments using information
of web users and the goal was to group ones such that individuals into the same
group were similar behavior, for this, a dynamic clustering method for interval-
valued variable was used. Giusti and Grassini [6] presented experiments that
used symbolic data analysis methods and tools to cluster local areas of Italy
based in the their economic specialization where the concept of interest was the
local labor systems and the first level observation was the municipalities.
The purpose of this work is to investigate the use of the symbolic data analysis
methods and tools to classify, through a weighted clustering method, Brazilian
research institutes in the basis of their scientific production. The advantage of
this method is that the clustering algorithm is able to recognize clusters of
different shapes and sizes. The analysis is carried out on the data derived from
the National Council for Scientific and Technological Development over the years
2006, 2007 and 2008. These data are aggregated by sub-area of knowledge (like
physics, mathematics, medicine, etc) and institute in order to summarize the
original data base and obtain new concepts of scientific production. Thus, a new
data base is constructed in which each item of this base represents a group of
researches under the same institute and subjects of research.
The paper proceeds as follows: Section 2 describes the scientific production
data considered in this paper and highlights the aggregation process adopted to
obtain symbolic data (second- level objects) from these data (first-level objects).
This section also presents the reasons of using symbolic data. Section 3 shows the
application of a weighted clustering method on the scientific production data.
The concluding remarks are given in the Section 4.
2 Scientific Production Data
The data were extracted from the National Council for Scientific and Techno-
logical Development (http://www.cnpq.br) that is an agency of the Ministry
of Science, Technology and Innovation in order to promote scientific and techno-
logical research and the training of human resources for research in the country.
Other important Brazilian agency is the Coordination for the Improvement of
Higher Level Personnel (http://www.capes.gov.br) whose main activity is to
evaluate the Brazilian research institutes. This agency evaluates the Brazilian
post-graduate courses based on the scientific production of the researchers.
Weighted Clustering and Symbolic Data 437
The scientific production of each researcher is described by a set of 33 continu-

ous numerical and 3 categorical variables. The continuous variables are averages
of production values computed in three years (2006, 2007 and 2008) for each re-
searcher. They are: 1. National journal, 2. International journal, 3. Presentation
of papers, 4. Books , 5. Chapter of book, 6. Other publications, 7. Summary of
journal, 8. Summary of annals, 9. Publication, 10. PhD guidelines finished, 11.
Master guidelines finished, 12. Specialization guidelines finished, 13. Graduate
guidelines finished, 14. UR guidelines finished, 15. PhD guidelines unfinished, 16.
Master guidelines unfinished, 17. Specialization guidelines unfinished, 18. Grad-
uate guidelines unfinished, 19. UR guidelines unfinished, 20. Guidelines finished,
21. Guidelines unfinished, 22. Other intellectual productions, 23. Other types of
production, 24. Registered software, 25. Unregistered software, 26. Unregistered
product, 27. Registered techniques, 28. Unregistered techniques, 29. Technique
works, 30. Technique presentations, 31. Other production-related techniques,
32. Techniques and 33. Artworks. The categorical variables are: institute, area
of knowledge and sub-area of knowledge.
The data base considers 141260 researchers from 410 institutions such as fed-
eral, state, municipal and private universities, colleges integrated, colleges, insti-
tutes, schools, technical education centers that have at least one course of masters
or doctorate degree recognized by Coordination for the Improvement of Higher
Level Personnel, public institutes of scientific research, public technological in-
stitutes and federal centers of technological education or research laboratories
and development of state enterprisers. Each institution is organized into sev-
eral areas of knowledge such as Biological Sciences, Exact Science, Engineering,
Agricultural Science, Health Sciences, Applied Social Sciences, Humanities and
Linguistics-Literature-Arts. Each area of knowledge is divided in 76 sub-areas of
knowledge. Each researcher is related to only one sub-area of knowledge.
Let Ω be a data set of researches indexed by i (i = 1, . . . , 141260). Each
researcher is described by a vector of 33 continuous numerical and 3 categorial
values vi = (vi1 , . . . , vi33 , c1i , c2i , c3i ) where vij ∈ (j = 1, . . . , 33) and c1i , c2i , c3i are
the institute, area and sub-area of knowledge of the researcher i. Tools of SDA
are applied on this researcher data base in order to build new units. These new
units are modeled by interval symbolic data. Here, three reasons are considered
by using SDA: 1) to reduce the size of the original base since SDA starts as
mining data process applied on a large data base; 2) to ensure the privacy of
researches and 3) to use higher-level objects described by the variables that allow
to take into account variability and/or uncertainty.
The main goal of this work is to analyze the scientific production of Brazilian
institutes. In this context, the original data (first level of observation) are ag-
gregated regarding the institute and sub-area of knowledge categorical variables
generating a new data base of size 5630. These data represent new concepts of
scientific production (second level of observation). The symbolic data are avail-
able at http://www.cin.ufpe.br/~bap/ScientificProduction.
Let E be a data set of researcher groups indexed by h (h = 1, . . . , , 5630). Each
researcher group is described by a vector of 33 interval data xh = (x1h , . . . , xph ).
Each interval is obtained in the following way: the interval xjh = [ajh , bjh ] with
ajh < bjh and ajh , bjh ∈ where ajh = [min{vij }, max{vij }] ∀i ∈ Ω such that
c1i = c1f , c3i = c3f f
= i f ∈ Ω.
Example: consider a group of 4 researchers indexed by 1, . . . , 4 under the same
institute ”UFPE” and sub-area of knowledge ”Biophysical”. Given an continuous
numerical variable ”International journal” theses researches have the following
values for this variable, respectively: 1.75, 0.25, 0.75 and 0.00. In order to describe
these researchers regarding a symbolic interval variable, the original data are
aggregated and minimum and maximum generalization tools are applied to the
continuous values. Now this group of researchers has [0.00, 1.75] as interval for
the interval variable ”International journal”.
3 Application
In this work we used the dynamic clustering algorithm based on Hausdorff adap-
tive distances [8] to partition the data set into K clusters. The main goal of this
partition is to analyze groups of sub-area of knowledge of different institutions
which have similar proprieties according the scientific production. In the follow-
ing section, there is a brief description of this algorithm.
3.1 A Weighted Clustering Method
Let P a partition P = (C1 , . . . , CK ) of E in K clusters. Each cluster Ck ∈ P

described by a prototype yk is also represented as a vector of intervals yk =
(yk1 , · · · , ykp ), where ykj = [αjk , βkj ] with αjk < βkj αjk , βkj ∈ and p is the number
of variables.
The dynamical method for interval data with adaptive Hausdorff distances
searches for a partition P = (C1 , . . . , CK ) of E in K classes, its corresponding set
of K class prototypes y1 , . . . , yk which locally minimizes an adequacy criterion
that is usually stated as
K

J= dk (xi , yk ). (1)
k=1 i∈Ck
where dk is a weighted distance for a cluster Ck

p

dk (xi , yk ) = λjk max{|aji − αjk |, |bji − βkj |}. (2)
j=1
Let mji = (aji + bji )/2 be the midpoint of the interval xji = [aji , bji ] and lij =
(bji − aji )/2 be half of its length. Consider μjk = median{mji , i ∈ Ck } and ρjk =
median{lij , i ∈ Ck }. The lower and upper bounds αjk and βkj are given as:
αjk = μjk − ρjk and βkj = μjk + ρjk . (3)

The parameter λjk belongs to weight vector λk = (λ1k , . . . , λpk ) and weights the
distance between the object xi and the prototype yk according the cluster Ck
and variable j. The weight vectors minimize the clustering criterion J and can
be calculated using the method of Lagrange multipliers. After some algebra the
parameter λjk is calculated as:
p 1
[ h=1 ( i∈Ck max{|ahi − αhk |, |bhi − βkh |})] p
λjk = (4)
i∈Ck max{|aji − αjk |, |bji − βkj |}

with the restrictions: pj=1 λjk = 1 and λjk > 0.
The algorithm starts with an initial partition and alternates three steps until
convergence when the criterion J reaches a stationary value or the partition does
not change (test = 0) representing a local minimum.
Schema of the weighted clustering algorithm
1. Initialization
Choose (randomly) a partition P = ({C1 . . . , CK }) of Ω.
2. Representation step
(Fixed the partition P = (C1 . . . , CK ) of Ω)
a) Compute the prototype yk = ([α1k , βk1 ], . . . , [αpk , βkp ]) where αjk and βkj are
calculated following the equation (3);
b) For j = 1, . . . , p and k = 1, . . . , K compute λjk with equation (4).
3. Allocation step: definition of the partition
(Fixed the prototypes yk = ([α1k , βk1 ], . . . , [αpk , βkp ]) and the weights λjk (j =
1, . . . , p) and k = 1, . . . , K))
Fix test ← 0.
For i = 1 to n: Define k∗ = argmin dk (xi , yk ); If i ∈ Ck and k∗ = k, set
k=1,...,K
test ← 1, Ck∗ ← Ck∗ ∪ {i} and Ck ← Ck \ {i}.
4. Stopping criterion
If test = 0 then STOP, otherwise go to step (2).
3.2 Results
Coordination for the Improvement of Higher Level Personnel initially considered
7 levels of categories of researcher. The level 1 means institutes with a very poor
performance in terms of scientific production below the minimum standard value
of quality required and the level 7 means institute that offers an excellence level
of scientific production. However, in this application, it was adopted the levels 3
to 7 since the levels 1 and 2 are bellow the minimum standard value of quality
required . Thus, in this work, the clustering algorithm aims to look for a partition
in 5 clusters.
The weighted clustering method is run until the convergence to a stationary
value of the adequacy criterion 200 times and the best result, according to the
adequacy criterion, is selected. The 5 clusters of the partition have the following
sizes, respectively: 935, 763, 1734, 600, 1598.
In order to evaluate the spread for each variable, Figures 1 and 2 present the
visualization using the Zoom Star method [7] for the 5 prototypes found by the
weighted clustering algorithm. The Zoom Star method allows to display the area
between the upper-bound and lower-bound polygons. Here, each axis represents
the lower and upper bounds of the interval variables for a given class. Moreover,
the variables were standardized. The new boundaries are:
αjk − mj βkj − mj
ajk = and b j
k =
n j − mj n j − mj
where mj = min{αjl : l = 1, . . . , K} e nj = max{βlj : l = 1, . . . , K}.
Fig. 1. Prototypes of the clusters 1, 2 and 3 according the Zoom Star method
Fig. 2. Prototypes of the clusters 4 and 5 according the Zoom Star method
According the figures above, the cluster 4 has the highest scientific production.
However, this cluster represent 10% of the symbolic data set. The cluster 3 has
the lowest production and it is the greatest cluster in size representing 31%. The
cluster 5 is similar to the cluster 2 and they together consist of 42% of the data
set. The cluster 1 has medium scientific production in comparison with the other
clusters and it represents 17%. In conclusion, these results show that advances
in scientific production are need since 73% of the researcher groups presents low
production.
The clustering algorithm based on adaptive distance calculates a weight vector
for each cluster. Each weight represents the variability of a variable in relation to
prototype. Low values of weights mean variables with high variability, whereas
high values mean variables with low variability. Table 1 shows the weight vectors
for 5 clusters provided by the clustering algorithm after the convergence. The
registered software and technique variables have very high weights in clusters 1,
3 and 5. These values are outliers. It is because there are many values close to
0 for these variables since the institutes of research register few softwares and
techniques in the years 2006, 2007 and 2008. However, these variables have the
highest values of contribution in cluster 4 and they are not outliers. In cluster 2,
the highest values of contribution are for the following variables: specialization
guidelines finished, unregistered techniques and summary of journal. There are
not weight outliers in cluster 2. Regardless the weight outliers, the unregistered
product variable has the highest contribution in both clusters 1 and 3 and the
unregistered technique variable has the highest weight in cluster 5.
Table 1. Weight values of the variables for 5 clusters
Variable Weight vector

λ1 λ2 λ3 λ4 λ5
1. National Journal 0.55 0.86 0.27 0.67 0.33
2. International Journal 0.57 0.39 0.53 0.40 0.29
3. Presentation of papers 0.34 0.27 0.13 0.36 0.22
4. Books 1.86 4.23 1.05 3.01 1.62
5. Chapter of book 0.69 1.35 0.45 0.79 0.46
6. Other publications 0.20 0.46 0.11 0.28 0.14
7. Summary of journal 2.56 10.75 506.39 1.14 0.96
8. Summary of annals 0.24 0.21 0.09 0.23 0.11
9. Publication 0.11 0.14 0.04 0.13 0.06
10. PhD guidelines finished 2.08 2.29 506.39 3.01 1.17
11. Master guidelines finished 1.15 1.18 0.62 2.18 0.58
12. Specialization guidelines finished 0.35 1.10 0.17 0.62 0.28
13. Graduate guidelines finished 0.27 0.39 0.07 0.60 0.16
14. UR guidelines finished 0.96 0.70 0.24 1.28 0.44
15. PhD guidelines unfinished 2.34 1.94 506.39 3.31 1.38
16. Master guidelines unfinished 1.71 1.63 0.92 3.38 1.05
17. Specialization guidelines unfinished 1.41 7.33 1.17 2.40 1.98
18. Graduate guidelines unfinished 0.96 1.72 0.53 2.63 0.75
19. UR guidelines unfinished 2.01 0.84 0.48 2.61 1.08
20. Guidelines finished 0.27 0.30 0.06 0.46 0.12
21. Guidelines unfinished 0.73 0.67 0.27 1.74 0.51
22. Other intellectual productions 0.11 0.15 0.03 0.19 0.07
23. Other types of production 0.11 0.14 0.03 0.18 0.07
24. Registered software 1277.71 9.65 506.39 16.10 1054.12
25. Unregistered software 3.63 1.32 3.92 3.98 1054.12
26. Unregistered product 3.72 4.00 6.91 1.34 1054.12
27. Registered techniques 1277.71 3.47 506.39 7.07 1054.12
28. Unregistered techniques 2.98 7.22 506.39 4.22 7.73
29. Techniques works 0.20 0.37 0.17 0.29 0.22
30. Techniques presentations 0.27 0.48 0.11 0.33 0.17
31. Other production-related techniques 0.29 0.44 0.10 0.40 0.19
32. Techniques 0.12 0.24 0.06 0.19 0.10
33. Artwork 0.75 4.79 0.77 1.47 0.99
4 Conclusion
This work presented a study of Brazilian scientific production based on tools of

the Symbolic Data Analysis (SDA) and a weighted clustering algorithm. The
data set is originally formed by researchers of different centers of research. Tools
of Symbolic Data Analysis are applied in order to model the information regard-
ing variables that take into account variability. So, new units are obtained and
they are described by interval data. Each unit represents aggregated data under
the same institute and subject of research. The aggregation process provided
the following advantages: reduction of the size of the base, assurance of the pri-
vacy of individuals and utilization of a higher-level category generating research
profiles.
The clustering analysis was based on the Zoom Star approach applied to the
prototypes and the vector of variable weights calculated for each cluster. The
Zoom Star showed that there are three groups of production scientific levels and
more advances in scientific production are need since 73% of the research groups
presents low scientific production. The weight study pointed out that there are
several outlier data and the contribution of each variable for each cluster.
Acknowledgments. The authors would like to thank Ministry of Science, Tech-

nology and Innovation and CAPES, CNPq and FACEPE Brazilian Agencies for
their financial support.
References
1. Diday, E., Noirhomme, F.M.: Symbolic Data Analysis and the SODAS Software.
Wiley Interscience, Chichester (2008)
2. Billard, L., Diday, E.: Symbolic Data Analysis: Conceptual Statistics and Data
Mining. Wiley Interscience, Chichester (2006)
3. Neto, E.A.L., De Carvalho, F.A.T.: Symbolic Approach to Analyzing Administrative
Management. The Electronic Journal of Symbolic Data Analysis 1(1), 1–13 (2002)
4. Zuccolotto, P.: Principal components of sample estimates: an approach through
symbolic data analysis. Applied and Metallurgical Statistics 16, 173–192 (2006)
5. Silva, A.D., Lechevallier, Y., De Carvalho, F.A.T., Trousse, B.: Mining web usage
data for discovering navigation clusters. In: IEEE Symposium on Computers and
Communications, pp. 910–915 (2006)
6. Giusti, A., Grassini, L.: Cluster analysis of census data using the symbolic data
approach. Adv. Data Analysis and Classification 2, 163–176 (2008)
7. Noirhomme-Fraiture, M.: Visualization of large data sets: The Zoom Star solution
Electron. J. Symbol. Data Anal. 0, 26–39 (2002)
8. De Carvalho, F.A.T., Souza, R.M.C.R., Chavent, M., Lechevallier, Y.: Adaptive
Hausdorff distances and dynamic clustering of symbolic data. Pattern Recognition
Letters 27(3), 167–179 (2006)
Comparison
of Input Data Compression Methods
in Neural Network Solution of Inverse Problem
in Laser Raman Spectroscopy of Natural Waters
Sergey Dolenko1 , Tatiana Dolenko2 , Sergey Burikov2,

Victor Fadeev2 , Alexey Sabirov2 , and Igor Persiantsev1
1
D.V. Skobeltsyn Institute of Nuclear Physics, M.V.Lomonosov Moscow State
University, Leninskie Gory, Moscow, 119991 Russia
dolenko@srd.sinp.msu.ru
2
Physical Department, M.V.Lomonosov Moscow State University, Leninskie Gory,
Moscow, 119991 Russia
Abstract. In their previous papers, the authors of this study have sug-
gested and realized a method of simultaneous determination of temper-
ature and salinity of seawater using laser Raman spectroscopy, with the
help of neural networks. Later, the method has been improved for de-
termination of temperature and salinity of natural water using Raman
spectra, in presence of fluorescence of dissolved organic matter as dis-
persant pedestal under Raman valence band. In this study, the method
has been further improved by compression of input data. This paper
presents comparison of various input data compression methods using
feature selection and feature extraction and their effect on the error of
determination of temperature and salinity.
Keywords: neural networks, inverse problems, input data compression,

feature selection, feature extraction, Raman spectroscopy.
1 Introduction
Knowledge of such parameters of seawater as temperature (T) and salinity (S)
is of great importance, because it helps to understand the evolution of climate
change, to study energy exchange between water surface and atmosphere. Neces-
sity of global monitoring of T and S arises from the tendency observed during last
years - decrease of icecap in polar latitudes because of global warming. Melting
of ice leads to desalination of the surface layer of ocean. It can give an impulse to
reconstruction of oceanic current system and become the reason of considerable
climate changes not only in polar areas but in planetary scale.
It is obvious that for ecological monitoring of nature waters - for determination
of such key parameters as T and S, one needs express non-contact methods of
diagnostics, which can be implemented in real time. Such properties are inherent
in the non-contact radiometric method of determination of either S or T of the

444 S. Dolenko et al.
surface of sea waters [1,2], which is most widespread in oceanology. Measurement

of S with the help of radiometers, based on the dependence of absorbing capacity
of water surface on concentration of salts, allows its determination with the error
not better than unit or tenth of practical salinity unit (psu) [3]. The error of
determination of sea surface T with the radiometric method, e.g. with the help
of the Advanced Very High Resolution Radiometer [4,5], is 1 ◦ C.
Inaccuracy of determination of T and S is due to the necessity to distinguish
small changes of thermal radiation caused by changes of T and S against very
intensive signal caused by surface unevenness. One should also account for influ-
ence of weather conditions on the absorbance of surface layer of water. Methods
of laser spectroscopy represent a more convenient tool for determination of pa-
rameters of natural water. Using Raman and fluorescence spectroscopy, one can
determine water parameters remotely (using lidar or optical fiber) in real time.
Influence of T and S on water Raman spectrum was found by Walrafen [6,7,8].
Based on the dependence of water Raman valence band on T and S, method of
determination of these characteristics of sea water was elaborated in [9,10,11].
Using dependence of ratio of intensities of high- and low-frequency parts of
spectrum of water Raman valence band on T, authors of these papers obtained
error of determination of T of 0.5 ◦ C in laboratory and 2 ◦ C in field conditions.
Vibrational (in particular, Raman) spectroscopy can be used for determina-
tion of parameters of water because of sensitivity of Raman spectra to type and
concentration of dissolved salts and water temperature [12,13,14]. Influence of
water temperature and dissolved salts on spectra of water Raman valence band
manifests itself in changes of shape and position of this band [12,13,14], Fig.
1-2 from [15]. At increasing T and/or salt concentration, the intensity of high-
frequency region of spectrum increases, the intensity of low-frequency region
decreases, the band narrows and shifts to high frequencies.
In [16], it was suggested to use decomposition of water Raman valence band
into contours of Gauss or Voigt shape for determination of T. Linear section of
temperature dependence of ratio of the two most intense components (high- and
low-frequency) on T was used. The error of determination of T was 1 ◦ C [16].
Dependence of the same parameter on salts concentration was used in [17,18]
for determination of seawater salinity.
Authors of this paper have demonstrated that T and S can be measured
simultaneously using water Raman valence band [15,19]. Determination of T and
S by three-wavenumber method provided error of determination of the required
parameters 0.7 ◦ C and 1.0 psu (in vitro), and 1.1 ◦ C and 1.4 psu (in natural
conditions). Use of artificial neural networks (NN) allowed decreasing the error
of determination of T and S (0.5 ◦ C and 0.7 psu in laboratory, [15]).
In this study, determination of T and S was performed in presence of fluores-
cence of dissolved organic matter (DOM) in a wide range of concentration. DOM
is always present in natural waters, its concentration is variable and it depends
on measurement place (much more DOM is present in river mouths), season etc.
[20]. Fluorescence of DOM overlaps with Raman spectrum, thus providing an
additional source of errors.
Comparison of Input Data Compression Methods in NN Solution 445
2 Experiment
Solution of stated problem (determination of T and S taking into account fluo-
rescence of DOM) using NN was performed via experiment-based approach [21].
It means that only experimental spectra were used for NN training. In this case
one needs no a priori constructed model and all specific features of the object
are automatically taken into consideration.
To perform this study, an array of experimental spectra with different values of
the parameters (temperature, salinity and concentration of DOM) was recorded.
Solutions were prepared from bidistilled water, river humus and sea salt. Salinity
was changed from 0 to 45 psu (step 5 psu), concentration of humus - from 0 to
350 mg/l, temperature - from 0 to 35 ◦ C (srep 5 ◦ C).
Fig. 1. Scheme of experimental setup: 1 - argon laser (488 nm), 2 - beam splitter,
3 - laser power meter, 4 - focusing lens, 5 - thermo-stabilized cuvette, 6 - system of
thermo-stabilization, 7 - system of lenses, 8 - edge-filter, 9 - monochromator, 10 -
photomultiplier, 11 - CCD-camera, 12 - computer
Diagram of the Raman spectrometer is presented in Fig. 1. Spectra were mea-

sured in the range 800-4000 cm−1 with practical resolution 2 cm−1 . Argon laser
with output power near 500 mW at wavelength 488 nm was used for excitation
of Raman scattering. Spectra were recorded by CCD camera. System of ther-
mostabilization made it possible to set and control the temperature of samples
with accuracy 0.1 ◦ C. Spectra were normalized to the laser power and time of
data accumulation.
Fig. 2 presents panoramic Raman spectra of solutions in a wide spectral range
800-3800 cm−1 and in a wide range of change of temperature, salinity and con-
centration of DOM in solutions. Registration was performed by the photomul-
tiplier tube (PMT) (using CCD, one can measure spectra only in a relatively
narrow spectral range). That is why the quality of these spectra is not so high
(in comparison with those obtained by CCD, e.g., Fig. 3). Hence, these spectra
were not used for NN training.
Fig. 2. Panoramic Raman spectra. 1 - 25 ◦ C, 0 psu, 0 mg/l; 2 - 25 ◦ C, 45 psu, 0 mg/l;

3 - 25 ◦ C, 45 psu, 175 mg/l; 4 - 25 ◦ C, 45 psu, 350 mg/l.
In the preceding study [22], measurements of the valence band (2220-3870

cm−1 , 1024 features, Fig. 3) were supplemented by an additional set of identi-
fication features - low-frequency region of water Raman spectra (from 800 up
to 1800 cm−1 , also 1024 features), which depends on T, S and DOM too and
which can have its own Raman bands of such anions as NO3 − , SO4 2− , PO4 3− ,
HCO3 − . It was expected that thus it would be easier to measure salinity taking
into account the DOM fluorescence, as noise.
Fig. 3. Raman valence band spectra obtained by CCD: 1 - (0 ◦ C, 25 psu, 0 mg/l); 2 -

(25 ◦ C, 15 psu, 175 mg/l); 3 - (15 ◦ C, 45 psu, 350 mg/l)
All spectra used for work with NN were measured with 5 s camera exposure
time for valence bands and 10 s for low-frequency bands.
3 Methods
In the preceding study [22], the same experimental data array has been used
for NN determination of T and S in presence of DOM. The best results were
obtained with perceptrons with three hidden layers. Using only Raman valence
band, the best results obtained were 1.2 ◦ C for mean absolute error (MAE) of
temperature determination, and 1.5 psu for MAE of salinity determination.
Using both valence band and low-frequency region, it was possible to reduce
errors down to 0.8 ◦ C and 1.1 psu. Remind that maximum MAE values that can
make a method interesting for practical applications are about 1 ◦ C and 1 psu.
The purpose of the present study was to achieve the same level of results or
better using only the valence band. Such an opportunity would be important, as
recording the low-frequency region of Raman spectra requires more sophisticated
and therefore more expensive experimental equipment.
It was planned to achieve this goal by reducing the initial dimensionality of
the input data (1024 features - spectra channels). It is quite obvious that the
actual dimensionality of the problem should be much lower. So, different methods
of feature selection and feature extraction were applied to achieve input data
compression.
For all NN experiments in this study, a fixed NN architecture was used. It was
a perceptron with a single 64-neuron hidden layer, logistic activation function in
the hidden layer and linear activation function in the output one. Learning rate
r=0.01, moment m=0.5. Training was stopped after 1000 epochs after minimum
error on test set. The results were estimated on the examination (out-of-sample)
set. To account for random factors due to weight initialization, 5 NNs with
different initial weights were trained for each experiment.
1) Cross-correlation. The values of cross-correlation (CC) of each of the in-
put features with the output ones were calculated. Then, only the input features
with CC exceeding a pre-defined threshold value (0.3), were used to solve the
problem. The main shortcoming of this method is that linear correlation can cap-
ture only linear relationships between variables, thus missing to find significant
input features with nonlinear influence on the determined output variable. The
determined dependence of CC on spectral shift corresponding to each feature is
presented in Fig. 4.
Fig. 4. Spectral dependences of cross-correlation and cross-entropy coefficients
2) Cross-entropy. The values of cross-entropy (CE) of each of the input fea-

tures with the output ones were calculated. Then, only the input features with
CE exceeding a pre-defined threshold value (0.2), were used to solve the problem.
While CE can capture non-linear relationships, the precision of its calculation is
poor for a small number of samples that can be provided from experiment. The
determined spectral dependence of CE is presented in Fig. 4.
3) General Regression NN (GRNN, [23]) with correcting coefficients for the
smoothing factor for each input feature, as implemented in NeuroShell 2 soft-
ware package [24]. Only input features with correcting coefficient exceeding a
pre-defined threshold value (0.5) were used to solve the problem. As there are
obviously interconnections among input features, and as the correction coeffi-
cients are determined using genetic algorithm, the set of coefficients that are
determined from a single launch of the algorithm has a strong influence of ran-
dom factors. Therefore, the procedure was applied recurrently several times, each
new launch producing a narrower set of significant features. Each of the itera-
tively obtained sets was used to solve the problem. The dependence of MAE for
T and S on the number of selected features is presented in Fig. 5.
Fig. 5. Mean absolute error for T and S de- Fig. 6. Mean absolute error for T and S
termination vs number of features selected determination vs binary logarithm of the
by GRNN number of features extracted by adjacent
channel aggregation
4) Feature Aggregation. The simplest method of feature extraction is aggre-

gating adjacent spectral channels, thus simultaneously reducing the number of
input features and the level of noise. Summing up intensities in a fixed number of
adjacent channels corresponds to reducing spectral resolution of the device, thus
making the equipment simpler and cheaper. The studied problem was solved
for the following numbers of aggregated channels: 2, 4, 8, 16, 32, 64, 128, thus
producing 512, 256, 128, 64, 32, 16 or 8 aggregated input features, respectively.
The dependence of MAE for T and S on the binary logarithm of the number of
extracted features is presented in Fig. 6.
4 Results
The best results obtained in this study for different methods of input data com-
pression are summarized in Table 1. The presented values are mean absolute
error on the out-of-sample set of data.
Table 1. Mean absolute error of problem solution for T and S on the out-of-sample
set of data for different methods of input data compression
Method of feature Number of input

ΔT, ◦ C ΔS, psu
selection/extraction features
None 1024 1.15±0.08 1.37±0.24
Cross-correlation 375 0.92±0.06 1.18±0.09
Cross-entropy 694 0.91±0.05 1.15±0.07
GRNN-GA 319 0.97±0.04 1.02±0.07
Channel aggregation 64 0.69±0.02 0.76±0.04
5 Conclusion
This study was devoted to comparison of various methods of feature selection and
extraction for NN solution of the inverse problem of determination of seawater
temperature and salinity by valence band of Raman spectrum, in presence of
fluorescence of dissolved organic matter in a wide range of concentrations.
The best results were obtained for feature extraction by aggregating each
16 adjacent spectral channels, producing 64 input features. This means that
practical spectral resolution required to solve the problem is as large as 32 cm−1 ,
which can be easily achieved by inexpensive spectroscopy equipment.
The obtained values of mean absolute error on the out-of-sample set of data
are 0.69±0.02 ◦ C and 0.76±0.04 psu, which are not much greater than the results
obtained by NN solution of the same problem with no dissolved organic matter.
Acknowledgments. This study was supported by RFBR grants No. 11-05-

01160-a and 12-01-00958-a. All NN calculations were performed with NeuroShell
2 software [24].
References
1. Font, J., Camps, A., Borges, A., et al.: SMOS: The challenging measurement of
sea surface salinity from space. In: P. IEEE, vol. 98 (5), pp. 649–665. IEEE Press,
New York (2010)
2. Turiel, A., Nieves, V., Garcia-Ladona, et al.: The multifractal structure of satellite
sea surface temperature maps can be used to obtain global maps of streamlines.
Ocean Sci. 5, 447–460 (2009)
3. Boutin, J., Waldteufel, P., Martin, N., et al.: Surface salinity retrieved from SMOS
measurements over the global ocean: Imprecisions due to sea surface roughness and
temperature uncertainties. J. Atmos. Ocean. Technol. 21, 1432–1447 (2004)
4. Eugenio, F., Marcello, J., Hernandez-Guerra, A., Rovaris, E.: Methodology to ob-
tain accurate sea surface temperature from locally received NOAA-14 data in the
Canary-Azores-Gibraltar area. Scientia Marina 65(1), 127–137 (2001)
5. Garcia-Santos, V., Valor, E., Caselles, V.: Determination of temperature by remote
sensing. J. of Mediterranean Meteorology and Climatology 7, 67–74 (2010)
6. Walrafen, G.E.: Raman Spectral Studies of Water Structure. J. Chem. Phys. 40,
3249–3256 (1964)
7. Walrafen, G.E.: Raman Spectral Studies of the Effects of Temperature on Water
and Electrolyte Solutions. J. Chem. Phys. 44, 1546–1558 (1966)
8. Walrafen, G.E.: Raman Spectral Studies of the Effects of Temperature on Water
Structure. J. Chem. Phys. 47, 114–126 (1967)
9. Chang, C.H., Young, L.A.: Seawater Temperature Measurement from Raman Spec-
tra. Avco Everett Research Laboratory, Inc., Interim technical report (1972)
10. Leonard, D., Chang, C., Yang, L.: Remote measurement of fluid temperature by
Raman scattered radiation. U.S. Patent 3.986.775, Class 356-75 (1974)
11. Leonard, D., Caputo, B., Hoge, F.: Remote sensing of subsurface water temperature
by Raman scattering. Applied Optics 18(11), 1732–1745 (1979)
12. Terpstra, P., Combes, D., Zwick, A.: Effect of salts on dynamics of water: A Raman
spectroscopy study. J. Chem. Phys. 92(1), 65–70 (1990)
13. Dolenko, T.A., Churina, I.V., Fadeev, V.V., Glushkov, S.M.: Valence band of liquid
water Raman scattering: some peculiarities and applications in the diagnostics of
water media. J. of Raman Spectroscopy 31(8-9), 863–870 (2000)
14. Sherer, J., Go, M., Kint, S.: Raman spectra and structure of water from 10 to 90.
J. Phys. Chem. 78(13), 1304–1313 (1974)
15. Burikov, S.A., Churina, I.V., Dolenko, S.A., et al.: New approaches to determina-
tion of temperature and salinity of seawater by laser Raman spectroscopy. In: 3nd
EARSeL Workshop on Remote Sensing of the Coastal Zone, pp. 298–305 (2003)
16. Karl, J., Ottmann, M., Hein, D.: Measuring water temperatures by means of linear
Raman spectroscopy. In: Proc. of the 9th International Symposium on Application
of Laser Techniques to Fluid Mechanics, vol. II, pp. 23.2.1–23.2.8 (1998)
17. Becucci, M., Cavalieri, S., Eramo, R., Fini, L., Materazzi, M.: Raman spectroscopy
for water temperature sensing. Laser Physics 9(1), 422–425 (1999)
18. Furić, K., Ciglenečki, I., Ćosović, B.: Raman spectroscopic study of sodium chloride
water solutions. J. Mol. Str., 550–551, 225–234 (2000)
19. Bekkiev, A., Gogolinskaya (Dolenko), T., Fadeev, V.: Simultaneous determination
of temperature and salinity of seawater by the method of laser Raman spectroscopy.
Soviet Physics Doklady 271(4), 849–853 (1983)
20. Shubina, D.M., Patsaeva, S.V., Yuzhakov, V.I., et al.: Fluorescence of organic mat-
ter dissolved in natural water. Water: Chemistry and Ecology 11, 31–37 (2009)
21. Gerdova, I.V., Churina, I.V., Dolenko, S.A., et al.: New Opportunities in Solution
of Inverse Problems in Laser Spectroscopy Due to Application of Artificial Neural
Networks. In: Proc. SPIE, vol. 4749, pp. 157–166 (2002)
22. Dolenko, T.A., Burikov, S.A., Sabirov, A.R., et al.: Remote determination of tem-
perature and salinity in consideration of dissolved organic matter in natural waters
using laser spectroscopy. In: EARSeL eProceedings, vol. 10(2), pp. 159–165 (2011)
23. Specht, D.: A General Regression Neural Network. IEEE Trans. on Neural Net-
works 2(6), 568–576 (1991)
24. NeuroShell 2, http://www.wardsystems.com/neuroshell2.asp
New Approach for Clustering Relational Data
Based on Relationship and Attribute
Information
João Carlos Xavier-Júnior1, Anne M.P. Canuto1 , Luiz M.G. Gonçalves2,

and Luiz A.H.G. de Oliveira2
1
Informatics and Applied Mathematics Department, Federal University of Rio
Grande do Norte, Natal, RN, Brazil, 59072-970
{jcx1,anne}@dimap.ufrn.br
2
Computing and Automation Engineering Department, Federal University of Rio
Grande do Norte, Natal, RN, Brazil, 59078-900
{affonso,lmarcos}@dca.ufrn.br
Abstract. A wide range of the database systems in use today are based
on the relational model. As a consequence, more information used by
those systems has been stored in multi relational object types. However,
most of the traditional machine learning algorithms have not been orig-
inally proposed to handle this type of data. Aiming to propose better
ways of handling the relational particularities of the data, this paper
proposes a new relational clustering method based on relationship and
attribute information. In our method, attributes have weights associated
with their importance between the object types. An empirical analysis is
performed in order to evaluate the effectiveness of the proposed method,
comparing with two traditional methods for relational clustering. Three
relational databases were used in the experiments.
Keywords: Relational Data, Cluster Validity, Similarity Measures.
1 Introduction
The vast majority of traditional clustering algorithms proposed in the literature
are focused on flat data, also known as single table. In this type of data, informa-
tion is organized into rows and columns, very similar to a spreadsheet. However,
the databases in the real world are much richer in structure, since they involve
multi-type objects and their relationships.
The relational databases are powerful because they require few assumptions
about how data is related or how it will be extracted from the database. As
a result, the same database can be viewed in many different ways [5]. These
characteristics have made it a popular choice of data storage. As a consequence,
different types of systems have used relational database. In a relational database,
all data is stored in tables (object types). They have the same structure repeated
in each row (like a spreadsheet), and also have links (foreign keys) that are used
to perform the relation between them [11].

452 J.C. Xavier-Júnior et al.
As a consequence of the worldwide use of relational databases, the use of data

mining methods to discover knowledge in this type of database has become an
interesting issue. However, conventional data clustering algorithms, for instance,
identify similar instances of a dataset based on the similarity of their attribute
values. In order to do this, the attributes need to be placed on a single table or
worksheet format file. Relatively few researches have addressed the particularities
of the relational data to discover unknown patterns within a relational database
[9], [14].
In order to cluster relational data by using both relationship and attribute
information, in this paper, we propose a new method for Relational Clustering,
called RCRAI. In the RCRAI method, we apply weights to relationship informa-
tion between object types according to the number and similarity of attributes
of each object type. This paper will also evaluates the effectiveness of the pro-
posed approach, comparing with two other ways of clustering relational data,
a baseline approach and a proper relational clustering method proposed in [4],
respectively.
2 Related Works
The main aim of clustering algorithms is to use attribute information to group
objects (instances) that have similar attribute values. However, when dealing
with relational data there are more types of information available that need to
be used to distinguish groupings, such as relationship information. Clustering in
relational data has been studied in some works [1], [9], [4], [8], [12].
In [1], the authors proposed a modelling approach that is capable of learning
clustering model from a relational database directly. In this modelling approach,
they presented a way of generalizing or mapping data with one-to-many relation-
ship learned from relational domain. The authors use DARA (Dynamic Aggre-
gation of Relational Attributes) to convert the data from a relational model into
a vector space model. DARA processes the relation data from tables convert-
ing it into instances of binary values. In [9], the authors adapted graph cutting
algorithms to cluster only link attribute (relationships) information, only at-
tributes information, or both by using a hybrid approach (graph-partitioning
algorithm with an attribute similarity metric). Their algorithm was only applied
on synthetic datasets.
In [4], the authors proposed a two-stage clustering method for multi-type rela-
tional data, called TSMRC. To improve clustering quality, the authors proposed
different similarity measures for both stages. In TSMRC, only attributes val-
ues were considered when clustering the tables separately (first stage), and all
relationships were considered during the second stage. The authors state that
the method improves clustering efficiency and accuracy. However, it is not clear
whether this method can cope with a considerably large number of tables and
the consuming time for clustering them in two-stages.
In another work, [8], the authors proposed a probabilistic model for rela-
tional clustering, which also provides a principal framework to unify various im-
portant clustering tasks including traditional attributes-based clustering, semi-
New Approach for Clustering Relational Data 453
supervised clustering, co-clustering and graph clustering. The proposed model

seeks to identify cluster structures for each type of data objects and interaction
patterns between different object types (tables).
Finally, in [12], the authors proposed a hierarchical approach for handling the
properties of the relational data. In this approach, the attributes values of each
object type are structured in a hierarchical model. By doing this, they were able
to measure the distance between two instances of the dataset as the instances
have different values according to the hierarchy. Different distance metrics were
applied for normal (categorical or numeric) and hierarchical attributes.
Unlike the aforementioned works, this work introduces a new method for clus-
tering relational data by using both relationship and attribute information. In
our method called RCRAI, we apply weights to relationship information between
object types according to the number of attributes of each object type, and also
to attribute information.
3 Relational Clustering Methods

The clustering methods for relational data aforementioned in the related works
section handle the particularities of the data with certain efficiency. However,
they still have some limitations. By examining their clustering aspects, we chose
the clustering method proposed by [4] in order to compare the effectiveness of
our method. A baseline approach was also used for comparison reasons. In this
baseline approach, only attribute information were considered and all relation-
ship information were discarded. Both methods will be presented in the following
subsections.
3.1 The Proposed RCRAI Method

The Relational Clustering based on Relationship and Attribute Information
(RCRAI) method uses a similarity measure to compute the values of common
attributes (categorical and numeric) and relationships. The main idea of this
method is that both attributes and inter-relationships receive weights that di-
rectly have influence in the distance between two instances. In this case, the
similarity measure used to calculate the distance between instances in RCRAI
is defined as follows:
n
δ(ai , bi ) ∗ wi
sim(a, b) = i=1n (1)
i=1 wi
where, a and b are two instances, n is the number of attributes, δ(ai , bi ) is the
distance function between attribute i and instances a and b, and wi is the weight
associated with attribute i.
For the common attributes, these weights wi are usually assigned 1, since they
have the same impact in the similarity measure used. Therefore, the distance
function δ(ai , bi ) can be defined according to the type of the attribute to be
measured. For numeric and categorical attributes it can be defined as follows:
δn (ai , bi ) = |ai − bi | (2)

0, if ai = bi
δc (ai , bi ) = (3)
bi
1, if ai =
The main aim of this method is to calculate and to use weights for the inter-
relationships. In this case, we will discard the use of the attributes of the sec-
ondary table of a relational dataset. In other words, only the attributes of the
main table is considered, along with its relational information. In this case, if two
instances are pointing to different instances in the secondary table, its similarity
measure δ is 1. Then, it is multiplied by its corresponding weights in Eq. 1.
Moreover, weights can be considered a data-dependent parameter and we
used a simple method to calculate these weights in RCRAI. It is based on the
similarity of each attribute of the secondary tables of the relationship and it is
based on the following equation.
m

wr = M Vi (4)
i=1
where m is the number of attributes in the secondary table and M Vi is the

mean variation of attribute i in the secondary table. M Vi is a value which lies
in the interval [0,1] where 0 means no variation of an attribute (all instances
of the secondary table have the same value for this attribute) and 1 means
complete variation (all instances of the secondary table have different values for
this attribute).
In this paper, we use the well-know Agglomerative hierarchical clustering algo-
rithm as base algorithm for RCRAI. Nevertheless, any clustering algorithm (de-
terministic or probabilistic) could be used in the implementation of our method.
3.2 The TSMRC Method
The two stage clustering algorithm for multi-type relational data (TSMRC) is
a method in two stage for relational data. In the first stage of TSMRC, all
object types are clustered separately by using both attribute and relationship
information. In the second stage, all the resulting clusters of the first stage are
merged according to their interrelationship.
The similarity used in the first stage is defines as follows:
p
N

r
SimAtt (Xij , Xik ) = Xij − Xik
r
+λ r
δ(Xij r
, Xik ) (5)
r=1 r=p+1
r r
where Xij , Xik is the rth attribute of Xij , Xik . N is the number of attributes,
p is the number of numeric attributes, and (N − p) is the number of categorical
attributes. Function δ() is difference function, if a and b is equal, the value of
δ(a, b) is 0, otherwise, the value is 1.
In the second stage, the similarity between two object types is defined as
follows:
⎧
⎨|Rinter (Cip , Cjq )| ∗ |Xi | ∗ |Xj | , i
=j
Sim(Cip , Cjq ) = |Cip | ∗ |Cjq | ∗ |Rinter (Xi , Xj )| (6)
⎩
0, i=j
where Cip and Cjq are two clusters of Xi and Xj , which are in the results
of the first stage; |Cip | and |Cjq | are the number of objects in Cip and Cjq ;
|Rinter (Xi , Xj )| is the number of interrelationships between Cip and Cjq ; |Xi |
and |Xj | are the number of objects in Xi and Xj , |Rinter (Xi , Xj )| is the number
of interrelationships between Xi and Xj .
The Agglomerative hierarchical clustering algorithm was used in both stages
according to the authors.
4 Experimental Setting Up
Finding available relational databases is not an easy task. Due to this reason, we
used only three databases in our experiments. The first one is a movie database,
which is widely used in experiments related to relational clustering methods. It
is available at UCI repository (http://archive.ics.uci.edu/ml/datasets/Movie).
This database is typically relational, where the data is divided into multiple
tables (object types). It consists of five tables (Movie, Studio, Director, Actor
and Casts). The original database has more than 10,000 movies. However, we
selected only 500 movies divided into ten categories: action (50), adventure (50),
comedy (50), crime (50), drama (50), fiction (50), war (50), musical (50), romance
(50) and terror (50). For each movie, we chose only three actor who acted in.
Thus, the table Casts has a total of 1,500 instances. Finally, the 10 different
categories represent the class attribute of the dataset.
The second one is called Nursery Database and it was derived from a hierarchi-
cal decision model originally developed to rank applications for nursery schools.
It was used during several years in 1980’s when there was excessive enrolment
to these schools in Ljubljana, Slovenia, and the rejected applications frequently
needed an objective explanation. This database contains three tables (Employ,
Finance and Health) which combined sum a total of 12,960 instances. The class
attribute is composed by five unbalanced values. The database is available at
UCI repository (http://archive.ics.uci.edu/ml/datasets/Nursery).
The last one is the NatalGIS database and it stores accesses of the users
that visualize geographic information provided by a system of the same name
[13]. This database is composed by eight tables which store the features of the
geographic information provided by the system. All available information belong
to a coral reef area known as Parrachos de Maracaja located in North-east of
Brazil. Finally, the tables combined sum a total of 1,000 instances with no class
attribute.
For measuring the results of a clustering algorithm, some validity indices have
been proposed in the field of data mining [7]. These indices are used to measure
the ”goodness” of a clustering result comparing it to other results obtained by
other clustering algorithms, or the same algorithm but varying different param-
eters. In this paper, we use three validity indices for measuring the clustering
results. Two internal, the Davies-Bouldin (DB) [2] and the Silhouette [10] in-
dices, and one external, the Adjusted Rand index [6].
For experimental purposes, we proposed three different scenarios. Each sce-
nario represent a method for clustering relational data. The main purpose of
having scenario 1 (very common in literature) is to establish a baseline approach
necessary for comparison. In this scenario, a dataset containing all the attributes
of all tables is created. However, only numeric and categorical attributes were
considered in this scenario. We use the agglomerative Hierarchical clustering
algorithm as the base clustering method.
In scenario 2, we use a method for clustering relational data proposed in [4].
The main idea of this method is to cluster the instances of the relational tables
separately and then merge them together according to their relationships. Again,
we used the agglomerative hierarchical clustering algorithm.
In the third scenario, we use a new method for clustering relational data
called Relational Clustering based on Relationship and Attribute Information
(RCRAI). In our method, we use a similarity measure of Eq. 1 to compute the
values of normal attributes (categorical and numeric) and relationships. Both
attributes and interrelationships receive weights that directly influence the dis-
tance between two instances. Again, in order to perform a fair evaluation, we
use the agglomerative hierarchical clustering algorithm to cluster the datasets.
In this section, we present the experimental results for scenarios 1, 2 and 3. In

all three scenarios, the number of clusters varied from 2 to 20, and the type
of linkage used was a version of the average link. It is important to emphasize
that the agglomerative hierarchical clustering algorithm was used in all three
scenarios.
As already mentioned, the clustering methods will be evaluated in terms of
three indices, two internal (DB and Silhouette) and one external (Adjusted
Rand). In order to compare the effectiveness of the methods in the proposed
scenarios, a statistical test will be applied, which is called hypothesis test (one-
tailed student t-test), with a confidence level of 95% ( = 0.05) [3].
Table 1 presents the average and standard deviation values of DB and Sil-
houette indices for scenarios 1, 2 and 3. It is important to emphasize that this
dataset (NatalGIS) does not have class label. Therefore, we can not use the
rand index in this dataset. In this table, the best value for each index and each
dataset is marked as bold. The best results that are statistically significant are
underlined. It can be observed that scenario 3 (the proposed method) obtained
the best results for both indices, but the results were not statistically significant.
Table 1. Experimental results for the NatalGIS dataset
NatalGIS Dataset (1,000 instances)

Scenario Davies-Bouldin Silhouette
1. Baseline 2.5027 ± 0.4256 0.3571 ± 0.0700
2. TSMRC 3.7608 ± 1.9262 0.3637 ± 0.1376
3. RCRAI 2.4926 ± 0.4066 0.3946 ± 0.0369
Table 2 presents the experimental results for the Movie dataset. Unlike the
previous dataset, Scenario 2 obtained the best result for the DB index while
scenario 3 obtained the best results for Silhouette and Rand indices. However,
the results were not statistically significant in any analysed case.
Table 2. Experimental results for the Movie dataset
Movie Dataset (1,500 instances)

Scenario Davies-Bouldin Silhouette Rand
1. Baseline 7.3696 ± 1.6364 0.3270 ± 0.0671 0.0785 ± 0.0441
2. TSMRC 6.5510 ± 0.4170 0.3145 ± 0.0470 0.1225 ± 0.0302
3. RCRAI 11.0432 ± 1.0179 0.3469 ± 0.0442 0.1297 ± 0.1058
Table 3 presents the experimental results for the Nursery dataset. Note that
Scenario 2 obtained the best result for the DB and Silhouette indices while
scenario 3 obtained the best results for the Rand index. Moreover, the results
obtained in scenario 3 measured by the rand index were statistically significant.
This is an important result since it indicates that our method clustered the
Nursery dataset well, according to the class attribute of the dataset (rand index).
Table 3. Experimental results for the Nursery dataset
Nursery Dataset (12,960 instances)

Scenario Davies-Bouldin Silhouette Rand
1. Baseline 10.1384 ± 1.2160 0.0813 ± 0.0106 0.1960 ± 0.0252
2. TSMRC 8.3056 ± 2.0232 0.1986 ± 0.0338 0.2033 ± 0.0653
3. RCRAI 10.1449 ± 1.0939 0.1615 ± 0.0205 0.4552 ± 0.1655
6 Conclusion
The main contribution of this paper is to propose a new clustering method able to
handle the real world relational database particularities. Our method is very sim-
ple and it computes the distances between instances based on common (numeric
and categorical) and relationship attributes. Based on the experiments results
obtained for scenarios 1, 2 and 3, we can conclude that the proposed method
for clustering relational database had significant positive results on clustering

compact and separated groups in all three used datasets. Our proposal has al-
ways provided better values for rand index. In addition, the time consumed with
pre-processing algorithms is very low compared to other studies [4]. Finally, the
number of relational tables does not seem to be a problem for our approach, since
we used databases with number of tables varying from 3 and 8 and the pattern
of behaviour for our proposal was basically the same for all three datasets.
References
1. Alfred, R., Kazakov, D.: Clustering approach to generalized pattern identification
based on multi-instanced objects with dara. In: ADBIS Research Communications,
pp. 38–49 (2007)
2. Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Transactions on
Pattern Analysis and Machine Intelligence 1(2), 224–227 (1979)
3. Fisher, R.: Statistical methods for research workers. The Eugenics Review 18(2),
148–150 (1926)
4. Gao, Y., Liu, D.Y., Sun, C.M., Liu, H.: A two-stage clustering algorithm for multi-
type relational data. In: Proceedings of the 2008 Ninth ACIS International Con-
ference on Software Engineering, Artificial Intelligence, Networking, and Paral-
lel/Distributed Computing, pp. 376–380. IEEE Computer Society, Washington,
DC (2008)
5. Harrington, J.L.: Relational Database Design and Implementation: Clearly Ex-
plained. Morgan Kaufmann (2009)
6. Hubert, L., Arabie, P.: Comparing partitions. Journal of Classification 2(1),
193–218 (1985)
7. Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: A review. ACM Comput.
Surv. 31(3), 264–323 (1999)
8. Long, B., Zhang, Z., Yu, P.S.: A probabilistic framework for relational clustering.
In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pp. 470–479 (2007)
9. Neville, J., Adler, M., Jensen, D.: Clustering relational data using attribute and
link information. In: Proceedings of the Text Mining and Link Analysis Workshop
(2003)
10. Rousseeuw, P.: Silhouettes: a graphical aid to the interpretation and validation of
cluster analysis. J. Comput. Appl. Math. 20(1), 53–65 (1987)
11. Seltzer, M.I.: Beyond relational databases. ACM Queue 3(3), 50–58 (2005)
12. Xavier-Júnior, J.C., Canuto, A., Freitas, A., Gonçalves, L., Silla-Jr., C.: A hi-
erarchical approach to represent relational data applied to clustering tasks. In:
International Joint Conference on Neural Networks, pp. 3055–3062. IEEE Press
(2011)
13. Xavier-Junior, J.C., Signoretti, A., Canuto, A.M.P., Campos, A.M., Gonçalves,
L.M.G., Fialho, S.V.: Introducing Affective Agents in Recommendation Systems
Based on Relational Data Clustering. In: Hameurlain, A., Liddle, S.W., Schewe, K.-
D., Zhou, X. (eds.) DEXA 2011, Part II. LNCS, vol. 6861, pp. 303–310. Springer,
Heidelberg (2011)
14. Yin, X., Han, J., Yu, P.S.: Cross-relational clustering with user’s guidance. In: Pro-
ceedings of the Eleventh ACM SIGKDD International Conference on Knowledge
Discovery in Data Mining, KDD 2005, pp. 344–353. ACM, New York (2005)
Comparative Study on Information Theoretic
Clustering and Classical Clustering Algorithms
Daniel Araújo1,2 , Adrião Dória Neto2 , and Allan Martins2

1
Federal Rural University of Semi-Árido, Campus Angicos, Angicos/RN, Brasil
daniel@ufersa.edu.br
2
Federal University of Rio Grande do Norte,
Departament of Computer Engineering and Automation, Natal/RN, Brasil
{daniel,adriao,allan}@dca.ufrn.br
Abstract. This paper proposes a comparative empirical study on algo-

rithms for clustering. We tested the method proposed in [2] using dis-
tinct synthetic and real (gene expression) datasets. We chose synthetic
datasets with different spatial complex to verify the applicability of the
algorithm. We also evaluated the IT algorithm in real-life problems by us-
ing microarray gene expression datasets. Compared with simple but still
spread used classical algorithms k-means, hierarchical clustering and fi-
nite mixture of Gaussians, the IT algorithm showed to be more robust
for both proposed scenarios.
Keywords: Cluster Analysis, Information Theoretic Learning, Complex

Datasets, Entropy, Microarray Data.
1 Introduction
There are many fields that clustering techniques can be applied such as mar-
keting, biology, pattern recognition, image segmentation and text processing.
Clustering algorithms attempt to organize unlabeled data points into clusters
in a way that samples within a cluster are “more similar” than samples in dif-
ferent clusters [4]. To achieve this task, several algorithms were developed using
different heuristics. It is known that spatial distribution is a problematic issue
in clustering tasks, since most part of the algorithms has some bias to a specific
cluster shape. For example, single linkage hierarchical algorithms are sensitive
to noise and outliers tending to produce elongated clusters and k-means yields
to elliptical clusters.
The incorporation of spatial statistics of the data gives a good measure of
spatial distribution of the objects in a dataset. One way of doing this is using
information-theoretic (IT) elements in the clustering process. In fact, Informa-
tion Theory involves the quantification of information in a dataset using some
statistical measures. Recently, [11,10,14] achieved good results using some ele-
ments of information theory to help clustering tasks.

460 D. Araújo, A.D. Neto, and A. Martins
Based on that, [2] proposed a information-theoretic heuristic for clustering.

They proposed a iterative algorithm that finds the best label configuration by
switching the labels of initial auxiliary regions according to a cost function based
on cross information potential [12].
In this paper we evaluate this IT algorithm in different contexts. For this
purpose, we collected several datasets used in clustering literature: six synthetic
datasets with distinct spatial complexity; and two real gene expression microar-
ray datasets.
In order to assess the performance of the IT algorithm, we compared its results
with the three most used clustering algorithms, i.e., k-means [6], hierarchical
single-linkage algorithm [6] and the finite mixture of Gaussians [6].
The rest of this paper is organized as follows: in Sect. 2 we make some con-
siderations about clustering analysis; in Sect. 3 we describe the data used and
how the experiments worked; Sect. 4 shows the results obtained and in Sect. 5
the conclusions and final considerations are made.
2 Clustering Analysis
Clustering analysis is an unsupervised way of grouping data using a specific sim-
ilarity measure. Clustering algorithms try to organize unlabeled feature vectors
in natural agglomerates in such a way that objects inside a cluster are more
similar than objects in different clusters. As mentioned before, there some is-
sues related to the traditional clustering analysis, where only the data itself is
considered in the process. Information Theoretic Clustering appeared as a good
alternative to use the underlying statistics of the data for clustering analysis.
2.1 Information Theoretic Clustering

Information theory involves the quantification of information in a dataset using a
statistical measure. One key measure in this context is known as entropy, which
measures the uncertainty about a stochastic event, or alternatively, it measures
the amount of information associated with an event [13].
The fact that entropy measures uncertainty about a stochastic event makes it
an ideal element to create a cost function for clustering. In fact, when we assign
a sample to a specif cluster, we are subject to a entropy cost.
[2] proposed an algorithm that uses a measure based on entropy as cost func-
tion to define the clusters in a dataset. The goal of the algorithm is to minimize
the cross information potential (CIP) between all groups in the partition. When
this measure is used in the context of clustering, it takes into account the rela-
tionship between each cluster. This relationship is shown as the influence that
one cluster has over another. Thus, to minimize the information potential is
equivalent to maximizing the entropy of the data set, letting the clusters more
homogeneous, so that the influence of one cluster over another is minimum.
Comparative Study on IT Clustering and Classical Clustering Algorithms 461
2.2 Preprocessing
Machine learning techniques, such as clustering algorithms, may not work cor-
rectly with high dimensional data. In general, the efficiency and accuracy of
these techniques degrades as the size of the data increases. Besides, features
that are related to special properties of the data often lie in a subspace of a
lower dimension within the original data [6].
In the context of this work, we want to cluster gene expression data, which is
characterized by few samples (only a few dozen), but each sample is in a very
high dimension (up to tens of thousands). This type of data makes the use of
the proposed clustering unfeasible, both because it requires a large runtime, and
has uninformative attributes which disturbs the clustering process.
In [1] is made a comparative study between dimension reduction techniques in
the context of complex datasets (gene expression). We need a tool for reduce the
original dimension of our real datasets, then we chose the best method pointed
in that paper: t-Distributed Stochastic Neighbor Embedding (T-SNE) [9]. The
t-SNE method consists of a nonlinear mapping of the original dimensional space
to a feature space and it can perform feature extraction very efficiently, helping
the cluster analysis by reducing the runtime of the algorithms and increase the
accuracy of the solutions created.
3 Material and Methods

In order to verify the applicability of the information-theoretic clustering al-
gorithm, different datasets were selected and used in a controlled way, i.e.,
databases where we have access to the labels of all objects. This is crucial in
a study of clustering analysis, which despite not using this information during
the process, they are primordial to evaluate the performance of the methods
tested.
For this purpose, we use synthetic data designed to evaluate the performance
in distinct scenarios. These data represent different situations in the context of
clustering. The datasets selected were used in other studies of clustering [8,7].
We also want to verify the scalability of the method to real-life problems, so we
collected some gene expression datasets to represent this class of problem.
3.1 Synthetic Datasets

The artificial datasets were created to evaluate the performance of the proposed
method on several different spatial complexities1 . The datasets were selected so
that they contained groups with the highest diversity of different characteristics.
Thus, we take into consideration the following features: number of cluster, size
of each cluster, cluster shapes (individually), dispersion of data and overlapping
elements. All synthetic datasets can be summarized in Table 1 and visualized in
Figure 1.
1
All datasets were collected from the page: http://pages.bangor.ac.uk/mas00a /ac-
tivities/artificial data.htm, some of them were used in the work [7], as referenced.
Table 1. Synthetic Datasets
Dataset k n
boat 3 100
easy doughnut [7] 2 100
four gaussians [7] 4 100
half rings [7] 2 400
petals 4 100
spirals [7] 2 200
(a) boat (b) easy doughnut (c) four gaussians
(d) half rings (e) petals (f) spirals
Fig. 1. Synthetic datasets
3.2 Real Datasets

As mentioned earlier, gene expression data are of great interesting of bioinformat-
ics research field. Using it, one can get information about cancer at the molecular
level and thereby improve the quality of medical diagnosis and treatments.
Typically, sets of gene expression are characterized by containing a large num-
ber of attributes (expression of thousands of genes), thus, it becomes a challenge
to perform cluster analysis of these data using classical algorithms.
We chose two datasets in this category for composing the material used in the
evaluation of the IT clustering algorithm: one dataset containing two different
types of cancer and the other with the expression of samples of the same cancer
(three different subtypes).
Table 2 summarizes the information of the of gene expression data used in this
work. In the table, n represents the number of objects present in the dataset,
#C the number of classes and m the original number of attributes (dimension).
Table 2. Real Datasets - Gene Expression Data
Dataset Tissue n #C Class distribution m

Chowdary [3] Mama, Colon 104 2 62,42 22283
Golub [5] Medula 72 3 38,9,25 7129
3.3 Experimental Design
The experiments were accomplished by presenting each dataset (synthetic and

real) to the individual methods: k-means (KM), hierarchical clustering algo-
rithm using single linkage (SL), finite mixture of Gaussians (FMG) and the IT
algorithm proposed in [2] (IT).
With the exception of SL, all algorithms are non-deterministic. To minimize
the randomness of initial guesses of such algorithms, we ran them 50 times and
choose for analysis the best result of each one.
In terms of parameters settings, we set all algorithms to cluster the datasets
with their real number of classes. We choose this value for k because we want to
evaluate the methods in terms of recovering the actual structure of the data.
IT algorithm also has another parameter: the number of the initial auxiliary
regions (ak). The performance of the algorithm depends entirely on this param-
eter. Thus, we ran the algorithm varying the value of ak in the interval [2,40%
of n]. The results are measured in terms of corrected Rand index (CR) [6].
In this session, we show the results obtained with datasets described in Sec. 3.
All results are expressed as values of the index corrected Rand (CR) with respect
to the already known labels for each object of the data. In order to contextualize
the analysis, the results are separated into two different sections. The CR values
related to partitions created using each of the clustering methods are displayed
in tables so that we can compare the performances of them.
4.1 Results for Synthetic Data
Table 3 shows the results of the tested algorithms: k-means (KM), hierarchical
single linkage (SL), finite mixture of Gaussians (FMG) and the IT learning al-
gorithm (IT) for all synthetic datasets. For the latter algorithm, we also show
the number of auxiliary regions related to the presented partition. These tech-
niques were chosen in order to show the performance of algorithms with different
clustering criteria.
Based on this, first we can notice that KM achieved good results in only few
datasets (four gaussians, petals). These three sets, have the same character-
istics, i.e,, all groups are well separated and the points are clustered around a
common center. This behavior is predictable, because KM only works correctly
Table 3. Performance of clustering algorithms for synthetic datasets
Dataset KM SL FMG IT
boat 0.38 0.00 0.52 0.71(35)
easy doughnut 0.24 1.00 1.00 1.00(07)
four gaussians 0.95 0.69 0.97 0.97(05)
half rings 0.24 0.03 0.66 1.00(10)
petals 1.00 0.38 0.89 1.00(04)
spirals 0.01 1.00 0.06 1.00(25)
when clusters are elliptical. This happens due to the criteria it minimizes (mean
square error) that favors groups with little dispersion. In the other datasets,
where the dispersion is greater or the separation between groups is not so obvi-
ous, KM did not delivered good results.
In practice, for datasets with clusters which can not be linearly separable,
the use of k-means is inappropriate. Despite its low computational complexity,
this technique can not separate the groups when they are, for example, inserted
one inside the other (easy donut and spirals). This happens even when the
separation between the groups is clear.
Opposite result occurs to LS, which by their local nature, only achieves good
results when the groups are completely apart and the points are very close to each
other, favoring partitions with elongated clusters (easy donut and spirals).
This type of technique is very sensitive to noise and outliers, then, if points of
different groups are closer, the probability of these clusters become one is very
high.
Finite mixture of Gaussians has showed good results in different contexts.
For the datasets used in this work, this technique achieved good results where
KM and LS failed. On the other hand, for datasets as boat and spirals its
performance was very low. In the case of spirals its result was even worse than
SL (worst overall performance), which obtained a maximum value of CR. For
such datasets it is natural that FMG can not achieve good results, since their
clustering criteria favors partitions containing groups with elliptical shape.
The IT algorithm has achieved excellent results (CR close to one) for the
most datasets. Both for data that has low spatial complexity (four gaussians
and petals) and for those with complex spatial distribution, i.e., easy donut,
half rings and spirals. Those datasets have a data distribution that prevents
separation by minimizing the mean square error, thus disabling KM to achieve
good results.
Even for the dataset boat, which has a mix of difficulties (internal clusters,
distinct dispersions and different cluster size), the IT algorithm has delivered
results far superior to other algorithms.
It is important to notice that each classical algorithm were able to cluster some
dataset satisfactorily. However, none has achieved good results for the majority
of the datasets. Moreover, IT algorithm has achieved a good overall performance,
reaching good results for the majority of the data sets.
4.2 Results for Real Data
For real datasets, we chose to use two datasets of gene expression. We followed
basically the same procedures for the artificial data. Additionally, due to the large
size of the gene expression data, a dimensionality reduction step was added as
a preprocessing procedure. We reduced all data to two dimensions.
Table 4. Performance of clustering algorithms for real datasets
Dataset KM SL FMG IT
chowdary 0.45 0.07 0.59 0.78(06)
golub 0.02 0.01 0.34 0.65(05)
From the results presented in Table 4 we can notice that the performance of
the IT algorithm is far superior than classical clustering algorithms. Between
the latter class of algorithms, FMG has the best results.
It is worth noting that the results for Chowdary are, in general, better than
for Golub. That difference is expected, since the first dataset contains the gene
expression of sample extracted from two different types of cancer, while the
second contains the expression of three subtypes of the same cancer.
5 Conclusions
This paper proposed a evaluation of a information theoretic clustering algo-

rithm. For this purpose, we ran experiments in order to compare its results with
the three most used clustering algorithms, k-means, hierarchical single-linkage
algorithm and the finite mixture of Gaussians.
With regard to data, experiments were performed with six sets of artificial
data (each one with a different spatial complexity) and two datasets containing
real information about gene expression of cell samples from patients with cancer.
The IT algorithm was superior both in cases where the spatial complexity
was simple as in more complex cases. We can also noticed the fact that some
“classical” algorithms achieve good results, however, this behavior was only true
for some specific datasets, showing that these techniques are not robust enough
to be applied in different fields with the same level of precision.
It is important to point out that the IT algorithm has a running time greater
than the other algorithms. This is obvious since it uses the k-means itself to find
the initial auxiliary regions. Besides, the cross information potential is calculated
using all pairs of objects of distinct clusters.
We can conclude that, according to the explored domain, the IT algorithm
can achieve better results in terms of recovery the real structure of artificial and
real data when compared with classical techniques of clustering, regardless of
the complexity of such data.
References
1. Araujo, D., Dória Neto, A., Martins, A., Melo, J.: Comparative study on dimension
reduction techniques for cluster analysis of microarray data. In: The 2011 Inter-
national Joint Conference on Neural Networks (IJCNN), pp. 1835–1842 (August
2011)
2. de Araújo, D., Neto, A.D., Melo, J., Martins, A.: Clustering Using Elements of
Information Theory. In: Diamantaras, K., Duch, W., Iliadis, L.S. (eds.) ICANN
2010, Part III. LNCS, vol. 6354, pp. 397–406. Springer, Heidelberg (2010)
3. Chowdary, D., Lathrop, J., Skelton, J., Curtin, K., Briggs, T., Zhang, Y., Yu, J.,
Wang, Y., Mazumder, A.: Prognostic gene expression signatures can be measured
in tissues collected in RNAlater preservative. J. Mol. Diagn. 8(1), 31–39 (2006)
4. Duda, R., Hart, P., Stork, D.: Pattern Classification. Wiley (2001)
5. Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P.,
Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D., Lander,
E.S.: Molecular classification of cancer: class discovery and class prediction by gene
expression monitoring. Science 286(5439), 531–537 (1999)
6. Jain, A.K., Dubes, R.C.: Algorithms for clustering data. Prentice-Hall, Inc., Upper
Saddle River (1988)
7. Kuncheva, L., Hadjitodorov, S., Todorova, L.: Experimental comparison of cluster
ensemble methods. In: 2006 9th International Conference on Information Fusion,
pp. 1–7 (July 2006)
8. Kuncheva, L.I.: Combining Pattern Classifiers: Methods and Algorithms. Wiley-
Interscience (2004)
9. van der Maaten, L.J.P., Postma, E.O., van den Herik, H.J.: Dimensionality Reduc-
tion: A Comparative Review (2007),
http://www.cs.unimaas.nl/l.vandermaaten/dr/DR_draft.pdf
10. Martins, A.M., Dória Neto, A., Costa, J.D., Costa, J.A.F.: Clustering using neural
networks and kullback-leibler divergency. In: Proc. of IEEE International Joint
Conference on Neural Networks, vol. 4, pp. 2813–2817.
11. Principe, J.C.: Information theoretic learning, ch. 7. John Wiley (2000)
12. Prı́ncipe, J.: Information Theoretic Learning: Renyi’s Entropy and Kernel Perspec-
tives. Information Science and Statistics. Springer (2010)
13. Principe, J.C., Xu, D.: Information-theoretic learning using renyi’s quadratic en-
tropy. In: Proceedings of the First International Workshop on Independent Com-
ponent Analysis and Signal Separation, Aussois, pp. 407–412 (1999)
14. Rao, S., de Medeiros Martins, A., Prı́ncipe, J.C.: Mean shift: An information the-
oretic perspective. Pattern Recogn. Lett. 30(3), 222–230 (2009)
Text Mining for Wellbeing: Selecting Stories
Using Semantic and Pragmatic Features
Timo Honkela, Zaur Izzatdust, and Krista Lagus

Aalto University School of Science, P.O. Box 15400, FI-00076 Aalto, Finland
Abstract. In this article, we explore an application in an area of re-

search called wellbeing informatics. More specifically, we consider how
to build a system that could be used for searching stories that relate
to the interest of the user (content relevance), and help the user in his
or her developmental process by providing encouragement, useful expe-
riences, or otherwise supportive content (emotive relevance). The first
objective is covered through topic modeling applying independent com-
ponent analysis and the second by using sentiment analysis. We also use
style analysis to exclude stories that are inappropriate in style. We dis-
cuss linguistic theories and methodological aspects of this area, outline a
hybrid methodology that can be used in selecting stories that match both
the content and emotive criteria, and present the results of experiments
that have been used to validate the approach.
1 Introduction
Wellbeing informatics is an emerging area of research in which ICT method-

ologies are used to measure, analyze, and promote wellbeing of individuals. Ex-
amples of traditional applications include heart rate monitoring, tracking sports
activities, analyzing the nutritional content of diets, and analyzing sleeping pat-
terns with mobile technologies. In this article, we present analytical methodology
developed for a social media application in which users can find stories that are
potentially helpful in their individual life situations. The users may wish to de-
velop their wellbeing further, or need to solve some problem that prevents them
from achieving a satisfactory level of wellbeing. The novelty of this paper lies
in the overall framework and in its application domain. The specific methods
used, such as independent component analysis, are well known as such. Due to
the complexity of the overall setting, we evaluate the results in a qualitative
manner.
In addition to relying on experts, people have always been listening to advice
provided by their family members and trusted friends. This kind of peer support
can be very valuable as it is not possible to seek professional advice for all health
and wellbeing related questions. Finding a set of suitable stories for a person in
a particular situation resembles the basic information retrieval task in which a
user is provided with a number of search results.

468 T. Honkela, Z. Izzatdust, and K. Lagus
In the wellbeing case, the information retrieval problem setting is multifaceted.

There are at least three criteria that can be used to evaluate stories: a) the match
between the problem or situation of the person and the topic or semantic content
of a story, b) the style of each story (whether it can be considered appropriate or
not), and c) the sentiment of each story (whether the person considers the story
encouraging, inspiring, or in general positive or not). These basic evaluation
criteria are illustrated in Fig. 1.
Fig. 1. The basic architecture of a system that conducts text mining in order to find
stories that can support users’ wellbeing
Fig. 1 also provides a wider system context for the present work. The con-
tent analysis can be divided into two main areas, i.e., semantic and pragmatic
analysis. Linguistic semantics is a research are where computational modeling
has traditionally taken place in the framework of symbolic logic. However, adap-
tive and statistical method are increasingly popular and there are numerous
approaches based on neural networks and statistical machine learning. Classi-
cal examples include latent semantic analysis [5] and self-organizing semantic
maps [14]. In this work, we apply independent component analysis (ICA) in the
semantic analysis [2,8]. This approach is described in detail in the next section.
Whereas semantics predominantly focuses on prototypical meaning, pragmat-
ics is concerned with communicative, contextual and subjective aspects of mean-
ing [7]. From computational point of view, the amount of research on prototypical
semantics is much more common than work on pragmatics, mainly due to the
efforts invested in knowledge representation and semantic web research. How-
ever, there are increasing evidence that the area of computational pragmatics
is gaining ground. Research on detection of antisocial behavior from texts [13],
and modeling the context of communication [3] can be mentioned as examples.
In the context of the present work, analysis of sentiments (see e.g. [15]) and style
(see e.g. [12]) are of particular interest.
Text Mining for Wellbeing 469
2 Methods
Here we describe in brief the methods that we later apply for extracting wellbeing-
related patterns and features from discussion forum stories. The components of
the analysis process, i.e. topic analysis with independent component analysis,
style analysis and sentiment analysis are explained in the following sections.
2.1 Topic Modeling with ICA

Independent component analysis (ICA) is a stochastic and an unsupervised
learning method for blind source separation. The task in blind source sepa-
ration is to separate original components (sources) from observed mixtures of
random variables without any or with very little a priori knowledge about the
components or the nature of the mixing process. In the classic version of the ICA
model [11,4,10], each observed random variable is represented as a weighted sum
of independent random variables. An example of an observed random variable
in our case is the frequency of a word in a story.
ICA can be seen as an extension to principal component analysis (PCA) and
factor analysis, which underlie latent semantic analysis. ICA is a more powerful
technique than these as it is capable of making underlying factors explicit under
certain conditions [4,10,8]. While PCA is clearly inferior to ICA in terms of
identifying the underlying factors, it is useful as a preprocessing technique due
to it being able to reduce the dimensionality of the data with minimum mean-
squares error. More detailed information on ICA is available in [10] and a detailed
presentation on using ICA on text data can be found in [8]. In this work, we use
ICA to model and extract topics of stories.
2.2 Sentiment Analysis

Sentiment analysis has gained increasing amount of interest as a problem to be
solved. It is valuable, for instance, when companies wish to know how the cus-
tomers are commenting on their services and products in social media and what
kind of direct feedback they receive as e-mail messages, through web forms, etc.
there are different kinds of methodological alternatives developed for sentiment
analysis (see, e.g., [9,1,6,15]).
The basic approach is to build a lexicon where each lexical item (word or
phrase) is associated with a negative or positive sentiment value. This lexicon can
be built fully by hand, or machine learning techniques can be used to associate
values automatically. Linguistic methods are applied for handling constructions
such as negation. The overall field of sentiment analysis is too varied to be fully
explored here. We rather consider it as one module in the overall system and
acknowledge that different choices can be made.
For the sentiment analysis, we used a method called SentiStrength, which has
been developed at the University of Wolverhampton, UK1 [15]. SentiStrength
1
http://sentistrength.wlv.ac.uk/
estimates the strength of positive and negative sentiment in short texts, even for
informal language. SentiStrength provides values both for positive and negative
sentiments, with scales from -1 (not negative) to -5 (extremely negative), and
from 1 (not positive) to 5 (extremely positive). This means that a document can
at the same time show both positive and negative sentiments which provides us
with more useful information, compared to the regular approach in which only
the polarity of a text is determined.
SentiStrength is a lexicon-based classifier that uses negating words, emoticons,
spelling correction, punctuation and other kinds of linguistic information in an
attempt to achieve a high precision in detecting sentiments[15].
2.3 Style Analysis

Stylistic differences of texts can be most easily defined either by describing the
source of the text, or in terms of the genre of the text. Style also deals with the
complexity, readability and trustworthiness of texts [12].
In the application context of this work, the most important style factor is
the inappropriate use of language such as swearing. When recommendations
about potentially useful stories are given, stylistically questionable texts should
be excluded. In this work, we applied a simple vocabulary-based approach that
is described in the following section.
3 Experiments
3.1 Data and Preprocessing

The data used in the experiments consists of stories that have been collected
from an internet-based service called Reddit (http://www.reddit.com). Reddit
is a social media where registered users provide contents in the form of texts,
normally written by the users themselves, or links. The contents cover all aspects
of life but we have naturally focused on items that deal with wellbeing. More
specifically, selected Reddit channels were “anxiety”, “depression”, “happy”, “off
my chest” and “self help”. These include stories that describe both positive (e.g.
happiness, accomplishment) and negative experiences (e.g. depression, anxiety).
In the preprocessing phase, punctuation marks were removed and all upper-
case letters were replaced by the corresponding lowercase letters. The resulting
corpus consists of 2570 documents, 2,886,772 tokens (words in the running text)
and 13,023 types (different unique words).
3.2 Independent Component Analysis

In the ICA analysis of the Reddit data, we applied the FastICA algorithm on
the word-document matrix. The original dimensionality of the data was first
reduced by PCA. Symmetric orthogonalization and tanh as the nonlinearity
function were used [10].
The vocabulary was manually selected to only cover words that are related to
the theme of wellbeing. The full list is too long to be included here but it can be
found at http://research.ics.tkk.fi/cog/data/icann12sp/wordlist.txt.
We used the FastICA Matlab package to extract a prespecified number of 20
features. In considering the feature distributions, it is good to keep in mind that
the sign of the features is arbitrary. This is due to the ambiguity of the sign: the
components can be multiplied by 1 without affecting the model [10,8].
Examples of the ICA results on the Reddit data are shown in Fig. 2 and
Fig. 3. The upper row of diagram in Fig. 2 shows how words “anxiety”, “stress”,
“nausea”, and “relief” are associated with the same emergent feature. Similarly,
when the lower row in the figure is considered, it is clear that the words “class”,
“book”, “exam” and “professor” have a shared feature. In each case, the repre-
sentation of the word is quite sparse, i.e., each word is mainly represented by
one or two distinguishable features.
anxiety
1 stress nausea relief
0.1 0.04 0.06
0 0.04
0.02
0.05
0.02
−1
0
0
0
−2
−0.02
−0.02
−0.05
−3 −0.04
−0.04
−0.1
−4 −0.06
−0.06
−5 −0.15 −0.08 −0.08

0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25
class book exam professor

0.8 0.2 0.1 0.08
0.7
0.15
0.08 0.06
0.6
0.5 0.1
0.06 0.04
0.4
0.05
0.3 0.04 0.02
0
0.2
0.02 0
0.1
−0.05
0
0 −0.02
−0.1
−0.1
−0.2 −0.15 −0.02 −0.04

0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25
Fig. 2. Upper row: Anxiety, stress, nausea and relief, lower row: Class, book, exam and
professor
Another set of result of emergent features based on the use of ICA is presented
in Fig. 3. The analysis of the words “problem”, “worry”, “health” and “moti-
vation” give rise to a a rich representation. These words are clearly associated
with several emergent features.
problem worry health motivation

0.4 0.15 0.5 0.06
0.3
0.1 0.4
0.04
0.2
0.05 0.3
0.02
0.1
0 0.2
0 0
−0.05 0.1
−0.1
−0.02
−0.1 0
−0.2
−0.04
−0.15 −0.1
−0.3
−0.4 −0.2 −0.2 −0.06

0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25
Fig. 3. Problem, worry, health and motivation

The categorical nature of each emergent feature can be illustrated by listing

the words that are strongest in relation to each component. This analysis is
presented in Table 1. The order of features is arbitrary but the words most
strongly associated with each feature show a clear emergent order. The content
of the table is quite self-explanatory but it may be interesting to considered
some specific examples. For instance, the feature number 10 is clearly associated
with family members and number 8 with alcoholic and non-alcoholic beverages.
Feature number 15 appears to collect words that most likely are associated with
phenomena that promote wellbeing. This hypothesis receives concrete support
in further analysis, discussed later in this paper. If we consider the analysis of
the words “health” and “motivation” shown in Fig. 3 and compare them with
the features shown in Table 1, we can see that the emergent feature structures
clearly make sense. For instance, the word “health” is clearly associated with
features numbered 1 (“anxiety”, etc.), 12 (“fobia”, etc.) and 20 (“doctor”, etc.).
In general, it can be concluded that the ICA-based study of a rather small
number of documents gives rise to a meaningful representation of the relationship
between conceptual items related to the domain of wellbeing.
Table 1. Automatically extracted ICA features
Feature Associated words

1 anxiety, nausea, relief, stress, yoga, relaxing (also 19)
2 hate, dread
3 conversation, friends
4 facebook, happy, song
5 chronic, depressed, illness, suffering
6 boyfriend, relationship
7 job
8 alcohol, beer, drink, tea, water
9 friend, girl
10 family, dad, mom, father, mother, parents, brother, sister
11 disability, work
12 fobia, anger, progress, therapist(s), therapy, psychologist (also 20)
13 book (also 15), class, exam, professor (also 17)
14 adrenaline, panic, danger, dying
15 dog, god, love, laugh (also 3), music (also 18), wisdom (also others)
16 worry (also 1), symptom (also others), problem (also 8, 12, 13)
17 school, college, university, homework
18 social, pain
19 sleep, asleep, dream, nightmares, paralysis
20 doctor, drugs, hospital, treat, medical, prescription, google, psychiatrist (also 12)
3.3 Sentiment and Style Analysis

As mentioned earlier in this article, the SentiStrength method outputs values
that assess both positive and negative sentiments in a document. From the anal-
ysis, we can see that, overall, the selected subset of Reddit data is biased in the
direction of negative sentiments. The results are shown in more detail when they
are presented in relation to the results gained with other methods.
The style analysis approach we chose consists of determining the ratio of
obscene language to normal language used in the stories. The method simply
checks the texts against a dictionary of swear words and then normalizes the
results against the length of the corresponding stories.2
3.4 Relationships between Variables

In this section, the results of two types of analyses of relationships between
different variables is presented. First, correlation coefficients between the up
and down votes, positive and negative sentiments, the age of the stories, and the
ICA components are shown. Assuming that up-vote indicates a preference for
reading a story leads us to look at the correlations of that feature with many
other features. Regarding correlations with thematic features found by the ICA,
the strongest correlation is between ICA features “chronic” and “anxiety” with
negative sentiments expressed in the text (see Table 2 for further details).
Table 2. Correlation coefficients between variables
Up Down Days Posit. Negat. Style

Upvotes 1.0000 0.7982 0.0646 0.2217 0.0280 0.0099
Downvotes 0.7982 1.0000 0.0664 0.1521 0.0027 0.0046
Days 0.0646 0.0664 1.0000 0.0255 0.0614 -0.1245
Posit. sentiment 0.2217 0.1521 0.0255 1.0000 -0.0851 0.0456
Negat. sentiment 0.0280 0.0027 0.0614 -0.0851 1.0000 -0.0735
Style 0.0099 0.0046 -0.1245 0.0456 -0.0735 1.0000
ICA1: anxiety 0.0462 0.0431 -0.0300 -0.0478 0.1159 0.1138
ICA4: facebook -0.3625 -0.2591 -0.0305 -0.1482 0.0279 -0.0153
ICA5: chronic 0.0080 0.0235 0.1296 -0.0548 0.1601 0.0537
ICA9: friend 0.0624 0.0535 -0.0438 0.1061 0.0015 0.0805
ICA10: family -0.1355 -0.1578 0.0251 -0.1179 0.0703 -0.0406
ICA12: fobia -0.0313 -0.0183 0.0609 0.1010 -0.1184 -0.0552
ICA14: adrenaline 0.0312 0.0228 -0.0075 -0.0194 0.1025 0.0606
ICA15: dog -0.1356 -0.1047 -0.0592 -0.2692 0.0167 -0.0155
4 Conclusions and Discussion

The domain of research presented in this article is novel. It touches upon compu-
tational methods, linguistics, psychology and sociology and thus we do not claim
to have conclusive results but rather aim to pave way to applications in the area
of wellbeing informatics. We see the overall framework as our main contribution.
The results highlight the importance of looking at various pragmatic features in
addition to semantic ones, when selecting stories that users of wellbeing-related
discussion forums find useful.
2
The list of swear words is available at
http://research.ics.aalto.fi/cog/data/icann12sp/
References
1. Agarwal, A., Bhattacharyya, P.: Sentiment analysis: A new approach for effective
use of linguistic knowledge and exploiting similarities in a set of documents to be
classified. In: Proc. of the Int. Conf. on NLP (2005)
2. Bingham, E., Kuusisto, J., Lagus, K.: ICA and SOM in text document analysis. In:
Proceedings of the 25th ACM SIGIR Conference, pp. 361–362. ACM, New York
(2002)
3. Bleys, J., Loetzsch, M., Spranger, M., Steels, L.: The grounded color naming game.
In: Proceedings of the 18th IEEE International Symposium on Robot and Human
Interactive Communication (2009)
4. Comon, P.: Independent component analysis—a new concept? Signal Processing 36,
287–314 (1994)
5. Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.:
Indexing by latent semantic analysis. Journal of the American Society of Informa-
tion Science 41, 391–407 (1990)
6. Devitt, A., Ahmad, K.: Sentiment analysis in financial news: A cohesion-based
approach. In: Proceedings of the Association for Computational Linguistics (ACL),
pp. 984–991 (2007)
7. Givón, T.: Mind, code, and context: essays in pragmatics. Lawrence Erlbaum As-
sociates (1989)
8. Honkela, T., Hyvärinen, A., Väyrynen, J.: WordICA - Emergence of linguistic
representations for words by independent component analysis. Natural Language
Engineering 16(3), 277–308 (2010)
9. Hurst, M., Nigam, K.: Retrieving topical sentiments from online document collec-
tions. In: Document Recognition and Retrieval XI, pp. 27–34 (2004)
10. Hyvärinen, A., Karhunen, J., Oja, E.: Independent component analysis, vol. 26.
Wiley (2001)
11. Jutten, C., Hérault, J.: Blind separation of sources, part I: An adaptive algorithm
based on neuromimetic architecture. Signal Processing 24, 1–10 (1991)
12. Karlgren, J.: Textual stylistic variation: Choices, genres and individuals. In: Struc-
ture of Style, pp. 129–142. Springer (2010)
13. Munezero, M., Kakkonen, T., Montero, C.: Towards automatic detection of antiso-
cial behavior from texts. In: Proceedings of the Workshop on Sentiment Analysis
where AI meets Psychology (SAAIP 2011), pp. 20–27 (November 2011)
14. Ritter, H., Kohonen, T.: Self-organizing semantic maps. Biological Cybernet-
ics 61(4), 241–254 (1989)
15. Thelwall, M., Buckley, K., Paltoglou, G., Cai, D., Kappas, A.: Sentiment in short
strength detection informal text. Journal of the American Society for Information
Science and Technology 61(12), 2544–2558 (2010)
Hybrid Bilinear and Trilinear Models
for Exploratory Analysis of Three-Way
Poisson Counts
Juha Raitio, Tapani Raiko, and Timo Honkela
Aalto University School of Science,

Department of Information and Computer Science,
P.O.Box 15400, FI-00076 AALTO, Finland
{juha.raitio,tapani.raiko,timo.honkela}@aalto.fi
http://ics.aalto.fi/
Abstract. We propose a probabilistic model class for the analysis of

three-way count data, motivated by studying the subjectivity of lan-
guage. Our models are applicable for instance to a data tensor of how
many times each subject used each term in each context, thus revealing
individual variation in natural language use. As our main goal is ex-
ploratory analysis, we propose hybrid bilinear and trilinear models with
zero-mean constraints, separating modeling the simpler and more com-
plex phenomena. While helping exploratory analysis, this approach leads
into a more involved model selection problem. Our solution by forward
selection guided by cross-validation likelihood is shown to work reliably
on experiments with synthetic data.
Keywords: tensor factorization, multilinear model, unsupervised learn-

ing, exploratory data analysis, text analysis, Grounded Intersubjective
Concept Analysis.
1 Introduction
As a generic task, analysis of counts in relation to two categorical variables also
known as factors, ways or modes is encountered in a vast number of scientific
studies and engineering applications. In order to make the problem setting and
evaluation of the present work more accessible, we concentrate on a concrete
example from text analysis where the counts of selected words in a set of doc-
uments is represented as a term-document matrix, words indexed as rows and
documents as columns. Common analyzes of this representation include relating
the documents to each other by the counts of the word occurrences in docu-
ments, or studying the relation of words by their co-occurrences in documents.
As such this comprises an example of 2-way data analysis.
It may be of interest to additionally study how the counts of the term-
document matrix vary according to some other factor such as the author of
the document. In fact, if the variation between documents according to the au-
thor is included in analysis, one may be able to attribute some of the variation

476 J. Raitio, T. Raiko, and T. Honkela
in the word counts to the language use of the author, and consequently, give
more accurate inference on relations between words (or documents) in general.
The term-document data augmented with information about the authors has a
representation as 3-way array. 3-way data arrays, or in general multi-way data,
can be studied for example using methods of tensor data analysis. For a generic
introduction to the topic, see e.g. [13].
The present work is originally motivated by the finding that there can be sub-
stantial individual variation in how natural language expressions are used and
interpreted. In [6], a method called Grounded Intersubjective Concept Analysis
(GICA) has been introduced. The essence of the GICA method is to model in-
dividual variation in using natural language expressions and for this purpose, a
3-way analysis of Subject-Object-Context (SOC) tensors is needed. The analy-
sis of such tensors may reveal individual differences in style but, more impor-
tantly, indicate subjectivity in modeling the relationship between language and
the world. If this kind of subjectivity remains unrecognized, various kinds of
problems related to communication may arise.
A more specific motivation for the present work stems from the fact that in [6]
the analysis of Subject-Object-Context tensors was conducted by flattening the
3-way arrays to 2-way matrices. These matrices can then be straightforwardly
analyzed using traditional data analysis methods such as PCA, SVD or ICA.
Each direction of flattening introduces a point of view and may, as such, provide
important insights into the data when analyzed. However, the flattening of the
original data appears to be useful, but possibly inadequate approach. It appears
necessary to devise a methodology that would make it possible to analyze all the
relationships without first determining which modes of the array are in focus. As
discussed above, traditional term-document matrices are formed by counting the
number of instances of each term in each document and by storing this count in
the element that corresponds to the row associated with the term in question and
the column associated with the particular document. The GICA data is formed
following the same basic principle, but adding a third mode which is used to
include all the subjects being included in the analysis. Moreover, rather than
considering frequency counts in whole documents, the counts concern typically
some context window of a given length.
One might think that subjectivity of language would be an exception rather
than a rule, since semantics appear to be well defined through thesauri, on-
tologies, and other knowledge representations. However, as natural language is
immersed with ambiguity, there is also a great amount of subjectivity and con-
textuality involved. A more detailed account on this matter is provided in [6].
Here it may be sufficient to refer to two examples. For the basic color terms,
there seems to be a high degree of intersubjective agreement. Around the idea
of prototypical red, green or blue there is not much subjective variation even
though a particular context may shift the evaluation, like in the case of phrases
“red skin” or “red wine” [3]. However, a lot more subjective variation is to be
expected if less typical color names are considered, such as “purple”, “khaki”
or “orchid”. An even more convincing example is when abstract words are
Hybrid Bilinear and Trilinear Models for Analysis of 3-way Counts 477
considered. It is unlikely that all people mastering English would understand

words like “democracy/-tic”, “fair”, “love”, “wellbeing”, or even “computation”
in a mutually compatible manner. It should be obvious that there is variation in
the interpretation in the use of these and many other words. Tools for the for-
mal modeling and systematic analysis of this kind of semantic variation are not
readily available and widely used, though. The Subject-Object-Context tensors
[6] and the methodological development presented in this paper aim to alleviate
the lack of tools and to provide a systematic framework for approaching this
common but mostly unexplored phenomenon.
In this paper, we thus propose an unsupervised method for analyzing 3-way
count data – including GICA data – where a 2-way analysis based on decom-
position models is extended to allow 3-way analysis in the same framework in a
probabilistic manner. The complexity of the original phenomenon is very high
and the same concerns the data in question. With the methodology presented
in this paper, it should become possible to explore the data so that useful con-
clusions can be made. In particular, findings that show that there is significant
level of variation in the interpretation of some expression even if the context is
the same.
2 Proposed Model
The count xijk indexed by the levels i, j and k in the ranges {1, 2, . . . , I},
{1, 2, . . . , J} and {1, 2, . . . , K} of the three modes under consideration is modeled
as Poisson distributed
P (xijk ) = Pois(exp(lijk )), (1)

where the trilinear predictor
(0) (0) (0)
lijk =ai + b j + ck
h1
h2
h3

(1) (2) (1) (2) (1) (2)
+ aim1 bjm1 + bjm2 ckm2 + ckm3 aim3 (2)
m1 =1 m2 =1 m3 =1
h4
(3) (3) (3)
+ aim4 bjm4 ckm4 .
m4 =1
This specifies a model class that predicts the logarithm of the Poisson mean
count by a specially structured trilinear model consisting of
– bias parameters a(0) , b(0) , and c(0) for capturing the mean of each mode,
(q) (q)
– all combinations of the bilinear factorizations with parameters ai: , bj: and
(q)
ck: , q = 1, 2 for capturing interactions between modes,
(3)
– the trilinear factorization or the PARAFAC model [4] with parameters ai: ,
(3) (3)
bj: and ck: for capturing 3-way interactions between modes, and
– hyperparameters h1 , h2 , h3 and h4 for adjusting the model complexity,
where the subscript “:” is used to denote all values of the index of summation
m within a factorization.
Without loss of generality, we assume that the vectors a(q) , b(q) , and c(q) , q =
(q) (q) (q)
1, 2, 3 are zero-mean in the sense that i aim = 0, j bjm = 0 and k ckm = 0
for all m (see Appendix for proof). These parameter vectors are also known as
loadings.
The proposed model class can be interpreted as statistical multiple regression
models, where a Poisson distributed count is regressed on three categorical (fac-
torial) independents. The dimension of the parameter space is the number of
parameters in a specific model, I + J + K + h1 (I + J) + h2 (J + K) + h3 (I + K) +
h4 (I + J + K). In the special case of h1 = h2 = h3 = h4 = 0 our specification
is linear and equals that of a Generalized Linear Model [12] with logarithmic,
canonical link function for Poisson distributed data. Our model is, however,
nonlinear in parameters in its general form.
2.1 Motivation: Exploratory Analysis
The reason we propose a combination of bilinear and trilinear terms instead of
only the trilinear part, is the exploratory analysis of the results: We wish each
phenomenon in the data to be modeled with as simple terms as possible. Since
the trilinear part is often the most interesting but also the most difficult one for
analysis, we hope to clarify it by separating the more trivial phenomena away.
It is easy to see that the trilinear term could emulate the other terms by using
constant loadings 1 for parameter vectors a, b or c. However, since we introduce
the zero-mean constraint, we force the simpler terms to be used, too.
In the GICA context, the interpretation of the terms in Equation (2) is as
follows. I is the number of people (or subjects), J is the number of terms (or
objects) and K is the number of contexts. Biases describe how much text we have
from each subject, and how common is each term and each context. The first
bilinear term models how people prefer using some objects (or terms). This part
is comparable to collaborative filtering. The second bilinear term is about how
terms are used in contexts (or documents). This part is comparable to latent
Dirichlet allocation. The third bilinear term models how common particular
contexts are for different people, again comparable to collaborative filtering.
The trilinear term can model the subjectivity of context to the use of terms.
2.2 Parameter Estimation

The parameter vectors a, b, and c are learned by fitting the model to i.i.d. data.
The log-likelihood of the parameters is

ln P (xijk | lijk ) = ln Pois(xijk | exp(lijk )) (3)
i,j,k i,j,k
exp(lijk )xijk exp(− exp(lijk ))
= ln (4)
xijk !
i,j,k

= [xijk lijk − exp(lijk ) − ln(xijk !)] (5)
i,j,k
and its partial derivative w.r.t. lijk is

∂ ln P
= xijk − exp(lijk ). (6)
∂lijk
The gradient for fitting the parameters is further derived using the chain rule.
Finding maximum likelihood estimates is subject to the zero-mean constraints
of the parameter vectors.
2.3 Model Selection
The proposed algorithm for model selection is as follows. First, we set the hyper-
(0) (0) (0)
parameters h1 = h2 = h3 = h4 = 0 to estimate the biases ai , bj , ck . Then
model complexity is increased by incrementing the hyperparameters one at a
time and thus introducing new components into the model. The new parameters
are fitted while keeping the old ones fixed.
The avoid overfitting, proper hyperparameter values are determined by cross-
validation [14], i.e., by splitting the tensor elements randomly into a number
of equal sized partitions and then, in turn, holding out each partition from the
parameter estimation as validation set. We stop increasing each hyperparameter
whenever the probability of validation set, that is, its evidence for the model,
stops increasing significantly. In cross-validation we compare the distribution of
changes in the model evidences of the validation sets between before and after
adding new parameters. We apply a non-parametric test (Wilcoxon signed-rank)
to compare the significance level of the increase to a critical value.
After determining the hyperparameters, thus fixing the model complexity, the
model parameters are estimated without holding out any data, and at the end,
the whole model is fine-tuned by estimating all the parameters simultaneously.
3 Simulation Experiment
In order to assess our contribution, comprising of a statistical model class to-

gether with parameter estimation, model selection and data analysis procedures
proposed in Sect. 2, we apply them to synthetic data generated using 100 ran-
dom models in the proposed model class. For this purpose we first draw each hy-
perparameter for a model uniformly from {0, 1, 2}, and then draw values for the
respective parameter vectors uniformly from [−1, 1] and remove their mean. The
drawn models are summarized in Table 1.A. Finally, we sample size 40 × 25 × 15
(I × J × K) data tensors from these models.
In the simulation we identified each of the models independently, applying
the proposed model selection procedure using 10-folds in cross-validation and
critical value of 0.15 for entering new components. In the models that generated
the 100 data tensors, we had in total 400 hyperparameter values to identify.
The method failed in 7 of the hyperparameter values for the 3-way components
(h4 ) and in 2 for the 2-way components (h2 and h3 ). In all but one case out of
Table 1. Summary of the experiment for identification of random models. Table A

(on the left) displays the count for each hyperparameter value k in the sample of 100
models. Table B (on the right) displays the accuracy of the method in the identification
test in terms of the error rate for each hyperparameter and value, and in total over the
types of factorizations and complexities.
k h1 h2 h3 h4 tot. k h1 h2 h3 h4 tot.
0 35 33 35 36 139 0 0 0 0 0 0
1 31 33 32 39 135 1 0 0 0.03 0.10 0.04
2 34 34 33 25 126 2 0 0.03 0 0.12 0.03
tot. 100 100 100 100 400 tot. 0 0.01 0.01 0.07 0.02
the 9, the error was that one true generating component was excluded from the
identified model. Once, for h2 , one extra component was included. In total 91
out of the 100 generating models were identified correctly. Table 1.B summarizes
the results in identification accuracy.
In overall, according to this experiment, the model selection procedure is
feasible. It seems that model estimation works surprisingly well despite the lack
of guarantees for finding the global optimum. Failures in the identification may
be due to suboptimal parameter values or to the sampling of cross-validation
data. Consequently, the improvement in the model evidence by introduction
of some components have not been considered significant in our conservative
model selection procedure. It is interesting to note that off-by-one errors in the
identification do not seem to induce further errors in subsequently identified
components and hence the estimation procedure can be considered robust in
this respect.
4 Discussion
Our method estimates the trilinear predictor tensor as a sum of finite number of
constraint rank-one tensors, i.e., as constraint CANDECOMP[1]/PARAFAC[4]
trilinear decomposition. Once this representation has been found, it follows from
the properties of the decomposition that the trilinear components are unique up
to permutation and scaling of the parameter vectors under certain sufficient
conditions [11] that are true in most real world data analyses.
It is known that in general the approximation of a tensor by the trilinear
decomposition is an ill-posed problem [15] that does not have a bounded solu-
tion for some degenerate tensors. Also the greedy approach we apply (but not
depend on) for fitting the decomposition incrementally, does not result in best
fit in the sense that the optimal parameters of a less complex model are not
generally optimal in a more complex model [9]. Both of these results are de-
rived for approximations based on the Frobenius norm. We are not aware of
results that are valid for our probabilistic metric. Additionally, there are results
(e.g. [10]) that in real world applications the trilinear composition, fitted in the
greedy manner, gives comparable performance besides its lower computational

cost than the fitting of all of the parameters simultaneously.
Our model is a probabilistic generative model as opposed to traditional ten-
sor factorization models. One benefit from this is the well-founded handling of
missing values. We used missing values for the model selection by holding out
validation elements in the tensor, but in general, the original data might contain
missing elements, too. As the proportion of the missing values increases, mod-
elling the posterior uncertainty of the parameters starts to become important.
Our approach resembles the basic 2-way model in [7] and could be extended to
the more advanced treatment such as variational Bayes.
A tensor of size I × J × K can be factorized in many different ways, see [8] for
a review on tensor factorization. We build upon the CANDECOMP/PARAFAC
model in the multilinear predictor, that is, using factors I × h and J × h and
K × h, except that we include the simpler factorizations for explaining the other
phenomena. Another difference is of course that we use it hierarchically as a
parameter for the Poisson distribution. Tucker decomposition [5] is the oldest
tensor factorization method, which uses h1 × h2 × h3 and I × h1 and J × h2 and
K × h3 . It can be used also in computing nonnegative tensor factorizations [2].
Recently, an algorithm for solving a set of factorization problems with possibly
coupled factors was given in [16].
References
1. Carroll, J.D., Chang, J.J.: Analysis of individual differences in multidimensional
scaling via an n-way generalization of ”eckart-young” decomposition. Psychome-
trika 35(3), 283–319 (1970), http://dx.doi.org/10.1007/BF02310791
2. Friedlander, M.P., Hatz, K.: Computing nonnegative tensor factorizations. Com-
putational Optimization and Applications 23(4), 631–647 (2008)
3. Gärdenfors, P.: Conceptual Spaces. MIT Press (2000)
4. Harshman, R.: Foundations of the PARAFAC procedure: Model and conditions
for an ’explanatory’ multi-mode factor analysis. In: UCLA Working Papers in
phonetics (16) (1970)
5. Hitchcock, F.L.: The expression of a tensor or a polyadic as a sum of products.
Journal of Mathematics and Physics 6, 164–189 (1927)
6. Honkela, T., Raitio, J., Nieminen, I., Lagus, K., Honkela, N., Pantzar, M.: Using
GICA method to quantify epistemological subjectivity. In: Proceedings of IJCNN
2012, International Joint Conference on Neural Networks (2012)
7. Ilin, A., Raiko, T.: Practical approaches to principal component analysis in the
presence of missing values. Journal of Machine Learning Research (JMLR) 11,
1957–2000 (2010)
8. Kolda, T.G., Bader, B.W.: Tensor decompositions and applications. SIAM Re-
view 51(3), 455–500 (2009)
9. Kolda, T.G.: Orthogonal tensor decompositions. SIAM Journal on Matrix Analysis
and Applications 23(1), 243–255 (2001)
10. Kolda, T.G., Bader, B.W., Kenny, J.P.: Higher-order web link analysis using mul-
tilinear algebra. In: ICDM 2005: Proceedings of the 5th IEEE International Con-
ference on Data Mining, pp. 242–249 (November 2005)
11. Kruskal, J.B.: Three-way arrays: Rank and uniqueness of trilinear decompositions,
with application to arithmetic complexity and statistics. Linear Algebra and its
Applications 18, 95–138 (1977)
12. McCullagh, P., Nelder, J.A.: Generalized linear models, 2nd edn. Chapman & Hall,
London (1989)
13. Mørup, M.: Applications of tensor (multiway array) factorizations and decomposi-
tions in data mining (2011),
http://onlinelibrary.wiley.com/doi/10.1002/widm.1/full
14. Picard, R.R., Cook, R.D.: Cross-validation of regression models. Journal of the
American Statistical Association 79(387), 575–583 (1984),
http://www.jstor.org/stable/2288403
15. de Silva, V., Lim, L.H.: Tensor rank and the ill-posedness of the best low-rank
approximation problem. SIAM J. Matrix Analysis Applications 30(3), 1084–1127
(2008),
http://dblp.uni-trier.de/db/journals/siammax/siammax30.html#SilvaL08
16. Yilmaz, Y.K., Cemgil, A.T., Simsekli, U.: Generalised coupled tensor factorisation.
In: Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F.C.N., Weinberger, K.Q.
(eds.) NIPS, pp. 2151–2159 (2011),
http://dblp.uni-trier.de/db/conf/nips/nips2011.html#YilmazCS11
Appendix: Removing Mean of Parameter Vectors

(q) (q)
Without loss of generality, we can assume that i aim = 0, j bjm = 0 and
(q)
k ckm = 0 for all m, q = 1, 2, 3 in Equation (2). This is because any non-zero
mean in parameter vectors could be moved to a simpler term. For instance the
(a1) (1) (0)
mean μm of aim could be moved to bj by noting that for all i and k
(0) (1) (2) (0) (2) (1) (2)

bj + aim bjm = [bj + μ(a1)
m bjm ] + [aim − μm ]bjm .
(a1)
(7)
(a3) (3)
For removing the mean μm of aim , we can increase h2 by 1 and set the new
part to

(1) (a3) (3)
bjh2 = μm bjm , (8)

(2) (a3) (3)
ckh2 = μm ckm . (9)
Estimating Quantities:
Comparing Simple Heuristics
and Machine Learning Algorithms
Jan K. Woike1,2 , Ulrich Hoffrage1 , and Ralph Hertwig2

1
Faculty of Business and Economics, University of Lausanne, Switzerland
2
Faculty of Psychology, University of Basel, Switzerland
Abstract. Estimating quantities is an important everyday task. We an-

alyzed the performance of various estimation strategies in ninety-nine
real-world environments drawn from various domains. In an extensive
simulation study, we compared two classes of strategies: one included
machine learning algorithms such as general regression neural networks
and classification and regression trees, the other two psychologically plau-
sible and computationally much simpler heuristics (QEst and Zig-QEst).
We report the strategies’ ability to generalize from training sets to new
data and explore the ecological rationality of their use; that is, how
well they perform as a function of the statistical structure of the envi-
ronment. While the machine learning algorithms outperform the heuris-
tics when fitting data, Zig-QEst is competitive when making predictions
out-of-sample.
Keywords: estimation, simple heuristics, general regression neural net-

works, QuickEst, ecological rationality.
1 Introduction
Being able to accurately estimate quantities can be of utmost importance – for

instance, for a company that needs to estimate the demand for a new product,
or for a government that wants to quantify the effects of legislative changes. An
abundance of estimation algorithms have been proposed, some as prescriptive
strategies that have been developed to minimize prediction errors, and some as
descriptive strategies that have been developed as models of human estimates.
Estimation strategies differ in computational complexity. On the one hand,
there are sub-symbolic strategies such as artificial neural nets that require heavy
computation. On the other hand, there are much simpler symbolic strategies that
could eventually even be executed solely on paper, such as estimation heuristics
that have been proposed in the context of the simple heuristics program [1,2].
This program posits that human rationality is bounded, but that cognitive lim-
itations do not necessarily have to be a disadvantage. To the extent that simple
heuristics are able to exploit the structure of information in the environment in
which humans have to function, they can still reach a high level of performance.

484 J.K. Woike, U. Hoffrage, and R. Hertwig
The study of this match between the architecture of decision strategies and en-
vironmental structures is central to the study of ecological rationality [3]. In this
contribution, two heuristics that have been developed within the simple heuris-
tics framework are pitted against a selection of machine learning algorithms to
test for comparative strengths and weaknesses across a range of empirical (data)
environments with binary predictor variables (henceforth: cues) and a continuous
criterion.
2 Algorithms in the Competition

2.1 Simple Estimation Heuristics
The two heuristics considered here are based on the QuickEst-heuristic, which
exploits the fact that in most environments the distribution of criterion val-
ues follows a power law [4,5]. The first variant (QEst), requires the following
preparatory steps:
1. For each of the k binary cues bi with possible values 0 and 1, calculate s1,i
and s0,i as the average criterion value for all cases o in the learning set (L)
with bi (o) = 1 and bi (o) = 0, respectively.
−
2. Let s+
i = max(s1,i , s0,i ) and si = min(s1,i , s0,i ). Recode the cue values such
+
that si = s1,i .
3. Order the k cues in ascending order of s− i , so that for (c1 , ... ck ):
s− − −
1 ≤ s2 ≤ ... ≤ sk .
QEst can be represented as a minimal binary-tree (see Fig. 1, left side): the tree
has k levels following the root node and one exit node on each level following
the root node, except for the last level, which has two exit nodes. The cues
are assigned in ascending order s− i to the k decision nodes beginning with the
root node. To create an estimate for a case o, cues are looked up in the order
determined above, and once a negative cue value cj is encountered, an exit node
is reached and s− j is returned as the estimate. Only if no single negative cue
value is found, the mean for all cases in L that do not have a single negative cue
value will be predicted.
The only difference between QEst and the original QuickEst is that QuickEst
rounds the estimates to the next spontaneous number [4], as it has been designed
to model human inferences (and humans tend to generate ”round” estimates).
Because in an environment in which the criterion distribution follows a power
law there are many cases with small criterion values (and these cases will tend
to have negative cue values), QEst reduces information search by design and will
likely be able to return estimates after inspecting only very few cues. Note that
neither of the two variants has free parameters.
The second variant that we introduce and test in this paper, the Zig-QEst-
heuristic (ZQ, see Fig. 1, right side), differs from QuickEst on still another di-
mension: It sorts out the extreme cases on both sides of the distribution first.
Cues are put into sequence by choosing the cue with minimum s− and maximum
Estimating Quantities 485
c1 (t) c1 (t)
1 0 1 0
c2 (t) − c2 (t) −
s s
1 1
1 0 1 0
... − + ...
s2 s2
1 0 1 0
ck (t) ... ck (t) ...
1 0 1 0
− + −
s0 s s s
k k k
Fig. 1. Structure of the QEst-heuristic (left) and the Zig-QEst-heuristic (right)
s+ , alternatingly (Fig. 1, right side, shows one of the two possible structures).
An exit node associated with an s− is reached when the corresponding cue has
a negative value and an exit node associated with s+ , when the corresponding
cue has a positive value. The first exit node is placed based on the maximum
−
absolute deviation from the mean max(|s+ i − ȳ|, |ȳ − si |) across all cues. As
a benchmark, the prediction of the mean observed in L was added as a third
heuristic. The heuristics were implemented in Borland Delphi 6.0.
2.2 Machine Learning Algorithms

The following strategies were chosen from the wide spectrum of algorithms and
heuristics that have been proposed in statistics and machine learning as solu-
tions to estimation tasks that involve generalization to unknown quantities. The
general regression neural network (GRN) [6] earned the reputation of being able
to perform well even for small learning sets (in contrast to, for example, back-
propagation networks). This feature is particularly relevant for the present study
that uses real-world datasets, which typically do not contain huge numbers of
cases and cues. GRN’s topology consists of a layer of input units that distribute
information to a second radial basis layer, and a final linear layer connected to
output units. The radial basis layer consists of pattern nodes that represent the
exemplars found in L. Their activation is nonlinearly related to the distance of
new cases to these exemplars, and a new prediction is essentially constructed as
a blending of values known for exemplars with similar input features. In addi-
tion, GRN only has one free parameter, the smoothness parameter σ, which is
set to 1.0 in the simulation.
As a second representative of data-mining algorithms, we picked the method
of Classification and Regression Trees (CRT) [7]. CRT has been successfully ap-
plied to many estimation tasks in pattern classification and estimation. Unlike
the first algorithm, CRT does not belong to the class of exemplar-based models.
Any instance of a CRT model can be easily translated into rule lists, so that
the application of CRT can be considered as rule-based decision making. Trees

are constructed by a partitioning algorithm that recursively separates the set of
cases in L into subsets based on single cue values while maximizing their distinc-
tiveness: A common splitting criterion for estimation tasks is the minimization
of the predicted squared error, when the mean of cases in the subsets is used as
an estimate. Subsets that fall below a minimum size are not split any further.
The implemented CRT algorithm has one free parameter: the minimum size for
parent nodes that are considered for splits, which is set to 5 in the simulation.
We also included OLS multiple regression (LR) in this category, as it can be
considered a standard workhorse in statistical analysis that comes with a proven
track record of usefulness in estimation tasks [8]. Finally, as the cues in the task
are binary, we added estimation trees (EstT) to the set. EstT make estimations
based on a precise match of new cases with known cases: For any profile of cue
values that has been observed before, the mean of criterion values for known
cases with the exact cue profile is used as an estimate, otherwise the mean of all
known cases is predicted. These algorithms were all implemented in Matlab 7.9.
3 Simulation Setup
3.1 Environments
The performance of the estimation heuristics and strategies mentioned above

has been tested in ninety-nine datasets from various data repositories, on-line
data collections, and textbook materials [1,9,10,11,12,13,14,15,16,17,18,19,20].
The datasets were chosen from a diverse range of fields, such as economics, com-
puter science, sports, medicine, social sciences, engineering, and biology. The
selection of datasets was completed prior to the first simulation run so that it
could not have been influenced by the desirability of results. Continuous predic-
tor variables were binarized by using the mean as threshold. On average, there
are 292.2 cases (Range: 11–3450) and 7.4 cues (Range: 2–22) in each dataset.
3.2 Accuracy Criterion
Because the criteria of the ninety-nine environments differed dramatically with

respect to their scales, we had to standardize the performance of the algorithms
before we could aggregate across environments. Specifically, the performance (A)
of a given strategy s in a given data environment d was standardized as
E(s) − Et
A(s) = 1 − , (1)
Em − Et
where E(s) represents the MSE for the algorithm’s predictions:
m
1
E(s) = (ŷ(oi , s) − y(oi ))2 . (2)
m 1=1
Et is the fitting MSE performance of the estimation tree in the full dataset,
that is, with a learning set that consisted of all cases. If U (x) is the subset of
cases in L with cue values identical to those of case x, then the prediction of the
estimation tree is the mean of criterion values in this subset.
Et is the lower bound for E(s), if deterministic predictions have to be made for
the full dataset based on the cue values, as the mean minimizes MSE for all cue
equivalence classes. As the upper bound for the prediction error, in contrast, we
take the variance of criterion values in the full dataset (Em = σ 2 (y)), because this
variance is equivalent to the minimal MSE of a model that completely disregards
cue information. The criterion A(s) is a linear transformation of E(s) that maps
any MSE between these two extremes to the interval [0; 1], such that lower MSEs
correspond to higher values on A(s). There may be subsets of the full dataset
for which A(s) > 1, and ill-fitted models can generate values below 0.
3.3 Learning Conditions
Fitting and prediction accuracy were measured for each strategy in every envi-
ronment under two conditions: Algorithms were trained and fitted to a learning
set of either 50% or 75% of randomly chosen cases from the full dataset and
predicted the criterion values in the hold-out set that consisted of the remaining
50% or 25% of the cases in the datasets. All seven algorithms were tested on the
same randomly generated subsets, and there were 1000 trials for each combina-
tion of dataset, algorithm and learning set size (for a total of 1, 386, 000 fitting
and 1, 386, 000 prediction accuracy results).
4 Simulation Results
4.1 Fitting and Prediction
The results, averaged across the nintey-nine datasets, are presented in Fig. 2, on
the left side for fitting, and on the right side for prediction. When fitting known
data, the winners of the competition are the machine learning strategies: The
best performing strategy is EstT, which is optimally suited to fit data with binary
cues (the average accuracy is larger than 1, as it is easier for estimation trees
to fit smaller samples). Not far behind in the race are CRT and GRN, followed
by LR. In contrast, the two simple heuristics perform much worse, about half
way between the machine learning strategies and the mean model as the lower
benchmark. As expected, fitting results for the 50% condition are slightly better
than for the 75% condition, as it is easier to fit a smaller number of cases with
the same number of parameters.
For the prediction task the results are markedly different. First, and as ex-
pected, each algorithm performs worse than in fitting. Second, and as expected,
results for the 75% condition are better, because larger learning sets yield bet-
ter parameter estimates. Third, and more interestingly: the estimation tree and
CRT are no longer competitive. In the 50% condition, the best performance is
A A
1.1 Fitting 1.1 Prediction
1.0 Learning Set 50% 1.0 Learning Set 50%
0.9 Learning Set 75% 0.9 Learning Set 75%
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
Mean
0 0
MeanQEstZigQ LR CRT GRNEstT QEstZigQ LR CRT GRNEstT
Fig. 2. Average Accuracy-Score (A) for the fitting and prediction of strategies across
the ninety-nine data sets: the bar colors represent the learning condition (the size of
the learning sets)
reached by GRN and ZQ and in the 75% condition the winner is LR. Both the
complex GRN and the simple ZQ-heuristic perform on similar levels, while QEst
fails to make good predictions. The difference in accuracy loss between fitting
and prediction clearly signals that the complex models over-fit the data. Finally,
the difference in performance of LR between the 50% and 75% condition under-
scores the importance of large sample sizes for generalizable parameter estimates
in LR-models.
4.2 Ecological Rationality

The number of datasets used in this study allows for a more detailed analysis
of performance differences between strategies across environments, which can be
valuable from both a descriptive and prescriptive point of view. In this paper,
we will restrict ourselves to report results of a comparison between the most
complex algorithm and one of the simple heuristics: GRN and ZQ. Figure 3
shows the difference in predictive accuracy in the 50% condition across datasets,
ordered by difference.
Although both algorithms exhibit a similar average prediction accuracy, their
relative performance varies across environments. The differences between envi-
ronments that favor one of the two are not obvious, but some correlation results
may shed some light on their nature: the difference in favor of GRN is correlated
positively with the number of cases in the datasets (r=.32, p=.001) and nega-
tively with the number of cues (r=-.24, p=.016). The performance of both strate-
gies is positively correlated with the number of cases and negatively correlated
with the number of cues. This pattern suggests that GRN can reap larger bene-
fits than ZQ from bigger learning sets and also from a reduction in the number
of variables. Further, while the average point-biserial correlation between cues
A
1.0
0.5
−0.5
−1.0
−1.5
−2.0 A(GRN)
A(ZQ)
−2.5
Environments
Fig. 3. Predictive Accuracy for GRN and ZQ for each of the ninety-nine data sets
ordered by decreasing Δ(A) = A(GRN ) − A(ZQ). The length of the vertical lines
corresponds to |Δ(A)|
and criterion in the full dataset is positively correlated with the performance of
both algorithms, it shows a positive correlation with the difference between the
algorithms (r=.292, p=.003).
5 Discussion
The results of the horse-race simulation clearly demonstrate that predictive ac-
curacy is not necessarily linked to the algorithmic complexity of the strategies. In
fact, ZQ, a simple non-compensatory heuristic, which is a plausible candidate for
modeling estimation by boundedly rational humans, compared favorably with
the machine learning algorithms when the performance across ninety-nine real-
world datasets was assessed in cross-validation. The ecological analysis further
suggests that the heuristics are less prone to over-fitting, as they suffer less from
a decrease in sample size and can cope better with a large number of variables
than the machine learning algorithms. The ZQ heuristic drastically outperformed
the QEst heuristic and should be more vigorously studied in future research.
These results are in line with previous simulation results for non-compensatory,
lexicographic heuristics in pair-comparison [21] and classification [22] tasks. This
study also supports the claim that the study of strategies (here, strategies for
estimation) cannot be separated from the study of environments in which these
strategies are applied [23].
References
1. Gigerenzer, G., Todd, P.M., ABC Research Group: Simple heuristics that make us
smart. Oxford UP, New York (1999)
2. Gigerenzer, G., Selten, R. (eds.): The adaptive toolbox. MIT Press, Cambridge
(2001)
3. Todd, P.M., Gigerenzer, G.: and the ABC Research Group, Ecological rationality:
Intelligence in the world. Oxford UP, New York (2012)
4. Hertwig, R., Hoffrage, U., Martignon, L.: Quick estimation: Letting the environ-
ment do some of the work. In: Gigerenzer, G., Todd, P.M., The ABC Research
Group (eds.) Simple Heuristics that Make Us Smart, pp. 209–234. Oxford UP,
New York (1999)
5. Hertwig, R., Hoffrage, U., Sparr, R.: The QuickEst heuristic: How to benefit from
an imbalanced world. In: Todd, P.M., Gigerenzer, G., The ABC Research Group
(eds.) Ecological Rationality: Intelligence in the World, pp. 379–406. Oxford UP,
New York (2012)
6. Specht, D.E.: A general regression neural network. IEEE Transactions on Neural
Networks 2(6), 568–576 (1991)
7. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and regres-
sion trees. Wadsworth, Monterey (1984)
8. Dawes, R.M.: The robust beauty of improper linear models in decision making.
American Psychologist 34(7), 571–582 (1979)
9. Asuncion, A., Newman, D.J.: UCI Machine Learning Repository. University of
California, School of Information and Computer Science, Irvine (2007),
http://www.ics.uci.edu/~ mlearn/MLRepository.html
10. Statlib on-line data base, http://lib.stat.cmu.edu/datasets
11. DASL - Data and Story Library, http://lib.stat.cmu.edu/DASL/
12. OzDASL - Australasian Data and Story Library, http://www.statsci.org/data/
13. Journal of Statistics Education Data Archive,
http://www.amstat.org/publications/jse/jse_data_archive.html
14. Swivel, http://www.swivel.com/data_sets/
15. Social Explorer, http://www.socialexplorer.com/
16. Inter-University Consortium for Political and Social Research (ICPSR),
http://dx.doi.org/10.3886/ICPSR02650
17. National Institute for Occupational Safety and Health (NIOSH) Mining Division,
http://www.cdc.gov/niosh/mining/data/
18. UCLA Statistics Data Sets, http://www.stat.ucla.edu/data/
19. Weisberg, S.: Applied linear regression. John Wiley and Sons, New York (1985)
20. Hettich, S., Bay, S.D.: The UCI KDD Archive. University of California, Department
of Information and Computer Science, Irvine (1999), http://kdd.ics.uci.edu
21. Czerlinski, J., Gigerenzer, G., Goldstein, D.G.: How good are simple heuristics.
In: Gigerenzer, G., Todd, P.M., The ABC Reseach Group (eds.) Simple Heuristics
that Make Us Smart, pp. 97–118. Oxford UP, New York (1999)
22. Martignon, L., Katsikopoulos, K.V., Woike, J.K.: Categorization with limited re-
sources: A family of simple heuristics. Journal of Mathematical Psychology 52(6),
352–361 (2008)
23. Todd, P.M., Gigerenzer, G.: Environments that make us smart: Ecological ratio-
nality. Current Directions in Psychological Science 16(3), 167–171 (2007)
Rademacher Complexity and Structural Risk
Minimization: An Application to Human Gene
Expression Datasets
Luca Oneto, Davide Anguita, Alessandro Ghio, and Sandro Ridella
DITEN – University of Genova, Via Opera Pia 11A, Genova, I-16145, Italy
{Luca.Oneto,Davide.Anguita,Alessandro.Ghio,Sandro.Ridella}@unige.it
Abstract. In this paper, we target the problem of model selection for

Support Vector Classifiers through in–sample methods, which are partic-
ularly appealing in the small–sample regime, i.e. when few high–dimen-
sional patterns are available. In particular, we describe the application
of a trimmed hinge loss function to Rademacher Complexity and Max-
imal Discrepancy based in–sample approaches. We also show that the
selected classifiers outperform the ones obtained with other state-of-the-
art in-sample and out–of–sample model selection techniques in classifying
Human Gene Expression datasets.
Keywords: Support Vector Machine, Structural Risk Minimization,

Rademacher Complexity, Gene Expression Datasets.
1 Introduction
The process of building an optimal Support Vector Classifier (SVC) [14] consists
of two phases: (i) the first one addresses the identification of a set of parameters,
which are found by solving a Quadratic Programming problem; (ii) the second
phase aims at tuning a set of additional variables, namely the hyperparameters.
The last step is also known as the model selection phase and is usually linked
to the problem of estimating the generalization ability of the classifier since,
usually, the best hyperparameters are selected as to minimize this quantity.
Out–of–sample techniques exploit a validation set for tuning the SVC hy-
perparameters, which is obtained by removing some samples from the original
dataset and save them for the model selection phase. An example of out–of–
sample technique is the Cross Validation [8], which is a common choice among
practitioners. Unfortunately, these methods result to be unreliable when ap-
plied to the small–sample setting [6], where only few high–dimensional samples
are available for training the classifier. In this framework, in–sample techniques
based on complexity measures, such as the Rademacher Complexity (RC) [4]
or the Maximal Discrepancy (MD) [5], have shown to be a suitable alterna-
tive [3]. The main advantage of in–sample methods, respect to out–of–sample
approaches, is the use of the whole set of available data for both training and
model selection purposes, therefore increasing the reliability of the final classifier.

492 L. Oneto et al.
Computing the RC or the MD of an SVC is not a trivial task, because these

methods require the use of a bounded loss function, which is not the case of the
Support Vector Machine (SVM) hinge loss [14]. The obvious choice in classifi-
cation problems would be the 0/1 loss, often called the hard loss [5], because
it simply counts the number of errors. Unfortunately, as will be better detailed
in Section 3, the hard loss cannot be applied, in practice, to the conventional
SVM formulation, therefore, as an alternative, a soft loss has been successfully
suggested [3], which is bounded and Lipschitz continuous. The soft loss considers
as errors not only the misclassified samples, but also the patterns lying inside
the SVM margin and can be considered a piecewise linear approximation of the
logistic function, thus providing an approximate value of the classifier proba-
bility of error [3]. However, when the main interest of the user is solely on the
number of misclassifications, the soft loss is not an optimal choice because it
is only a loose upper bound of this quantity [3]. Our objective, in this work, is
then to identify an upper–bound of the hard loss function, tighter than the soft
loss, which satisfies the condition of boundedness and can be applied to the SVC
in–sample model selection.
In particular, we propose the application of the trimmed hinge loss, originally
introduced in [12], to the MD and RC techniques. This loss is characterized by
some of the appealing properties of the soft loss, but is a tighter upper bound
of the number of misclassifications. In Section 4, we present some experimental
results on real–world small–sample Human Gene Expression (HGE) problems,
showing the improvements in the model selection performance when using the
trimmed hinge loss.
2 The Maximal Discrepancy and the Rademacher

Complexity of a Classifier
Let X = {(xi , yi )} be an input dataset of i = 1 . . . n i.i.d. patterns, originated

from an unknown distribution μ(x, y), where xi ∈ Rd and yi belongs to an
output set Y = {±1}. We are interested in estimating the generalization ability
of a classifier h(x) : Rd → Yh ⊆ R, chosen in a class of functions H. Let us
consider a (bounded) loss (h(x), y) : Yh × Y → [0, 1], then the generalization
error of h(x) is defined as L(h) = Eμ (h(x), y). Obviously, this value cannot be
computed in practice, because the distribution originating the data is unknown,
n
and we must resort to its empirical estimate L̂n (h) = n1 i=1 (h(xi ), yi ). In-
sample methods analyze the behaviour of the generalization bias suph∈H [L(h) −
L̂n (h)], with respect to the entire class H, so that, irrespective of the particular
classifier chosen by the learning algorithm, which exploits the training data, the
generalization error can be safely estimated.
In particular, the following complexity term of the class H is computed:
2
n
C(H)
ˆ = Eσ sup σi (h(xi ), yi ), (1)
h∈H n i=1
Rademacher Complexity and Structural Risk Minimization 493
where σi are independent, uniformly distributed and {−1, +1}–valued random

variables1 . The above quantity is known as the Rademacher Complexity
of the
class H while, if the combinations are constrained to the n/2 n
cases where
n
i=1 σi = 0, the Maximal Discrepancy is obtained, instead.
In both cases, once a user–defined confidence δ has been defined, the following
bound for L(h), which holds with probability (1 − δ), can be obtained [3,5]:

log 2δ
L(h) ≤ L̂n (h) + C(H)
ˆ +3 , ∀h ∈ H. (2)
2n
As this bound is valid for any function belonging to the class H, it will also be
valid for any classifier chosen by the learning and model selection procedures, so
giving a powerful tool for selecting the best performing one.
3 MD and RC for the SVC Model Selection
We consider in this paper linear classifiers, since small–sample problems are

usually linearly separable as d n. Let H be the class of functions, with h(x) ∈
H, so that h(x) = w · x + b, where w ∈ Rd and b ∈ R. The parameters of the
SVM are found by solving the following primal problem for any fixed value of
the hyperparameter C
w 2 n
min +C ξ (h(xi ), yi ), (3)
w,b 2 i=1
where ξ (h(xi ), yi ) = (1 − yi h(xi ))+ is the hinge loss, and (·)+ = max(0, ·). This
is the well–known Tikhonov formulation of the SVM, but the class H is better
defined through the Ivanov formulation [1]:
1
n
2
min ξ (h(xi ), yi ), subject to: w ≤ ρ, (4)
w,b n i=1
which is equivalent to the previous one for some values of the hyperparameters
[1,14]. In problem (4), the hyperparameter is ρ and H is defined as the class of
2
functions for which w ≤ ρ and b ∈ R. In other words, the hyperparameter ρ
controls directly the size of the class: the larger is ρ, the larger is H.
According to the Structural Risk Minimization (SRM) principle [14], the
model selection phase of the SVM consists in selecting the best class of func-
tions H∗ , and by consequence the best classifier h∗ (x) ∈ H∗ , by solving problem
(4) or, equivalently, problem (3), exploring a sequence of classes of increasing
1
They are also known as Rademacher variables. Note that, as there are 2n possible
combinations of such variables, the expectation respect to σ is usually computed, in
practice, through a Monte-Carlo procedure, though novel computing architectures
are opening new perspectives (e.g. quantum computing [4]).
494 L. Oneto et al.
complexity H1 ⊆ H2 ⊆ . . .. Then, by exploiting the RC and MD theory, the

model selection problem becomes equivalent to searching for the class H∗ and
the classifier h∗ ∈ H∗ , which minimize C(H) ˆ and L̂n (h), respectively. Note, in
fact, that the last term of Eq. (2) depends only on the size of the training set
and the user–defined confidence, but not on the classifier choice.
It is well-known that the hinge loss cannot be used for this purpose, because it
is not bounded and the RC or MD theory cannot be applied [3]. Unfortunately,
the hard loss, which is defined as H (h(x), y) = (sign(−yh(x)))+ and simply
counts the number of errors [5], is not an option either. In fact, when H is used,
2
the number of errors on the training set does not depend on w , thus the final
classifier h(x) = β(w x + b) can be defined only up to a constant β > 0, and
T
the regularization advantage of the SVM over a general linear classifier is lost.
As an alternative, the soft loss S (h(x), y) = min [ξ (h(x), y)/2, 1] has been
proposed [3], since it is Lipschitz continuous and bounded. The soft loss possesses
some nice symmetry properties, in particular S (h(x), y) = 1 − S (h(x), −y), so
that the computation of C(H) ˆ can be performed quite easily through a mini-
mization procedure [3]. Unfortunately, it is easy to note that
S (h(x), y) ≤ H (h(x), y) ≤ 2S (h(x), y) ≤ ξ (h(x), y), (5)
therefore the soft loss is only a loose upper-bound of the number of misclassified
samples.
Then, we propose to use the trimmed hinge loss [12] (see also Fig. 1a)

H (h(x), y) if yh(x) < 0
T (h(x), y) = (6)
ξ (h(x), y) if yh(x) ≥ 0,
which is bounded, Lipschitz continuous and represents a tighter upper–bound of

the hard loss:
H (h(x), y) ≤ T (h(x), y) ≤ 2S (h(x), y). (7)
The main issue, now, becomes the application of this loss to the primal SVM
learning problem, because the symmetry property does not hold anymore.
Let us consider a particular realization of the Rademacher variables, then, by
inserting the trimmed hinge loss in problem (4), we can write:
2 2
n n
arg max σi T (h(xi ), yi ) = arg min − σi T (h(xi ), yi ) (8)
w,b n i=1 w,b n i=1
2
subject to w ≤ ρ. The previous problem can be reformulated as:
2 2 2
1.5 1.5 1.5
1 1 1
loss
loss
loss
0.5 0.5 0.5
0 0 0
−0.5 −0.5 −0.5
−1 −1 −1
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
y h(x) y h(x) y h(x)
(a) The regularized (b) Concave part of (c) Convex part of T

hard loss function T for S − for S +
Fig. 1. Splitting of the trimmed hinge loss T in a convex and a concave part
Jconvex (w,b) Jconcave (w,b)

min PC ζi − ζi − ζ̄i − ζ̄i (9)
w,b,ζ,ζ̄
i∈S + i∈S − i∈S + i∈S −
2
w ≤ ρ (10)
i∈S +
: yi (w · xi + b) ≥ 1 − ζi , ζi ≥ 0 (11)
yi (w · xi + b) ≥ −ζ̄i , ζ̄i ≥ 0 (12)
−
i∈S : − yi (w · xi + b) ≥ −1 + ζi , ζi ≤ 1 (13)
− yi (w · xi + b) ≥ 1 − ζ̄i , ζ̄i ≥ 0 (14)
where S + = {i : −σi = +1}, S − = {i : −σi = −1}, S = S + ∪ S − , and |S| = n.

As the optimal solution of the problem above cannot be exactly found in poly-
nomial time, we propose to exploit the peeling technique, originally presented
in [3], to find a rigorous upper bound of the Primal Cost (PC), defined in Eq.
(9)2 . The peeling procedure, summarized in Algorithm 1, consists in iteratively:
(i) training a classifier, by introducing a convex relaxation of Problem (9) as
detailed below; (ii) based on this model, eliminating the Critical Samples (CSs)
which cause the loss function to be unbounded, i.e. the patterns for which ζ¯i = 0.
Then, the first step of the peeling procedure consists in introducing a convex
relaxation of Problem (9) by splitting the trimmed hinge loss as shown in Fig. 1
and formulating the following learning problem, based on Ivanov regularization:
2
min ζi − ζi subject to: w ≤ ρ (15)
w,b,ζ,ζ̄
i∈S + i∈S −
and also subject to the constraints (11) and (13). Equivalently, the Tikhonov
formulation of the above problem is:
1 2

min w + ζi − ζi , (16)
w,b,ζ,ζ̄ 2 + − i∈S i∈S
2
As an alternative, Convex-ConCave Programming (CCCP) techniques can be ex-
ploited as well [7].
496 L. Oneto et al.
Algorithm 1. Peeling procedure with the trimmed hinge loss

Require: X , σ, ρ
Ensure: Upper bound of PC (see Problem (9)), h(x)
1: S = {1, . . . , n} , CS + = CS − = 0,
2: loop
3: {h(x), PC} = Solve Problem (15) using the samples indexed by S
4: i = −1, v = 0
5: for j ∈ S do
6: if −σj == +1 then
7: if v < (ζj − 1) then
8: i = j, v = (ζj − 1)
9: end if
10: else
11: if v < |ζj − 1| − 1 then
12: i = j, v = |ζj − 1| − 1
13: end if
14: end if
15: end for
16: if i == −1 then
17: Break
18: end if
19: if −σj == +1 then
20: CS + = CS + + 1
21: else
22: CS − = CS − + 1
23: end if
24: S = S \ {i}
25: end loop
26: PC = PC + CS +
27: return h(x), PC
again subject to the constraints (11) and (13). The previous problem is equivalent
to (15) for some value of C [10]. Thus, Problem (15) can be solved through the
Tikhonov formulation of Problem (16), as shown in [1].
We can compute the dual formulation of Problem (16)3 and obtain:
1
n n
min αi αj yi yj xi · xj − αi (17)
α 2 i=1 j=1
i∈S +
n
i ∈ S + : 0 ≤ αi ≤ C
− , yi αi = 0.
i ∈ S : −C ≤ αi ≤ 0
i=1
Once a solution
n has been found [11], the final classifier is obviously defined as4
h(x) = i=1 αi yi xi · x + b. The model is then used to identify possible CSs,
i.e. the patterns which cause the unboundedness of the loss function: the pro-
cedure for identifying and eliminating the critical data is detailed in Algorithm
1. Analogously to [3], it is worth noting that the number of CSs can be used to
obtained a rigorous upper bound of the minimum of Problem (9), as also shown
in line 26 of Algorithm 1.
3
The proof is omitted because of space constraints.
4
It is also worth noting that the non-linear extension of the presented approach
through the kernel trick [14] becomes straightforward by simply applying a non-linear
mapping x → φ(x) and by defining the kernel function as K(xi , xj ) = φ(xi ) · φ(xj ).
Table 1. Average number of errors on the test sets of the HGE datasets
Dataset d n KCV (k = 10) MDS MDT RCS RCT

Brain Tumor 1 5920 90 5.5 ± 0.5 6.0 ± 0.0 5.4 ± 0.3 6.0 ± 0.0 5.6 ± 0.3
Brain Tumor 2 10367 50 2.8 ± 0.5 0.4 ± 0.6 0.0 ± 0.0 0.4 ± 0.6 0.4 ± 0.6
Colon Cancer 1 22283 47 8.7 ± 0.7 6.0 ± 0.0 5.4 ± 0.3 6.0 ± 0.0 5.0 ± 0.0
Colon Cancer 2 2000 62 8.7 ± 0.0 6.0 ± 0.0 6.0 ± 0.0 6.0 ± 0.0 4.0 ± 1.0
DLBCL 5469 77 7.6 ± 0.3 7.0 ± 0.0 5.4 ± 0.3 7.0 ± 0.0 5.4 ± 0.3
Duke Breast Cancer 7129 44 6.6 ± 1.2 5.8 ± 1.5 5.0 ± 0.2 5.8 ± 1.5 5.0 ± 0.2
Leukemia 7129 72 5.0 ± 0.0 5.0 ± 0.0 5.0 ± 0.0 5.0 ± 0.0 5.0 ± 0.0
Leukemia 1 5327 72 9.8 ± 0.2 9.8 ± 0.5 9.0 ± 0.0 10.0 ± 0.0 8.2 ± 0.2
Leukemia 2 11225 72 9.0 ± 0.7 8.0 ± 0.0 8.0 ± 0.0 8.0 ± 0.0 8.0 ± 0.0
Lung Cancer 12600 203 7.2 ± 0.1 16.0 ± 0.0 14.4 ± 0.3 16.0 ± 0.0 14.4 ± 0.3
Myeloma 28032 105 0.0 ± 0.0 7.0 ± 0.0 6.8 ± 0.2 7.0 ± 0.0 6.8 ± 0.2
Prostate Tumor 10509 102 10.9 ± 1.7 7.6 ± 2.6 6.8 ± 0.2 7.8 ± 2.2 6.6 ± 0.2
SRBCT 2308 83 7.6 ± 0.5 9.0 ± 0.0 6.8 ± 0.2 9.0 ± 0.0 6.6 ± 0.2
In order to verify whether the trimmed hinge loss allows to improve the model
selection performance of RC and MD-based bounds in the small–sample setting,
we make use of several Human Gene Expression datasets [2]5 . As in this kind
of setting a reference set of reasonable size is not available for evaluating the
performance of the entire procedure, we reproduce the methodology suggested
in [13], which consists in generating five different training/test pairs using a
random sampling approach. The model selection is performed by searching for
the optimal hyperparameter ρ in the range [10−6 , 102 ] among 30 values, equally
spaced in a logarithmic scale. As, in this paper, we are targeting two–class clas-
sification, we map multi-class datasets into two–class ones by simply grouping
classes so to obtain almost balanced problems.
In Table 1, we also present the average number of errors, performed on the five
test set replicas. In particular, we compare the results obtained with the RC and
the MD approaches, using both the soft loss (RCS , MDS ) and the trimmed hinge
loss (RCT , MDT ). Though in-sample methods, based on the soft loss, already
resulted to outperform out-of-sample approaches on the same datasets [2], we
report here for the sake of completeness the misclassification rate for the well-
known K-fold Cross Validation (KCV) technique [9]. In particular, for the KCV,
the conventional Tikhonov SVM formulation is exploited, where the number of
folds k is set to 10 [8] and the hyperparameter C is searched within the range
[10−6 , 102 ] among 30 values, analogously to ρ. As expected, the trimmed hinge
loss is a tighter upper–bound of the number of errors, and the classifiers chosen
by RCT and MDT perform consistently better than the ones selected by RCS ,
MDS and the KCV.
5 Conclusions
We introduced in this work the exploitation of the trimmed hinge loss in Support
Vector classifiers, which allows to rigorously perform in-sample model selection
5
We do not include here the original references for all the datasets because of space
constraints. However they can be retrieved in [2].
498 L. Oneto et al.
through Rademacher Complexity and Maximal Discrepancy based bounds. In

this framework, we presented a peeling technique, which can be used to solve the
resulting non-convex SVM learning problem. Thank to its appealing properties,
the use of the trimmed hinge loss for in-sample model selection allows to choose
classifiers, which outperform both state-of-the-art in-sample and out-of-sample
approaches when applied to real-world small-sample problems. Further improve-
ments are possible, e.g. through introducing data–dependent strategies for the
selection of the hypothesis space [2].
References
1. Anguita, D., Ghio, A., Oneto, L., Ridella, S.: In-sample Model Selection for Support
Vector Machines. In: Proc. of the Int. Joint Conference on Neural Networks (2011)
2. Anguita, D., Ghio, A., Oneto, L., Ridella, S.: Selecting the Hypothesis Space for
Improving the Generalization Ability of Support Vector Machines. In: Proc. of the
Int. Joint Conference on Neural Networks (2011)
3. Anguita, D., Ghio, A., Ridella, S.: Maximal Discrepancy for Support Vector Ma-
chines. Neurocomputing 74, 1436–1443 (2011)
4. Anguita, D., Ridella, S., Rivieccio, F., Zunino, R.: Quantum optimization for train-
ing support vector machines. Neural Networks 16(5-6), 763–770 (2003)
5. Bartlett, P.L., Boucheron, S., Lugosi, G.: Model selection and error estimation.
Machine Learning 48, 85–113 (2002)
6. Braga-Neto, U.M., Dougherty, E.R.: Is cross-validation valid for small-sample mi-
croarray classification? Bioinformatics 20(3), 374 (2004)
7. Collobert, R., Sinz, F., Weston, J., Bottou, L.: Trading convexity for scalability.
In: Proceedings of the 23rd International Conference on Machine Learning, pp.
201–208 (2006)
8. Hsu, C., Chang, C., Lin, C., et al.: A practical guide to support vector classification
(2003)
9. Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation
and model selection. In: International Joint Conference on Artificial Intelligence,
vol. 14, pp. 1137–1145 (1995)
10. Pelckmans, K., Suykens, J.A.K., De Moor, B.: Morozov, ivanov and tikhonov reg-
ularization based ls-svms. Neural Information Processing 3316, 1216–1222 (2004)
11. Platt, J.: Sequential minimal optimization: A fast algorithm for training sup-
port vector machines. Advances in Kernel Methods-Support Vector Learning 208,
98–112 (1999)
12. Shawe-Taylor, J., Cristianini, N.: Kernel methods for pattern analysis. Cambridge
University Press (2004)
13. Statnikov, A., Aliferis, C.F., Tsamardinos, I., Hardin, D., Levy, S.: A comprehen-
sive evaluation of multicategory classification methods for microarray gene expres-
sion cancer diagnosis. Bioinformatics 21(5), 631 (2005)
14. Vapnik, V.N.: Statistical learning theory. Wiley Interscience (1998)
Using a Support Vector Machine and Sampling
to Classify Compounds as Potential Transdermal
Enhancers
Alpa Shah1 , Gary P. Moss2 , Yi Sun1 , Rod Adams1 ,

Neil Davey1 , and Simon Wilkinson3
1
Science and technology research school, University of Hertfordshire,
United Kingdom, AL10 9AB
{a.shah8,comrys,r.g.adams,n.davey}@herts.ac.uk
2
School of Pharmacy, Keele University, United Kingdom
3
Medical Toxicology Centre, Wolfson Unit, Medical School,
University of Newcastle-upon-Tyne, UK
Abstract. Distinguishing good chemical enhancers of percutaneous absorption

from poor enhancers is a difficult problem. Previously, discriminant analysis and
other machine learning methods have been applied to this problem. Results
showed that the ordinary SVM provided the best result. In this work, we ap-
ply both SVM with different cost errors and sampling methods to improve the
accuracy of classification. We show that a good classification is possible.
1 Introduction
Incorporation of certain chemicals into the drug delivery vehicle may lead to enhance-
ment of drug release and a more rapid clinical response. Such chemicals have been var-
iously labeled penetration enhancers, accelerants or sorption promoters. Investigation
and development of suitable enhancers is, like other aspects of pharmaceutical devel-
opment, limited by the time and expense of in vivo studies and even a wide range of in
vitro experiments. Therefore, mathematical relationships are often sought between the
physical properties of pharmaceutical systems and their clinical performance.
In [1], Pugh et al. employed discriminant analysis to classify such enhancers into
simple categories, based on the physicochemical properties of the enhancer molecules.
One of difficulties in this work is imbalanced dataset. The data has two classes labeled
as either good enhancers or poor enhancers, with about 83% being poor enhancers. In
[2], several machine learning methods including K-nearest-neighbour (KNN) regres-
sion, single layer networks, Gaussian processes (GP) and the SVM classifier were ap-
plied to an enhancer dataset. The best classification result was obtained by the ordinary
SVM method. Here we investigate the effect of using two different SVM error costs,
and whether a sampling method including both under-sampling and over-sampling can
further improve the SVM classifier’s accuracy rate.
2 Problem Domain
The percutaneous absorption of exogenous chemicals, particularly for therapeutic pur-
poses, is limited by the nature of the skin barrier, specifically, the properties of the

500 A. Shah et al.
outermost layer of the skin, the stratum corneum. A range of chemical and physical
methods have been employed, with varying degrees of success, to enhance the rate and
extent of absorption.
Chemical penetration enhancers are often classified based on their mechanism. Math-
ematical models have been developed and applied to investigate the enhancement ef-
fects of chemical compounds. Multiple regression analysis, often in the form of quan-
titative structure-activity relationships (QSARs), have been extensively employed to
develop such relationships between the effect and a set of physical properties of phar-
maceutical systems. For example, in [3] the authors used QSARs to look into the en-
hancement effects associated with a range of penetration enhancers.
Further, Pugh et al. [1] have applied discriminant analysis to identify compounds
with potential as enhancers. They classified 73 potential enhancers of hydrocortisone
permeation based on an enhancement ratio (ER) of the amount of hydrocortisone trans-
ferred after 24 hours relative to a control. In their work, ER ≥ 10 was considered a good
enhancer, and ER < 10 a poor enhancer. They found that discriminant analysis using
the carbon chain length, number of hydrogen-bonding atoms present on a molecule and
the molecular weight resulted in a correct classification of good enhancers of 92%, but
a relatively limited success in correctly classifying poor enhancers (72%).
The application of novel Machine Learning techniques has also highlighted their
usefulness to the field of percutaneous absorption [2]. Results in [2] show that machine
learning methods can provide more accurate classification of enhancer type with fewer
false-positive results.
In the current work, we investigate whether the false positive rate can be further
decreased when using SVM with sampling methods.
3 A Description of the Data
The dataset employed in this study is based on [1]. Several changes have been made
from this dataset, including the correction of several errors contained in the original.
The dataset consists of seventy-one compounds in which there are twelve compounds
belonging to the good enhancer class of samples and fifty-nine compounds belonging to
the poor enhancer class of samples. All data used refer to the transfer of hydrocortisone
across hairless mouse skin over 24 hours from propylene glycol solutions of enhancers.
As mentioned in reference [1], a range of calculable molecular features were consid-
ered as predictors. But most of them proved to be unsuccessful, and therefore limited to
a discussion of the five most successful predictors. These five predictors are log P (P
denotes the ratio: octanol/water), log S (S denotes solubility), Molecular Weight (MW),
carbon chain length (CC) and hydrogen bonds (HB). These features are readily calcu-
lable molecular features. In [1], it was shown that log P and log S are highly correlated
with a correlation coefficients equal to −0.91. Here, M W, CC andHB are considered
as another set of independent variables. So in the following experiments, 3 features
refers to M W , CC and HB while 5 features refers to the above 3 plus log P and log S.
A comparison between the good class and poor class was undertaken by visual data
exploration. For instance, Figure 1 shows a box plot of molecule weights by class. (We
do not show the figure for all features here due to the space limit.) In summary, there are
Using a Support Vector Machine and Sampling to Classify Compounds 501
statistical differences between the good and poor classes. For example, the interquartile
range for the good class is at a higher value than that of the poor class on M W , CC
and log P , while it is at a lower value on HB and log S. In addition, the interquartile
range for the good class is always narrower than that for the poor class on all 5 features.
450
400
350
300
MW (Da)
250
200
150
100
50
Poor Good
Fig. 1. Box plot of Molecular Weight grouped as poor and good classes. Da is a unit of mass.
4 Performance Measures
It is obvious that for a problem domain with an imbalanced dataset, classification ac-
curacy rate is not sufficient as a standard performance measure. To evaluate classifiers
used in this work, we apply several common performance metrics, such as Recall, Pre-
cision and F-score, which are calculated in order to fairly quantify the performance of
the classification algorithm on the minority class.
Based on the confusion matrix (see Table 1) computed from the test results, several
common performance metrics can be defined as follows in Table 2.
Table 1. A confusion matrix: where TN is the number of true negative samples; FP is false
positive samples; FN is false negative samples; TP is true positive samples
TN FP
FN TP
502 A. Shah et al.
Table 2. Common performance metrics
TP TP
Recall = , (1) Precision = , (2)
(TP + FN) (TP + FP)
FP
2 · Recall · Precision FP rate = . (4)
F-score = , (3) FP+TN
Recall+Precision
In the context of identifying good enhancers a high F-score and low FP Rate are
particularly important, as a higher cost is associated with a degradation of performance
on these metrics. There is a trade-off between Precision and Recall integrated into the
metrics.
5 Techniques for Learning Imbalanced Datasets
In this paper we address the problem of our imbalanced data in two ways: firstly by
using data based sampling techniques [4] and secondly by using different SVM error
costs for the two classes [5].
5.1 Sampling Techniques
One way to address imbalance is simply to change the relative frequencies of the two
classes by under sampling the majority class and over sampling the minority class.
Under-sampling the majority class can be done by just randomly selecting a subset
of the class. Over-sampling the minority class is not so simple and here we use the
Synthetic Minority Overs-ampling Technique (SMOTE) [4]. For each member of the
minority class its nearest neighbours in the same class are identified and new instances
are created, placed randomly between the instance and its neighbours.
5.2 Different SVM Error Costs
In the standard SVM the primal Lagrangian that is minimized is:
n n n
||w2 ||
Lp = +C ξi − αi [yi (w·xi + b) − 1 + ξi ] − ri ξi (5)
2 i=1 i=1 i=1
n

subject to: 0 ≤ αi ≤ C and αi yi = 0 (6)
i=1
Here C represents the trade-off between the empirical error, ξ, and the margin. The
problem here is that both the majority and minority classes use the same value for C,
which as pointed out by Akbani et al [6] will probably leave the decision boundary too
near the minority class. Veropoulos et al [5] suggest that having a different C value for
the two classes may be useful. They suggest that the primal Lagrangian is modified to:
n n
||w2 ||
Lp = + C+ ξi + C − ξi − αi [yi (w·xi + b) − 1 + ξi ] − ri ξi
2 i=1 i=1
i|yi =+1 i|yi =−1
(7)
n

subject to: 0 ≤ αi ≤ C + if yi = +1, 0 ≤ αi ≤ C − if yi = −1, and αi yi = 0
i=1
(8)
+ −
Here the trade-off coefficient C is split into C and C for the two classes, allowing
the decision boundary to be influenced by different trade-offs for each class. Thus the
decision boundary can be moved away from the minority class by lowering C + with
respect to C − .
Akbani et al [19] argue that using this technique should improve the position of the
decision boundary but will not address the fact that it may be misshapen due to the
relative lack of information about the distribution of the minority class. So they suggest
that the minority class should also be over-sampled, using SMOTE, to produce a method
they call SMOTE with Different Costs (SDC). This is one of the techniques we evaluate
here.
6 Experiments
For each different dataset, the leave-one-out technique was applied, that is, one chem-
ical is used for testing, and all others are employed for training. This was repeated for
each compound in turn. Finally, performance metrics were computed in terms of all
predictions. The SVM experiments were completed using LIBSVM, which is available
from the URL http://www.csie.ntu.edu.tw/∼cjlin/libsvm.
6.1 Experiment 1
In the first experiment, the effect of different SVM error costs are investigated. A sys-
tematic search was done and the best results were obtained with the cost penalty as-
signed for misclassifying a poor enhancer was 10 whereas that of misclassifying a good
Table 3. Performances using SVM with cost penalties
Mathod Features Recall Precision F-Score FP Rate

Ordinary SVM 3 0.58 0.58 0.58 0.08
5 0.67 0.61 0.64 0.08
SVM+Cost Penalties 3 0.83 0.62 0.71 0.10
5 0.67 0.61 0.64 0.08
504 A. Shah et al.
enhancer was 80. Results are shown in Table 3. For comparison, the ordinary SVM
classification results obtained from the radial basis kernel function is also shown in the
table.
It can be seen that using SVM with cost penalties does not improve classification
performance on the dataset with all five features, but it improves F-score, recall and
precision though not FP-rate when using only three features.
6.2 Experiment 2
In this experiment, we evaluate the classification effect produced by over-sampling the

minority instances and under-sampling the majority instances, separately. The experi-
ment consists of the random sampling process of the dataset followed by the classifi-
cation. Thus, it is conducted twenty times and the averaged values of performance are
reported together with standard deviations.
First, the good enhancers are over-sampled using SMOTE. Three different number
of samples are tried, which are 59, 24 and 36 separately. The best result obtained using
59, that is the ratio between the majority to the minority is equal to 1, is presented in
Table 4. Second, the poor enhancers are under-sampled to 12, 30 and 20 separately. The
best result obtained using 20, that is the ratio between the minority to the majority is
equal to three fifths, is also presented in Table 4.
Table 4. Performances using SVM with either oversampling or undersampling
Method Features Recall Precision F-Score FP Rate

SVM+over- 3 0.94 ± 0.04 0.66 ± 0.05 0.77 ± 0.04 0.1 ± 0.02
sampling 5 0.92 ± 0.02 0.62 ± 0.04 0.74 ± 0.03 0.12 ± 0.02
SVM+under- 3 0.97 ± 0.04 0.76 ± 0.05 0.85 ± 0.04 0.06 ± 0.02
sampling 5 0.96 ± 0.04 0.67 ± 0.08 0.81 ± 0.06 0.09 ± 0.03
Table 4 shows that under-sampling has better performance than the one obtained
from the over-sampling method. Comparing with Table 3, it can be seen that on average
results from sampling have better performance on F-score and precision with much
higher values on recall. For FP-rate, SVM with under-sampling provides a lower value
than the one obtained from SVM with or without cost penalties on the corresponding
dataset; while SVM with over-sampling has a higher (worse) FP-rate. Furthermore,
Table 5 shows the confusion matrix obtained from under-sampling. It can be seen that
Table 5. A confusion matrix with 3 features
T N = 55.24 ± 1.14 F P = 3.76 ± 1.14

F N = 0.43 ± 0.51 T P = 11.57 ± 0.51
on average the number of true positives is almost the same as the actual number of
labeled good enhancers.
Figure 2 shows a Principal component analysis (PCA) plot, where label information
is from the SVM with under-sampling on the three feature dataset. One false negative
and three false positive instances can be seen. It can be see that the three false positives
are located in the area occupied by true positives, so that their misclassification is not
surprising.
2.5 True Negative

True Positive
Score for second Principal Component
2 False Positive
False Negative
1.5
0.5
−0.5
−1
−1.5
−2
−3 −2 −1 0 1 2 3
Score for first Principal Component
Fig. 2. The PCA plot of the dataset with three features. The single false negative can be seen to
be a long way from the other positive samples.
6.3 Experiment 3
In the final experiment, we combine under-sampling and over-sampling together, and

we investigate whether SVM with different error costs can work with sampling to fur-
ther improve classifier’s performance. The dataset was sampled to double the minority
class by the over-sampling method and halve the majority class by the under-sampling
method. Hence, the dataset consists of twenty-four good and thirty poor samples total-
ing to fifty-four samples. The experiment is undertaken with and without cost penalties.
Results are shown in Table 6.
Results in Table 6 show that SVM with combined samplings has very similar perfor-
mance with or without different cost errors, though overall with cost penalties providing
a little bit of improvement. Furthermore, comparing Table 6 with Table 4, it can be seen
that the combined sampling methods with cost penalties provides a similar result to the
under-sampling only method, but under-sampling is still the best result overall.
506 A. Shah et al.
Table 6. Performances using SVM with combined sampling
Method Features Recall Precision F-Score FP Rate

SVM+Combined 3 0.97 ± 0.04 0.66 ± 0.09 0.78 ± 0.07 0.1 ± 0.05
Sampling 5 0.97 ± 0.04 0.66 ± 0.07 0.78 ± 0.05 0.11 ± 0.03
SVM+Combined 3 0.96 ± 0.06 0.7 ± 0.08 0.81 ± 0.07 0.09 ± 0.04
Sampling + Cost Penalties 5 0.98 ± 0.03 0.68 ± 0.07 0.8 ± 0.05 0.1 ± 0.03
7 Conclusions
The main results of this work have shown that by using suitable machine learning tech-
niques we can produce an effective classifier on possible transdermal enhancers. More-
over, this has been accomplished with a very small training set of only 70 compounds.
The technical results in this paper have shown that sampling methods can be useful to
improve the SVM classifier’s performance. Using SVM with different cost errors can
also improve the ordinary SVM’s performance, but not so much as sampling does. For
this dataset, over-sampling, under-sampling and combined sampling are evaluated, with
under-sampling giving the best results followed by the combined sampling methods.
References
[1] Pugh, W., Wong, R., Falson, F., Michniak, B., Moss, G.: Discriminant analysis as a tool
to identify compounds with potential as transdermal enhancers. Journal of Pharmacy and
Pharmacology 57, 1389–1396 (2005)
[2] Moss, G., Shah, A., Adams, R., Davey, N., Wilkinson, S., Pugh, W., Sun, Y.: The applica-
tion of discriminant analysis and machine learning methods as tools to identify and classify
compounds with potential as transdermal enhancers. European Journal of Pharmaceutical
Sciences 45, 116–127 (2012)
[3] Ghafourian, T., Zandasrar, P., Hamishekar, H., Nokhodchi, A.: The effect of penetration en-
hancers on drug delivery through skin: a qsar study. J. Control. Release 99, 113–125 (2004)
[4] Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote:synthetic minority over-
sampling technique. Journal of Artificial Intelligence Research 16, 321–357 (2002)
[5] Veropoulos, K., Cristianini, N., Campbell, C.: Controlling the sensitivity of support vector
machines. In: Proceedings of the International Joint Conference on Artificial Intelligence
(1999)
[6] Akbani, R., Kwek, S.S., Japkowicz, N.: Applying Support Vector Machines to Imbalanced
Datasets. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004.
LNCS (LNAI), vol. 3201, pp. 39–50. Springer, Heidelberg (2004)
The Application of Gaussian Processes in the Predictions
of Permeability across Mammalian Membranes
Yi Sun1 , Marc B. Brown2 , Maria Prapopoulou2, Rod Adams1 ,

Neil Davey1, and Gary P. Moss3
1
Science and technology research school, University of Hertfordshire,
United Kingdom, AL10 9AB
{comrys,r.g.adams,n.davey}@herts.ac.uk
2
School of Pharmacy, University of Hertfordshire, United Kingdom
3
School of Pharmacy, Keele University, United Kingdom
Abstract. The problem of predicting the rate of percutaneous absorption of a

drug is an important issue with the increasing use of the skin as a means of mod-
erating and controlling drug delivery. The aim of the current study was to explore
whether including another species skin data in a training set can improve predic-
tions of the human skin permeability coefficient. Permeability data for absorption
across rodent skin was collected from the literature. The Gaussian process model
was applied to the data, and this was compared to two QSPR methods. The results
demonstrate that data from non-human skin can provide useful information in the
prediction of the permeability of human skin.
1 Introduction
The problem of predicting the rate at which various chemical compounds penetrate
human skin is an important issue with the increasing use of skin as a means of achiev-
ing both local and systemic drug delivery. In [1] and [2], it is shown that advanced
machine learning techniques, especially, Gaussian Processes (GP), outperform quanti-
tative structure-activity relationships (QSARs), which are widely used in the pharmacy
community.
One key feature of predicting percutaneous absorption accurately is that the target,
the skin permeability coefficient, may have a strongly non-linear relationship with the
compound physicochemical descriptors (features), which has already been shown to
be the case [1] and [2], using a human skin dataset. In [3], GP is further evaluated
on four different datasets, namely experimentally derived drug permeation data across
human skin, pig skin, rodent skin, and a synthetic (SilasticR ) membrane. The GPs with
Matérn and neural network covariance functions give the best performance in the work
shown in [3]. It is found that five compound features applied to human, pig and rodent
membranes cannot represent the main characteristics of the silastic dataset.
In the current work, given the number of animal experiments described in the lit-
erature, and the difficulties in obtaining human skin for experiments, the ability of a
dataset consisting of permeability values from animal experiments (rodent) was inves-
tigated by GP methods in order to determine if it could provide reasonable estimates of
human skin permeability.

508 Y. Sun et al.
2 Problem Domain
Predicting percutaneous absorption accurately has proven to be a major challenge and
one which has substantial implications for the pharmaceutical and cosmetic industries,
as well as toxicological issues in fields such as pesticide usage and chemicals manufac-
ture. Predictive modeling is a frequently used tool to increase the throughput of percuta-
neous absorption experiments. The use of animal models for percutaneous penetration
is often considered essential, given the possible toxicity, cost, ethics and inconvenience
of employing human skin during laboratory experiments. Human skin differs from that
of many animals in numerous ways including the thickness of the stratum corneum,
number of appendages per unit area and amount of skin lipids present. Despite this, it
is very surprising that no quantitative mathematical models had been developed for the
purpose of characterising permeation across non-human skin before [3] was published.
This is perhaps due to the development of the Potts and Guy model [4], the first major,
quantitative model for measuring percutaneous absorption, which was based on human
skin data.
In using a model system, the researcher must take into account the inherent dif-
ferences of the various species employed and the parameters affecting percutaneous
penetration in each species. The model selected must therefore resemble human skin
as closely as possible. Various models have been offered by many researchers. [5] and
[6] investigated several potential models, including rabbit, miniature swine and rat, and
concluded that rabbit skin, and then rat skin, were the most permeable membranes, and
that flux denoted as J, that is the rate of permeant transfer from one side to the other,
through pig skin most resembled that of the permeation across human skin.
By convention Kp denotes the permeability coefficient. Kp is a concentration cor-
rected version of flux that allows comparison of permeation for different molecules. Kp
is defined as Kp = J/ΔCm , where ΔCm denotes the concentration difference across
the membrane. Several approaches have been used to try to quantify and predict skin
absorption. One such method involves the use of quantitative structure activity (or per-
meability) relationships (QSARs, or QSPRs). Usually, lipophilicity (P) and molecular
weight (MW) appear to be the only significant features in QSAR forms, although subset
analysis has shown the significance of other parameters [7]. P is the ratio of the solubil-
ity of a molecule between two phases; octanol, to represent the lipid phase, and water
(or a buffered aqueous solution) to represent the aqueous phase. Normally, this gives
quite a range as some molecules will prefer one phase to another, often across as wide
a range as 10−7 to 107 . Hence, a log scale, log P is used. For the same reason log Kp is
used for skin percutaneous absorption rather than Kp . It is important to note that log Kp
is a completely different term to log P .
Recently, new approaches, for example, artificial neural networks and fuzzy mod-
elling, have been applied to predict percutaneous absorption. [1] has employed Gaus-
sian Processes to predict percutaneous absorption using a human skin dataset, showed
the underlying non-linear nature of the dataset, and provided a substantial statistical
improvement over existing models.
The novel contribution in this work is exploring another interesting issue that is
whether including another species skin data in a training set can improve predictions
of the human skin permeability coefficient.
The Application of Gaussian Processes in the Predictions 509
3 A Description of the Data

The two datasets employed in this study have been collated with reference to a range
of literature sources. The human and rodent membrane datasets consist of 140 and
103 chemical compounds, respectively. Among these two datasets, there are common
chemical compounds tested for different membranes. For example, caffeine was tested
through human and rodent membrane. Its log Kp (cm/h) value for human skin is −3.68
and for rodent is −2.99. The number of chemical compounds common to human and
rodent membranes is 48. Therefore, the number of non-common compounds is 92 in
the human skin dataset, and 55 in the rodent skin dataset.
In [1] and [2], it is shown that using five features: molecular weight (M W ), solubility
parameter (SP ), log P , counts of the number of hydrogen bonding acceptor (HA)
and donor groups (HD), can produce better predictions when compared to using only
lipophilicity and molecular weight alone. Therefore, in this work, the five compound
features are used.
3.1 Visualisation of the Skin Data

First all the compounds from the two datasets were combined. The data was then nor-
malised so that all five features had a zero mean and unit variance. The normalised data
points were visualised by applying principle component analysis (PCA), which maps
data to a low-dimensional space with a linear transformation.
The compounds were plotted using the corresponding log Kp values against the first
two principal components to represent the variation in the five features of all chemical
compounds (see Figure 1). The first principal component accounts for 44.8% of the
total variance, and the second accounts for 29.8%. Figure 1 shows that there is no linear
relation between log Kp and the compound features. It suggests that there may be more
complex non-linear structures in the data. In addition, it can be seen that the rodent skin
dataset has a similar distribution to the human skin dataset.
4 Modelling Methods
Two QSPR methods were applied to the human skin data in order to provide a compar-
ison between Gaussian Processes and previous approaches to this task. The first one,
denoted as Potts, was proposed by [4] and derived from the Flynn dataset [8]. It is given
by the equation log Kp (cm/s) = 0.71 log P − 0.0061M W − 6.3. The second model,
denoted as Moss, is represented by log Kp (cm/s) = 0.74 log P − 0.0091M W − 2.39,
which was derived from a slightly larger dataset [7].
Since there are no QSAR models used for animal skin, a simple naı̈ve model was
used for comparison. In the naı̈ve model, for any input the prediction is always the
same value, namely the mean of log Kp in the training set.
4.1 The Gaussian Processes Regression

A Gaussian process (GP) is defined simply as a collection of random variables which
have a joint Gaussian distribution. It is completely characterised by its mean and co-
variance function. Usually, the mean function is considered to be the zero everywhere
510 Y. Sun et al.
1
Human
Rodent
0
−1
−2
LogKp
−3
−4
−5
−6
−7
−4 −2 0 2 4 6 8 10 12 14 16
PC1
(a)
1
Human
Rodent
0
−1
−2
LogKp
−3
−4
−5
−6
−7
−6 −4 −2 0 2 4 6
PC2
(b)
Fig. 1. The relationship between log Kp and the PCA space of chemical compounds: a) the first
principal component; b) the second principal component
function. The covariance function, k(xi , xj ), allows for specifying a-priori knowledge
from a training dataset. It defines nearness or similarity between the values of f (x) at
the two points xi , xj .
To make a prediction y∗ at a new input x∗ , the conditional distribution
p(y∗ |y1 , . . . , yN ) on the observed vector [y1 , . . . , yN ] is needed to be computed. Since
the model is a Gaussian process, this distribution is also a Gaussian and is completely
defined by its mean and variance. By applying standard linear algebra, the mean and
variance at x∗ are given by
E[y∗ ] = kT∗ (K + σn2 I)−1 y , (1)
var[y∗ ] = k(x∗ , x∗ ) − kT∗ (K + σn2 I)−1 k∗ , (2)

where k∗ denotes the vector of covariances between the test point and the N training
data; K denotes the covariance matrix of the training data; σn2 denotes the variance of
an independent identically distributed Gaussian noise, which means observations are
noisy; y denotes the vector of training targets; and k(x∗ , x∗ ) denotes the variance of
y∗ . As is normally the case, mean values were used as predictions, and the variance was
used as error bars on the prediction.
In [3], it shows that the GPs with Matérn and neural network covariance functions
give the best performance. The Matérn covariance function can be defined as a product
of an exponential and a polynomial of order p. More details can be found in [9]. In this
work, the Matérn covariance function is applied with p = 1.
4.2 Performance Measures

As mentioned in [1] and [2], mean squared error (MSE), improvement over Naı̈ve
(ION), negative log estimated predictive density (NLL), and Pearson correlation co-
efficients (CORR) were all used to evaluate the performance of each model.
The MSE measures the average squared difference between model predictions and
the corresponding targets. The ION measures the degree of improvement of the model
MSE naı̈ve −MSE
over the Naı̈ve predictor, that is, ION = MSE ×100%. The NLL is defined
naı̈ve
Ntst
as N LL = N1tst n=1 (− log p(yn |xn )) , where − log p(yn |xn ) = 12 log(2πσ∗2 ) +
(yn −E[yn ])2
2σ∗2 , in which case σ∗2 is the predictive variance. The CORR measures the
correlation between predictions and targets. For comparison, a model should have low
values of both MSE and NLL, as well as high values of both ION and the correlation
coefficient (CORR) on a given test dataset.
5 Experiments
For each different dataset, the leave-one-out technique was applied, that is, one chem-
ical is used for testing, and all others are employed for training. This was repeated for
each compound in turn. Finally, performance metrics were computed in terms of all
predictions.
512 Y. Sun et al.
5.1 Experiment 1
The first experiment investigated whether a GP model trained using the rodent skin
dataset provides reasonable predictions for the human skin dataset. It is obviously much
easier to obtain animal tissue than human tissue. The rodent skin dataset was used as the
training set and the trained GP model was tested on the complete human skin dataset.
Table 1 shows results for the complete human set. The best result for each column is
indicted in bold. It can be seen that GP models outperform the QSAR models, with the
model trained using the rodent dataset giving the best prediction result.
Table 1. Performances on the complete human skin dataset when trained with the rodent dataset.
Model MSE ION CORR NLL

Moss 20.09 -1171.00 0.14 -
QSAR Potts 5.50 -248.22 0.10 -
Naı̈ve 1.63 0 0.00 -
trained on rodent GP 1.24 24.14 0.56 2.61
5.2 Experiment 2
The possibility of using an animal model to predict human skin permeability was in-
vestigated in experiment 1. In experiment 2, a quantitative comparison of human skin
permeability predictions between trained GP models on a rodent and a human train-
ing dataset was undertaken. To make this comparison, the compounds that are common
in the human and rodent dataset had to be used. A training set including 48 common
chemical compounds for which a target value were known for both the human and ro-
dent data was taken. Two trained models were then produced. Previously unseen human
data were taken as a test set for both models. This used the rest of the human dataset,
including 92 compounds.
Table 2 shows the corresponding results with the best results in bold. It can be seen
that both the human and rodent training sets give better predictions on the human test
set than using either the naı̈ve or the QSAR models, where the human training set has
the best performance. Interestingly, it can be seen that the rodent naı̈ve model is better
than the human naı̈ve model for predicting the human skin permeability. Moreover, GP
predictions from the rodent model is much better than the human naı̈ve model. This
means that the rodent model could be more useful than often thought for predicting
human skin permeability ([6]).
5.3 Experiment 3
The final experiment investigated how adding rodent examples into a human training
set may affect predictions on a human test set. To avoid inconsistent training examples,
that is, examples with the same features but different target values, the non-common
compounds were used as training examples. In this experiment, a human model was
trained using human data, denoted as trnH, using the 92 non-common compounds and
Table 2. Performances on the human test set using model trained on rodent and human training
sets, separately. Note that QSAR’s ION results were from the human training set.

Moss 19.35 -1203.3 0.16 -
QSAR Potts 6.01 −304.59 0.12 -
trained on human Naı̈ve 1.48 0 0 -
GP 1.05 29.40 0.52 1.47
trained on rodent Naı̈ve 1.37 0 0 -
GP 1.13 17.22 0.46 1.47
tested using human data with the 48 common compounds. Results produced by the
human model are shown in the first two rows of Table 3.
To generate a mixed model using human and rodent data, a mixed dataset was
needed. Moreover a training set of size 92 compounds was needed in order to compare
it with trnH including 92 compounds. Chemical compounds which are not included in
those common compounds were extracted from the rodent dataset, in total 55 chemical
compounds, denoted by trnR. Then 50 samples from trnH, and 42 from trnR were ran-
domly selected, and added together, denoted as trnHR. A GP model was trained using
trnHR. Finally, the model was tested for the same human data as was used for the hu-
man model (that is 48 compounds). This procedure was repeated 10 times, and average
results are shown in Table 3.
It can be seen that including rodent examples in the training set can produce predic-
tions on average almost as accurate as using a human training set of the same size. In
addition, the model trained on the mixed training set can produce a much more reliably
lower N LL on this test set compared with the one trained using the human training set.
Finally, the actual value of permeability for rodent skin was used as the prediction
value for the human skin. This works since the test set contains only those compounds
that are common to both rodent and human tissues. The final row of Table 3 shows the
result of this ‘prediction’. Interestingly, it gives the best result.
Table 3. Performances on the human test set using models trained on the human, rodent and
mixed training set, respectively. Performance using rodent values directly is also shown (see
text).

trained on human Naı̈ve 2.15 0 0 -
GP 1.93 9.96 0.43 13.06
trained on mixed GP 1.91 ± 0.09 11.78 ± 3.05 0.51 ± 0.05 2.10 ± 0.31
rodent values GP 1.51 29.47 0.68 −
514 Y. Sun et al.
6 Conclusions
The main results in this paper have shown that data from the rodent skin can provide
useful information in the prediction of the permeability of human skin. In particular, if
the permeability of rodent skin for a specific compound is used as an estimate for the
permeability of that compound for human skin, then quite a high quality prediction is
made. In fact, such a prediction is the best of all the models we have evaluated. It is also
interesting to see that a GP model trained with a mixture of human and rodent data did
a little better than a GP model trained with just human data on average.
The implication of this work could be important in that testing the permeability of
human skin is difficult and expensive, whereas the testing of animal skin is much easier.
It is known from literature (for example, [6] and [10]), that skin of other species is
more similar to human skin than that of the rodent. In particular, many researchers have
indicated the suitability of porcine (pig) skin, especially from weanling or stillborn
animals, as a model for percutaneous absorption ([11] and [12]). In future work, we
would therefore like to use porcine data as a predictor for human skin permeability.
References
[1] Moss, G., Sun, Y., Davey, N., Adams, R., Pugh, W., Brown, M.: The application of gaussian
processes to the prediction of percutaneous absorption. Journal of Pharmacy & Pharmacol-
ogy 61, 1147–1153 (2009)
[2] Sun, Y., Brown, M., Prapopoulou, M., Davey, N., Adams, R., Moss, G.: The application
of stochastic machine learning methods in the prediction of skin penetration. Applied Soft
Computing 11(2), 2367–2375 (2010)
[3] Sun, Y., Moss, G.P., Prapopoulou, M., Adams, R., Brown M.B., B.M., Davey, N.: The ap-
plication of gaussian processes in the prediction of percutaneous absorption for mammalian
and synthetic membranes. In: ESANN 2010 Proceedings, Bruges, Belgium (2010)
[4] Potts, R., Guy, R.: Predicting skin permeability. Pharmaceutical Research 9, 663–669
(1992)
[5] Bartek, M., LaBudde, J., Maibach, H.: Skin permeability in vivo: comparison in rat, rabbit,
pig and man. Journal of Investigative Dermatology 58, 114–123 (1972)
[6] Wester, R., Noonan, P.: Relevance of animal models for percutaneous absorption. Interna-
tional Journal of Pharmaceutics 7, 99–110 (1980)
[7] Moss, G., Cronin, M.: Quantitative structure-permeability relationships for percutaneous
absorption: re-analysis of steroid data. International Journal of Pharmaceutics 238, 105–
109 (2002)
[8] Flynn, G.L.: Physicochemical Determinants of Skin Absorption, pp. 93–127. Elsevietr, New
York (1990)
[9] Rasmussen, C., Williams, C.: Gaussian Processes for Machine Learning. The MIT Press
(2006)
[10] Bronaugh, R., Stewart, R., Congdon, E., Giles, A.: Methods for in vitro percutaneous ab-
sorption studies i: Comparison with in vivo results. Applied Pharmacology 62, 481–488
(1982)
[11] Marzulli, F., Maibach, H.: Relevance of animal models: the hexachlorophene story. In: An-
imal Models in Dermatology. Churchill-Livingstone, Edinburgh (1975)
[12] Chow, C., Chow, A., Downie, R., Buttar, H.: Percutaneous absorption of hexachlorophene
in rats, guinea pigs and pigs. Toxicology 9, 147–154 (1978)
Protein Structural Blocks Representation
and Search through Unsupervised NN
Virginio Cantoni1 , Alessio Ferone2 , Ozlem Ozbudak3 , and Alfredo Petrosino2

1
University of Pavia, Department of Electrical and Computer Engineering, Via A.
Ferrata, 1, 27100, Pavia, Italy
virginio.cantoni@unipv.it
2
University of Naples Parthenope, Department of Applied Science, Centro
Direzionale Napoli - Isola C4, 80143, Napoli, Italy
{alessio.ferone,alfredo.petrosino}@uniparthenope.it
3
Istanbul Technical University, Department of Electronics and Communication
Engineering, Ayazaga Campus, 34469, Maslak, Istanbul, Turkey
ozbudak@itu.edu.tr
Abstract. This paper aims to develop methods for protein structural

representation and to implement structural blocks retrieval into a macro-
molecule or even into the entire protein data-base (PDB). A first problem
to deal with is that of representing structural blocks. A proposal is to
exploit the Extended Gaussian Image (EGI) which maps on the uni-
tary sphere the histogram of the orientations of the object surface. In
fact, we propose to adopt a particular ’abstract’ data-structure named
Protein Gaussian Image (PGI) for representing the orientation of the
protein secondary structures (helices and sheets). The ’concrete’ data
structure is the same as for the EGI, however, in this case the points of
the Gaussian sphere surface do not contain the area of patches having
that orientation but features of the secondary structures (SSs) having
that direction. Among the features we may include the versus (e.g. +
origin versus surface or - vice versa), the length of the structure (number
of amino), biochemical properties, and even the sequence of the amino
(such as in a list). We consider this representation very effective for a
preliminary screening when searching on the PDB. We propose to em-
ploy the PGI in a fold recognition problem in which the learning task is
performed by means an unsupervised method for structured data.
1 Introduction
There are currently 80,041 (at 13/3/2012) experimentally determined 3D struc-
tures of protein deposited in the Protein Data Bank (PDB) [1] (with an incre-
ment of about 700 new molecules for month). However this set contains a lot of
very similar (if not identical) structures. The importance of the study of struc-
tural building blocks, their comparison and their classification, is instrumental
to the study on evolution and on functional annotation, and has brought about
many methods for their identification and classification in proteins of known

516 V. Cantoni et al.
structure. The procedures, often automatic or semi-automatic, for reliable as-

signment are essential for the generation of the databases (especially as the
number of protein structures is increasing continuously) and the reliability and
precision of the taxonomy is a very critical subject. This also because there is
no standard definition of what a structural motif, a domain, a family, a fold, a
sub-unit, a class, etc. really is, so that assignments have varied enormously, with
each researcher (other than for each DB) using a its own set of criteria.
There are several DBs for structural classification of proteins; among them
the most commonly used are Structural Classification Of Proteins (SCOP) [13]
and Class Architecture Topology and Homologous super families (CATH) [14].
The problem of structural representation of a macromolecule until now has been
pursued by ’ad hoc’ descriptors of patterns which are often point-based and cum-
bersome for the management and processing. Among these we can quote: spin
image [5], [6], context shape [7] and harmonic shape [8]. For our purposes we
intend to propose a new approach based on an appropriate extended Gaussian
image (EGI) [9] which represents the histogram of the surface orientations. The
EGI, introduced, for applications of photometry, by B.K.P. Horn [9], has been
extended by K. Ikeuci [10] [11] [12] (the Complex-EGI). Despite the many vari-
ants of the Gaussian Image, to our knowledge, this type of representation has
never been applied to proteins; we intend to present a new version, suitable for
this context and to employ it in a structural recognition task [4]. In particular,
the PGI will be used in a fold recognition problem in which the learning task is
performed by SOM–SD, a self organizing map for structured data.
The remainder of the paper is as follows. In Section 2 the Protein Gaussian
Image is explained. In Section 3 a brief overview of the SOM–SD method is
given. Section 4 shows preliminary experimental results and Section 5 concludes
the paper.
2 The Protein Gaussian Image
There are several methods for defining protein secondary structure, but the
Dictionary of Protein Secondary Structure (DSSP) [2] approach is the most
commonly used. The DSSP defines eight types of secondary structures, never-
theless, the majority of secondary prediction methods simplify further to the
three dominant states: Helix, Sheet and Coil. Namely, the helices include 3/10
helix, α–helix and π–helix; sheets or strands include extended strand (in parallel
and/or anti-parallel β–sheet conformation); finally, coils which can be considered
just connections. In the sequel, the structural analysis for protein recognition and
comparison, is conducted only on the basis of the two most frequent components
[3]: the α–helices and the β–strands. DSSP and STRIDE both extract from the
PDB segments 3D locations and attitudes, positions in the sequence of SSs and in
particular also strands (constituting sheets), and many other information easily
integrated in the new data structure.
Protein Structural Blocks Representation 517
PGI is a representation in the Gaussian image in which each SS is mapped

with a unit vector from the origin of the sphere having the orientation of the SS.
Each point of the sphere surface contains the data orientation (length, location
of starting and ending residue, etc) of the existing protein SSs having the cor-
responding orientation. The chain sequence of SS is recorded as a list which is
mapped on the sphere surface.
The proposed data structure is complete (no information is lost for an an-
alytic analysis) and effective from the computational viewpoints (only two ref-
erence coordinates are needed), but also, as for needle maps for general ob-
ject representation, supports effectively the structural perception. In order to
validate the effectiveness of the PGI representation of the protein structure,
we propose to employ an unsupervised framework for structured data in a
pratical structural learning problem, where each protein is represented by a
PGI.
Before showing the results, we give two practical examples of how proteins
are represented employing the PGI data–structure. Two couples of motif-protein
molecules are shown: 1FNB and 4GCR. For the former couple Figure 1(a) rep-
resents the molecule, Figure 1(b) shows the PGI (in green the Greek key motif
is highligthed) while Figure 1(c) shows the PGI of the motif with the linked
sequence of SSs. In Figures 2(a)-2(c) the analogous representation for the latter
couple are shown.
(a)
(b) (c)
Fig. 1. 1(a) A picture generated by PyMOL on PDB file 1FNB rotated π/2 for format
reasons. In green the Greek key motif (residues 56-116). 1(b) Protein Gaussian Image
of protein 1FNB. Green arrows represent the Greek key motif. 1(c) Protein Gaussian
Image of Greek motif contained in protein 1FNB. Green arrows represent the Greek
key motif, while the green line shows the sequence of SS.
(a)
(b) (c)
Fig. 2. 2(a) A picture generated by PyMOL on PDB file 4GCR rotated π/2 for format
reasons. In green the Greek key motif (residues 34-62). 2(b) Protein Gaussian Image
of protein 4GCR. Green arrows represent the Greek key motif. 2(c) Protein Gaussian
Image of Greek motif contained in protein 4GCR. Green arrows represent the Greek
key motif, while the green line shows the sequence of SS.
3 Learning Structured Data

In the recent years, complex data structures have been employed to obtain a
better representation of data and possibly a better solution for a given problem.
Many methods are based on a preprocessing step that first maps the strucutred
data to a vector of reals which then can be processed in the usual manner.
However, important information, like topological relationship between nodes,
may be lost during the preprocessing phase, so that the final result strongly
depends on the adopted algorithm. Neural methods are an example of techniques
that evolved to handle structured data where the original connectionist models
have been modified to process different data structures.
In order to prove the effeciveness of the PGI, we propose to perform a learn-
ing task by employing the SOM–SD framework [15], an unsupervised method for
structured data. Throughout this section we will use the same notation as in [15].
3.1 SOM Framework for Structured Data

The SOM–SD framework can be defined by using a computational framework
similar to that defined for recursive neural networks (RNN) [16]. The class of
functions which can be realized by a RNN can be characterized as the class of
functional DAG (Directed Acyclic Graph) transductions τ : I # → Rk , which
can be represented in the following form τ = g ◦ τ̂ , where τ̂ : I # → Rn is
the encoding function and g : Rn → Rk is the output function. τ̂ is defined
recursively as:
⎧
⎪
⎨ 0(the null vector in R ),
n
if D = void graph
τ̂ (D) = (1)
⎪
⎩ τ (y , τ̂ (D(1) , . . . , D(c) )),
s otherwise
where D is a DAG, ys is the label of the supersource of D and τ is defined as:

τ : R m × R n × . . . × Rn → Rn (2)

c times
A typical form for τ is:
c
τ (uv , x(1) , . . . , x(c) ) = F(Wuv + j xj + θ)
W (3)
j=1
where Fi (v) = sgd(vi ) (sigmoidal function), uv ∈ Rm is a label, θ ∈ Rn is the

bias vector, W ∈ Rm×n is the weight matrix associated with the label space,
x(j) ∈ Rm are the vectorial codes obtained by the application of the encoding
j ∈ Rm×m is the
function τ̂ to the subgraphs D(j) (i.e., x(j) = τ̂ (D(j) )), and W
weight matrix associated with the j–th subgraph space. The output function g
is generally realized by a feedforward neural network (NN).
The key consideration to adapt this framework to the unsupervised SOM
approach [17], is that the function τ maps information about a node and its
children from a higher dimensional space (i.e., m + c · n) to a lower space (i.e.,
n). The aim of the SOM learning algorithm is to learn a feature map
M:I→A (4)
which, given a vector in the input space I returns a point in the output space A.
This is obtained in the SOM by associating each point in A to a different neuron.
Given an input vector v, the SOM returns the coordinates within A of the neuron
with the closest weight vector. Thus, the set of neurons induces a partition of the
input space I. In typical applications I ≡ Rm , where m 2, and A is given by
a two dimensional lattice of neurons. In this way, input vectors which are close
to each other will activate neighbor neurons in the lattice. SOM–SD represents
an extension of the SOM framework, where the τ function is implemented in an
unsupervised learning framework, in order to generalize 4 to deal with the case
(c)
I ≡ Y # , i.e., the input space is a structured domain with labels in Y. The
function
(c)
M# : Y # →A (5)
is realized by defining 1 as shown in 6
⎧
⎪
⎨ nil A , if D = void graph
M# (D) =
⎪
⎩M
node (ys , M (D ), . . . , M# (D(c) )),
# (1)
otherwise
(6)
where s = source(D), D(1) , . . . , D(c) are the subgraphs pointed by the outgoing
edges leaving from s, nilA is a special coordinate vector into the discrete output
space A, and
Mnode : Y × A × . . . × A → A (7)

c times
is a SOM, defined on a generic node, which takes as input the label of the
node and the “encoding” of the subgraphs D(1) , . . . , D(c) according to the M#
map. By “unfolding” the recursive definition in 6, it turns out that M# (D) can
be computed by starting to apply Mnode to leaf nodes (i.e., nodes with null
outdegree), and proceeding with the application of Mnode bottom-up from the
frontier nodes (sink nodes) to the supersource of the graph D.
Details about the learning algorithm can be found in [15].
This section shows preliminary results obtained by SOM–SD using as input a

set of proteins represented as PGI. In particular, for each secondary structure,
only its direction is considered. The dataset is composed of 45 proteins classified
by SCOP as belonging to the class “Alpha and beta proteins (a/b)”. Three folds
have been considered, namely “Flavodoxin–like” “RibonucleaseH-likemotif” and
“TIMbeta/alpha-barrel”, and for each fold 15 proteins have been chosen. The
task consists in grouping proteins or side chains belonging to the same fold.
The network performances will be reported in terms of clustering perfor-
mance, classification performance, and retrieval performance. The clustering per-
formance is a measure on how well clusters are formed. This measure does not
take into account the desired clustering, not considering the target labels. The
classification performance is a measure on how well the clustering corresponds
to the desired clustering by comparing the target values of vertices mapped to
the same location. The retrieval performance is a measure of confidence of the
clustering result, so that if there are many vertices with different targets mapped
to the same location, the confidence in the clustering is low, hence producing
a low retrieval performance. The SOM–SD was trained on 200 × 200 neurons
with a neighborhood spread of σ = 60, considering different learning rate η = [1
1.25 1.5] and differents iterations λ = [40 60 80]. For each test the average
performance over 50 runs is reported. The dataset for each test was composed
by randomly picking the 70% of the patterns for the training phase and the
remaining 30% for the testing phase.
The first test has been conducted considering the whole protein as pattern,
i.e., each protein is represented by a PGI. It can be observerd in Table 1 that
the results are quite good in term of clustering performance. Even though this
measure does not take into account the desired clustering outcome, the result is
supported by the good retrieval performance which reflects a reduced confusion
in the mappings of each pattern. The classification performance, reflecting the
Table 1. Performance of a 200 × 200 SOM–SD considering the whole proteins
Test Set
Learning Rate 1 1.25 1.5
Iterations 40 60 80 40 60 80 40 60 80
Retrieval 84,99 84,08 84,86 86,46 87,69 79,79 79,29 82,31 79,82
Classification 72,65 56,95 56,91 59,39 70,65 60,73 65,91 64,30 74,82
Clustering 0,79 0,85 0,83 0,82 0,81 0,85 0,83 0,79 0,80
performance with respect to the desired clustering outcome, shows less accurate
results, but with an interesting peak at 74, 82%
The second test has been performed considering as patterns the single side
chains of each protein, i.e., each side chain is represented by a PGI. From Table
2 it can be noted how this “reduced” representation yields better results in term
of classification and retrieval performance while performing slightly worst with
respect to clustering performance. In particular, the clustering performance is
almost the same but with a higher confidence reflected by the higher retrieval
performance. The interesting result concerns the classification performance that
is much higher considering only the side chain.
Table 2. Performance of a 200 × 200 SOM–SD considering the single side chains of
each protein
Test Set
Learning Rate 1 1.25 1.5
Iterations 40 60 80 40 60 80 40 60 80
Retrieval 74,39 81,67 79,63 92,72 92,35 93,40 92,36 92,34 93,85
Classification 75,58 76,37 77,64 85,11 84,14 84,17 85,03 85,78 86,42
Clustering 0,8 0,80 0,80 0,79 0,80 0,79 0,79 0,80 0,79
5 Conclusion
A new data structure has been introduced that supports both artificial and
human analysis of protein structure. Preliminary tests have been performed
by employing the PGI in a structural learning task obtaining very good and
promising results. We are now planning an intensive quantitative analysis of the
effectiveness of this new representation approach for practical problems such as
alignment or even of structural block retrieval at different level of complexity:
from basic motifs composed of a few SSs, to domains, up to units. Moreover, in
future works we will try to exploit other feature beside the orientation, like the
length of the structure in terms of number of amino, biochemical properties, the
sequence of the amino, etc.
References
1. Protein Data Bank, http://www.pdb.org/
2. Kabsch, W., Sander, C.: Dictionary of protein secondary structure: pattern recogni-
tion of hydrogen-bonded and geometrical features. Biopolymers 22(12), 2577–2637
(1983)
3. David, E.: The discovery of the alfa-helix and beta-sheet, the principal struc-
tural features of proteins. Proceedings of the National Academy of Sciences USA,
100:11207–100:11210 (2003)
4. Cantoni, V., Ferone, A., Petrosino, A.: Protein Gaussian Image (PGI) - A Protein
Structural Representation Based on the Spatial Attitude of Secondary Structure.
New Tools and Methods for Pattern Recognition in Complex Biological Systems
(in press)
5. Shulman-Peleg, A., Nussinov, R., Wolfson, H.: Recognition of Functional Sites in
Protein Structures. J. Mol. Biol. 339, 607–633 (2004)
6. Bock, M.E., Garutti, C., Guerra, C.: Spin image profile: a geometric descriptor for
identifying and matching protein cavities. In: Proc. of CSB, San Diego (2007)
7. Frome, A., Huber, D., Kolluri, R., Bülow, T., Malik, J.: Recognizing Objects in
Range Data Using Regional Point Descriptors. In: Pajdla, T., Matas, J(G.) (eds.)
ECCV 2004. LNCS, vol. 3023, pp. 224–237. Springer, Heidelberg (2004)
8. Glaser, F., Morris, R.J., Najmanovich, R.J., Laskowski, R.A., Thornton, J.M.: A
Method for Localizing Ligand Binding Pockets in Protein Structures. PROTEINS:
Structure, Function, and Bioinformatics 62, 479–488 (2006)
9. Horn, B.K.P.: Extended Gaussian images. Proc. IEEE 72(12), 1671–1686 (1984)
10. Kang, S.B., Ikeuchi, K.: The complex EGI: a new representation for 3-D pose
determination. In: IEEE-T-PAMI, pp. 707–721 (1993)
11. Shum, H., Hebert, M., Ikeuchi, K.: On 3D shape similarity. In: Proceedings of the
IEEE-CVPR 1996, pp. 526–531 (1996)
12. Kang, S., Ikeuchi, K.: The complex EGI, a new representation for 3D pose deter-
mination. IEEE Trans. Pattern Anal. Mach. Intell. 15(7), 707–721 (1993)
13. Murzin, A.G., Brenner, S.E., Hubbard, T., Chothia, C.: SCOP: a structural clas-
sification of proteins database for the investigation of sequences and structures. J.
Mol. Biol. 247, 536–540 (1995)
14. Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B., Thorn-
ton, J.M.: CATH–a hierarchic classification of protein domain structures. Struc-
ture 5(8), 1093–1108 (1997)
15. Hagenbuchner, M., Sperduti, A., Tsoi, A.H.: A self-organizing map for adaptive
processing of structured data. IEEE Transactions on Neural Networks 14(3), 491–
505 (2003)
16. Sperduti, A.: Knowledge-Base Neurocomputing. In: Cloete, I., Zurada, J.M. (eds.)
pp. 117–152. MIT Press, Cambridge (2000)
17. Kohonen, T.: Self–Organization andAssociative Memory, 3rd edn. Springer, New
York (1990)
Evolutionary Support Vector Machines
for Time Series Forecasting
Paulo Cortez1, and Juan Peralta Donate2

1
Centro Algoritmi, Departamento de Sistemas de Informação,
Universidade do Minho, 4800-058 Guimarães, Portugal
pcortez@dsi.uminho.pt
http://www3.dsi.uminho.pt/pcortez
2
Computer Science Department, University Carlos III of Madrid, Av. de la
Universidad 30 28911 Leganes, Spain
jperalta@inf.uc3m.es
Abstract. Time Series Forecasting (TSF) uses past patterns of an event

in order to predict its future values and is a key tool to support decision
making. In the last decades, Computational Intelligence (CI) techniques,
such as Artificial Neural Networks (ANN) and more recently Support
Vector Machines (SVM), have been proposed for TSF. The accuracy
of the best CI model is affected by both the selection of input time
lags and the model’s hyperparameters. In this work, we propose a novel
Evolutionary SVM (ESVM) approach for TSF based on the Estimation
Distribution Algorithm to search for the best number of inputs and SVM
hyperparameters. Several experiments were held, using a set of six time
series from distinct real-world domains. Overall, the proposed ESVM is
competitive when compared with an Evolutionary ANN (EANN) and
the popular ARIMA methodology, while consuming less computational
effort when compared with EANN.
Keywords: Evolutionary Computation, Support Vector Machines, Time

Series, Forecasting.
1 Introduction
Forecasting the future using past data is an important tool to reduce uncertainty
and support both individual and organization decision making. In particular,
multi-step ahead predictions (e.g., issued several months in advance) are useful
to aid tactical decisions, such as planning production resources or evaluating
alternative economic strategies [2]. The field of Time Series Forecasting (TSF)
deals with the prediction of a given phenomenon (e.g., umbrella sales) based
on the past patterns of the same event. TSF has become increasingly used in
distinct areas such as Agriculture, Finance, Production or Sales [14].

The research reported here has been supported by FEDER (program COMPETE
and FCT) under project FCOMP-01-0124-FEDER-022674.

524 P. Cortez and J.P. Donate
Several Operational Research TSF methods have been proposed, such as Holt-
Winters (in the sixties) or the ARIMA methodology [14] (in the seventies). More
recently, several Computational Intelligence (CI) methods have been applied to
TSF, such as Artificial Neural Networks (ANN) [12,6], and Support Vector Ma-
chines (SVM) [15,9,3]. CI models such as ANNs and SVMs are natural solutions
for TSF, since they are more flexible (i.e., no a priori restriction is imposed)
when compared with classical TSF models, presenting nonlinear learning capa-
bilities. When compared with ANN, SVM presents theoretical advantages, such
as the absence of local minima in the learning phase.
When applying these CI methods to TSF, variable and model selection are
critical issues. A sliding time window is often used to create a set of training
examples from the series. A small time window will provide insufficient infor-
mation, while using a large number of time lags will increase the probability of
having irrelevant inputs. Thus, variable selection is useful to discard irrelevant
time lags, leading to simpler models that are easier to interpret and that usually
give better performances [4,9,6]. On the other hand, CI models such as ANN
and SVM have hyperparameters that need to be adjusted (e.g., number of ANN
hidden nodes or kernel parameter) [8]. Complex models may overfit the data,
losing the capability to generalize, while a model that is too simple will present
limited learning capabilities.
Several hybrid systems, which combine two or more CI techniques, have also
been proposed for TSF, such as Evolutionary ANN (EANN) [4]. Most EANNs
use the standard Genetic Algorithm (GA). More recently, the Estimation Dis-
tribution Algorithm (EDA) was proposed [13]. Such algorithm uses exploitation
and exploration properties to find a good solution. In [16], EDA was used as
the search engine of an EANN, outperforming a GA based EANN. Following
this result, in this paper we propose a novel Evolutionary SVM (ESVM) ap-
proach based on the EDA engine, in order to automatically select the best SVM
multi-step ahead forecasting model. Moreover, we also compare ESVM with the
EANN proposed in [16] and the popular ARIMA methodology.
The paper is organized as follows. First, Section 2 describes the ESVM ap-
proach. Next, Section 3 presents the experimental setup and the obtained results.
Finally, the paper is concluded in Section 4.
2 Evolutionary Support Vector Machine

2.1 Time Series Forecasting
The problem of TSF using CI models, such as ANN [16] or SVM [3], is considered
as obtaining the relationship from the value at period yt and the values from
previous elements of the time series, using several time lags {t−1, t−2, . . . , t−I},
to obtain a function as it is shown in [16]:
ŷt = f (yt−1 , yt−2 , . . . , yt−I ) (1)
where ŷt denotes the estimated forecast as given by the function f (e.g., SVM),
and I the number of input values.
ESVM for Time Series Forecasting 525
In order to obtain a single model to forecast time series values, an initial

step has to be done with the original values of the time series, i.e., normalizing
the data. The original values (yt ) are normalized into the range [0, 1] (leading
to the Nt values). Once the model outputs the resulting values, the inverse
process is carried out, rescaling them back to the original scale. Only one output
was chosen and multi-step ahead forecasts are built by iteratively using 1-ahead
predictions as inputs [4]. Therefore, the time series is transformed into a patterns
set depending on the I inputs of a particular model, each pattern consisting of:
– I inputs values: with the I-th normalized previous values Nt−1 , Nt−2 , ..., Nt−I .
– one output value: Nt (the desired target).
This patterns set will be used to train and validate each model generated dur-
ing the evolutionary execution. Thus, the patterns set is split into two subsets,
training (with the first 70% elements of the series) and validation (with the
remaining ones), under a timely ordered holdout scheme.
2.2 Evolutionary Support Vector Machine Design

The problem of designing SVM for TSF can be seen as a search problem into the
space of all possible solutions. While several Evolutionary Computation methods
could be used for this search, we adopt the EDA algorithm, since it has outper-
formed the standard GA in our previous work [16]. When designing a ESVM,
there are three crucial issues: i) setting the solution’s space, i.e., what informa-
tion of the SVM is previously set and what is included into the chromosome; ii)
how each solution is codified into a chromosome, i.e., encoding schema; and iii)
what are we optimizing, i.e., defining the fitness function.
When designing a SVM, there are three crucial issues it should be taken
into account: the type of SVM to use, the selection of the kernel function and
tuning the parameters associated with the two previous selections. Since TSF is a
particular regression case, for the SVM type and kernel, we selected the popular
ε-insensitive loss function (known as ε-SVR) and Gaussian kernel combination,
as implemented in the LIBSVM tool [1]. In SVM regression [17], the input y =
(yt−kI , . . . , yt−k2 , yt−k1 ), for a SVM with I inputs, is transformed into a high
m-dimensional feature space, by using a nonlinear mapping (φ) that does not
need to be explicitly known but that depends of a kernel function. Then, the
SVM algorithm finds the best linear separating hyperplane, tolerating a small
error (ε) when fitting the data, in the feature space:
m

ŷt = w0 + wj φj (y) (2)
j=1
This model requires setting three parameters: γ – the Gaussian kernel parameter,
exp(−γ||x − x ||2 ), γ > 0; C – a trade-off between fitting the errors and the
flatness of the mapping; and ε - the width of the ε-insensitive tube.
In this paper, an evolving hybrid system that uses EDA and SVM, is adopted.
Following the suggestion of the LIBSVM authors [1], SVM parameters are
searched in terms an exponentially growing scale. We also take into account

the number of input values of the time series (I) used to train the SVM. There-
fore, we adopt a direct encoding scheme, using a numeric representation with 8
genes, according to the chromosome g1 g2 g3 g4 g5 g6 g7 g8 , such that:
I = round(α · n · 10g1 +g2 +1

100 ) , g1 ∈ {0, 1, ..., 9} ∧ g1 ∈ {0, 1, ..., 9}
g4
γ = 2(g3 + 10 )−5 , g3 ∈ {−9, −8, ..., 9} ∧ g4 ∈ {−9, −8, ..., 9}
g6 (3)
C = 2(g5 + 10 )+5 , g5 ∈ {−9, −8, ..., 9} ∧ g6 ∈ {−9, −8, ..., 9}
g8
ε = 2(g7 + 10 )−8 , g7 ∈ {−9, −8, ..., 9} ∧ g8 ∈ {−9, −8, ..., 9}
where gi denotes the i-th gene of the chromosome. The input search range in-
cludes only integer numbers, due to the use of round function, and depends
on n, the length of the time series (#in-samples or training data), scaled by a
constant α factor. Based on our previous research using EANN for TSF [16],
we set α = 0.45. Thus, the ranges for the search space are: I, depends on
the series length (e.g., I ∈ {3, 7, 10, 13, 17, ..., 331} for Quebec); γ ∈ 2[−14.9,4.9] ;
C ∈ 2[−4.9,14.9]; and ε ∈ 2[−17.9,1.9].
The evolutionary process consists of the following steps:
1. First, a randomly generated population (composed of to 50 individuals), i.e.,
a set of randomly generated chromosomes, is obtained.
2. The phenotypes (SVM model) and fitness value of each individual of the
actual generation is obtained. This includes the steps:
(a) The phenotype of an individual of the actual generation is first obtained
(using LIBSVM [1]).
(b) Then, for each model, training and validation subsets are obtained from
time series data depending on the number of inputs nodes, as it was
explained in Section 2.1.
(c) The model is fitted using the Sequential Minimal Optimization algo-
rithm, as implemented in LIBSVM. The fitness for each individual is
given by the Mean Squared Error (MSE) validation error, during the
learning process. The aim is to reduce extreme errors (e.g., outliers)
that can highly affect multi-step ahead forecasts. Moreover, preliminary
experiments (with time series not present in Table 1), have shown that
this choice leads to better forecasts when compared with absolute errors.
3. Once the fitness values for whole population have been already obtained,
UMDA-EDA (with no dependencies between variables) [16] operators (se-
lection, estimation of the empirical probability distribution and sampling
solutions) are applied in order to generate the population of the next gener-
ation, i.e., a new set of chromosomes.
4. Steps 2 and 3 are iteratively executed until a maximum number of genera-
tions is reached.
The parameters of the EDA (e.g., population size of 50) were set to the same
values used by the EANN proposed in [16]. Since the EDA works as a second
order optimization procedure, the tuning of its internal parameters is not a
critical issue (e.g., using a population of 48 does not does not substantially
change the results). Also, this EDA has a fast convergence (shown in Fig. 1).
3.1 Time Series
We selected a total of six time series, with different characteristics and from
distinct domains (Table 1). Five series were selected from the well-known Hyn-
dman’s time series data library repository [10]. These are named Passengers,
Temperature, Dow-Jones, Quebec and Abraham12. We also adopt the Mackey-
Glass series, which is a common nonlinear benchmark. It should be noted that
these six times series were also adopted by the NN3 and NN5 forecasting com-
petitions [5]. Except for Mackey-Glass, all datasets are from real-world domains
and such data can be affected by external issues (e.g., strikes), which make them
interesting datasets and more difficult to predict.
Table 1. Time series data
Series Description #in-samples #out-of-samples (H)

Passengers monthly int. airline passengers 125 19
Temperature monthly air temp. at Nottingham 206 19
Dow-Jones monthly index closings (1968-81) 129 19
Abraham12 gasoline demand at Ontario 168 24
Quebec daily number of births at Quebec 735 56
Mackey-Glass chaotic series 735 56
3.2 Evaluation
The Symmetric Mean Absolute Percentage Error (SMAPE) is given by [11]:

T +H
1 |et |
SM AP E = × 100% (4)
H (|yt | + |ŷt |)/2
t=T +1
where et = yt − ŷt , T is the current time period and H is the forecasting horizon,
the number of multi-step ahead forecasts. SMAPE is a popular forecasting metric
that has the advantage of being scale independent when compared with MSE,
thus can be more easily used to compare methods across different series, ranging
from 0% to 200%. SMAPE was also the error metric used in NN3, NN5 and
NNGC1 forecasting competitions [5].
3.3 Results
For the comparison, we have chosen the EANN presented in [16], which is similar
to ESVM except that is uses a multilayer perceptron trained with the RPROP
algorithm, as implemented using the SNNS tool [18]. EANN optimizes the num-
ber of inputs (I ∈ {1, ..., 100}) and hidden nodes (from 0 to 99) and the RPROP
parameters (Δ0 ∈ {1, 0.01, 0.001, . . . , 10−9 } and Δmax ∈ {0, 1, ..., 99}). Both the
Table 2. Forecasting errors (%SMAPE, best values in bold) and best SVM models
Time Series ARIMA EANN ESVM I γ C ε

Passengers 4.50 3.39 5.35 14 2−4.1 212.7 2−8.6
Temperature 3.42 3.51 4.42 79 2−1.9 2−0.5 2−8.8
Dow-Jones 4.78 6.28 6.35 19 2−13.3 210.5 2−5.3
Abraham12 6.20 4.71 4.86 28 2−13.1 214.9 2−15.0
Quebec 10.36 10.83 9.31 301 2−13.6 212.7 2−16.3
Mackey-glass 26.20 7.06 1.32 10 22.0 21.70 2−16.0
Average 9.24 5.96 5.27
Median 5.49 5.49 5.11
ESVM and EANN experiments were conducted using code written in the C lan-
guage by the authors. As the stopping criterion, we used a maximum of 100
generations for ESVM and EANN. For a baseline comparison, we have also cho-
sen the ARIMA methodology, as computed by the ForecastPro c tool [7]. The
rationale is to use a popular benchmark that can easily be compared and that
does not require expert model selection capabilities from the user. The obtained
results are shown in Table 2 (SMAPE errors and best SVM models).
Quebec (SMAPE=9.31)
0.015
Passengers
Temperature
0.08
300
Dow−Jones ●
Abraham12
●
Quebec ● ●
● ●
Mackey−Glass
fitness (MSE over normalized data)
0.010
● ● ●
● ●
● ● ● ●
0.06
● ● ●
● ●
●●
● ●
●
● ●
●
250
● ●
Values
●
● ● ● ●
●
●
0.005
0.04
●
●
●
● ● ●
● ●
●
● ●
● ●
0.02
●
200
●
0.000
● ●
●
●
0 5 10 15
0.00
● ESVM
Quebec
0 20 40 60 80 100
0 10 20 30 40 50
generations
Horizon (ahead forecasts)
Fig. 1. Evolution of the ESVM best fitness value for all series (left, includes a zoom of
the bottom left area) and example of the ESVM forecasts for Quebec (right)
An analysis to tables shows that the proposed ESVM provides interesting

forecasts when compared with EANN and ARIMA. Each forecasting approach is
the best option for two series. Yet, ESVM provides the lowest average and median
(over all series, last two rows) results. The left of Fig. 1 plots the evolution of
Table 3. Comparison of computational effort required by ESVM and EANN
Time EANN ESVM Rt

Series time (min) time (min)
Passengers 165 5 96.7%
Temperature 315 6 98.1%
Dow-Jones 161 3 98.2%
Abraham12 420 6 98.6%
Quebec 6603 95 98.5%
Mackey-glass 5649 105 98.1%
the best fitness values for ESVM (left), showing a fast convergence of the EDA
algorithm for all series. For demonstration purposes, the right of Fig. 1 plots the
ESVM forecasts for Quebec, showing a good fit.
The experimentation was carried out with an exclusive access to a server (Intel
Xeon 2.27 GHz processor using Linux). Table 3 shows the computational time
(in minutes) required by each evolutionary method and series. The last column
shows in percentage, the reduction of computational effort obtained by ESVM
when compared with EANN, where Rt = 1 − (tESVM /tEANN ) and tM is the
time required for model M . As shown by the table, ESVM consumes much less
computational effort in all the time series tested. It can be observed a reduction
rate of at least 96% in all cases.
4 Conclusions
This paper proposes a novel Evolutionary Support Vector Machine (ESVM)

method for multi-step ahead TSF, which automatically searches best number of
inputs and SVM hyperparameters. The proposed ESVM was compared against
two approaches, an Evolutionary Artificial Neural Network (EANN) and the
popular ARIMA methodology. The experiments held using six time series from
distinct domains, revealed the proposed ESVM as the best forecasting method.
Moreover, when compared with EANN, ESVM requires much less computational
effort, with a reduction rate (in terms of time elapsed) greater than 96%. Thus,
ESVM is a good tool to quickly obtain accurate forecasts and this is a useful
characteristic for several TSF domains, such as critical or control systems. In the
future, we intend to address other SVM kernels (e.g., Polynomial) and extend
the experimentation to include time series from more domains (e.g., Medical
data).
References
1. Chang, C.-C., Lin, C.-J.: LIBSVM: A library for support vector machines. ACM
Transactions on Intelligent Systems and Technology 2, 27:1–27:27 (2011)
2. Chatfield, C.: Time-series forecasting. CRC Press (2001)
3. Cortez, P.: Sensitivity Analysis for Time Lag Selection to Forecast Seasonal Time
Series using Neural Networks and Support Vector Machines. In: Proceedings of the
International Joint Conference on Neural Networks (IJCNN 2010), pp. 3694–3701.
IEEE, Barcelona (2010)
4. Cortez, P., Rocha, M., Neves, J.: Time Series Forecasting by Evolutionary Neural
Networks. In: Artificial Neural Networks in Real-Life Applications, ch. III, pp.
47–70. Idea Group Publishing, USA (2006)
5. Crone, S.: Time series forecasting competition for neural networks and compu-
tational intelligence (2011), http://www-.neural--forecastingcompetition.com
(accessed on January 2011)
6. Crone, S.F., Kourentzes, N.: Feature selection for time series prediction - a com-
bined filter and wrapper approach for neural networks. Neurocomputing 73, 1923–
1936 (2010)
7. Goodrich, R.L.: The Forecast Pro methodology. International Journal of Forecast-
ing 16(4), 533–535 (2000)
8. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. Springer, NY (2008)
9. He, W., Wang, Z., Jiang, H.: Model optimizing and feature selecting for sup-
port vector regression in time series forecasting. Neurocomputing 72(1-3), 600–611
(2008)
10. Hyndman, R.: Time Series Data Library (2011), http://robjhyndman.com/TSDL/
(accessed on January 2011)
11. Hyndman, R.J., Koehler, A.B.: Another look at measures of forecast accuracy.
International Journal of Forecasting 22(4), 679–688 (2006)
12. Lapedes, A., Farber, R.: Non-Linear Signal Processing Using Neural Networks:
Prediction and System Modelling. Technical Report LA-UR-87-2662, Los Alamos
National Laboratory, USA (1987)
13. Larranaga, P., Lozano, J.A.: Estimation of distribution algorithms: A new tool for
evolutionary computation, vol. 2. Springer (2002)
14. Makridakis, S., Wheelwright, S.C., Hyndman, R.J.: Forecasting methods and ap-
plications, 3rd edn. John Wiley & Sons, USA (2008)
15. Miiller, K., Smola, A., Ratsch, G., Scholkopf, B., Kohlmorgen, J., Vapnik, V.:
Predicting Time Series with Support Vector Machines. In: Gerstner, W., Hasler,
M., Germond, A., Nicoud, J.-D. (eds.) ICANN 1997. LNCS, vol. 1327, pp. 999–
1004. Springer, Heidelberg (1997)
16. Peralta, J., Gutierrez, G., Sanchis, A.: Time series forecasting by evolving artificial
neural networks using genetic algorithms and estimation of distribution algorithms.
In: The 2010 International Joint Conference on Neural Networks (IJCNN), pp. 1–8
(July 2010)
17. Smola, A., Schölkopf, B.: A tutorial on support vector regression. Statistics and
Computing 14, 199–222 (2004)
18. Zell, A., Mache, N., Huebner, R., Mamier, G., Vogt, M., Schmalzl, M., Herrmann,
K.U.: Snns (stuttgart neural network simulator). In: Neural Network Simulation
Environments, pp. 165–186 (1994)
Learning Relevant Time Points
for Time-Series Data in the Life Sciences
Frank-Michael Schleif, Bassam Mokbel, Andrej Gisbrecht, Leslie Theunissen,

Volker Dürr, and Barbara Hammer
CITEC centre of excellence, University of Bielefeld, 33615 Bielefeld, Germany

fschleif@techfak.uni-bielefeld.de
Abstract. In the life sciences, short time series with high dimensional
entries are becoming more and more popular such as spectrometric data
or gene expression profiles taken over time. Data characteristics rule out
classical time series analysis due to the few time points, and they pre-
vent a simple vectorial treatment due to the high dimensionality. In this
contribution, we successfully use the generative topographic mapping
through time (GTM-TT) which is based on hidden Markov models en-
hanced with a topographic mapping to model such data. We propose an
extension of GTM-TT by relevance learning which automatically adapts
the model such that the most relevant input variables and time points
are emphasized by means of an automatic relevance weighting scheme.
We demonstrate the technique in two applications from the life sciences.
1 Introduction
Due to improved sensor technology, many data sets occurring in the biomedical
domain are very high dimensional such as mass spectra or gene expression pro-
files. At the same time, more and more data display a temporal characteristics
e.g. when investigating the development of an organism over time or the success
of a therapy. In these scenarios, classical time series analysis cannot be applied
due to comparably few time points (often less than 10). In addition, a direct
vectorial treatment is prohibited by the high dimensionality of the data.
A few machine learning techniques exist to investigate high dimensional time
series: Topographic mapping such as the self-organizing map (SOM) is extended
by a recursive context which accounts for the temporal dynamics in the approach
[15]. A probabilistic counterpart is offered by the Generative Topographic Map-
ping Through Time (GTM-TT) which combines hidden Markov models with a
constraint mixture model induced by a low dimensional latent space. This ap-
proach is extended to better take the relevance of the feature components into
account in [13], but relying on an unsupervised model. A supervised relevance
weighting scheme which singles out relevant features in a wrapper approach
based on hidden Markov models has been proposed in [12]. In [6] a similar
approach introducing class-wise constraints in the hidden Markov model. The
approach [5] deals with time series data and feature selection relying on support
vector machines in combination with a Kalman filter. In [12], applications to life

532 F.-M. Schleif et al.
science data are presented resulting in 85% prediction accuracy on a multiple

sclerosis (MS) data set, but the approach relies on strong assumptions about the
underlying HMM structure. The approach in [5] improves this result by about
3% but it results in a black box scenario without feature selection. The approach
[6] is evaluated in the same scenario achieving improved performance for the MS
data set. There is further ongoing work in this field, reflecting the high demand
for effective methods to deal with short high dimensional time series data. The
application field is not limited to the bio-medical domain [12,6,8] but covers a
broader field of applications in industry and geo-science [13,15].
GTM and SOM crucially rely on the Euclidean distance which becomes more
and more meaningless for high dimensional data and which suffers from an in-
appropriate scaling of the dimensions [11]. Because of this observation, distance
based learning has been extended to automatic relevance adaptation which auto-
matically adapts metric parameters according to given auxiliary information, see
e.g. [9,7]. In this contribution, we are interested in the question how relevance
learning can be transferred to the temporal domain, thereby weighting both,
features and time points of the model according to their relevance as specified
by given auxiliary information. The identification of relevant dimensions is very
important as outlined e.g. in [13,12] to obtain a better understanding of the data,
to reduce the processing complexity, and to improve the overall prediction accu-
racy. We propose a relevance learning scheme for GTM-TT and we demonstrate
the suitability of this approach in two applications from the life sciences.
2 Generative Topographic Mapping through Time

The Generative Topographic Mapping (GTM) as introduced in [4] models a data
set X with xi ∈ RD , i = 1, . . . , N by means of a mixture of Gaussians induced
by a lattice of K points wi in a low dimensional latent space which can be used
for visualization.
The lattice points are mapped via wi → ti = y(wi , W) to the data space,
where the function is parametrized by W ∈ Rm×D ; usually, a generalized linear
regression model is chosen y(w) = Φ(w) · W with K fixed, m dimensional
base functions Φ given by equally spaced Gaussians. The resulting prototypes
y(wi , W) should represent the data space as accurately as possible.
Every latent point induces a Gaussian
D/2
β β
p(x|wi , W, β) = exp − x − y(wi , W)2 (1)
2π 2
with varianceβ −1 . This gives the data distribution as a mixture of K modes

K i
p(x|W,
β)= i=1 1/K·p(x|w , W, β). Training optimizes the data log-likelihood

N K
i=1 p(w )p(x |w , W, β)
i n i
ln n=1 by means of an expectation maximiza-
tion (EM) approach with respect to the parameters W and β.
The GTM through time (GTM-TT) [3] extends the topographic mapping to
time series data of the form x = (x(1), . . . , x(T )) ∈ (RD )∗ where T ≥ 1 is the
Learning Relevant Time Points for Time-Series Data in the Life Sciences 533
length of the time series. A data point of the training set is referred to by xi .
Consecutive entries xi (t) and xi+1 (t+1) are strongly correlated. While the space
of observations over time is represented by a topographic mapping as before, the
temporal dependencies are modeled by a hidden Markov model (HMM) with
hidden states characterized by the lattice points wi .
The HMM is parametrized by initial state probabilities
π = (πj )K j=1 where πj = p(z(1) = wj ) and transition probabilities
where pij = p(z(t) = w |z(t − 1) = w ). The data probability is
P = (pij )K j i
i,j=1
p(x|Θ) = z∈{w1 ,...,wK }T p(x, z|Θ) with parameters Θ = (W, β, π, P), the con-
ditional probability p(x(t)|z(t)) := p(x(t)|z(t), W, β) as before (1), and

p(x, z|Θ) = p(z(1)) Tt=2 p(z(t)|z(t−1), W, β) Tt=1 p(x(t)|z(t)) for any sequence
z of hidden states [4].
As for HMMs, a forward-backward procedure allows to determine the hidden
parameters, the responsibilities of states for a given sequence, in an efficient way
[16], based on which the parameters W and β can be determined as before.
We obtain the probability of being in state wk at time t, given the observation
sequence xn :
Akt Bkt
rkn (t) = p(z(t) = wk |xn , Θ) = (2)
p(xn |Θ)
with forward variables Akt = p(xn (1) . . . xn (t), z(t) = wk |Θ) and backward vari-
able Bkt = p(xn (t + 1) . . . xn (tn ), z(t) = wk |Θ).
For an input time series xn (1) . . . xn (T ), GTM-TT gives rise to a time series
of responsibilities rkn (1) . . . rkn (T ) of neuron k. Based on these responsibilities,
a winner can be determined for every time step t as neuron argmaxk rkn (t).
Based on this observation, a supervised variant of GTM-TT (SGTM-TT) can
be determined as follows: Assume that the time series x is equipped with label
information l which is element of a finite set of different labels 1, . . . L. Then, we
train a separate GTM-TT for every class, whereby the models are coupled by
choosing the same bandwidth β and the same underlying topological structure
in the latent space, i.e. the same base functions Φ and prototypes wi . The pa-
rameters Wl are trained individually for every model representing label l. The
same holds for the initial state probability πl and the transition probabilities Pl .
When processing a novel time series x we thus obtain L time series of respon-
sibilities according to the labels. We denote the responsibilities of model l for
input x at time point t by rlk (x(t)). This gives rise to an aggregated value of
K T
responsibilities rl (x) := k=1 t=1 rlk (x(t))/(KT ). One can pick the label as
the value l for which this quantity is largest. However, to take optimum prior
class probabilities into account, we use an additional linear classifier with inputs
given by the vectors (rl (x))L l=1 which is trained using a standard SVM.
3 Relevance Learning for SGTM -TT
The principle of relevance learning has been introduced in [9] as a particularly

simple and efficient method to adapt the underlying metric of prototype based
classifiers according to the given situation at hand. Besides an improved data

representation, it allows to interpret the relevance of the considered features
to the given task. Here, we propose two different relevance weighting schemes:
relevance learning of the input dimensions to change the topographic mapping
according to the given class labels, and relevance learning of the time points to
improve the interpretability of the results.
Relevance Adaptation of the Features: The squared Euclidean metric used

to describe the data is substituted by the weighted form

D
dλ (x, t) = λ2d (xd − td )2 . (3)
d=1
Relevance learning for GTM has been introduced in [7] for i.i.d. data. For SGTM-
TT, a few modifications are necessary. We use the weighted metric (3) to define
the Gaussians (1). This gives rise to a data log-likelihood which takes into ac-
count the dimensions according to their relevance and, hence, a topographic
mapping which mirrors the relevance weighting scheme.
The question is how to set relevance parameters λ in a such way that the clas-
sification accuracy of the resulting mapping is as high as possible. We proceed
as in [7] and train the relevance parameters based on priorly given class informa-
tion in a separate step which is interleaved with the standard adaptation of the
SGTM-TT. We rely on the cost function as introduced in generalized learning
vector quantization which refers to the hypothesis margin of the classifier [14]:

dλ (xn , t+ ) − dλ (xn , t− )
E(λ) = sgd (4)
n
dλ (xn , t+ ) + dλ (xn , t− )
where t+ corresponds to the closest prototype with a correct label, whereas t−

corresponds to the closest prototype with an incorrect label, given input xn . sgd
is the logistic function. For SGTM-TT, both, data points xn and prototypes t are
time series, the latter given by the winning prototypes of the GTM-TT model per
time step. Therefore, we use a metric dλ which constitutes a sum of the functional
metric of time series components as proposed by Lee and Verleysen in [10], taking
the relevance weights as parameters. Since this metric is differentiable, we can
optimize this objective by means of a gradient technique.
Relevant Time Points: Since SGTM-TT relies on HMMs, every time point
depends on its predecessor only. Thus, it is not reasonable to adapt the relevance
of time points to obtain a better representation of data in the GTM-TT models.
However, it is reasonable to judge the relevance of time points resulting from
the GTM-TT models for the final classification, in particular if time series are of
the same or a similar length. This method offers insights into which time points
are particularly discriminative for the given task at hand.
We obtain a relevance profile in the following way: Denote by rl (x(t)) :=
K k
k=1 (rl (x(t)))/K the accumulated responsibility of the GTM-TT model l for
data point xn at time point t. Based on this value, a classification can be based
on the maximum responsibility rl (x(t)) in time point t. For every time point
t, we simply count the number of data points which are classified correctly as
belonging to class l based on the classification for time point t only, averaged
over all data. A global relevance profile results thereof as sum over all labels.
4 Experiments
We consider two data sets from the biomedical domain:
Multiple Sclerosis Data: The multiple sclerosis (MS) data set is taken from
[2] (IBIS) in the prepared form, given in [6]. The data are taken from a clinical
study analyzing the response of MS patients to the treatment. Blood sample en-
trenched with mono-nuclear cells from 52 relapsing-remitting MS patients were
obtained 0, 3, 6, 7, 12, 18 and 24 months after initiation of IFNβ therapy. This
resulted in 7 measurements over 2 years on average. Expression profiles were ob-
tained using one-step kinetic reverse-transcription PCR over 70 genes selected
by the specialists to be potentially related to IFNβ treatment. Overall, 8% of
the measurements were missing due to patients missing the appointments. After
the two year endpoint, patients were classified as either good or bad responders,
depending on strict clinical criteria. Bad responders were defined as having suf-
fered two or more relapses or having a confirmed increase of at least one point on
the expanded disability status scale (EDSS). From 52 patients, 33 were classified
as good and 19 as bad responders, see [2].
We use a SGTM-TT with 9 hidden states and 4 basis functions. A 4 fold cross-
validation with 5 repetitions is used. We compare the results with the general
HMM classifier (HMM-Lin) and the discriminative HMM classifier (HHM-Disc-
Lin) proposed in [12]. We also included the results of [2] who originally proposed
the MS study, the analysis of [1], employing a Kalman Filter combined with
an SVM approach and [6] proposing a semi-supervised analysis coupled with a
wrapper and cut-off technique to identify discriminating features.
In Table 1 we summarize the prediction results for the MS data set in com-
parison to the results given in [2]. As expected, results improve by integration of
Table 1. Prediction accuracies (test data) for different models using the MS data.
Improved prediction accuracy employing relevance learning is observed.
Method Number of genes Test accuracy (%)

SGTM-TT 70 85.66 ± 8.3
SGTM-TT-R 7 93.43 ± 5.8
IBIS 3 74.20
Kalman-SVM - 87.80
Lin-Best 7 85.00
Costa-Best 17 92.70 ± 6.1
MAP3K1
Mean
0.3
NfKBiB
Min Caspase 10
IRF8
0.25 −Std
RIP
0.2 Jak2 Flip
IL−4Ra Stat4
BAX
Relevance
0.15
0.1
0.05
−0.05
−0.1
0 10 20 30 40 50 60 70
Genes
Fig. 1. Relevance profile as obtained using SGTM-TT with relevance learning. The
plot shows the average relevance (blue/dark), minimal relevance (green/bright) and
the standard deviation, flipped to the negative part of the relevance axis.
0.95
0.8
0.9 0.7
0.85 0.6
Relevance
0.5
Relevance
0.8
0.4
0.75
0.3
0.7
0.2
0.65
0.1
0
1 2 3 4 5 6 7 0 100 200 300 400 500 600 700 800 900
Time point Time point
Fig. 2. Time points relevance profile for the ms data (top) and the insect data set
(bottom). For the insect profile one can clearly identify a peak in the first third of the
experiment, which is when insects climbed the first of two steps. The MS data indicate
the most relevant time points are at t = 2, t = 7 with a relevance of ≈ 0.75 and ≈ 0.87
this may support a prognosis for the therapy outcome already at the second time point.
relevance learning compared to the full feature set. Overall the SGTM-TT with
relevance learning achieves results of 93.43% accuracy which is comparable to
the best reported model but relies on a smaller number of necessary features.
Further the integrated relevance learning avoids multiple time consuming runs
within a wrapper approach like for the techniques used in [12,6]. The obtained
relevance profile is depicted in Figure 1 and provides direct access to an inter-
pretation of the relevant features, or marker-candidates, pruning irrelevant or
noisy dimensions. The five most significant genes found by relevance learning
cover three genes found by [2] and four genes found by [12,6].
Analyzing the relevance of time points (Fig. 2) for the MS data, we observe
a peak at time point two, indicating that a partial prognosis of the therapy
outcome is possible already at an early stage of the therapy.
Insect Locomotion Data: We investigated motion-captured whole-body kine-

matics of insect locomotion, using data from stick insects (Carausius morosus).
69 sequences where recorded in two walking conditions: a straight walk (class 1,
36% of the data), and a climbing task (class 2, 64% of the data, climbing consists
of two consecutive steps of 48 mm height each). Every time step is characterized
by 36 joint angles expressed in local coordinate systems of 18 leg- and 3 thorax-
T2 Y
T2 Z
T1 Y
T1 Z
R3 tib Y
R3 fem Y
R3 cox Y
R3 cox X
R3 cox Z
R2 tib Y
R2 fem Y
R2 cox Y
R2 cox X
R2 cox Z
R1 tib Y
R1 fem Y
R1 cox Y
R1 cox X
R1 cox Z
L3 tib Y
L3 fem Y
L3 cox Y
L3 cox X
L3 cox Z
L2 tib Y
L2 fem Y
L2 cox Y
L2 cox X
L2 cox Z
L1 tib Y
L1 fem Y
L1 cox Y
L1 cox X
L1 cox Z
Head Y
Head Z
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
Fig. 3. Relevance profile of the joint angle features of the insect data.
segments. Sequences were down-sampled from 200 Hz to 20 Hz and normalized

to a standard length of 800 time steps per sequence.
An analysis of the insects data with SGTM-TT with relevance learning, 9
latent points and 4 basis functions results in a prediction accuracy of 91% percent
in a 4-fold cross-validation. Relevance learning enables a 3% increase of the
accuracy as compared to simple SGT-TT with 88% accuracy. The most relevant
features are shown in Figure 3.
It is clearly visible that the pitch angle of the first thorax segment T 1 − y
and the levation angles of the left legs are emphasized (cox − y and f em − y
act synergistically). These angles display a much stronger variance when consid-
ering the climbing condition (class 2). The relevance profile of time points (see
Fig. 2) indicates that the most relevant time-range occurs in the first third of
the dynamics, corresponding to the first ascension.
5 Conclusion
We have presented a novel approach for the analysis of short temporal sequences.
It is based on the idea to introduce supervision and relevance learning into
Generalized Topographic Mapping through time. Our results show that we are
able to achieve improved or similar performance to alternative methods in the

literature for a typical biomedical data set. In addition, the prototype concept
of the underlying method permits a direct inspection of the model and extended
visualization performance. We also obtain a direct ranking of the individual
features employing the relevance profile as well as a ranking of the relevance of
time points. This information opens the way towards a more detailed analysis
of relevant parts of the data set and the resulting model.
Acknowledgment. The authors thank: Peter Tino, University of Birmingham, for

interesting discussions about probabilistic modeling and Falk Altheide, University of
Bielefeld, and Tien-ho Lin, Carnegie Mellon University, USA for support with the sim-
ulation data. We would also give extra thanks to Ivan Olier, University of Manchaster,
UK; Iain Strachan, AEA Technology, Harwell, UK and Markus Svensen, Microsoft
Research, Cambridge, UK for providing code for GTM and GTM-TT.
Funding: This work was supported by the DFG project HA2719/4-1 to BH, by the
EU project EMICAB (FP7-ICT, No. 270182) to VD, and by the Cluster of Excellence
277 CITEC funded in the framework of the German Excellence Initiative.
References
1. Altman, R.B., Murray, T., Klein, T.E., Dunker, A.K., Hunter, L. (eds.): Biocom-
puting 2006, Proceedings of the Pacific Symposium, Maui, Hawaii, USA, January
3-7. World Scientific (2006)
2. Baranzini, S.E., Mousavi, P., Rio, J., Caillier, S.J., Stillman, A., Villoslada, P., Wy-
att, M.M., Comabella, M., Greller, L.D., Somogyi, R., Montalban, X., Oksenberg,
J.R.: Transcription-based prediction of response to ifn using supervised computa-
tional methods. PLoS Biol. 3(1), e2 (2004),
http://dx.doi.org/10.1371%2Fjournal.pbio.0030002
3. Bishop, C.M.: Gtm through time. In: IEE Fifth International Conference on Arti-
ficial Neural Networks, pp. 111–116 (1997)
4. Bishop, C.M., Svensén, M., Williams, C.K.I.: Gtm: The generative topographic
mapping. Neural Computation 10(1), 215–234 (1998)
5. Borgwardt, K.M., Vishwanathan, S.V.N., Kriegel, H.P.: Class prediction from time
series gene expression profiles using dynamical systems kernels. In: Altman, et al.
[1], pp. 547–558
6. Costa, I.G., Schönhuth, A., Hafemeister, C., Schliep, A.: Constrained mixture es-
timation for analysis and robust classification of clinical time series. Bioinformat-
ics 25(12) (2009)
7. Gisbrecht, A., Hammer, B.: Relevance learning in generative topographic mapping.
Neurocomputing 74(9), 1359–1371 (2011)
8. Hafemeister, C., Costa, I.G., Schönhuth, A., Schliep, A.: Classifying short gene
expression time-courses with bayesian estimation of piecewise constant functions.
Bioinformatics 27(7), 946–952 (2011)
9. Hammer, B., Villmann, T.: Generalized relevance learning vector quantization.
Neural Networks 15(8-9), 1059–1068 (2002)
10. Lee, J., Verleysen, M.: Generalizations of the lp norm for time series and its applica-
tion to self-organizing maps. In: Cottrell, M. (ed.) 5th Workshop on Self-Organizing
Maps, vol. 1, pp. 733–740 (2005)
11. Lee, J., Verleysen, M.: Nonlinear Dimensionality Reduction. Springer (2010)
12. Lin, T., Kaminski, N., Bar-Joseph, Z.: Alignment and classification of time series
gene expression in clinical studies. In: ISMB, pp. 147–155 (2008)
13. Olier, I., Vellido, A.: Advances in clustering and visualization of time series using
gtm through time. Neural Networks 21(7), 904–913 (2008)
14. Schneider, P., Biehl, M., Hammer, B.: Distance learning in discriminative vector
quantization. Neural Computation 21, 2942–2969 (2009)
15. Strickert, M., Hammer, B.: Merge SOM for temporal data. Neurocomputing 64,
39–72 (2005)
16. Welch, L.R.: Hidden Markov Models and the Baum-Welch Algorithm. IEEE Infor-
mation Theory Society Newsletter 53(4) (December 2003),
http://www.itsoc.org/publications/nltr/it_dec_03final.pdf
A Multivariate Approach to Estimate
Complexity of FMRI Time Series
Henry Schütze1,2 , Thomas Martinetz1 ,

Silke Anders2 , and Amir Madany Mamlouk1
1
Institute for Neuro- and Bioinformatics,
2
Department of Neurology and Neuroimage Nord
University of Lübeck
Ratzeburger Allee 160, 23562 Lübeck, Germany
schuetze@inb.uni-luebeck.de
Abstract. Modern functional brain imaging methods (e.g. functional

magnetic resonance imaging, fMRI) produce large amounts of data. To
adequately describe the underlying neural processes, data analysis meth-
ods are required that are capable to map changes of high-dimensional
spatio-temporal patterns over time. In this paper, we introduce Multi-
variate Principal Subspace Entropy (MPSE), a multivariate entropy ap-
proach that estimates spatio-temporal complexity of fMRI time series.
In a temporally sliding window, MPSE measures the differential entropy
of an assumed multivariate Gaussian density, with parameters that are
estimated based on low-dimensional principal subspace projections of
fMRI images. First, we apply MPSE to simulated time series to test how
reliably it can differentiate between state phases that differ only in their
intrinsic dimensionality. Secondly, we apply MPSE to real-world fMRI
data of subjects who were scanned during an emotional task. Our find-
ings suggest that MPSE might be a valid descriptor of spatio-temporal
complexity of brain states.
Keywords: spatio-temporal complexity estimation, multivariate entropy,

fMRI data.
1 Introduction
Traditionally, data analysis in fMRI (functional magnetic resonance imaging) has

heavily relied on mass univaritate approaches that consider the time course of
each volume element (voxel) in isolation [4]. However, these methods may fail to
detect neural processes that lead to significant changes in widely distributed and
heavily interconnected neural networks, but are too subtle to cause detectable
changes in the local signal. Hence, multivariate approaches seem indispensable
in thorough fMRI data analysis.
Dhamala et al. (2002) used the correlation dimension to estimate spatio-
temporal complexity of brain activity in a task-driven fMRI study. The authors

Multivariate Complexity of FMRI Time Series 541
showed that their measure of complexity was directly related to task difficulty
and mental load during task performance [3]. However, the measure proposed
by Dhamala et al. (2002) requires large sample sizes and depends on manual
inspection steps, two requirements that constrain the usefulness of this measure
for the analysis of fMRI data.
Here, we propose a new method to estimate spatio-temporal complexity of
brain states that is based on multivariate entropy. Unlike in [3], our aim is to
describe changes of global brain states over time rather than globally describe
complexity. One common assumption is that complexity is strongly related to
information content. Hence, entropy functions, which are by definition informa-
tion measures [7], are well-suited candidates to estimate complexity. An intuitive
approach to estimate complexity of brain states would be to compute the multi-
variate differential entropy with respect to the corresponding functional images.
Unfortunately, we do not know the probability density function (pdf) that is nec-
essary for this estimation. If we assume that the functional images are Gaussian
distributed multivariate samples, the differential entropy can be computed by
evaluating a closed-form expression. However, estimating the required Gaussian
distribution parameters from high-dimensional, low sample size data prohibits a
straightforward entropy computation.
In the following, we will introduce Multivariate Principal Subspace Entropy
(MPSE) and show how it is derived from the differential entropy of multivariate
Gaussian distributed data. Then, we apply MPSE to a data set simulated with
a simple model to illustrate the main characteristics of MPSE. Subsequently, we
present results that were obtained by applying MPSE to task-driven fMRI time
series. Finally, we aim to explain our empirical findings by our simple model.
2 Methods
Let X = [x1 , ..., xn ] ∈ Rd×n denote a data matrix representing an fMRI time se-
ries. Each column xt of X represents an fMRI image corresponding to a discrete
time index t ∈ {1, ..., n}. In order to obtain spatio-temporal complexity estimates
size w ∈ N
+
at individual time indices, a temporally sliding
window of an odd
is employed so that 1 < w < n. Let X τ = xτ − w , ..., xτ + w ∈ Rd×w denote
2 2
the data matrix that columnwise containsw windowed images corresponding to
a fix central window position τ ∈ Tw = { w2 , ..., n − w2 }. The corresponding
sample covariance matrix Ĉ X τ is defined as
τ + w
2
1

T
Ĉ X τ = xi − µ̂X τ xi − µ̂X τ , (1)
w−1
i=τ − w
2
where the sample mean µ̂X τ is defined as
τ + w
2
1
µ̂X τ = xi . (2)
w
i=τ − w
2
542 H. Schütze et al.
Since an fMRI time series usually comprises much fewer images (samples) than
voxels (dimensions), a sample covariance matrix Ĉ X τ is necessarily rank defi-
cient, i.e. rk(Ĉ X τ ) ≤ w − 1 < d. Hence, the eigenvalue spectrum Λ(Ĉ X τ ) =
(λ1 , ..., λd ) contains at least d − (w − 1) zero eigenvalues.
The aim of the current paper is to estimate the spatio-temporal complexity
of a given data matrix X τ by employing multivariate differential entropy. It is
assumed that the samples X τ are observations drawn from a continuous random
vector x ∈ Rd that has some pdf p. The corresponding differential entropy H[p]
is defined as

H[p] = − p(x) ln p(x) dx .
x∈Rd
If we assume that p is a Gaussian pdf with a mean vector µ and a covariance

matrix C, i.e. p(x) is given by N (x | µ, C), then the corresponding differential
entropy is given by the closed-form expression (see e.g. [2])
1 d
H[p] = ln |C| + (1 + ln(2π)) . (3)
2 2
Note that (3) essentially depends on thedeterminant |C| and thereby on the
eigenvalue spectrum Λ(C), since |C| = λj ∈Λ(C) λj . In order to compute (3)
the true Gaussian density parameters µ and C have to be estimated. This can
be done by computing the maximum likelihood estimates of µ and C based on
the data matrix X τ that are given by (2) and (1) [2]. As mentioned above, the
sample covariance matrix Ĉ X τ is rank deficient.
Simply setting C = Ĉ X τ in

(3) would lead to a singularity, since Ĉ X τ = 0 and lim ln x = −∞. In order to
x→0
elude the singularity we evaluate (3) in a lower k-dimensional principal subspace
of X τ so that k = rk(Ĉ X τ ).
k
Let X̃ τ ∈ Rk×w denote the data matrix containing the k-dimensional princi-
pal subspace projections of the mean subtracted samples of X τ , i.e.
k
X̃ τ = U k (Ĉ X τ )T (X τ − (µ̂X τ 1Tw )) ,
where U k (Ĉ X τ ) = [u1 , ..., uk ] ∈ Rd×k denotes the matrix that contains the unit
eigenvectors of Ĉ X τ corresponding to its k largest eigenvalues. By construc-
tion the sample covariance matrix Ĉ X̃ k is diagonal and contains the k leading
τ
∧
eigenvalues of Ĉ X τ as diagonal elements. Note that diag(Ĉ X̃ k ) = Λ(Ĉ X̃ k ).
τ τ
Since k is choosen so that k = rk(Ĉ X τ ), the rank sufficiency of Ĉ X̃ k follows,

τ
i.e. rk(Ĉ X̃ k ) = k. Let x̃k ∈ Rk denote the random vector representing the k-
τ
dimensional principal subspace projection of x subject to U k (Ĉ X τ ). Then the

pdf q(x̃k ) is given by N (x̃k | 0k , Ĉ X̃ k ) and the corresponding differential entropy
τ
is k
1
H[q] = ln Ĉ X̃ k + (1 + ln(2π)) . (4)
2 τ 2
Note that the differential entropy (4) is well-defined due to the rank sufficiency of
Ĉ X̃ k . We call this quantity Multivariate Principal Subspace Entropy (MPSE).
τ
MPSE(X τ ) can be computed as follows:
1
k
k
MPSE(X τ ) = H[q] = ln λj + (1 + ln(2π))
2 j=1 2
1
k
k
= ln λj + (1 + ln(2π)) . (5)
2 j=1 2
A multivariate Gaussian distribution is a unimodal distribution, i.e. its entropy

depends on its generalized variance, which is given by the determinant of the
covariance matrix (see e.g. [6]). If the generalized variance in the principal sub-
space increases (decreases), MPSE given by (5) also increases (decreases). Note
that in contrast to the discrete Shannon entropy the differential entropy and
thus MPSE is not bounded and not necessarily nonnegative [2].
3 Data
3.1 Simulated Data
To assess the behavior of the spatio-temporal complexity estimator MPSE, we
simulated a number of d-variate time series that comprise phases with states of
different complexity. To this end, we modeled temporally alternating off-state
phases (off-state a) and more complex on-state phases (on-state b). We assume
that these two phases are represented by distinct temporally changing processes
that lie in orthogonal subspaces of different intrinsic dimensionalities da , db ∈
N+ , and also that the on-state phase process is higher dimensional than the
off-state phase process, i.e. db > da .
Let X a ∈ Rda ×n (X b ∈ Rdb ×n ) be randomly generated by drawing n ∈
N samples from the da -variate (db -variate) Gaussian distribution N (0da , I da )
+
(N (0db , I db )). We furthermore introduce two binary task functions sa , sb ∈

{0, 1}n that are used to model the time course of off-state and on-state phases.
Finally, a matrix V is used to embed randomly generated low dimensional sam-
ples into a much higher d-dimensional target space. Let V ∈ Rd×(da +db ) be a
randomly generated projection matrix consisting of (da + db ) pairwise orthonor-
mal d-dimensional vectors, i.e. V T V = I (da +db ) . Hence, an entire simulated time
series X is generated as

X a (1da sTa )
X=V ,
X b (1db sTb )
where denotes the element-wise matrix product operator.
We chose a target space dimensionality of d = 5000. For each simulated time
series we modeled five off-state phases of length na ∈ N+ and four on-state phases
of length nb ∈ N+ . Two different differences of dimensionality between state b
and state a were chosen: (1) a rather large difference (i.e. da = 20, db = 60) and
(2) a rather small difference (i.e. da = 55, db = 60).
In the first simulation setting (exclusive setting), we assume that both states
a and b are temporally exclusive, i.e. either off-state a or on-state b is active. For
this setting, simulation parameters are chosen as follows: na = 15, nb = 10 (i.e.
n = 5na + 4nb = 115). In the second simulation setting (transition setting), we
additionally model a transition phase c for each alternation of states. A transition
phase is modeled as the concurrency of off-state a and on-state b, i.e. there are
short phases of length nc ∈ N+ in which both states are simultaneously active.
In other words, a transition phase does not enforce the states to be temporal
exclusive and incorporates that one process has a certain shut down delay as
another process already becomes active. Note that the dimensionality of this
transition phase is dc = da + db . For this setting, we chose na = 13, nb = 8 and
nc = 2 (i.e. n = 5na + 4nb + 8nc = 113).
3.2 FMRI Data
In this section we introduce fMRI data from a task-driven fMRI study that
has been published elsewhere [1]. Six female subjects participated in an fMRI
experiment that comprised ten runs. Each run comprised four emotional periods
(20s) alternating with four resting periods (18s - 24s). During an emotional
period subjects were asked to submerge themselves into an emotional situation
(joy, anger, disgust, fear, sadness) and to facially express their emotional feelings.
A single word cue (e.g. ’joy’) on a black screen signaled an emotional period.
During resting periods subjects were asked to relax. A neutral fixation cross
signaled a resting period. There were two runs for each emotion per subject, i.e.
60 fMRI time series in total (6 subjects × 2 runs × 5 emotions). Each fMRI
time series comprises 80 whole brain functional images, that were acquired with
a TR (repetition time) = 2s.
FMRI time series were preprocessed including slice acquisition time correc-
tion, concurrent spatial realignment and correction of image distortions by use
of individual static field maps, normalization into standard MNI space and spa-
tial smoothing (10 mm Gaussian kernel) using SPM5 [9]. Only voxels within a
standard anatomical gray matter brain mask [8] were considered in our analy-
sis (i.e. 46,556 voxels per image). Furthermore, from each voxelwise fMRI time
series the temporal mean was subtracted and the linear trend was removed. A
detailed description of the fMRI experiment can be found in [1].
4 Results
4.1 Results on Simulated Data
Fig. 1 shows MPSE time courses of the simulated data described in section 3.1
using a window of size w = 5.
For the exclusive setting in our simulation and a rather large difference in
state dimensionality (da = 20, db = 60) the MPSE time course shows distinctly
higher values during on-state phases than during off-state phases (Fig. 1 (a)).
1.5 0.2
1.0 0.1
MPSE
0.5
0
0
-0.5 -0.1
-1.0 -0.2
20 40 60 80 100 20 40 60 80 100
(a) da = 20, db = 60, exclusive setting (b) da = 55, db = 60, exclusive setting
1.5 0.6
1 0.4
0.5
MPSE
0.2
0 0
−0.5
−0.2
−1
−1.5 −0.4
20 40 60 80 100 20 40 60 80 100
(c) da = 20, db = 60, transition setting (d) da = 55, db = 60, transition setting
Fig. 1. Average MPSE time courses of simulated time series with standard error
(N=30) using a window size w = 5. From each single MPSE time course the tempo-
ral mean was subtracted before averaging. Light gray (white) bars illustrate simulated
on-state (off-state) phases of dimensionality db (da ). Dark gray bars indicate transition
phases of dimensionality (da + db ).
If in the same simulation setting the difference in state dimensionality is chosen

rather small (da = 55, db = 60) the MPSE time course still attains high values
during on-state phases and low values during off-state phases (Fig. 1 (b)). The
on-state to off-state level difference in Fig. 1 (b), however, is smaller compared
to Fig. 1 (a). Furthermore, the MPSE time course in Fig. 1 (b) shows stronger
high-frequency fluctuations and noise than the MPSE time course in Fig. 1 (a).
For the transition setting in our simulation and a rather large difference in
state dimensionality (da = 20, db = 60) the average MPSE time course shows
again a high on-state and a low off-state level. In addition, an even higher third
level can be observed during transition phases (Fig. 1 (c)). If in the same simula-
tion setting the difference in state dimensionality is chosen rather small (da = 55,
db = 60) the transition level can stil be observed (Fig. 1 (d)), but, the MPSE
level difference between transition phase and on-state phase increases as the
difference in state dimensionality decreases (compare Fig. 1 (c) to Fig. 1 (d)).
4.2 Results on FMRI Data

Fig. 2 shows the average MPSE of the fMRI data set that was described in
section 3.2 for windows of size w ∈ {3, 5}. Applying MPSE to the fMRI data
0.4 0.6
0.3 0.4
0.2 0.2
MPSE
0.1
0 0
−0.1 −0.2
−0.2 −0.4
−0.3
−0.4 −0.6
20 40 60 80 20 40 60 80
(a) w = 3 (b) w = 5
Fig. 2. Average MPSE time courses of 60 fMRI time series with standard error across
subjects (N=6) using window size w. The temporal mean of each single time course was
subtracted before averaging. Gray (white) bars indicate emotional (resting) periods.
obviously results in a time course that reflects the alternation between emotional
and resting periods in the fMRI experiment. For both window sizes the average
MPSE time course has higher values during emotional periods and lower values
during resting periods, i.e. there are task-condition specific levels. Interestingly,
the obtained time courses show distinct peaks at task transitions. Increasing the
value w from w = 3 to w = 5 seems to temporally smooth the average MPSE
time course. For the larger window w = 5 the off-state transition peaks become
smaller, but are still clearly visible.
5 Discussion
For simulated data with a large difference in state dimensionality, we showed that
MPSE is capable to successfully detect on-state and off-state phases. MPSE con-
siders only w samples at a time and, hence, (w − 1) non-zero eigenvalues due to
w < da < db . Although the population eigenvalues are equally sized (isotropic
Gaussian model) the estimated eigenvalue spectra give different MPSE levels for
different intrinsic dimensionalities (i.e. high MPSE level for high dimensionality
and low MPSE level for low dimensionality). This could be explained by the
fact that under these high-dimensional, low sample size simulation settings esti-
mated eigenvalues are Marc̆enko-Pastur distributed [5]. Roughly speaking, as the
ratio α between sample size and dimensionality decreases, estimated non-zero
eigenvalues and variance of the corresponding spectra increase. Note that the
Marc̆enko-Pastur law actually holds for the asymptotic case of infinite sample
sizes for some fix α. Even though we are dealing with very small sample sizes,
MPSE increases, due to larger eigenvalue estimates, as the intrinsic dimension-
ality increases. The same argument explains the high third MPSE level during
transition phases. The transition-peaks are rather high compared to the on-state
and off-state levels when difference in state dimensionalities is small. This is not
surprising, as – by construction – the intrinsic dimensionality of a transition
phase is almost twice as high as the dimensionality of an on- or off-state.
For the real-world data, we assumed that during emotional periods brain ac-
tivity of subjects shows an increased complexity, due to the neural processing
induced by the experimental task. During resting periods, in contrast, complex-
ity was expected to decrease. Thus, we expected a correspondence between the
time course of a spatio-temporal complexity estimator and the time course of
the experimental task. MPSE estimates the time course of the experimental task
with high precision. There were significantly higher MPSE levels during emo-
tional than during resting periods. Furthermore, the MPSE time course shows
distinct peaks at the state transitions.
In sum, we have introduced MPSE as a measure to estimate spatio-temporal
complexity of fMRI time series. We have shown that MPSE is capable to detect
different experimental conditions for a real fMRI data set. Entropy levels were
higher during emotional periods and lower during resting periods. Furthermore,
we found evidence that during transitions between emotional and resting periods
complexity increases. Employing a simple model, we could reproduce (1) MPSE
level differences and (2) MPSE task transition peaks in simulated time series
that comprise state phases of different intrinsic dimensionalities.
Acknowledgement. This work was supported by the DFG, grant number

MA 2401/2-1.
References
1. Anders, S., Heinzle, J., Weiskopf, N., Ethofer, T., Haynes, J.D.: Flow of affective
information between communicating brains. NeuroImage 54(1), 439–446 (2011)
2. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, New York
(2006)
3. Dhamala, M., Pagnoni, G., Wiesenfeld, K., Berns, G.S.: Measurements of brain ac-
tivity complexity for varying mental loads. Physical Review E - Statistical, Nonlinear
and Soft Matter Physics 65(4), 041917-1–041917-7 (2002)
4. Holmes, A.P., Poline, J.B., Friston, K.J.: Characterizing brain images with the gen-
eral linear model. In: Human Brain Function, pp. 59–84. Academic Press, USA
(1997)
5. Hoyle, D.C.: Automatic PCA Dimension Selection for High Dimensional Data and
Small Sample Sizes. Journal of Machine Learning Research 9, 2733–2759 (2008)
6. Johnson, R.A., Wichern, D.W.: Applied Multivariate Statistical Analysis. Prentice
Hall, New Jersey (1998)
7. Shannon, C.E.: A Mathematical Theory of Communication. The Bell System Tech-
nical Journal 27, 423, 623–656 (1948)
8. Tzourio-Mazoyer, N., Landeau, B., Papathanassiou, D., Crivello, F., Etard, O., Del-
croix, N., Mazoyer, B., Joliot, M.: Automated anatomical labeling of activations in
spm using a macroscopic anatomical parcellation of the mni mri single-subject brain.
NeuroImage 15(1), 273–289 (2002)
9. Wellcome Trust Centre for Neuroimaging, Statistical Parametric Mapping, SPM5,
http://www.fil.ion.ucl.ac.uk/spm/software/spm5/
Neural Architectures for Global Solar
Irradiation and Air Temperature Prediction
Pierrick Bruneau, Laurence Boudet, and Cécilia Damon
CEA, LIST, Information, Models & Learning Laboratory,

91191 Gif sur Yvette CEDEX, France
{pierrick.bruneau,laurence.boudet,cecilia.damon}@cea.fr
Abstract. This paper presents a study on neural architectures for the

prediction of global solar irradiation and air temperature time series, a
useful task for thermal energy management systems. In this contribution,
the highly cyclic nature of the variables is carefully considered in the nor-
malization step and the neural architecture design. The standard neural
approach is confronted to the absolute daily and the absolute tri-hourly
architectures for the prediction of the next 24 hours. For generalization
purpose, models are assessed and compared on data from two sites in
France. Results show that the absolute models outperform the reference
model and some naive models. A complexity analysis also outlines the
interest of the proposition.
Keywords: Meteorological time series prediction, Neural architectures,

MLP, Solar energy.
1 Introduction
Accurate predictions of global solar irradiation and air temperature are neces-
sary for anticipative control in autonomous energy management systems that
use solar energy. For example, the system could then anticipate future produc-
tion and needs, and adapt its behavior accordingly. This information may be
provided by an external numerical weather prediction system, but to prevent
any service versatility, or its excessive cost, some system designers require au-
tonomous predictions. Naive models such as persistence or monthly means can
then be effective [3]. Yet, weather prediction can be seen as a time series extrap-
olation problem: machine learning algorithms can then be adapted to computing
a predictor for future values of meteorological time series.
Among other models, the Multi-Layer Perceptron (MLP) architecture, and
some of its variants (e.g. Time Delayed Neural Networks, Recurrent Wavelet
Neural Networks), have already been advocated for meteorological time series
predictions [7], and specifically for solar irradiation prediction [2], [8], [9]. Hourly
or daily cumulated irradiation predictions were considered. Some authors pro-
posed a combination of MLP and ARMA models, either as an additive model
[9], or switching their respective usage according to recent observed errors [8].
Most works assume that data inputs come from a stationary process; real input

Neural Architectures for Global Solar Irradiation 549
meteorological data is strongly cyclic, and thus has to be pre-processed. Deriva-

tive filters [8], potentially combined with deterministic detrending [9], can be
applied to meteorological data until the stationarity hypothesis is validated by
a statistical test.
In [2], the authors perform a correlation analysis to define the recent time
slots of interest for prediction inputs. They use external cloud cover forecasts as
an additional input. Some authors found better performances when augmenting
the inputs with a time index (e.g. current time or current day in year, see [2]),
but this practice is not systematic [8]. Moreover, it may be seen as an attempt
to take linear and cyclic deterministic trends into account, which can be done
alternatively with proper pre-processing [6]. We choose the latter approach.
In this paper, new architectures for global solar irradiation and air tempera-
ture prediction are proposed. Tri-hourly inputs and predictions are considered.
Computational complexity issues taken apart, the extension to hourly data and
predictions would be straightforward. To avoid the dependence on an external
service, future values for the meteorological time series of interest are computed
using only recent local meteorological measurements as inputs. This includes
solar irradiation, temperature, air pressure and hygrometry measurements ob-
tained from local sensors. The latitude may be used too, as this information is
inherent to the location where the system is installed. Most existing models aim
at predicting the meteorological time series for the next time slot only [2], [8],
[9]. Selected subsets among data from the 24 last hours authorize some models
to predict a few further slots [2], [8], but the prediction may not be updated with
the most recently observed information. Alternatively, we propose to predict all
admissible time slots in the upcoming 24 hours, and update these predictions as
new data becomes available. Additionally, previous works suggest that the recent
24 hours of data are sufficient for short-term meteorological regression [1]. Then
irradiation or temperature for all time slots in the 24 upcoming hours have to be
predicted, using only the most recent 24 hours of data. From a computational
point of view, using the smallest amount of input variables is preferable. This
selection is usually performed in literature by autocorrelation analysis [2], [8].
Alternatively, we choose to use all available inputs, and prevent over-fitting us-
ing cross-validation. The main contribution of this paper is to take advantage of
the 24 hour-cyclic nature of meteorological data, and decompose the prediction
problem accordingly. MLP models are specialized to a current time slot or a pre-
dicting horizon, and become building blocks of an architecture of models for time
series prediction. We will show that it allows more accurate predictions, saves
computational time, and opens perspectives for variable selection. First we recall
some basics about the MLP with one hidden layer. Then we discuss a specific
usage of this model for meteorological time series prediction. Several architec-
tures are proposed. The available data sets, and the pre-processing procedures
thereof, are described in section 4. The baseline strategy is first compared to
a set of reference naive models, and then to other neural architectures. Perfor-
mance and complexity point of views are illustrated and discussed. The interest
of this approach is finally outlined, and perspectives to this work are drawn.
550 P. Bruneau, L. Boudet, and C. Damon
2 Multi-layer Perceptron Basics

Let us consider a regression problem with D-dimensional inputs x = {x1 , . . . ,
xd , . . . , xD }, and S-dimensional outputs y = {y1 , . . . , ys , . . . , yS }. The regression
MLP with S outputs and K hidden neurons is completely determined by the
following formula:

fs (x) = wsk σ wkd xd (1)
s,k k,d
where wkd (respectively wsk ) denotes the weight from the dth input node (respec-
tively k th hidden node) to the k th hidden node (respectively sth output node),
and f = {fs } is the model
output. The full set of weights is generally summarized
as w = {wkd }, {wsk } . Hidden and output layer nodes also have bias weights,
implicitly associated to an additional input dimension, clamped at 1 in formula
(1). σ(.) is a non-linear function, chosen as the logistic function in this paper.
Learning a MLP then amounts to set w such as f (x) approximates correctly y.
Values for w are fitted to a set {xn , yn }n=1...N using the back-propagation
algorithm [4]. This algorithm optimizes the quadratic loss of the model output
with respect to target yn vectors. The estimated model is then able to predict
y for an unknown x. Let us note that the cardinality of the weights |w| in the
MLP roughly equals K(D + S).
3 Application to Meteorological Time Series

The variables in a time series are indexed by chronologically ordered timestamps.
The time series length may be infinite (e.g. in the case new values are continu-
ously recorded). In this work, we restrict to tri-hourly time series, i.e. the time
between two timestamps t and t+1 is 3 hours. The difference between prediction
and current time slots is called the predicting horizon. Inputs and outputs are
also limited to be up to 24 hours from the current time: timestamps can then
be designated relatively to the current time slot hc and prediction time slot hp
without any ambiguity. With tri-hourly data, time slots range in {0, 3, . . . , 21}.
Note that a modulo 24h function is applied to associated horizon values (e.g. if
hc = 21h and hp = 3h, the predicting horizon hp − hc is 6 hours).
Various strategies may be proposed to predict the values associated to all ad-
missible horizons up to 24h. This problem amounts to decide how every possible
(hc , hp ) couple is handled. A single MLP model is used in most existing works
[2], [8], [9]. Instead, let us define a combination of strategies for input and output
treatment (see figure 1):
– The predictor is composed of one model (i.e. MLP) that predicts all the
upcoming time slots, i.e. for all possible hp values (daily), or it is composed
of 8 MLP’s, each of them being specialized to one hp value (tri-hourly).
– The predictor is composed of one MLP that handles all possible hc values
(relative), or it is composed of 8 MLP’s, each of them being specialized to
one hc value (absolute).
Fig. 1. Alternatives for input and output strategy: a) the relative predictor uses the
same MLP for all possible hc values, b) the absolute predictor specializes one MLP for
each adminissible hc value, c) the daily predictor uses the same MLP for all possible
hp values, and d) the tri-hourly predictor specializes one MLP for each admissible hp
value.
The predictor is composed of a different number of MLP’s according to the

chosen strategies, that can range from 1 (relative-daily) to 64 (absolute-tri-
hourly). Most propositions in the literature are relative-daily predictors. This
implementation will thus be used for comparison purposes in our experiments.
The relative-tri-hourly strategy will not be considered in the remainder of this
paper: we choose to put an emphasis on our main contribution, the absolute
architectures.
4 Experiments
4.1 Data Sets Description and Pre-processing
The data sets used in this experimental section were recorded in Saclay (France,
48.72oN, 2.15o E) and in Marcoule (France, 44.14oN, 4.71o E). These data sets
are sampled according to different time steps (10 minutes for Saclay’s data, 30
minutes for Marcoule). Tri-hourly data sets are first built for each location with
a moving average. Solar irradiation measurements were converted to an instan-
taneous scale (Wm−2 ) when needed. For Saclay, the available meteorological
time series span from january 1st , 1996 to december 31, 2004 (9 years of data,
3.3% of missing values), and for Marcoule the span is from january 1st , 1999 to
december 31, 2007 (9 years of data, 6.1% of missing values).
Time series data should be converted to an acceptable stationary process
before being used in a learning procedure [6], [8], [9]. For meteorological time
series, strong yearly and daily cycles have to be taken into account. To this aim,
let us define the monthly-hourly sets of a time series data set :
xm,h = {xt |month(t) = m ∧ time slot(t) = h}. (2)
This results in 96 sets (12 months and 8 time slots). Then the time series can
be normalized in the following way :
xt − mean(xm,h )
xnorm
t = (3)
standard deviation(xm,h )
∀t, s.t. m = month(t) ∧ h = time slot(t).
The transformation (3) aims at making the time series’ mean and variance ap-
proximately contant ∀t, which is the weak stationarity definition. Note that
irradiation measurements are 0 everywhere for 21h, 0h and 3h time slots. The
transformation (3) is then ignored in these cases.
4.2 MLP Learning Protocol

Input data for a single MLP comes from 4 meteorological time series. Training
sets are defined by their chronological bounds. For each admissible time t in
the training set (i.e. that match admissible hc values for this MLP), an input
vector is made with the time series restricted to [t − 21; t]. Matching output data
entries are either scalars or vectors, depending on admissible hp values. Input
and output vectors with missing values are ignored.
In the MLP definition in section 2, the determination of K, the number of
neurons in the hidden layer, was left open. A cross-validation scheme is used to
learn a model with good generalization capabilities (thus preventing over-fitting).
More specifically :
– the training set is randomly divided in 5 equal-sized parts,
– each of these 5 parts are used for the validation of a MLP. Each MLP is first
learnt using remaining parts.
– observed validation errors are averaged, and serve as a criterion for selecting
K : initialized to 1, K increases until validation error stops diminishing.
For each site, 3 years of consecutive data are used for training, and the 2 following
years are used for testing. Transformed data (see equation (3)) is used for learning
and validation, but testing errors are measured with back-transformed outputs.
The results are presented using the following quality metric :

N 2
1 yn − yˆn
nRMSE =
N n=1 max(ym,h ) − min(ym,h )
where yn is the time series to predict, yˆn the prediction, ym,h is the monthly-
hourly set associated to time series data yn (see equ. (2)). Errors are thus scaled
to the domain they relate to, and can be seen as the percentage to maximal
possible error on the original scale.
As noted at the end of section 4.1, observed irradiation time series are 0 every-
where on 3 slots. These slots can then be ignored when forming the input vectors;
i.e. D = 29 instead of 32. Moreover, predictions for these slots are trivial: this
means that associated models can be removed from tri-hourly architectures for
irradiation (e.g. absolute-tri-hourly irradiation predictor uses 40 models, instead
of 64). Consequently, these trivial predictions should be ignored when computing
or aggregating quality results.
4.3 Comparison to Naive Models

In this section, the relative-daily predictor is compared to the set of models
presented in table 1. The mixed model uses the clear-sky maximal theoretic
irradiation function [5]:
fcsmax (t) = A(cosθt )B (4)
with A, B coefficients specific to clear sky predictions
and θt the zenith angle of the sun at t.
Table 1. Description of the reference naive models
Reference Pre-requisites Predicted Details

model name variable
Persistence Latest 24 hours Temperature
Uses the latest 24 hours
of the time series or irradiation
as the evolution of the 24 next hours
Montly means Several years Temperature
Uses monthly-hourly sets means
of data history or irradiation
as a predictor (see equation (2))
Mixed Latitude Irradiation
Uses half the clear-sky
where the system maximal theoretic
is located irradiation function (4)
as a predictor
Adjusted Latest 24 hours Temperature The evolution of the 24 latests hours
persistence of the time series of temperature is translated
to current temperature
Let us note that expression (4) depends on the latitude of the measurements
through the zenith angle. The mixed predictor of irradiation in table 1 is the best
estimator under the assumption of an even distribution around fcsmax 2
(t)
. Exper-
imental evaluations of these reference models are reported in table 2, along with
a comparison to the relative-daily predictor, the simplest among the proposed
architectures. 5 independent experiments are used to estimate the variability of
learning the neural architecture. Results are averaged according to each admis-
sible horizon value.
With the exception of the adjusted persistence, naive predictors do not get up-
dated by recently observed data. Thus their nRMSE values remain independent
of the predicting horizon. Among naive predictors, monthly means (respectively
persistence) perform best for irradiation (respectively temperature) prediction.
Temperature prediction errors are reduced by the adjusted persistence predic-
tor for horizons less than or equal to 6 hours, e.g. up to 46.0%, for Saclay and
3-hour horizon. Best naive irradiation prediction errors for the 3-hour horizon
are largely reduced by the neural model (up to 27.7%). This reduction mono-
tonically decreases with the augmentation of the horizon, e.g. down to 4.7% for
Marcoule and 24-hour horizon.
These obervations almost hold for temperature predictions, with the notable
exception of 3 and 6 hours horizons. Indeed, the neural predictor error improve-
ment over the adjusted persistence raises from 28.9% for the 3-hour horizon, to
Table 2. nRMSE performance measures of naive and neural predictors. Results are
averaged according to predicting horizons. Bold-faced results indicate the best predictor
for each experimental setting.
hp − hc = 3h 6h 9h 12h 15h 18h 21h 24h

Irradiation
Saclay 25.2% 25.2% 25.2% 25.2% 25.2% 25.2% 25.2% 25.2%
Persistence
Marcoule 24.1% 24.1% 24.1% 24.1% 24.1% 24.1% 24.1% 24.1%
Montly Saclay 23.5% 23.5% 23.5% 23.5% 23.5% 23.5% 23.5% 23.5%
means Marcoule 21.2% 21.2% 21.2% 21.2% 21.2% 21.2% 21.2% 21.2%
Saclay 25.1% 25.1% 25.1% 25.1% 25.1% 25.1% 25.1% 25.1%
Mixed
Marcoule 32.6% 32.6% 32.6% 32.6% 32.6% 32.6% 32.6% 32.6%
Saclay 17.0% 19.4% 20.3% 20.6% 20.9% 21.2% 21.3% 21.5%
Relative ±0.3% ±0.4% ±0.3% ±0.3% ±0.4% ±0.4% ±0.4% ±0.3%
daily Marcoule 15.6% 18.0% 18.9% 19.2% 19.7% 20.0% 20.2% 20.2%
±0.4% ±0.3% ±0.3% ±0.2% ±0.2% ±0.2% ±0.2% ±0.2%
Saclay 15.8% 18.8% 19.9% 20.5% 20.9% 21.1% 21.3% 21.4%
Absolute ±0.2% ±0.1% ±0.2% ±0.2% ±0.2% ±0.2% ±0.3% ±0.1%
daily Marcoule 14.4% 17.2% 18.2% 18.6% 19.2% 19.4% 19.6% 19.7%
±0.1% ±0.1% ±0.1% ±0.1% ±0.1% ±0.1% ±0.1% ±0.1%
Saclay 15.5% 18.8% 19.9% 20.4% 21.5% 21.1% 21.3% 21.5%
Absolute ±0.2% ±0.1% ±0.2% ±0.1% ±1.4% ±0.2% ±0.2% ±0.3%
tri-hourly Marcoule 14.0% 17.0% 18.1% 18.6% 19.1% 19.5% 19.6% 20.0%
±0.1% ±0.1% ±0.1% ±0.1% ±0.2% ±0.2% ±0.1% ±0.2%
Temperature
Saclay 14.8% 14.8% 14.8% 14.8% 14.8% 14.8% 14.8% 14.8%
Persistence
Marcoule 14.6% 14.6% 14.6% 14.6% 14.6% 14.6% 14.6% 14.6%
Montly Saclay 18.5% 18.5% 18.5% 18.5% 18.5% 18.5% 18.5% 18.5%
means Marcoule 18.0% 18.0% 18.0% 18.0% 18.0% 18.0% 18.0% 18.0%
Adjusted Saclay 8.0% 12.6% 15.3% 17.1% 18.5% 19.8% 21.3% 22.5%
persistence Marcoule 9.0% 14.2% 16.9% 18.4% 19.3% 20.1% 21.4% 22.7%
Saclay 6.0% 8.9% 10.7% 11.8% 12.4% 12.8% 13.1% 13.3%
Relative ±0.1% ±0.1% ±0.1% ±0.2% ±0.2% ±0.2% ±0.2% ±0.1%
daily Marcoule 6.4% 9.6% 11.1% 11.7% 12.1% 12.3% 12.5% 12.8%
±0.1% ±0.1% ±0.1% ±0.1% ±0.1% ±0.1% ±0.1% ±0.1%
Saclay 5.9% 8.2% 9.8% 10.9% 11.7% 12.3% 12.7% 13.1%
Absolute ±0.1% ±0.1% ±0.1% ±0.1% ±0.2% ±0.1% ±0.1% ±0.1%
daily Marcoule 6.4% 9.1% 10.8% 11.6% 12.0% 12.3% 12.5% 12.9%
±0.3% ±0.1% ±0.1% ±0.1% ±0.1% ±0.1% ±0.1% ±0.1%
Saclay 5.4% 8.2% 9.8% 10.9% 11.6% 12.2% 12.7% 13.1%
Absolute ±0.1% ±0.1% ±0.1% ±0.1% ±0.1% ±0.1% ±0.1% ±0.1%
tri-hourly Marcoule 5.9% 9.0% 10.6% 11.5% 12.0% 12.3% 12.5% 12.8%
±0.1% ±0.1% ±0.1% ±0.1% ±0.1% ±0.1% ±0.1% ±0.1%
32.4% for the 6-hour horizon. As for the irradiation prediction, this reduction
decreases down to 10.1% with the horizon increase.
The relative-daily predictor thus largely improves naive approaches for short-
term prediction. This improvement becomes gradually less important as the
predicting horizon is increased. This can be seen as a decay of the information
provided by recently observed data.
4.4 Architectures Evaluation

In table 2, the relative-daily architecture is also compared to the absolute-daily
and absolute-tri-hourly architectures.
The relative-daily architecture error, either for irradiation or temperature
prediction, is improved at most by 11.4% for the 3-hour horizon. This reduction
Table 3. Comparative analysis of architecture complexities
# avg(K) avg(|w|) total # # of ops. total #

of MLP / MLP / MLP of ops. ∝ / MLP ∝ of ops. ∝
(prediction) (learning) (learning)
Irradiation
Relative Saclay 1 12.00 444 444 197136 197136
daily Marcoule 1 10.80 400 400 160000 160000
Absolute Saclay 8 2.13 79 79 6241 49928
daily Marcoule 8 1.40 52 52 2704 21632
Absolute Saclay 40 1.05 32 160 1024 40960
tri-hourly Marcoule 40 1.01 30 150 900 36000
Temperature
Relative Saclay 1 17.40 644 644 414736 414736
daily Marcoule 1 10.80 400 400 160000 160000
Absolute Saclay 8 2.73 101 101 10201 81608
daily Marcoule 8 2.25 83 83 6889 55112
Absolute Saclay 64 1.08 32 256 1024 65536
tri-hourly Marcoule 64 1.09 33 264 1089 69696
becomes almost 0% for horizons greater than 12 hours. Also, with the exception
of the 3-hour horizon, the absolute daily and tri-hourly architectures perform
similarly well.
The average hidden layer size K is reported for each experimental setting in
table 3. Average learning and predictive number of operations associated with
each architecture are also reported, using the fact that the back-propagation al-
gorithm is O(|w|2 ), and the prediction with MLP models is O(|w|). Predictions
by absolute architectures are less costly (from 1.5 to 8 times). The decomposition
implied by the absolute architectures thus leads to more parsimonious MLP’s.
The required number of operations for learning is also dramatically reduced by
the absolute architectures (up to 7.4 times). Absolute-daily and tri-hourly archi-
tectures learning costs are in the same order of magnitude: the preference may
be influenced by the site where the data was recorded (i.e. climate conditions).
However predictions by the absolute-daily architecture are much cheaper in pro-
cessing time (up to 3 times), which may be a much more decisive criterion in
the context of embedded systems.
5 Conclusion
In this paper, a general view to neural architectures for time series prediction
was proposed, and applied to tri-hourly meteorological data. This view unifies
previous works with the absolute strategy, that decomposes the prediction prob-
lem according to the current time slot under consideration. In addition, the
tri-hourly strategy specializes each MLP to a specific predicted time slot.
After validating the performance of neural prediction against reference naive
models, several architectures were compared. Maximal gains are made for short-
est term predictions: when the horizon is greater than 9 hours, all architectures
perform equally. The architectures were also compared from a computational
point of view: absolute architectures involve a much lower number of operations
for learning and prediction. Besides, absolute architectures open perspectives for
variable selection: the Bayesian scheme proposed in [4] could be used to derive
a well-founded variable selection procedure on each MLP of an architecture.
Consequences on performance and complexity need further investigations, but
the predictor would gain in interpretability. Indeed, each current time slot or
predicting horizon could then be associated with its own set of relevant inputs.
References
1. Aguiar, R., Collares-Pereira, M.: Statistical properties of hourly global radiation.
Solar Energy 48(3), 157–167 (1992)
2. Cao, J., Lin, X.: Study of hourly and daily solar irradiation forecast using diagonal
recurrent wavelet neural networks. Energy Conversion and Management 49, 1396–
1406 (2008)
3. Giebel, G., Kariniotakis, G.: Best practice in short-term forecasting. a users guide.
In: European Wind Energy Conference & Exhibition (2007)
4. Nabney, I.T.: Netlab Algorithms for pattern recognition. Springer (2002)
5. Perrin de Brichambaut, C.: Météorologie et énergie: l’évaluation du gisement solaire.
La Météorologie 6(5), 129 (1976)
6. Qi, M., Zhang, G.P.: Trend time series modeling and forecasting with neural net-
works. IEEE Transactions on Neural Networks 19(5), 808–816 (2008)
7. Smith, B.A., McClendon, R.W., Hoogenboom, G.: Improving air temperature pre-
diction with artificial neural networks. IJCI 3, 179–186 (2006)
8. Voyant, C., Muselli, M., Paoli, C., Nivet, M.-L.: Numerical weather prediction (nwp)
and hybrid arma/ann model to predict global radiation. Energy 39, 341–355 (2012)
9. Wu, J., Chan, C.K.: Prediction of hourly solar radiation using a novel hybrid model
of arma and tdnn. Solar Energy 85, 808–817 (2011)
Sparse Linear Wind Farm Energy Forecast
Carlos M. Alaı́z, Alberto Torres, and José R. Dorronsoro
Departamento de Ingenierı́a Informática & Instituto de Ingenierı́a del Conocimiento,

Universidad Autónoma de Madrid, 28049 Madrid, Spain
{carlos.alaiz,jose.dorronsoro}@uam.es, alberto.torres@iic.uam.es
Abstract. In this work we will apply sparse linear regression methods to

forecast wind farm energy production using numerical weather prediction
(NWP) features over several pressure levels, a problem where pattern
dimension can become very large. We shall place sparse regression in the
context of proximal optimization, which we shall briefly review, and we
shall show how sparse methods outperform other models while at the
same time shedding light on the most relevant NWP features and on
their predictive structure.
Keywords: Sparse Methods, Lasso, Group Lasso, Elastic–Net, Group

Elastic–Net, Wind Farm Energy Forecast.
1 Introduction
Most modelling problems of interest in practical machine learning involve high

dimensional data, sometimes coupled with large samples. The overall goal in
those problems is to achieve good predictive models but large sizes and dimen-
sions often preclude the straight use of strong but complex methods such as
neural networks or support vector machines. The simplest choice is to use linear
models, that while probably inferior to other alternatives, offer a first quality
standard against other approaches that can be benchmarked. Moreover, linear
models can also help to pinpoint which variables are most influential. This can
be exploited by selecting the most important features to be used as inputs of
stronger methods and, also, to gain further knowledge of the problem under
study.
In the past decade the Stanford school of Breiman, Friedman, Hastie and
Tibshirani has proposed a series of sparse enforcing linear models, such as the
Lasso [8], Group Lasso [9], and Elastic–Net [10]. They add to the standard square
error loss function an 1 penalty (Lasso), a mixed 2,1 penalty (Group Lasso) or
combine the 1 penalty with an 2 regularizer term (Elastic–Net). The resulting
optimization problems are still convex but no longer differentiable, and a series
of ad–hoc methods have been proposed to solve them. From a general point of
view the combined criterion function has the form

1 λ2
min L̂S (W ) + λ1 R̂(W ) + W 2 ,
2
(1)
W 2 2

558 C.M. Alaı́z, A. Torres, and J.R. Dorronsoro
with L̂S (W ) denoting the quadratic loss of a linear model with weights W over
a sample S and R̂(W ) the non–differentiable but still convex 1 or 2,1 norms
of W . This general formulation places the above problems under the scope of
Proximal Optimization [3] which exploits the concept of proximal operators to
arrive to a general optimization procedure. Moreover, it makes quite easy to
extend the previous methods. For instance we will consider here a group version
of Elastic–Net simply by mixing in (1) the 2,1 and 2 norms of W . We shall
apply this set up to the problem of predicting wind farm energy production, of
considerable interest nowadays and fitting squarely in the previous set up.
The usual approach uses historical wind energy production data and fore-
casts derived from numerical weather prediction (NWP) systems, in our case,
the Agencia Española de Meteorologı́a (AEMET; [1]). These systems provide
forecasts for the nodes of a geographical grid, typically at resolutions that start
at 0.16◦ degrees or even finer, for either a surface level derived from a smooth
orographical model or for several constant pressure levels that go from sea level
to a height of about 20 Km. These forecasts are usually given every three hours.
Moreover, many meteorological variables are available at each level and the num-
ber of possible features may clearly become very large. To manage this, a first
obvious approach is to fix a square of grid points centred at the wind farm and
consider for each grid point a number of surface variables. However, these are
given typically 10 m above the grid point height, but that may bear little rela-
tionship with the actual altitude of a wind farm. The alternative is to consider
NWP forecasts for a number of pressure levels but this will of course augment
feature dimension and, thus, make the sparse methods attractive modelling tools.
This approach will be applied to the study of wind energy forecasting at the
Sotavento wind farm, situated in the Galicia region of north–western Spain. We
shall work with a 6 × 6 grid with a 0.25◦ degree resolution, 6 pressure layers
and 5 meteorological variables. Total dimension is thus 1, 080. We will consider
a one–year long training sample, whose size is thus 2, 920, i.e., about three times
pattern dimension and well below the linear regression rule of thumb of having 10
patterns per dimension. Regularization is thus mandatory but sparse regression
also comes in very naturally. In fact, and as we shall see, sparse models beat ridge
regression using a quite small number of the features available. Moreover, sparse
models also shed light on the predictive structure of NWP variables, something
that could be exploited when considering stronger, more complex methods than
standard regression. The paper is organized as follows. In Sect. 2 we will briefly
review the theory of Proximal Optimization and its training algorithms, and
describe how the previous sparse regression problems fit in this set up. In Sect.
3 these models will be applied to wind energy prediction for the Sotavento farm
and the paper ends with a discussion and conclusions section.
2 Proximal Methods for Regularized Linear Models

N
Assume a training set S = (X (p) , y (p) ) p=1 with (X (p) , y (p) ) ∈ RD × R, for
which we want to build a model fW : RD → R, in a certain Hilbert space fW ∈ H
Sparse Linear Wind Farm Energy Forecast 559
and parametrized by a weight vector W , such that fW (X (p) ) ≈ y (p) , ∀p. To make
this more precise, we introduce a convex loss function LS : H → R and look in
∗
principle for a fW that minimizes LS (fW ). However, we may also want to control
model complexity for which we may introduce a sparsity controlling convex term
R(fW ) as well as a regularization term fW 2H . These considerations lead to the
general optimization problem

1 λ2
min LS (fW ) + λ1 R(fW ) + fW 2H , (2)
fW ∈H 2 2
where λ1 and λ2 are the parameters which will determine the relative impor-
tance of the regularization terms against the error term. Notice that if λ2 = 0,
the objective function is strictly convex. While the first and third terms are dif-
ferentiable, R(fW ) will usually not be so. To deal with this, we will consider the
problem (2) under the framework of Proximal Methods, a set of techniques to
solve non–differentiable optimization problems in an iterative way. The starting
∗
point is the fact [6] that the solution fW of (2) satisfies for any η > 0 the fixed
point equation

∗ λ2 ∗ 1 ∗
fW = prox λ1 ;R 1− fW − ∇LS (fW ) , (3)
η η 2η
where the proximity operator proxλ;F (x) of a function F at a point x ∈ RD

with step λ is defined as

1
proxλ;F (x) = arg min λF (y) + x − y2 .
2
(4)
y 2
The solution of (4) is problem dependent and finding it is the main issue when
applying proximal optimization. If it is known, (3) justifies an iterative algorithm
based on the steps

(t) λ2 (t−1) 1 (t−1)
fW = prox λ1 ;R 1− fW − ∇LS (fW ) .
ηt ηt 2ηt
There are several general purpose algorithms that apply this iterative scheme.
Here we will use the Fast Iterative Shrinkage–Thresholding Algorithm (FISTA;
[2]) which automatically determines the step length ηt .
Sparse regularized linear regression fits nicely in this set up. In that case,
H = RD , the basic model is fW (X) = X · W and the loss function is LS (fW ) =
N X W − Y 2 , where X is the matrix collecting all the inputs X
1 2 (p)
in its rows,
and Y is the vector of all the desired outputs y . All the mentioned linear
(p)
sparse models can be derived for particular choices of the functional R and the
parameters λ1 and λ2 , as summarized in Table 1. The simplest case is to fix
λ1 = λ2 = 0, which leads to the Ordinary Least Squares (OLS) model. The
resulting optimization problem can be easily solved analytically (see Table 1),
but if no regularization is included, OLS models are likely to over–fit the sample
when the feature dimension D is comparable with sample size. The simplest way
Table 1. Correspondence between the regularized linear models and problem (2)
Model LS (fW ) R(fW ) fW 2

H Solution
OLS 1
N X W − Y 22 × × W o
= (X T X )−1 X T Y
N λ2 −1 T
RLS 1
N X W − Y 22 × 1 2
D W 2 W o = (X T X + D ) X Y
LA 1
N X W − Y 22 1
D W 1 × FISTA
GL 1
N X W − Y 22 1
D W 2,1 × FISTA
ENet 1
N X W − Y 22 1
D W 1
1
D W 2
2
FISTA
GENet 1
N X W − Y 22 1
D W 2,1
1
D W 2
2 FISTA
to avoid this is just to take some λ2 > 0 while keeping λ1 = 0. This leads to
Regularized Least Squares (RLS; [4]) which has also a closed form solution.
A first choice for the functional R is to use the 1 norm, RLA (fW ) = D 1
W 1 .
Setting λ1 > 0 and λ2 = 0 and using this functional, we recover the Lasso (LA;
[8]) algorithm. The 1 norm encourages sparse models, something that can be
seen as an implicit feature selection, because the inputs associated with zero
coefficients are just discarded. Because of its non–differentiability, LA models
will be trained using FISTA, as explained above. The proximal operator for the
1 norm is given by soft thresholding [5] as (proxλ;·1 (x))i = xi (1− |xλi | )+ . Notice
that in LA all coefficients are treated individually. In certain circumstances we
may want to have a grouping effect in the features, so as to detect relevant
groups. A way to achieve this is to enforce that all the coefficients in a group
should be active or inactive at the same time. This is what the Group Lasso (GL;
[9]) algorithm obtains using a mixed 2,1 norm as regularizer, i.e., RGL (fW ) =
1 V V
D
D W 2,1 = D v=1 wi,v , where wi,v is the component corresponding

1 2
i=1
to the v–th variable of the i–th group, and the space RD is decomposed in D V
groups of V variables. Notice that this 2,1 norm is just the 1 norm of the 2
group norms; thus it precisely enforces group sparsity. Again, this problem will
be solved using FISTA with the proximal operator being now (proxλ;f1 (x))i =
λ
xi,v (1 − xi,· +
2 ) .
While the 1 norm enforces sparseness, the regularizing effect of the 2 norm
has its own advantages and combining both seems sensible. This is what Elastic–
Net (ENet; [10]) does, setting λ1 > 0, λ2 > 0 and using the RLA regularizer of
LA. On the other hand, the Proximal Optimization approach easily allows to
define a group version of this, the Group Elastic–Net (GENet), that combines
the 2 norm with the RGL regularizer. Both ENet and GENet will be also solved
by FISTA.
3 Numerical Experiments
In this section we will apply the previous algorithms to study the prediction of
the energy production of a wind farm. We will work with the Sotavento wind
farm [7], located at 43.34◦ N, 7.86◦ W and that makes production data publicly
available. The usual features in wind power forecasting are surface predictions
of meteorological variables. We shall work first with the following: V , the norm
of the wind speed, Vx , its x component, Vy , its y component, the temperature
T and pressure P . They will be considered over a rectangular, 0.25◦ resolution,
6 × 6 point grid surrounding Sotavento. The dimension for surface prediction
is thus 180 = 6 × 6 × 5, large but not too much so. We will work with a 1
year training set and a 2 months test period. Meteorological forecast are only
available every three hours; thus, we have eight patterns per day and training
sample size is 2, 920. We normalize the wind energy production target values to
the [0, 1] interval as percentages of the total installed wind power in Sotavento.
In any case, many more variables are available on the 17 constant pressure
levels for which AEMET gives NWP forecasts, although obviously not all of them
will have an effect on the energy. These levels have a 50 hPa resolution and over
them pressure is constant and no longer a predictive variable; we substitute it
by geopotential heights. A first level selection can be done using Sotavento’s
elevation, with an average of about 600 m. The first 11 levels are consistently
located much higher; moreover, correlation plots with wind farm production
(not included) show that they do not contain useful information and we have
discarded them outright. We are left with the lowest 6 layers and total feature
dimension is then 6 × 6 × 6 × 5 = 1, 080. Thus, sample size is about 3 times the
dimension. However, it is not clear that all of these features have the same effect
(if any) on the wind energy production and they may handicap full regression
models even if they are regularized. Sparse methods can thus help us first to
find better models and, second, to better understand which grid points, pressure
levels and variables are the most useful to improve predictions.
To do so, we will use the models described in Sect. 2. For the case of group
algorithms (GL and GENet), we consider as a group the 5 meteorological vari-
ables evaluated over a grid point. As usually done in wind energy, the mod-
els are evaluated using the Mean Absolute Error over the test set, MAE =
N
p=1 |X · W − y (p) |. We will also report the standard deviation σAE of
1 (p)
N
the absolute errors, although they are rather conservative as we do not perform
any sample size correction (assuming
√ independence for these errors would lead to
divide the values given by N and, hence, much smaller values). An important
issue for most of the algorithms used is the estimation of the hyper–parameters
λ1 and λ2 that configure each model. This is done as a search over a grid rep-
resentation of the parameter space, working on a logarithmic scale from 10−3
to 103 with steps of 100.10 . For the algorithms that involve a bi–dimensional
grid (ENet and GENet), the step size is increased to 100.20 . At each point of
the parameter grid, a 5–fold cross validations is used to evaluate a given model
using as fitness the MAE and discarding models above a predefined sparseness
level ρ, fixed as the percentage of non–zero weights. Three different values of ρ,
30, 50 and 100 (i.e., no restrictions), are considered for all models.
The comparison of the different algorithm is summarized in Table 2. A first
reference are surface models, for which we recall that feature dimension is 180;
therefore, we only consider the 100% sparsity level. Their performance is given
Table 2. Results and parameters for 6 pressure levels and minimum 30%, 50% and
100% sparseness, and surface data. Models ordered by MAE.
(a) 6 levels (ρ = 30%). (b) 6 levels (ρ = 50%).

Method MAE σAE Act W λ1 λ2 Method MAE σAE Act W λ1 λ2
ENet 7.09 6.5 011.7% −0.40 −2.80 LA 7.08 6.5 014.4% −0.60 ×
RLS 7.11 6.6 100.0% × +1.20 ENet 7.10 6.5 016.8% −0.60 +0.00
LA 7.14 6.5 010.9% −0.30 × RLS 7.11 6.6 100.0% × +1.20
GENet 7.52 6.7 017.1% +0.20 −2.00 GENet 7.31 6.6 023.6% +0.00 +0.00
GL 7.61 6.8 013.9% +0.30 × GL 7.32 6.6 022.7% +0.00 ×
OLS 8.65 8.6 100.0% × × OLS 8.65 8.6 100.0% × ×
(c) 6 levels (ρ = 100%). (d) Surface (ρ = 100%).

Method MAE σAE Act W λ1 λ2 Method MAE σAE Act W λ1 λ2
GL 7.05 6.5 053.2% −0.60 × GL 7.12 6.7 100.0% −2.80 ×
ENet 7.07 6.5 039.6% −1.00 +0.60 LA 7.13 6.7 091.7% −2.80 ×
LA 7.10 6.6 034.4% −1.00 × RLS 7.14 6.7 100.0% × −2.80
GENet 7.11 6.5 075.5% −0.60 +0.80 OLS 7.21 6.8 100.0% × ×
RLS 7.11 6.6 100.0% × +1.20 GENet 7.26 6.6 100.0% −2.00 −2.60
OLS 8.65 8.6 100.0% × × ENet 7.32 6.6 066.7% −2.20 −1.80
in Subtable 2d and the best models are GL and LA in that order, although
they essentially do not achieve any sparsity enforcing, as their active weights are
100% and 91.7% respectively. Subtables 2a, 2b and 2c give the performance of
the multi–level variables. Recall that feature dimension is now 1, 080, making
mandatory the use of regularized or sparse models (notice that unregularized
linear regression OLS performs very badly due to a clear case of over–fitting).
As it can be expected, the best results are obtained at the 100% sparsity level,
with the best algorithm being GL that uses about half of the features. If more
sparsity is imposed, model performance is just slightly worse, but sparsity greatly
increases. At the 50% level, LA is the second best model, with only 14.4% of
active weights. At the strictest 30% sparsity level, ENet is the best model, with
a sparsity of just 11.7%. Moreover, in all cases the results using pressure level
variables are better than the ones obtained using surface variables. As mentioned
before, a reason for this is that the 10 m height of surface variables may not
be representative of actual wind farm altitude. In any case, the use of sparse
methods such as LA and ENet over pressure layers is justified.
We turn now our attention to the structure identified by the sparse models.
Figure 1a shows the percentage of the total active weights per variable. The non–
sparse algorithms obviously do not perform any kind of variable selection and
the same is true for the group methods, the reason being that they essentially
select all the variables at a given grid point. On the other hand, LA and ENet
favour the V and Vx variables and discard almost completely the geopotential
height. This is reasonable as it has a much smaller correlation with respect
to wind energy production. Figure 1b shows the percentage of the total active
Active Weights per Variable Active Weights per Level

60 60
GL GL
GENet GENet
OLS OLS
50 RLS 50 RLS
LA LA
ENet ENet
40 40
%
%
30 30
20 20
10 10
0 0
V Vx Vy T H 12 13 14 15 16 17
Variable Level
(a) Active weights per variable. (b) Active weights per level.
Fig. 1. Active weight % per variable (left) and level (right) for ρ = 50%
weights per pressure level. Now all the sparse methods perform some kind of
level selection, favouring the highest and lowest layers. The reason for this is
clear, as all levels have high correlations with respect to wind energy while the
extreme levels are the most independent. This effect is particularly strong for
the group models GL and GENet, as they must focus on actually selecting levels
instead of variables. We also point out that sparse methods define some grid
structure as they select points which are mostly located in either the centre of
the grid (closest to the wind farm) or the grid extremes (points least correlated
with the grid centre but still correlated with the wind energy).
Summing things up, it is clear that taking into account different pressure levels
yields better models than considering only surface variables. Sparse models help
on this and, moreover, automatically select the feature structure better suited
for modelling.
4 Conclusions
Most modelling problems of interest in practical machine learning involve high

dimensional data. The goal in these problems is not only to achieve good predic-
tive models but also to do so in an economic way using as few features as possible
and, moreover, to identify some structure in the predictive variables. A natural
approach to this task is to use linear regression models upon which sparsity is
enforced; this is the case of the Lasso, Group Lasso and Elastic–Net methods,
to which we have added a group version of Elastic–Net. In this work, we have
reviewed them under the unifying point of view of Proximal Optimization. This
makes possible to apply efficient algorithms and to consider extensions of pre-
vious models, as it is the case of the Group Elastic–Net model. Wind energy
prediction is a natural field of application for these methods, as NWP makes
available a large number of predictive variables. We have analyzed wind energy
prediction for the Sotavento wind farm, located in the Galicia region of north–
western Spain, considering NWP values for points in a 6 × 6 grid over 6 different
pressure layers. As our results show, sparse models built over several pressure
layers outperform those built using just NWP surface values, even when a strict
degree of sparsity is required. Moreover, sparse models also identify the predic-
tive structure in the NWP features and discriminate among the levels considered,
thus improving our problem understanding.
In any case, stronger models could clearly yield better predictions, which
makes natural to exploit sparse linear regression to select the most relevant
features upon which more advanced models can be built. For example, better
models can be obtained using standard RLS over features selected by the Lasso.
Moreover, the sparse linear methods also have strong theoretical foundations
that could be brought to bear on feature selection. We are currently studying
these and other related issues.
Acknowledgement. With partial support from grant TIN2010-21575-C02-01

of Spain’s Ministerio de Economı́a y Competitividad and the UAM–ADIC Chair
for Machine Learning in Modelling and Prediction. The first author is supported
by the FPU–MEC grant AP2008-00167. We thank our colleague Álvaro Barbero
for the software used in this work.
References
1. Agencia española de meteorologı́a (2012), http://www.aemet.es
2. Beck, A., Teboulle, M.: A fast iterative shrinkage–thresholding algorithm for linear
inverse problems. SIAM Journal on Imaging Sciences 2(1), 183–202 (2009)
3. Combettes, P.L., Pesquet, J.C.: Proximal splitting methods in signal processing.
Recherche 49, 1–25 (2009)
4. Hoerl, A.E., Kennard, R.W.: Ridge regression: Biased estimation for nonorthogonal
problems. Technometrics 12(12), 55–67 (1970)
5. Kowalski, M., Torrésani, B.: Structured sparsity: from mixed norms to structured
shrinkage. In: Gribonval, R. (ed.) SPARS 2009 – Signal Processing with Adap-
tive Sparse Structured Representations. Inria Rennes – Bretagne Atlantique, Saint
Malo, France (2009)
6. Mosci, S., Rosasco, L., Santoro, M., Verri, A., Villa, S.: Solving structured sparsity
regularization with proximal methods. In: ECML/PKDD (2), Berlin, Heidelberg,
pp. 418–433 (2010)
7. Sotavento (2012), http://www.sotaventogalicia.com
8. Tibshirani, R.: Regression shrinkage and selection via the lasso. J. Roy. Statist.
Soc. Ser. B 58(1), 267–288 (1996)
9. Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped
variables. Journal of the Royal Statistical Society – Series B: Statistical Method-
ology 68(1), 49–67 (2006)
10. Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. Journal
of the Royal Statistical Society – Series B: Statistical Methodology 67(2), 301–320
(2005)
Diffusion Maps and Local Models
for Wind Power Prediction
Ángela Fernández Pascual, Carlos M. Alaı́z, Ana Ma González Marcos,

Julia Dı́az Garcı́a, and José R. Dorronsoro
Departamento de Ingenierı́a Informática and Instituto de Ingenierı́a del Conocimiento

Universidad Autónoma de Madrid, 28049, Madrid, Spain
{a.fernandez,carlos.alaiz,ana.marcos,julia.diaz,jose.dorronsoro}@uam.es
Abstract. In this work we will apply Diffusion Maps (DM), a recent

technique for dimensionality reduction and clustering, to build local mod-
els for wind energy forecasting. We will compare ridge regression models
for K–means clusters obtained over DM features, against the models ob-
tained for clusters constructed over the original meteorological data or
principal components, and also against a global model. We will see that
a combination of the DM model for the low wind power region and the
global model elsewhere outperforms other options.
1 Introduction
Local models are a natural and attractive option when trying to approach pro-
cesses with high variance data or whose underlying phenomena are known to
possibly correspond to quite different settings. However, to identify the appro-
priate local feature areas may be quite difficult, particularly for high dimensional
data that do not lend themselves easily to such a task. Unsupervised clustering
methods, such as K–means, appear as an attractive option. However, clustering
is often more an art than a technology and while many methods have been pro-
posed, simple approaches are usually followed in practice, in particular K–means
which is applied assuming an Euclidean distance in the feature space. Besides
fixing the number K of clusters, an adequate sampling is also an important issue
when working with high dimensional data as samples are then bound to be very
sparse. Moreover, the features to be used may not be homogeneous, something
probably better to be handled outside the chosen clustering procedure.
In this paper we will address the above issues in the context of wind energy
prediction. Wind power clearly presents wide, fast changing fluctuations, cer-
tainly at the individual farm level but also when the production of much larger
areas is considered. This is the case of Spain, the world’s fourth biggest producer
of wind power, where wind is currently the third source of electricity. The well
known, sigmoid–like structure of wind turbine power curves clearly shows differ-
ent regimes at low, medium and high wind speeds. Compounded with this are
wind speed frequencies, that follow a Weibull distribution, that is, a stretched
exponential with low wind having large frequencies. While the above does not

566 Á. Fernández Pascual et al.
directly apply when a wide area is considered, different regimes also appear.
Wind energy forecasting for large areas also implies high dimensional features
as the predictive variables, that are the outputs of numerical weather prediction
(NWP) models such as the ECMWF or GFS ones given for large grids that
cover the areas under study. Global models may find it difficult to handle these
regimes and local models are natural alternatives [1,9].
This high dimension suggests to precede clustering by some dimensionality
reduction (DR) technique, preferably one that is likely to yield an Euclidean
metric for the new features. Diffusion Maps (DM) [5], a novel spectral technique
for DR, is particularly suited to these requirements. In fact, there is a natural
diffusion metric in the original feature space that corresponds with Euclidean
metric in the embedded space. This means that clustering methods that rely on
Euclidean metrics, particularly K–means, should work well on the new features.
DM also allows to control to some extent the effects of the underlying data
distribution and, moreover, it allows to work with heterogeneous variables. In
other words, DM can be a powerful tool for finding informative clusters in high
dimensional, heterogeneous data.
Of course, DM is not the only option. Straight K–means clustering can cer-
tainly be used. Moreover, NWP variables for a large area usually show high
correlation among different grid points. This may suggest that variance–based
DR methods such as Principal Component Analysis (PCA) may be a useful
alternative. We shall consider these three options here in order to, first, iden-
tify local clusters and then to construct local models to be compared against a
global one. Many paradigms can be considered for model building but here we
will concentrate on the simplest alternative, ridge regression, i.e., regularized lin-
ear least squares, certainly not the strongest possible method but a good option
to measure the usefulness of local methods against a global one.
The paper is organized as follows. We will review in Sect. 2 DM from a general
point of view, as well as its use over heterogeneous data. In Sect. 3 we will
consider K–means on DM, PCA and the original features, we will compare local
ridge regression models on these clusters, we will discuss their effectiveness and
we will conclude on how to combine local and global models for better predictors.
Section 4 ends this paper with a brief discussion and conclusions.
2 Diffusion Maps Review
The key assumption in Diffusion Maps (DM) is that the data to be studied
lie in a low–dimensional manifold whose geometry can be described through
a Markov chain diffusion metric. To capture this intrinsic geometry, the first
step is to build a connectivity graph using the sample points S = {x1 , . . . , xn }
as graph nodes and defining a symmetric weight matrix Wij = w(xi , xj ). The
most common way to build this matrix
is to use the Gaussian Kernel and define
w(xi , xj ) = exp −||xi − xj ||2 /σ 2 , where σ determines the radius of the neigh-
borhoods centered at individual sample points. We start with this matrix towards
defining a Markov chain over this graph. We first choose a parameter α ∈ [0, 1]
Diffusion Maps and Local Models for Wind Power Prediction 567
Algorithm 1. Diffusion Maps Algorithm.

Input: S = {x1 , . . . , xn }, the original data set.
Output: {Ψ t (x1 ), . . . , Ψ t (xn )}, the embedded data set.
1: Construct G = (S, W ) where W is a symmetric distance matrix, Wij = w(xi , xj ).
2: Define the initial density function as q(xi ) = n j=1 w(xi , xj ).
3: Normalize the weights by the density, w(α) (x, y) = q(x) w(x,y)
α q(y)α .

4: Let g (α) (xi ) = j=1 w(α) (xi , xj ) be the graph degree. Define the transition prob-
n
w(α) (xi ,xj )

ability Pijt = p(α),t (xi , xj ) = g (α) (xi )
.
5: Obtain the eigenvalues {λtr }r0 and eigenfunctions {ψr }r0 of P t .
6: Compute the embedding dimension using a threshold d = max{l : |λtl | > δ|λt1 |}.
7: Formulate Diffusion Map, Ψ t (x) = (λt1 ψ1τ (x), . . . , λtd ψdτ (x))τ .
that is used to control the combined effects of manifold geometry and sample
w(x ,x ) n
distribution and define w(α) (xi , xj ) = q(xi )αiq(xjj )α , where q(xi ) = j=1 w(xi , xj )
is the degree at the i–th
n node of the W matrix. We define now the new α–degree
at xi as g (α) (xi ) = j=1 w(α) (xi , xj ) and arrive at the transition probability
w (α) (x ,x )
p(α) (xi , xj ) = g(α) (xi i )j . Notice that when α = 0, we are essentially defining
the weight matrix typically used in spectral dimensionality reduction [3]. In this
case, the infinitesimal generator L0 of the resulting Markov chain acts on an f
as L0 (f ) = Δ(fq q) − Δ(q)
q f [5], with Δ the manifold’s Laplace–Beltrami operator.
However, when α = 1 the infinitesimal generator L1 verifies L1 (f ) = Δf and
it is not influenced by the underlying density q (this will not be the case for
α = 0 unless q is uniform). We will consider here the case α = 1 and write just
pt (xi , xj ) if a t–step Markov chain is used. We will denote by P t the matrix of
t
transition probabilities in t steps (Pi,j = pt (xi , xj )).
Let λi , ψi (x), i = 0, . . . , n − 1, be the eigenvalues and eigenvectors of P , where
we assume 1 = λ0 · · · λn−1 ; P t has then eigenvalues λti and the same
eigenvectors ψi (x). To select for a given t the embedding dimension d = d(t) we
may fix a precision δ and choose d = max{l : |λtl | > δ|λt1 |}. The embedding pro-
jection is then Ψ t (x) = (λt1 ψ1τ (x), . . . , λtd ψdτ (x))τ , with τ the transpose operator.
The previous steps are summarized in Algorithm 1.
The Euclidean distance Ψ t (x) − Ψ t (z)2 = j λ2t j (ψj (x) − ψj (z)) in the em-
2
bedding coincides with the diffusion distance Dt (x, z) = ||p (x, ·)−p (z, ·)||2L2 ( 1 ) ,
2 t t
φ0
where φ0 is the stationary distribution of the P –Markov process. In other words,
if the diffusion distance Dt approximates the manifold metric, we get the orig-
inal data embedded in a lower dimension space for which Euclidean distance
captures the original local geometry, something very convenient if we want to
apply K–means. Once we have obtained K clusters {C1 , . . . , Ck } over the em-
bedded features, they can be projected back into clusters {A1 , . . . , Ak } in the
original space S defined as Ai = {xj |Ψt (xj ) ∈ Ci }.
A limitation of the above scheme is that it implicitly assumes the attributes
to be homogeneous; however, real–life datasets are frequently heterogeneous,
something that often cannot be handled just by normalizing the data. In [10] a
method is proposed to adapt DM to work with heterogeneous features just by
dealing separately with groups of attributes that are deemed to be homogeneous.
More precisely, assume that we have M such groups; we then split each pattern
xi into M new, lower dimensional ones xm i and build the corresponding sample
sets {Sm }M m=1 . We now apply DM as described before to each Sm obtaining
M embeddings {Ψm }M m=1 that capture the geometry associated to each feature
subset. Now, these Ψm are given by eigenvalue–eigenvector products, with the
eigenvectors being comparable across the embeddings since they have unit norm.
We can make eigenvalues also comparable if we re–scale them as λnew λm,i .
m,i = j λm,j
Thus the union of the normalized features gives a set of homogeneous features
that still represent the intrinsic geometry of our original data and we can simply
apply DM again to this new dataset to get the final lower dimensional embedding.
In summary, DM makes possible low dimensional embeddings of heteroge-
neous data while transforming the original space metric into an Euclidean one.
However, they require proper choices for the parameters σ and δ. Moreover, a
main drawback (as it also happens in spectral DR) is the difficulty to apply the
computed DM projection to new, unseen patterns. There are several proposals
for this such as Nyström formulae [4] or Laplacian Pyramids [10], but this is still
an area where further work is needed.
3 Experiments
In this section we will apply K–means clustering based on DM to build local
models for predicting the wind energy production in Spain and compare it with
the results of K–means applied to either the original full dimensional data or
to PCA lower dimensional features. Once clusters are defined, we will use Ridge
Regression (RR) [7] for model building. Recall that RR adds a 2 regularization
term to an Ordinary Least Square (OLS) regression, so the optimization problem
becomes minw Xw − y22 + γw22 . This prevents over–fitting in plain OLS but
requires a procedure to compute the penalty term γ. While stronger models
could be considered [2,8], our primary interest here is whether DM–based local
models improve on either other local models or global ones. If so, stronger models
should also benefit from this, although we will not consider them in this work.
We will use as inputs NWPs from the European Centre for Medium–Range
Weather Forecasts (ECMWF) [6] and consider five surface variables: wind speed
(V ), its horizontal and vertical components (Vx and Vy ), pressure (p) and temper-
ature (T ), which we normalize component–wise to zero mean and unit variance.
These variables are available over a rectangular 522–point, 0.5◦ resolution grid
that covers the Iberian peninsula. Pattern dimension is thus 2, 610 = 5 × 522.
Two year data will be considered, the first one for training purposes and the
second for testing. Since eight forecasts are given daily, training sample size is
thus 2, 910 = 365×8, close to feature dimension and hence making regularization
mandatory.
450 450 450
400 400 400
350 350 350
300 300 300
250 250 250
200 200 200
150 150 150
100 100 100
50 50 50
0 0 0
0 20 40 60 80 0 20 40 60 80 0 20 40 60 80
Wind Power (%) Wind Power (%) Wind Power (%)
(a) LMDM clusters. (b) LMOr clusters. (c) LMPC clusters.
Fig. 1. Wind power histograms for the clusters obtained using the 3 approaches
80 80 80
70 70 70
60 60 60
Wind Power (%)
Wind Power (%)
Wind Power (%)

50 50 50
40 40 40
30 30 30
20 20 20
10 10 10
0 0 0
0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12
Wind Module (m/s) Wind Module (m/s) Wind Module (m/s)
80 80 80
70 70 70
Wind Power Prediction (%)
60 60 60
50 50 50
40 40 40
30 30 30
20 20 20
10 10 10
0 0 0
0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12
Wind Module (m/s) Wind Module (m/s) Wind Module (m/s)
Fig. 2. Average wind versus actual power (top) and average wind versus predicted
power (bottom) for the clusters C1 (left), C2 (center) and C3 (right)
We will use NWPs and wind energy production data y for cluster definition.
Wind power is obviously unknown for the test dataset so we will use as a proxy
the wind power forecast of a global model. We already mentioned the difficulties
associated with the application of DM to test patterns. We will skip on them by
building the DM features and clusters, as well as the plain and PCA clusters,
over the full two year dataset. This confers some advantage to the local models
over the global one, partially compensated by the global model influencing cluster
definition. In any case, and as mentioned before, the computation of DM features
for new patterns is an area of active research.
We consider wind power production and the NWP variables as heterogeneous
and build first DM features separately on the y, V , Vx , Vy , T and p variables.
In all of them we define the graph’s weight matrix using a Gaussian Kernel with
bandwidth σ equal to the dataset diameter. We arrived at this value heuristi-
Table 1. (a) Errors per cluster (top). (b) Global errors (bottom).
MAE RMAE stdAE stdRAE

LMDM GM LMDM GM LMDM GM LMDM GM
C1 2.29 2.53 19.54 22.98 1.88 1.94 31.49 40.66
C2 4.01 3.83 19.93 18.34 3.20 3.16 72.67 65.63
C3 5.94 5.73 15.67 13.93 4.76 4.94 23.13 21.20
LMOr GM LMOr GM LMOr GM LMOr GM
C1 2.52 2.69 18.76 20.10 1.95 2.14 22.94 23.78
C2 3.72 3.65 20.52 19.37 3.11 3.15 38.75 40.73
C3 5.40 4.77 24.14 20.36 4.39 4.26 94.33 93.43
LMPC GM LMPC GM LMPC GM LMPC GM
C1 2.53 2.66 18.99 20.01 1.95 2.10 23.24 23.84
C2 3.68 3.64 20.39 19.62 3.08 3.12 37.37 40.48
C3 5.21 4.78 22.91 20.11 4.31 4.28 93.32 92.68
GM LMDM LMOr LMPC CMDM;G CMOr;G CMPC;G

MAE 3.48 3.47 3.56 3.53 3.37 3.40 3.42
RMAE 19.89 19.24 20.52 20.34 18.35 19.32 19.46
stdAE 3.16 3.19 3.21 3.17 2.80 2.87 2.88
stdRAE 51.60 52.80 51.25 50.87 44.98 44.12 44.37
cally after visually analyzing the structure of the resulting embeddings. We also
work with t = 1, i.e., considering the one–step diffusion distance on the original
feature space and final embedding dimension was obtained using a δ = 0.1 pre-
cision parameter. Embedding dimensions for the above variables were 1, 6, 3, 5,
1 and 1 respectively and the final dimension for the DM embedding is 5. There-
fore, we also considered a 2, 610 to 5 dimension reduction for PCA. Finally, the
choice of K is always difficult. We will consider 3 clusters that hopefully capture
high, medium and low ranges of wind power. While initial centroids are ran-
domly chosen in K–means, we found that the DM parameters used lead to very
stable cluster structures that are essentially independent of centroid initializa-
tion. Figure 1 gives the cluster histograms of the local wind power distributions
for each approach. As it can be seen, the 3 DM clusters offer a more clear–cut
structure while the other two methods seem to differentiate less between wind
energy regimes.
Once DM, PCA and original feature clusters are defined, we build a global
model and also three local RR models, one per cluster, that we denote as GM,
LMDM , LMPC and LMOr respectively. Prior to model building we select the
optimal regularization parameters for all the RR models by grid search for γ in
the interval [10−2 , 104 ], with a logarithmic step of 0.1 and using as validation
set the last 20% patterns of the first year data clusters. As usually done in wind
energy, we measure model performance by the mean absolute error (MAE) and
the relative mean absolute error (RMAE). The MAE is defined as the mean of
the differences between the predictions and the real values. The RMAE computes
the mean for the ratio of the absolute errors over actual wind power. Table 1a
contains local model errors per cluster as well as the cluster errors of the global
model. As we can see, the local models beat the global one in the first, low
wind power cluster C1 but GM beats them in C2 and particularly in the high
wind power cluster C3 . A reason for this can be seem in Fig. 2 that depicts for
the 3 LMDM clusters the relationships between average wind and power (top)
and between average wind and predicted power (bottom). Cluster C3 has the
fewest number of points but presents several marked outliers; these two facts
clearly penalize the local C3 models. Table 1a also gives values for the standard
deviations of MAE and RMAE, although they are rather conservative (assuming
independence for these errors would lead to divide the values given by the square
root of sample size and, hence, much smaller values).
These facts suggest to build predictors combining a local model on the C1 clus-
ter and the global one on the other two. Table 1b contains the MAE and RMAE
errors of the individual GM, LMDM , LMOr and LMPC , and of the combined
models CMDM;G , CMOr;G and CMPC;G . It shows that there is a clear advantage
of the combined models over the global one and that the gain is largest for the
CMDM;G model. While modest at first sight (a MAE of 3.37% against 3.48% for
GM), such gains may have a large economic impact, as wind energy represents
about 16% of Spain’s electricity demand.
4 Conclusions
Local models are obviously useful in many applied problems, being wind energy
forecasting a clear example. The main obstacle for their construction usually is
how to define the local regions on which models will be built. A natural option
is K–means clustering that requires to choose an adequate metric, something
always difficult and more so when we also have to deal with the high dimensional,
heterogeneous features that arise in wide area wind energy forecasting. In this
work we have applied to this task Diffusion Maps (DM), a novel dimensionality
reduction technique that lends itself naturally to work with heterogeneous data
and that has the very important property that Euclidean metric in the projected
space is naturally related to a diffusion distance on the original features. This
distance is in turn related to a Markov process whose infinitesimal generator is
just the Laplace–Beltrami operator of the underlying manifold. We can expect
that the Euclidean metric in the reduced features captures the original space
metric and, thus, standard K–means on the embedding results in meaningful
clusters for the original features.
We have compared this approach with clusters obtained by straight Euclidean
K–means on the full features and on PCA features with the same dimension as
the DM ones, building local ridge regression models that, in turn, are compared
with a global one. The local models beat the global one over a low wind power
cluster, with DM a clear winner, but the global model performs better on the
other medium and high wind power clusters. This suggests to define a mixed
model, using the DM local model for the low wind power cluster and the global
one for the other two. This model outperforms the others.
We can conclude that DM dimensionality reduction and clustering is an ef-
fective tool for local model building, although further work is needed. In fact,
DM features are derived after an spectral analysis of the sample distance matrix.
As it is also the case with spectral dimensionality reduction and clustering, this
makes costly to assign new, unseen patterns to already defined clusters. Tools
to alleviate this are Nyström formulae or Laplacian Pyramids. We are currently
doing research on this for wind energy and other applied problems.
Acknowledgement. With partial support from grant TIN2010-21575-C02-01

of Spain’s Ministerio de Economı́a y Competitividad and the UAM–ADIC Chair
for Machine Learning in Modelling and Prediction. The first author is also sup-
ported by an FPI-UAM grant and kindly thanks the Applied Mathematics De-
partment of Yale University for receiving her during a visit. The second author is
supported by the FPU-MEC grant AP2008-00167. We also thank Red Eléctrica
de España, Spain’s TSO, for providing historic wind energy data.
References
1. Alaı́z, C., Barbero, A., Fernández, A., Dorronsoro, J.: High wind and energy spe-
cific models for global production forecast. In: Proceedings of the European Wind
Energy Conference and Exhibition (EWEC 2009), Marseille, France (March 2009)
2. Barbero, A., López, J., Dorronsoro, J.: Kernel methods for wide area wind power
forecasting. In: Proceedings of the European Wind Energy Conference and Exhi-
bition (EWEC 2008), Brussels, Belgium (April 2008)
3. Belkin, M., Nyogi, P.: Laplacian eigenmaps for dimensionality reduction and data
representation. Neural Computation 15(6), 1373–1396 (2003)
4. Bengio, Y., Delalleau, O., Roux, N.L., Paiement, J., Vincent, P., Ouimet, M.:
Learning eigenfunctions links spectral embedding and kernel pca. Neural Compu-
tation 16(10), 2197–2219 (2004)
5. Coifman, R., Lafon, S.: Diffusion maps. Applied and Computational Harmonic
Analysis 21(1), 5–30 (2006)
6. European center for medium–range weather forecasts (2005),
http://www.ecmwf.int/
7. Hoerl, A.E., Kennard, R.W.: Ridge regression: Biased estimation for nonorthogonal
problems. Technometrics 12(12), 55–67 (1970)
8. Monteiro, C., Bessa, R., Miranda, V., Botterud, A., Wang, J., Conzelmann, G.:
Wind power forecasting: State–of–the–art 2009. Tech. rep., INESC Porto and Ar-
gonne National Laboratory (2009)
9. Pinson, P., Nielsen, H., Madsen, H., Nielsen, T.: Local linear regression with adap-
tive orthogonal fitting for the wind power application. Statistics and Comput-
ing 18(1), 59–71 (2009)
10. Rabin, N., Coifman, R.: Heterogeneous datasets representation and learning using
diffusion maps and laplacian pyramids. In: Proceedings of the 12th SIAM Interna-
tional Conference on Data Mining (SDM 2012), Anaheim, California, USA (April
2012)
A Hybrid Model for S&P500 Index Forecasting
Ricardo de A. Araújo1,2 , Adriano L.I. Oliveira1 , and Silvio R. L. Meira1

1
Informatics Center, Federal University of Pernambuco, Recife, PE, Brazil
2
Informatics Department, Federal Institute of Sertão Pernambucano,
Ouricuri, PE, Brazil
ricardo.araujo@ifsertao-pe.edu.br,{raa,alio,srlm}@cin.ufpe.br
Abstract. This paper presents a morphological-linear model, called the

dilation-erosion-linear perceptron (DELP), for financial forecasting. It
consists of a hybrid model composed of morphological operators under
context of lattice theory and a linear operator. A gradient-based method
is presented to design the proposed DELP (learning process). Also, it
is included an automatic phase fix procedure to adjust time phase dis-
tortions observed in financial phenomena. Furthermore, an experimental
analysis is conducted with the proposed model using the S&P500 Index,
where five well-known performance metrics and an evaluation function
are used to assess the forecasting performance.
Keywords: Hybrid Perceptrons, Mathematical Morphology, Lattice

Theory, Gradient-based Learning, Financial Forecasting.
1 Introduction
Several linear and nonlinear statistical models were proposed in the literature to
solve the problem of financial phenomena forecasting [1]. However, they have the
need of a problem specialist to validate their forecasts, limiting the development
of automatic forecasting systems [1]. Alternatively, artificial neural networks
(ANNs) [2, 3] have been applied in the attempt to overcome such drawback.
However, due to many complex features frequently present in these phenom-
ena, such as irregularities, volatility, trends, and noise, a limitation arises from all
these models for financial forecasting and is known as the random walk dilemma
(RWD). It has been reported in the literature [2–6]. In this context, forecasts
generated by arbitrary models have a characteristic one step ahead delay with
respect to the actual values, so that, there is a time phase distortion in financial
phenomena reconstruction [2–6]. This behavior has led some researchers to argue
that financial phenomena are unpredictable [4, 5].
In this work we present a hybrid model to overcome the RWD in the finan-
cial forecasting problem. The proposed model is generically called the dilation-
erosion-linear perceptron (DELP) and is composed of morphological operators
under context of lattice theory and a linear operator. The proposed learning
process of the DELP employs a gradient-based method based on ideas from
the back-propagation (BP) algorithm and using a systematic approach to over-
come the problem of non-differentiability of morphological operations, based

574 R. de A. Araújo, A.L.I. Oliveira, and S.R.L. Meira
on ideas from Pessoa and Maragos [7] and Sousa [8]. Also, we have included
into learning process of the DELP a procedure to overcome the RWD, called
the automatic phase fix procedure (APFP) [2, 3, 6], which is a correction step
that is geared toward eliminating time phase distortions that occur in financial
phenomena.
Furthermore, an experimental analysis is conducted with the DELP using
the S&P500 Index. Five metrics are used to assess the forecasting performance.
An evaluation function (EF) is further used as global forecasting performance
indicator. The obtained results are discussed and compared to results found
using .
The obtained results shown better performance of the proposed model when
compared to the random walk model [4, 5], the time-delay added evolutionary
forecasting (TAEF) method [3] and the morphological-rank-linear time-lag added
evolutionary forecasting (MRLTAEF) method [6]. The last two models, accord-
ing to their accurate and precise forecasts recently presented in the literature,
shown boosted performance enhancement when compared to classical forecasting
models.
2 The Time Series Forecasting

A time series is a sequence of observations about a given phenomenon observed
in a discrete or continuous space. In this work a time series will be considered
time discrete and equidistant, and formally defined by
x = {xt ∈ R | t = 1, 2, . . . , N }, (1)
where t is the temporal index, which is called time and defines the granularity
of observations of a given phenomenon, and N is the number of observations.
The aim of forecasting techniques applied to a given time series is to provide
a mechanism that allows, with certain accuracy, the forecasting of the future
values of x, given by xt+h , h = 1, 2, . . . , H, where h represents the forecasting
horizon of H steps ahead. These techniques try to identify certain regular pat-
terns present in the data set, creating a model capable of generating the next
temporal patterns, where, in this context, a most relevant factor for an accurate
forecasting performance is the correct choice of the past window, or the time
lags, considered for the representation of a given time series.
In mathematical sense, the relationship which involves time series historical
data defines a d-dimensional phase space, where d is the minimum dimension
capable of representing such relationship. Therefore, a d-dimensional phase space
can be built so that it is possible to unfold its corresponding time series. Takens
[9] proved that if d is sufficiently large, such phase space is homeomorphic to the
phase space that generates the series. The Takens’ Theorem [9] is the theoretical
justification of the phase space reconstruction using time lags.
A Hybrid Model for S&P500 Index Forecasting 575
2.1 The Random Walk Dilemma

A naive forecasting strategy is to define the last observation of a time series as
the best forecasting of its next future value (xt+1 = xt ). This kind of model is
known as the random walk (RW) model [4], which is defined by
xt = xt−1 + zt , (2)
where xt is the current observation, xt−1 is the immediate observation before xt ,

and zt is a noise term with a Gaussian distribution of zero mean and constant
standard deviation. The model above clearly implies that, as the information set
consists of past time series data, the future data is unpredictable. Therefore, on
average, the value xt−1 is indeed the best forecasting of value xt , and proof of
this statement is given in [3, 6].
It is possible to verify that the use of an arbitrary model to make forecasts
have an intrinsic limitation, since the generated forecasts have a characteristic
one step ahead delay regarding the original time series values, in which this
behavior is common in the finance and economics and is called random walk
dilemma or random walk hypothesis [4]. Therefore, in these conditions, to escape
of the random walk dilemma is a hard task [3, 6].
3 The Dilation-Erosion-Linear Perceptron

The proposed dilation-erosion-linear perceptron (DELP) consists of a linear com-
bination of a nonlinear operators (dilation and erosion operators) and a liner op-
erator (finite impulse response). Next we present the definition and the proposed
training algorithm to design the DELP.
Let x = (x1 , x2 , . . . , xn ) ∈ Rn a real-valued input signal inside an n-point
moving window and let y the model output. Then, the DELP is defined as a
morphological-linear system with local transformation rule x → y, given by
y = λα + (1 − λ)β, λ ∈ [0, 1], (3)
where
β = x · pT = x1 p1 + x2 p2 + . . . + xn pn , (4)
and
α = θϕ + (1 − θ)ω, θ ∈ [0, 1], (5)
in which

n
ϕ = δa (x) = (xi + ai ), (6)
i=1
and

n
ω = εb (x) = (xi + bi ), (7)
i=1
where term n denotes the dimensionality of the input signal (x), terms λ, θ ∈ R
and a, b, p ∈ Rn . The vector p ∈ Rn represents the linear operator weights.
The term β represents the output of the linear operator. The term α represents
the linear combination of the morphological operators of dilation and erosion
(the mixture term is defined by θ). The terms ϕ and ω represent the output
of morphological operators of dilation and erosion, respectively. The vectors a
and b represent the structuring elements (weights) of the dilation (δa (x)) and
erosion (εb (x))operators employed into the nonlinear module of the DELP.
Terms and represent the supremum and the infimum operations. Note
that the output y is given by a linear combination of the linear operator and
another linear combination of morphological operators of dilation and erosion
(the mixture term is defined by λ). The main differences between “+ ” and “+”
are given by the following rules:
(−∞) + (+∞) = (+∞) + (−∞) = −∞, (8)
and
(−∞) + (+∞) = (+∞) + (−∞) = +∞. (9)
3.1 Learning Process

The design of the DELP model requires the adjustment of the parameters
a, b, p ∈ Rd and λ, θ ∈ R. Therefore, the weight vector w ∈ R3d+2 of the
DELP model is given by
w = (a, b, p, λ, θ). (10)
During the proposed learning process, all parameters of the DELP model are
iteratively adjusted according to an error criterium until de convergence. There-
fore, it is necessary to define an objective function J(w) to be minimized during
the learning process, and given by

M
J(w) = e2 (m), (11)
m=1
in which M represents the input patterns amount in the learning process and
e(m) represents the instantaneous error for the m-th input pattern, and given
by
e(m) = t(m) − y(m), (12)
where t(m) and y(m) are the target and the model output, respectively.
The learning process updates the weight vector w based on the gradient steep-
est descent method. The adjustment of vector w for the m-th input training
pattern is given by the following iterative formula:
w(i + 1) = w(i) − μ∇J(w), (13)
where μ > 0 and i ∈ {1, 2, . . .}. Term ∇J(w) is the gradient, which is given by

∂J ∂J ∂J ∂J ∂J ∂J
∇J(w) = = , , , , , (14)
∂w ∂a ∂b ∂p ∂λ ∂θ
in which
∂J ∂y ∂y ∂y ∂y ∂y
= −2e(m) , , , , , (15)
∂w ∂a ∂b ∂p ∂λ ∂θ
Note that the existence of the gradient of J with respect to w depends on the
existence of the gradients ∂y ∂y ∂y ∂y ∂y
∂a , ∂b , ∂p , ∂λ and ∂θ . Next, we present the formulas
to calculate them.
∂y
The term ∂λ is given by
∂y
= α − β. (16)
∂λ
∂y
The term ∂p is given by
∂y ∂y ∂β
= , (17)
∂p ∂β ∂p
in which
∂y
= 1 − λ, (18)
∂β
and
∂β
= x, (19)
∂p
where x represents the input signal (m-th input training pattern).
The term ∂y
∂θ is given by
∂y ∂y ∂α
= , (20)
∂θ ∂α ∂θ
in which
∂y
= λ, (21)
∂α
and
∂α
= ϕ − ω. (22)
∂θ
Terms ∂y ∂y
∂a and ∂b are estimated using the concept of smoothed rank indicator
vector [7, 8] (because dilation and erosion operators can be seen as particular
cases of the rank function), where we choose the smoothed unit sample function
Qσ (x) = [qσ (x1 ), qσ (x2 ), . . . , qσ (xn )], in which
x
qσ (xi ) = sech2 , ∀ i = 1, . . . , n . (23)

σ
Note that the choice of the scale factor σ directly affect the estimation and
interpolation of the gradients ∂y ∂y
∂a and ∂b . However, the learning process of the
DELP model even works with σ → 0, since in this particular case, the gradient
will be given in terms of the usual rank indicator vector [7, 8].
∂y
Therefore, the term ∂a is given by
∂y ∂y ∂α ∂ϕ ∂α ∂ϕ
= =λ , (24)
∂a ∂α ∂ϕ ∂a ∂ϕ ∂a
in which
∂α
= θ, (25)
∂ϕ
and
∂ϕ Qσ (ϕ ·1 − (x + a))
= . (26)
∂a Qσ (ϕ ·1 − (x + a)) · 1T
∂y
As the same way, the term ∂b is given by
∂y ∂y ∂α ∂ω ∂α ∂ω
= =λ , (27)
∂b ∂α ∂ω ∂b ∂ω ∂b
in which
∂α
= 1 − θ, (28)
∂ω
and
∂ω Qσ (ω ·1 − (x + b))
= . (29)
∂b Qσ (ω ·1 − (x + b)) · 1T
Besides, in order to automatically adjust time phase distortions in financial
time series representation, we have included an automatic phase fix procedure
(APFP) [3, 6] in the proposed learning process of the DELP model. Figure 1
presents the APFP.
Fig. 1. Automatic phase fix procedure
According to the Figure 1, in the first step an input pattern x is presented to

DELP generating the output y1 . The first output y1 is used to rebuild the input
pattern in the second step. This reconstructed pattern is presented to the same
DELP generating the second output y2 , which is the phase fixed forecasting.
4 Simulations and Experimental Results

A financial time series (S&P500 Index) was used as a test bed for evaluation of
the proposed DELP model. This series was normalized to lie within the range
[0, 1] and divided in three sets according to Prechelt [10]: training set (50%),
validation set (25%) and test set (25%).
For the experiments, the entries of the DELP weight, vectors a, b and p, are
randomly initialized within the range [−1, 1]. The initial DELP mixture coeffi-
cients, λ and θ, are randomly chosen in the interval [0, 1]. Based on exhaustive
experiments to determine the best learning rate (μ) and the scale factor (σ), we
use μ = 0.01 and σ = 1.5. It is worth mentioning that three stop conditions
are used into the learning process [10]: i) The maximum epoch number equals
to 104 ; ii) The decrease in the training error process training (P t) of the cost
function equals to 10−6 ; iii) The increase in the validation error or generalization
loss (Gl) of the cost function equals to 5%.
In order to establish a performance study, results with the random walk
(RW) model [4, 5], which represents the results generated by classical forecast-
ing models, and with the time-delay added evolutionary forecasting (TAEF)
method [3] and the morphological-rank-linear time-lag added evolutionary fore-
casting (MRLTAEF) method [6] are employed in our comparative analysis, where
we investigate the same time series under the same conditions. Additionally, we
have used five well-known evaluation metrics formally defined in [3, 6] to assess
the forecasting performance: mean square error (MSE), mean absolute percent-
age error (MAPE), u of theil statistic (UTS), prediction of change in direction
(POCID) and average relative variance (ARV). Also, we use an evaluation func-
tion (EF) defined in [6] to serve as a global forecasting performance indicator.
4.1 S&P500 Index Series
The Standard & Poor’s 500 Index (S&P500) is a free-float capitalization-weighted

index given by prices of five hundred large-cap common stocks traded in the
United States stock market. The S&P500 series corresponds to daily records of
S&P500 Index from 2008/04/25 to 2012/04/12. For the S&P500 series forecast-
ing (with one step ahead of forecasting horizon – H = 1), we use the time lags
(2-151 – note that here d = 150) to create the input patterns. The Table 1 shows
the obtained results with RW and DELP models for all evaluation metrics.
Table 1. Obtained results with RW and DELP model for S&P500 series (test set)
Evaluation Metrics
Models MSE MAPE UTS ARV POCID EF
RW 5.5775e-004 2.2676e-002 1.0000e-000 5.4492e-002 55.65 26.8336
TAEF 4.8706e-004 1.4576e-002 3.6437e-001 1.3946e-002 92.93 66.6935
MRLTAEF 1.3429e-004 1.2904e-002 1.0046e-001 3.8450e-003 95.96 85.8818
DELP 1.3217e-006 1.2819e-003 2.3614e-003 1.2965e-004 100.00 99.6248
According to the Table 1, we can see that in all experiments the POCID metric
greater than 50%, indicating that the DELP model has much better performance
than a “coin-tossing” experiment. The obtained UTS metric value (2.36e-003)
indicates that the DELP model was able to overcome the random walk dilemma.
Note that the MAPE metric value (1.28e-003) is very small, that is, without
high percentage deviations. According to ARV metric value (1.30e-004), we
can see a much better performance of the proposed model regarding a naive
forecasting model. Also, we can verify a small value of MSE metric (1.32e-
006), which means that the forecasts are too close to real values. The EF metric
0.99
Standard & Poor’s 500 Index (S&P500)

0.98
0.97
0.96
0.95
0.94
0.93
0.92
0.91
0.9
228 230 232 234 236 238 240 242 244 246 248
Test Set
Fig. 2. Forecasting results of S&P500 series (last twenty points of the test set): actual
values (solid line) and predicted values (dashed line)
value (99.6) shows that the DELP have good global forecasting performance.
Besides, we can see that the proposed DELP model overcame, for all evaluation
metrics and for the evaluation function, the RW, TAEF and MRLTAEF models
in this work. Finally, we present in Figure 2 a comparative graphic between real
(solid line) and predicted (dashed line) values generated by DELP model for the
last twenty points of the S&P500 series test set. Note that the predicted values
are very close to the real values of the S&P500 series, where the one step delay
regarding the forecasting values did not occur.
5 Conclusion
In this work we presented a morphological-linear model to overcome the ran-
dom walk dilemma in the financial forecasting problem. The proposed model
was generically called the dilation-erosion-linear perceptron (DELP) and con-
sists of nonlinear morphological operators under context of lattice theory and a
linear operator. Also, we presented a learning process of the DELP employing
a gradient-based method based on ideas from the back-propagation (BP) algo-
rithm. Also, we have included into learning process of the DELP a procedure to
overcome the RWD, which is a correction step that is geared toward eliminating
time phase distortions that occur in financial phenomena. The evaluation per-
formance of the proposed DELP model regarding to the random walk (RW), the
time-delay added evolutionary forecasting (TAEF) and the morphological-rank-
linear time-lag added evolutionary forecasting (MRLTAEF) models was assessed
in terms of five well-known performance measures and using the S&P500 Index
series. In addition, an evaluation function served as a global indicator for the
quality of solutions achieved by the investigated models.
The experimental results demonstrated a consistently better performance of
the proposed DELP model, where we succeeded in to overcome the random
walk dilemma in a particular financial forecasting problem. It is possible to
verify that our forecasts have not any one step delay regarding real time series
values. Further studies must be developed to better formalize and explain the
properties of the DELP model and to determine its possible limitations with
other time series with components such as trends, seasonalities, impulses, steps
and other non-linearities. Further studies, in terms of risk and financial return,
must be done in order to determine the additional economical benefits, for an
investor, with the use of the DELP model in stock market applications. Finally,
a particular theoretical study about the complexity of the DELP model must be
done in order to establish a complete cost-performance evaluation of the DELP.
References
1. Clements, M.P., Franses, P.H., Swanson, N.R.: Forecasting economic and financial
time-series with non-linear models. International Journal of Forecasting 20, 169–
183 (2004)
2. de A. Araújo, R.: Swarm-based hybrid intelligent forecasting method for financial
time series prediction. Learning and Nonlinear Models 5(2), 137–154 (2007)
3. Ferreira, T.A.E., Vasconcelos, G.C., Adeodato, P.J.L.: A new intelligent system
methodology for time series forecasting with artificial neural networks. Neural Pro-
cessing Letters 28, 113–129 (2008)
4. Sitte, R., Sitte, J.: Neural networks approach to the random walk dilemma of
financial time series. Applied Intelligence 16(3), 163–171 (2002)
5. Malkiel, B.G.: A Random Walk Down Wall Street, Completely Revised and Up-
dated Edition. W. W. Norton & Company (April 2003)
6. de A. Araújo, R., Ferreira, T.A.E.: An intelligent hybrid morphological-rank-linear
method for financial time series prediction. Neurocomputing 72(10-12), 2507–2524
(2009)
7. Pessoa, L.F.C., Maragos, P.: Neural networks with hybrid morphological rank linear
nodes: a unifying framework with applications to handwritten character recogni-
tion. Pattern Recognition 33, 945–960 (2000)
8. Sousa, R.P., Carvalho, J.M., Assis, F.M., Pessoa, L.F.C.: Designing translation
invariant operations via neural network training. In: Proc. of the IEEE Intl. Con-
ference on Image Processing, Vancouver, Canada (2000)
9. Takens, F.: Detecting strange attractor in turbulence. In: Dold, A., Eckmann, B.
(eds.) Dynamical Systems and Turbulence. Lecture Notes in Mathematics, vol. 898,
pp. 366–381. Springer, New York (1980)
10. Prechelt, L.: Proben1: A set of neural network benchmark problems and bench-
marking rules. Technical Report 21/94 (1994)
Author Index
Abe, Keiga II-403 Bonath, Werner I-97

Abe, Shigeo II-339 Borisyuk, Roman I-427
Abraham, Nobi I-459 Boudet, Laurence II-548
Adams, Rod I-355, II-499, II-507 Braga, Antônio P. II-100, II-314
Agelidis, Vassilios G. II-33 Brakel, Philemon II-92
Agnes, Everton J. I-145 Braun, Hans Albert I-97
Ahmad, Imran Shafiq II-330 Brodeur, Simon I-547
Alaı́z, Carlos M. II-557, II-565 Brown, Marc B. II-507
Al-Busaidi, Asiya M. I-304 Bruce, Ian I-89
Alippi, Cesare II-305 Bruneau, Pierrick II-548
Allende, Héctor I-710 Brunnet, Leonardo G. I-145
Al-Yahmadi, Amer S. I-304 Burikov, Sergey II-443
Anders, Silke II-540 Burles, Nathan I-49
Anguita, Davide II-156, II-491 Butz, Martin V. I-499
Antonelli, Marco I-322
Araújo, Daniel II-459 Cabral, George G. I-693
Araújo, Ricardo de A. II-573 Calandra, Roberto II-379
Arleo, Angelo I-296 Cantoni, Virginio II-515
Arostegui, Manuel Moreno I-129 Canuto, Anne M.P. I-701, II-180, II-451
Arriero, Javier II-271 Caputi, Angel A. I-217
Asai, Yoshiyuki I-272 Casey, Matthew I-255, I-339
Atsumi, Masayasu I-419 Castelló, Vicente I-322
Austin, James I-49 Castro, Cristiano L. II-100, II-314
Azevedo, Carlos R.B. II-148 Chérif, Farouk I-17
Azizi, Nabiha II-189 Cheung, Kit I-113
Azzag, Hanane II-363 Cho, KyungHyun I-81
Azzopardi, George I-395 Chorowski, Jan II-347
Choudhary, Swadesh I-121
Bacciu, Davide I-57 Cobos, Pedro L. II-221
Barrera, Andres Gaona I-129 Coghill, George M. I-1
Bayer, Justin I-531 Cohen, Shimon I-482
Bedregal, Benjamn R.C. II-180 Conradt, Jörg I-313
Behnke, Sven I-620 Corchado, Emilio II-132
Bello, Rafael I-718 Cortez, Paulo II-523
Berger, Michael I-579 Coyle, Damien I-645
Bertol, Douglas W. I-330 Cui, Nanyi I-280
Beste, Christian I-185
Beuler, Marcel I-97 Damon, Cécilia II-548
Bezerra, Byron Leite Dantas II-246 da Silva Gomes, João I-427
Bhattacharya, Basabdatta Sen I-645 Davey, Neil I-355, II-499, II-507
Blachnik, Marcin II-255, II-263, II-296 Dávila-Chacón, Jorge I-239
Boahen, Kwabena I-121 de Alencar, Maycol I-330
Bogdan, Martin I-169, I-669 de Andrade, Vinı́cius Braga II-246
Bohte, Sander M. I-443 de Carvalho, Luı́s Alfredo V. I-379
584 Author Index
Deisenroth, Marc Peter II-379 Freire, Luı́s II-238

Del-Moral-Hernandez, Emilio II-355 Futagi, Daiki I-247
del Pobil, Angel P. I-322
Denizdurduran, Berat I-177, I-474 Gao, Peiran I-121
de Oliveira, João Fausto Lorenzato Garcia, Christophe II-58, II-172
II-66 Garcı́a-Hernández, Laura II-132
de Oliveira, Luiz A.H.G. II-451 Gerrard, Claire I-1
Depaire, Benoı̂t I-718 Gewaltig, Marc-Oliver I-209
De Pieri, Edson R. I-330 Ghio, Alessandro II-156, II-491
de Sousa, Giseli I-355 Gianniotis, Nikolaos II-395
de Souza, Renata M.C.R. II-435 Giese, Martin A. I-288
Dı́az Garcı́a, Julia II-565 Gisbrecht, Andrej II-531
Dieleman, Sander II-92 Giugliano, Michele I-193
Dı́ez-Fernández, Marta II-140 Glodek, Michael I-677
Doan, Nhat-Quang II-363 Gonçalves, André R. II-148
Dolenko, Sergey II-443 Gonçalves, Luiz M.G. II-451
Dolenko, Tatiana II-443 Gonzalez, Javier II-271
Dominey, Peter F. I-596 González Marcos, Ana Ma II-565
Donate, Juan Peralta II-523 Gorse, Denise II-140
Dorronsoro, José R. II-557, II-565 Gouveia, Ana Isabel Rodrigues II-238
Dozono, Hiroshi I-563 Grèzes, Félix I-296
Duch, Wlodzislaw II-255, II-296 Groß, Horst-Michael I-637, II-50
Duivesteijn, Wouter II-25 Grüning, André I-137, I-255, I-264
Durán, Boris I-25 Grzyb, Beata J. I-322
Dürr, Volker II-531 Guan, Ling II-330
Günther, Manuel I-411
Gurney, Kevin I-185
Eggert, Julian P. I-637
Güttler, Frank I-169
Elagouni, Khaoula II-172
Eliasmith, Chris I-121
Häming, Klaus I-435
Elibol, Rahmi I-177
Hammer, Barbara II-531
El-Laithy, Karim I-169
Hao, Tele II-124, II-411
Ennaji, Abdel II-189 Hara, Atsushi I-515
Ensari, Tolga II-347 Hara, Kazuyuki II-9
Entner, Doris II-84 Haufe, Dennis I-411
Erichsen Jr., Rubem I-145 Hayashi, Yoichi I-515
Heinrich, Stefan I-239, I-555
Fabisch, Alexander II-108 Hermans, Michiel I-612
Fadeev, Victor II-443 Hertwig, Ralph II-483
Farah, Nadir II-189 Hinaut, Xavier I-596
Faubel, Christian I-579 Hobson, Stephen I-49
Fernández Pascual, Ángela II-565 Hock, Howard I-579
Ferone, Alessio II-515 Hoffmann, Jörn I-169
Fiasché, Maurizio I-604, I-653 Hoffrage, Ulrich II-483
Filho, Alberto N.G. Lopes II-229 Honkela, Timo II-467, II-475
Filho, Humberto F. I-330 Horimoto, Narutoshi I-161
Filho, Isaac de L. Oliveira II-180 Hoyer, Patrik O. II-84
Fok, Sam I-121 Huang, Wei I-73
Fonarev, Anatoliy II-197 Humphries, Mark I-185
Fouad, Shereen II-322 Hyvärinen, Aapo I-491
Author Index 585
Ichisugi, Yuuji I-726 Lombardi, Warody C. I-330

Ilin, Alexander I-81, II-124 Lu, Zhiyun II-419
Ioannou, Panagiotis I-255 Luciw, Matthew II-279
Ishida, Yuta I-223 Ludermir, Teresa B. I-685, II-66
Ishii, Shin II-116 Luk, Wayne I-113
Izzatdust, Zaur II-467 Lukoševičius, Mantas I-587
Ji, Donghong I-73 Maass, Wolfgang I-209

Jitsev, Jenia I-459 Macleod, Christopher I-1
Jolivet, Renaud I-105 Madany Mamlouk, Amir II-540
Madeiro, Salomão II-148
Kamimura, Ryotaro II-387 Maeda, Shin-ichi II-116
Karandashev, Iakov I-9, I-41 Maex, Reinoud I-355
Karhunen, Juha II-124 Magistretti, Pierre J. I-105
Kasabov, Nikola I-604 Maguire, Liam P. I-645
Kassahun, Yohannes II-108 Mamalet, Franck II-58, II-172
Katahira, Kentaro II-9 Manoonpong, Poramate I-451
Khan, Naimul Mefraz II-330 Markram, Henry I-209
Kim, DaeEun I-347 Martens, Jean-Pierre I-629
Kindermans, Pieter-Jan I-661 Martinetz, Thomas II-540
King, Jean-Rémi I-296 Martins, Allan II-459
Kirikawa, Shota I-153 Martins, Nardênio A. I-330
Kitani, Edson Caoru II-355 Martos, Gabriel II-271
Kitano, Katsunori I-247 Mascher, Peter I-89
Klein, Stefan II-238 Maszczyk, Tomasz II-296
Knobbe, Arno II-25 Matsubara, Takashi I-231
Komendantskaya, Ekaterina II-427 Matsuda, Yoshitatsu II-205
Koprinkova-Hristova, Petia I-571 Matuszak, Michal I-33
Koprinska, Irena II-33 McCall, John I-1
Kordos, Miroslaw II-255, II-263 Meins, Nils I-403
Koryakin, Danil I-499 Meira, Silvio R.L. II-573
Krömer, Pavel II-132 Melo, Jorge II-164
Kryzhanovskiy, Vladimir II-197 Metin, Selin I-177
Kryzhanovsky, Boris I-9, I-41 Metz, Coert II-238
Ksantini, Riadh II-330 Micheli, Alessio I-57
Kühn, Nicolas II-395 Miȩkisz, Jacek I-33
Kuriya, Yasutaka I-387 Mkrtchyan, Lusine I-718
Kůrková, Věra II-17 Mokbel, Bassam II-531
Kurokawa, Hiroaki I-223 Morrison, Abigail I-459
Moss, Gary P. II-499, II-507
Lagus, Krista II-467 Muchungi, Kendi I-339
Layher, Georg I-288, I-677 Müller, Georg R. I-313
Lebbah, Mustapha II-363 Muñoz, Alberto II-271
Lee, Jong-Seok II-213
León, Maikel I-718 Nachstedt, Timo I-451
Lichota, Kacper II-427 Nakagawa, Masanori II-403
Lima, Priscila I-482 Nakano, Daichi II-116
Litinskii, Leonid I-9 Nakano, Ryohei II-1
Liu, Jindong I-239 Neckar, Alexander I-121
Løvlid, Rikke Amilde I-507 Neto, Adrião Dória II-164, II-459
586 Author Index
Neumann, Heiko I-288 Roelfsema, Pieter R. I-443

Neves, Renata F.P. II-229 Rombouts, Jaldert O. I-443
Niina, Gen I-563 Rosenstiel, Wolfgang I-669
Ninomiya, Hiroshi II-74 Rouat, Jean I-547
Nobili, Lino I-653 Roveri, Manuel II-305
Nóbrega, Jarley P. II-435 Ruan, Da I-718
Norman, Joseph I-579
Notley, Scott I-137
Sabirov, Alexey II-443
Saito, Toshimichi I-153, I-161
O’Donnell, Fionntán I-629 Sandamirskaya, Yulia I-25
Ogawa, Takashi I-153, I-161 Santana, Laura E.A. I-701
Oja, Erkki II-411, II-419 Satizábal, Héctor F. I-201
Okada, Masato II-9 Sato, Hiroki II-371
Okanoya, Kazuo II-9 Sato, Yasuomi D. I-387
O’Keefe, Simon I-49 Schatten, Carlotta II-156
Oliveira, Adriano L.I. I-693, II-573 Scherbaum, Frank II-395
Oneto, Luca II-156, II-491 Schleif, Frank-Michael II-531
Osana, Yuko II-371 Schliebs, Stefan I-604, I-653
Osendorfer, Christian I-531 Schmidhuber, Juergen II-279
Ozbudak, Ozlem II-515 Schneider, Petra II-322
Schöner, Gregor I-25, I-579
Schrauwen, Benjamin I-612, I-629,
Padilha, Carlos II-164
I-661, II-92
Palm, Günther I-677, II-42
Schultz, Simon R. I-113, I-280
Perez-Uribe, Andres I-201
Schulz, Hannes I-620
Persiantsev, Igor II-443
Schütze, Henry II-540
Peters, Gabriele I-435
Schwenker, Friedhelm I-677
Petkov, Nicolai I-395
Sébillot, Pascale II-172
Petrosino, Alfredo II-515
Sengor, Neslihan Serap I-177, I-474
Pimentel, Bruno Almeida II-435
Sergio, Anderson Tenório I-685
Pinkas, Gadi I-482
Shah, Alpa II-499
Platoš, Jan II-132
Shaposhnyk, Vladyslav I-371
Postnova, Svetlana I-97
Sheynikhovich, Denis I-296
Pouzols, Federico Montesino II-379
Shimizu, Shohei I-491
Pragathi Priyadharsini, Balasubramani
Silks, Isabella I-363
I-467
Silva, Leandro A. II-355
Prapopoulou, Maria II-507
Singh, Anand I-105
Probst, Dimitri I-209
Sloan, Steven I-121
Snášel, Václav II-132
Raiko, Tapani I-81, II-124, II-379, Soares, Carlos II-25
II-475 Sperduti, Alessandro I-57
Raitio, Juha II-475 Sporea, Ioana I-264
Ramı́rez, Felipe I-710 Spüler, Martin I-669
Rana, Mashud II-33 Srinivasa Chakravarthy, V. I-467
Ravindran, Balaraman I-467 Steege, Frank-Florian II-50
Raychaudhury, Somak II-322 Steuber, Volker I-355
Ribeiro, Geraldina II-25 Stewart, Jill I-645
Ridella, Sandro II-156, II-491 Stewart, Terry I-121
Riggelsen, Carsten II-395 Stocker, Alan A. I-523
Author Index 587
Sun, Yi II-499, II-507 Wang, Dan I-73

Suzumura, Shinya II-1 Washio, Takashi I-491
Weber, Bruno I-105
Tabie, Marc II-108 Weber, Cornelius I-403, I-539, I-555
Tang, Jiaying I-280 Wedemann, Roseli S. I-379
Tashiro, Tatsuya I-491 Wei, Xue-Xin I-523
Tchaptchet, Aubin I-97 Wermter, Stefan I-239, I-403, I-539,
Teleña, Sergio Alvarez II-140 I-555
Terai, Asuka II-403 Wilkinson, Simon II-499
Theunissen, Leslie II-531 Wöhrle, Hendrik II-108
Tino, Peter II-322 Woike, Jan K. II-483
Tittgemeyer, Marc I-459 Wood, Richard I-89
Tokic, Michel II-42 Wörgötter, Florentin I-451
Tomkins, Adam I-185 Würtz, Rolf P. I-411
Tontchev, Nikolay I-571
Torikai, Hiroyuki I-231 Xavier-Júnior, João Carlos II-451
Torres, Alberto II-557
Torres, Luiz C.B. II-100 Yamaguchi, Kazunori II-205
Trautmann, Eric I-121 Yang, Zhirong II-411, II-419
Triefenbach, Fabian I-629 Yim, Hyungu I-347
Trovò, Francesco II-305 Yonekawa, Masato I-223
Yucelgen, Cem I-177
van der Smagt, Patrick I-531 Yusoff, Nooraini I-137
Vanhoof, Koen I-718
van Ooyen, Arjen I-443 Zaamout, Khobaib II-288
Varona-Moya, Sergio II-221 Zaier, Riadh I-304
Vasilaki, Eleni I-185, I-193 Zanchettin, Cleber II-229, II-246
Veroneze, Rosana II-148 Zdunek, Rafal I-65
Verschore, Hannes I-661 Zhang, He II-411
Verstraeten, David I-661 Zhang, John Z. II-288
Villa, Alessandro E.P. I-272, I-371 Zhelavskaya, Irina II-197
Vollmer, Christian I-637 Zhong, Junpei I-539
Von Zuben, Fernando J. II-148 Zurada, Jacek M. II-347

Alessandro E Villa Włodzisław Duch Péter Érdi Francesco Masulli Günther Palm Artificial Neural Networks and Machine Learning - ICANN 2012 - 22nd International Conference

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Alessandro E Villa Włodzisław Duch Péter Érdi Francesco Masulli Günther Palm Artificial Neural Networks and Machine Learning - ICANN 2012 - 22nd International Conference

Uploaded by

Copyright:

Available Formats

Lecture Notes in Computer Science 7553

Commenced Publication in 1973

Artificial Neural Networks

22nd International Conference

Alessandro E.P. Villa

ISSN 0302-9743 e-ISSN 1611-3349

Library of Congress Control Number: 2012946038

CR Subject Classification (1998): I.2, F.1, I.4, I.5, J.3, H.3

LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues

© Springer-Verlag Berlin Heidelberg 2012

The International Conference on Artiﬁcial Neural Networks (ICANN) is the an-

July 2012 Alessandro E.P. Villa

John Gerald Taylor (18.VIII.1931-10.III.2012)

Program Committee and Editorial Board

Cesare Alippi Giacomo Indiveri

Jaako Peltonen Walter Senn

ENNS Travel Grant Committee

Wlodzislaw Duch Guenther Palm

Secretariat and Publicity

Daniela Serracca Fraccalvieri

Multilayer Perceptrons and Kernel Networks (A6)

Theoretical Analysis of Function of Derivative Term in On-Line

Some Comparisons of Networks with Radial and Kernel Units . . . . . . . . . 17

Multilayer Perceptron for Label Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

Electricity Load Forecasting: A Weekday-Based Approach . . . . . . . . . . . . . 33

Training and Learning (C4)

Comparison of Long-Term Adaptivity for Neural Networks . . . . . . . . . . . . 50

Simplifying ConvNets for Fast Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

A Modiﬁed Artiﬁcial Fish Swarm Algorithm for the Optimization

Robust Training of Feedforward Neural Networks Using Combined

Estimating a Causal Order among Groups of Variables in Linear

Training Restricted Boltzmann Machines with Multi-tempering:

A Computational Geometry Approach for Pareto-Optimal Selection

Learning Parameters of Linear Models in Compressed Parameter

Control of a Free-Falling Cat by Policy-Based Reinforcement

Gated Boltzmann Machine in Texture Modeling . . . . . . . . . . . . . . . . . . . . . 124

Neural PCA and Maximum Likelihood Hebbian Learning

Inference and Recognition (C5)

The Inﬂuence of Supervised Clustering for RBFNN Centers Deﬁnition:

Nested Sequential Minimal Optimization for Support Vector Machines . . 156

Random Subspace Method and Genetic Algorithm Applied to a

Text Recognition in Videos Using a Recurrent Connectionist

An Investigation of Ensemble Systems Applied to Encrypted and

New Dynamic Classiﬁers Selection Approach for Handwritten

Vector Perceptron Learning Algorithm Using Linear Programming . . . . . 197

A Robust Objective Function of Joint Approximate Diagonalization . . . . 205

TrueSkill-Based Pairwise Coupling for Multi-class Classiﬁcation . . . . . . . . 213

Analogical Inferences in the Family Trees Task: A Review . . . . . . . . . . . . . 221

An Eﬃcient Way of Combining SVMs for Handwritten Digit

Comparative Evaluation of Regression Methods for 3D-2D Image

A MDRNN-SVM Hybrid Model for Cursive Oﬄine Handwriting

Extraction of Prototype-Based Threshold Rules Using Neural Training

Instance Selection with Neural Networks for Regression Problems . . . . . . 263

A New Distance for Probability Measures Based on the Estimation

Low Complexity Proto-Value Function Learning from Sensory

Improving Neural Networks Classiﬁcation through Chaining . . . . . . . . . . . 288

Feature Ranking Methods Used for Selection of Prototypes . . . . . . . . . . . . 296

A “Learning from Models” Cognitive Fault Diagnosis System . . . . . . . . . . 305

Support Vector Machines (A5)

Learning Using Privileged Information in Prototype Based Models . . . . . 322

A Sparse Support Vector Machine Classiﬁer with Nonparametric

Training Mahalanobis Kernels by Linear Programming . . . . . . . . . . . . . . . . 339

Self-Organizing Maps and Clustering (A8)