bookDMNN 1516 PDF

Data Mining and Neural
Networks
Prof. Dr. ir. Johan Suykens

Katholieke Universiteit Leuven
Department of Electrical Engineering, ESAT-STADIUS
Kasteelpark Arenberg 10
B-3001 Leuven (Heverlee), Belgium
Tel: 32/16/32 18 02 Fax: 32/16/32 19 70
E-mail: johan.suykens@esat.kuleuven.be
http://www.esat.kuleuven.be/stadius/members/suykens.html
Academic year 2015-2016

Foreword i
For this course on Datamining and Neural Networks the course

material consists of lecture notes, together with copies of the slides.
Additional background material about books and more specialized
referenced journal papers is available if needed. All information about
the course and excercise sessions is available in Toledo.
How to study this course? One should not be able to reproduce all
the given mathematical equations. Instead it is important to obtain
insight into these equations and understand how to apply the methods
to several data sets during the excercise sessions, being aware of the
possibilities and limitations of the methods. The exam consists of an
oral discussion based upon the (individually) written reports about
the excercise sessions. The exam is open book.
Enjoy the fascinating world of datamining and neural nets!
Johan Suykens
KU Leuven
September 2015
The course Datamining and Neural Networks consists of
• 10 lectures and 5 computer exercise sessions if you attend course

number Hxxxxx in your study program.
• 8 lectures and 3 exercise sessions if you attend course number

Gxxxxx in your study program. (Master of Statistics: chapters
1 to 5.4, not 5.5 and 6)
ii
Abstract
In many application areas massive and growing volumes of data
are available which can be further explored and analysed in order to
obtain improved models, extract knowledge and automate processes.
Typical examples include pattern recognition, biomedical applications
and bio-informatics, signal processing and system identification, in-
dustrial processes, fraud detection, webmining, e-commerce, financial
engineering etc. For each of these areas artificial neural networks
constitute an important methodology for system analysis and design.
Neural networks are universal approximators, possess a parallel ar-
chitecture, can be trained either in batch mode or on-line from given
patterns and are a powerful class of methods for nonlinear modelling.
There exist both methods of supervised and unsupervised learning.
In this course a number of important classical and advanced meth-
ods for datamining and neural networks are discussed. Popular tech-
niques in neural networks (such as multilayer perceptrons and radial
basis function networks) are presented with aspects of architectures,
learning, optimization, on-line versus batch training, generalization,
validation, feedforward and recurrent networks, statistical interpre-
tations, pruning, variance reduction, discriminant functions, density
estimation and regularization theory. Special attention is paid to ef-
ficient and reliable methods for classification and function estimation
and for mining large data sets. Emphasis is also put on preprocess-
ing, feature selection, dimensionality reduction and incorporation of
expert knowledge. Besides classical techniques in neural networks,
also more advanced methods such as Bayesian inference, statistical
learning theory and support vector machines (kernel methods) are
explained. With respect to unsupervised learning methods, cluster al-
gorithms (and related methods based on expectation-maximization),
vector quantization and self-organizing maps are presented.
Contents
Foreword i
1 Introduction 3
2 Neural Networks and Modelling 19

2.1 Multilayer perceptrons and radial basis function networks 19
2.1.1 Biological neurons and McCulloch-Pitts model . 19
2.1.2 Multilayer perceptrons . . . . . . . . . . . . . . 20
2.1.3 Radial basis function networks . . . . . . . . . . 22
2.2 Model structures and parameterizations . . . . . . . . . 24
2.2.1 State space models . . . . . . . . . . . . . . . . 24
2.2.2 Input/output models . . . . . . . . . . . . . . . 27
2.2.3 Time-series prediction models . . . . . . . . . . 28
2.3 Universal approximation theorems . . . . . . . . . . . . 28
2.3.1 Neural nets are universal approximators . . . . 28
2.3.2 The curse of dimensionality . . . . . . . . . . . 33
2.4 Backpropagation training . . . . . . . . . . . . . . . . . 34
2.4.1 Generalized delta rule . . . . . . . . . . . . . . 34
2.4.2 Application to nonlinear system identification . 37
2.5 Learning and optimization . . . . . . . . . . . . . . . . 38
2.5.1 From steepest descent to Newton method . . . . 38
2.5.2 Levenberg-Marquardt method . . . . . . . . . . 41
2.5.3 Quasi-Newton methods . . . . . . . . . . . . . . 42
2.6 Methods for large scale problems . . . . . . . . . . . . 43
2.7 Recurrent networks and dynamic backpropagation . . . 45
2.8 Model validation . . . . . . . . . . . . . . . . . . . . . 49
3 Neural Networks and Classification 51

3.1 Single neuron case: perceptron algorithm . . . . . . . . 51
iii
iv
3.2 Linear versus non-linear separability . . . . . . . . . . 53

3.3 Multilayer perceptron classifiers . . . . . . . . . . . . . 55
3.4 Classification and Bayes rule . . . . . . . . . . . . . . . 57
3.4.1 Bayes rule - discrete variables . . . . . . . . . . 57
3.4.2 Bayes rule - continuous variables . . . . . . . . 59
3.4.3 Decision making . . . . . . . . . . . . . . . . . . 61
3.4.4 Discriminant functions . . . . . . . . . . . . . . 62
3.4.5 Minimizing risk . . . . . . . . . . . . . . . . . . 63
3.5 Gaussian density assumption . . . . . . . . . . . . . . . 64
3.5.1 Normal density . . . . . . . . . . . . . . . . . . 64
3.5.2 Discriminant functions . . . . . . . . . . . . . . 66
3.5.3 Logistic discrimination . . . . . . . . . . . . . . 66
3.6 Density estimation . . . . . . . . . . . . . . . . . . . . 68
3.7 Mixture models and EM algorithm . . . . . . . . . . . 71
3.8 Preprocessing and dimensionality reduction . . . . . . . 73
3.8.1 Preprocessing . . . . . . . . . . . . . . . . . . . 73
3.8.2 PCA dimensionality reduction . . . . . . . . . . 74
3.9 ROC curves . . . . . . . . . . . . . . . . . . . . . . . . 77
4 Learning and Generalization 81

4.1 Interpretation of network outputs . . . . . . . . . . . . 81
4.2 Bias and variance . . . . . . . . . . . . . . . . . . . . . 82
4.3 Regularization and early stopping . . . . . . . . . . . . 86
4.3.1 Regularization . . . . . . . . . . . . . . . . . . . 87
4.3.2 Early stopping and validation set . . . . . . . . 91
4.3.3 Cross-validation . . . . . . . . . . . . . . . . . . 93
4.4 Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.5 Committee networks and combining models . . . . . . 98
4.6 Complexity criteria . . . . . . . . . . . . . . . . . . . . 101
4.7 Bayesian learning . . . . . . . . . . . . . . . . . . . . . 101
4.7.1 Bayes theorem and model comparison . . . . . . 101
4.7.2 Probabilistic interpretations of models . . . . . 102
4.7.3 Levels of inference . . . . . . . . . . . . . . . . 104
4.7.4 Practical implementations and automatic rele-
vance determination . . . . . . . . . . . . . . . 106
5 Unsupervised Learning and Regularization Theory 109

5.1 Dimensionality reduction and nonlinear PCA . . . . . . 109
5.2 Cluster algorithms . . . . . . . . . . . . . . . . . . . . 112
Introduction 1
5.3 Vector quantization . . . . . . . . . . . . . . . . . . . . 113

5.4 Self-organizing maps . . . . . . . . . . . . . . . . . . . 115
5.5 Regularization theory . . . . . . . . . . . . . . . . . . . 120
5.5.1 RBF networks and regularization theory . . . . 120
5.5.2 Learning by a separation principle . . . . . . . . 126
5.5.3 Link with fuzzy models . . . . . . . . . . . . . . 127
6 Support Vector Machines 129

6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.2 Maximal margin classifiers and linear SVMs . . . . . . 131
6.2.1 Margin . . . . . . . . . . . . . . . . . . . . . . . 131
6.2.2 Linear SVM classifier: separable case . . . . . . 134
6.2.3 Linear SVM classifier: non-separable case . . . . 135
6.3 Kernel trick and Mercer condition . . . . . . . . . . . . 137
6.4 Nonlinear SVM classifiers . . . . . . . . . . . . . . . . 139
6.5 SVMs for function estimation . . . . . . . . . . . . . . 141
6.5.1 SVM for linear function estimation . . . . . . . 141
6.5.2 SVM for nonlinear function estimation . . . . . 145
6.6 Least squares SVM (LS-SVM) classifiers . . . . . . . . 145
6.7 LS-SVM for nonlinear function estimation . . . . . . . 149
6.8 Tuning parameter selection . . . . . . . . . . . . . . . . 151
7 Conclusions 157
2
Chapter 1
Introduction
Although there is currently not a universally accepted definition of

Data Mining, depending on the field it usually refers to the discovery
of new information from given facts of a database and the automatic
or semi-automatic exploration and analysis of voluminous data to dis-
cover meaningful patterns and rules. Often it is also one step within a
larger process of application understanding, target data set creation,
corrupted data removal and correction, data reduction, and interpre-
tation of patterns. It includes techniques as clustering, rule discovery,
classification, pattern recognition and outlier detection [7, 34].
It this course we take a broad perspective of datamining in view of
a wide range of application fields and mainly consider techniques which
originated from the field of neural networks. We emphasize reliability
and efficiency of designs: given data within a certain application, how
to apply neural network methods in order to derive good models which
can result into better knowledge and automation of processes?
In Fig.1.1 an overview is given of the several steps of a KDD
(Knowledge Discovery and Datamining) process according to the book
[7]. Given a large amount of data one often selects smaller target data
sets that are first preprocessed (data cleaning, missing values) and of-
ten transformed (linear scaling or sometimes nonlinear transformation
applied to certain input variables) to bring it in a suitable form for data
analysis, exploration and modelling. Several datamining techniques of
neural networks can be applied then at this point either in order to
model relationships between input and output variables or to discover
patterns, information and knowledge from the data. In Fig.1.2 an
outline of the complete KDD process is given according to [7]. The
3
4
interpretation
data mining 111
000
00
11
001
110
00
110
1
000
111
0
1
Knowledge
00
11
000
111
0
1
transformation 00
11
000
111
Patterns
preprocessing
Transformed Data
selection
Preprocessed Data
Target Data
DATA
Figure 1.1: Typical steps in the KDD process.
techniques used within datamining are related to several fields includ-

ing neural networks, statistics, pattern recognition, machine learning
and artificial intelligence (Fig.1.3).
In real-life datamining problems often massive data sets need to
be processed. This poses serious computational challenges. Thanks to
the continuously increasing computer power the computation of more
sophisticated nonlinear models has become feasible. This is in com-
parison with linear models which are often a rough (but nevertheless
often useful) approximation of reality. Also many interdisciplinary
studies have been made (systems, circuits and control theory, signal
processing, statistics, optimization, physics and other areas) which led
to a better understanding of nonlinear black-box modelling methods
in general. While these were often applied in a kind of miracle way
in the past, presently there exist many reliable techniques which can
lead to interesting results [18].
In order to get a first impression of neural networks and datamin-
ing and become more familiar with the field it can be interesting to
visit a number of sites related to journals, international conferences
and other links which are shown here:
Introduction 5
Task discovery domain model
report
Goal data model data output

cleaning design analysis generation
action
Data discovery
model
query statistics and visualization transformation monitor

tools neural net tools tools tools
Database
input output tool process task
process flow data flow tool usage
Figure 1.2: The complete KDD process.
Data Mining and

Knowledge discovery
Artificial Intelligence Pattern Recognition
Machine Learning Statistics
Neural Networks
Figure 1.3: Data Mining and related disciplines

6
Journals
• Neural Networks
http://www.elsevier.com/locate/neunet
• IEEE Transactions on Neural Networks

http://cis.ieee.org/
• Neural Computation
http://neco.mitpress.org/
• Neurocomputing
http://www.journals.elsevier.com/neurocomputing/
• Neural Processing Letters

http://link.springer.com/journal/11063
• Machine learning
• Journal of Machine Learning Reserach

http://jmlr.csail.mit.edu/
• Data Mining and Knowledge Discovery

Conferences
• IJCNN International Joint Conference on Neural Networks

http://www.wcci2016.org/
• NIPS Neural Information Processing Systems

http://nips.cc/
• ICANN International Conference Artificial Neural Networks

http://e-nns.org/2015/09/icann-2016/
• ESANN European Symposium Artificial Neural Networks

https://www.elen.ucl.ac.be/esann/
• KDD International Conference on Knowledge Discovery and Data

Mining
http://kdd.org/kdd2016/
Introduction 7
• PKDD Principles and Practice of Knowledge Discovery in Databases

http://www.ecmlpkdd2016.org/
• IEEE BigData Conference

http://cci.drexel.edu/bigdata/bigdata2014/
• SIAM International Conference on Data Mining

http://www.siam.org/meetings/sdm16/
Other useful links
• Guide to Data Mining, Web Mining, Knowledge Discovery

http://www.kdnuggets.com/
• SIGKDD Special Interest Group on KDD

http://www.acm.org/sigs/sigkdd/
• UCI benchmark data sets

https://archive.ics.uci.edu/ml/datasets.html
• Delve Data for Evaluating Learning in Valid Experiments

http://www.cs.toronto.edu/∼delve/
• Bayesian inference
http://wol.ra.phy.cam.ac.uk/mackay/
• Support vector machines and kernel based methods

http://www.kernel-machines.org/
• Self organizing maps

http://www.cis.hut.fi/research/som-research/teuvo.html
• Machine Learning Resources

http://www.sciencemag.org/site/feature/data/compsci/machine learning.xhtml
• NATO Advanced Study Institute on Learning Theory and Prac-

tice
http://www.esat.kuleuven.be/sista/natoasi/ltp2002.html
When we talk about neural networks in this course it refers to

artificial neural networks [1, 11]. These are inspired on biological neu-
ral networks but should be considered as quite strong mathematical
abstractions of reality. In the context of this course they should be
8
x1
x2
y
x n-1
x
n
Figure 1.4: Multilayer perceptron with one hidden layer, output y and
input vector x.
interpreted as a general class of nonlinear models. It has indeed been

mathematically proven that neural networks such as multilayer per-
ceptrons (MLP) (Fig.1.4) are universal approximators. They are able
to approximate any continuous nonlinear function arbitrarily well on
a compact interval. Neural networks can be trained from given input-
output training patterns either off-line (batch mode) or on-line. The
MLP architecture is also parallel which is an interesting property for
simulation on parallel computers and chip implementations. Neural
network models also typically have many inputs and outputs which
make them attractive for modelling multivariable systems and estab-
lish nonlinear relationship between several variables in databases.
The training of neural networks and models in general can be either
supervised or unsupervised. In supervised learning there is an external
teacher, who tells what the desired output value is for each input
pattern of the training set. Based upon the difference between the
actual output and its desired value the model is trained by minimizing
the errors. In unsupervised learning there is no such external teacher.
These considerations are relevant both for function estimation and
classification problems. The latter is illustrated in Fig.1.5.
Although neural nets are powerful models one should carefully
avoid to apply it in a miracle approach fashion. Important and criti-
cal design issues are e.g. the choice of the number of neurons, learning
Introduction 9
input output input output

Adaptive Adaptive
Network Network
-
desired output
error +
(a) (b)
Figure 1.5: (a) Supervised versus (b) unsupervised learning.
and generalization issues, avoiding bad local minima solutions, how to

deal with noise etc. such that a reliable solution can be guaranteed.
A huge list of successful real-life neural networks applications exist
nowadays and many new datamining problems pose serious challenges
such as in fraud detection, bioinformatics, biomedical applications,
textmining, time-series prediction, financial engineering, traffic analy-
sis, modelling and control in the process industry and decision making
in general [7, 24, 26, 34, 36]. In many datamining problems one en-
counters massive data sets.
Here we give a number of examples of challenging datamining prob-
lems:
• Bioinformatics:
A new term has been coined for the communities of molecular
biology, engineering and computer science: bioinformatics. The
term bioinformatics refers to the creation and advancement of
algorithms, computational and statistical techniques, and theory
to solve formal and practical problems inspired from the man-
agement and analysis of biological data. The explosion in the
rate of acquisition of biomedical data and advances in molecular
genetics technologies, such as DNA microarrays allow now to
obtain a “global” view of the cell. For example, the biological
molecular state of a cell can now be investigated by measuring
the simultaneous expression of tens of thousands of genes using
DNA microarrays.
10
• Microarray data analysis:

Microarrays have opened the possibility of creating data sets of
molecular information to represent many systems of biological or
clinical interest. Gene expression profiles can be used as inputs
to large-scale data analysis, for example, to serve as fingerprints
to build more accurate molecular classifiers, to discover hidden
taxonomies or to increase our understanding of normal and dis-
ease states.
Figure 1.6: Microarray data analysis.
Analysis of microarrays presents a number of unique challenges

for data mining. Typical data mining applications in domains
like banking or web, have a large number of records (thousands
and sometimes millions), while the number of fields is much
smaller (at most several hundred). In contrast, a typical mi-
croarray data analysis study may have only a small number of
records (less than a hundred), while the number of fields, corre-
Introduction 11
sponding to the number of genes, is typically thousands. Given

the difficulty of collecting microarray samples, the number of
samples is likely to remain small in many interesting cases. Such
setting creates a high likelihood of finding “false positives” that
are due to chance both in finding differentially expressed genes,
and in building predictive models. Robust methods to validate
the models and assess their likelihood are thus required. The
main types of data analysis include:
– Gene Selection: in data mining terms this is a process of

attribute selection, which finds the genes that are most
strongly related to a particular class.
– Classification: classifying diseases or predicting outcomes
based on gene expression patterns, and perhaps even iden-
tifying the best treatment for a given genetic signature.
– Clustering: finding new biological classes or refining exist-
ing ones.
The first generation of microarray analysis methodologies has

demonstrated that expression data can be used in a variety of
class discovery or class prediction bio-medical problems includ-
ing those relevant to tumor classification.
• Protein data analysis:

Genomics-based approaches have made it possible to identify
changes in gene expression as markers of certain diseases. How-
ever, gene expression does not always correlate with protein ex-
pression, which is more closely linked to the actual functional
state of cells. A single gene can lead to a large number of pro-
tein products via a complex process. In contrast to the genome,
the proteome, the ensemble of protein forms expressed in a bio-
logical sample at a given point in time reflects both the intrinsic
genetic program of the cell and the impact of its immediate en-
vironment. Clinical proteomics aims at investigating changes in
protein expression in order to discover new disease markers and
drug targets. A variety of proteomics workflows have been de-
veloped the common denominator of which is the use of mass
spectrometry.
12
Figure 1.7: Proteomics.
Mass spectrometry has become a tool of choice in the search for

biomarkers which may lead to new methods of disease diagno-
sis, prognosis and therapy. Body fluids such as serum or urine,
can be routinely used to generate protein profiles (signal inten-
sities with respect to m/z ratios). By comparing mass spectra of
diseased and control samples, discriminatory patterns reflecting
disease correlated alterations in the expression levels of proteins
can be extracted.
To discover these biomarker patterns, the data miner must face
a number of technical challenges, foremost among which is the
extremely high-dimensionality of mass spectra. A typical mass
spectrum will have several thousands of attributes that exhibit
a high degree of spatial redundancy. Most approaches have been
actively investigated in view of the early detection of cancer.
• Biomedicine:
Ovarian cancer is the most lethal cancer of the female repro-
ductive system [61]. According to the American Cancer So-
ciety, almost 15,000 women died of ovarian cancer in the US
in 2004. Only four types of cancer were more lethal. On the
other hand, many women present with benign ovarian tumours,
which could be managed conservatively or removed with mini-
Introduction 13
mally invasive surgery. Therefore, an accurate prediction about

the nature of the tumour (i.e. benign or malignant) is crucial for
optimal patient treatment. Many approaches for ovarian tumour
classification are based on simple scoring systems or logistic re-
gression models, typically based on small data sets collected at
one single medical centre. Therefore, the International Ovar-
ian Tumour Analysis (IOTA) group collected over 1,000 women
with ovarian tumours from nine centres across Europe (Belgium,
Sweden, Italy, France, UK). For each woman, over 40 measure-
ments were recorded. The aim of the study is the development
of basic and advanced mathematical models to classify the tu-
mours. Both binary classification (benign versus malignant) as
multi-class classification (benign versus three types of malignant
tumours) are targeted. A new data set of around 2,000 women
is collected to obtain a thorough prospective evaluation of the
models developed.
• Power load forecasting:

Electricity cannot be efficiently stored in large quantities, mean-
ing that the amount generated at any given time always has to
cover all the demands from the final consumers, including grid
losses [32]. Forecasts of the load are used to decide whether
extra generation has to be provided by increasing the output
of online generators or by committing one or more extra units.
Similarly, forecasts are also used to decide whether an already
running generation unit should be decreased in output or even
switched off. On the other hand, the liberalization of the electric
energy markets has led to the development of energy exchanges,
where consumers, generators, and traders can interact leading
to price settings.
Short-term load forecasting (STLF) concerns the prediction of
power-system loads over an interval ranging from one hour (or
less) to one week. Load forecasting has become a major field of
research in electrical engineering. The power industry requires
forecasts not only from the production side, but also from a
financial perspective. System identification techniques for mod-
eling and forecasting are used for STLF, where the main goal is
to generate a model that captures the dynamics and interactions
among possible explanatory variables for the load. This is typi-
14
Figure 1.8: Ovarian cancer data analysis.

Introduction 15
cally a large scale problem, with approximately 300 time series

of 40,000 datapoints each from different substations within the
Belgian grid (long time series provided by the Belgian Transmis-
sion Operator ELIA).
• High energy physics:
A data thunderstorm is gathered on the horizon with the next
generation of particle physics experiments. The prime data from
the next-generation CERN CMS detector will amount over a
petabyte (1015 bytes) per year that is to be archived, with sub-
sequent analysis to find rare events resulting from the decays of
new particles. CMS and other experiments to run at CERN’s
large hadron collider expect to accumulate on the order of 100
petabytes.
• Remote-sensing and geographical systems:
A significant range of scientific data is geospatial data that is
associated with a region on the surface of the earth. Large
quantities of earth-observation data have been acquired in many
wavelengths by remote sensing space missions, and accompany-
ing software systems have been built. Techniques such as inter-
ferometric SAR and hyperspectral imaging provide increasingly
comprehensive views of our planet.
• Fraud detection:
Several types of fraud happen in GSM networks. User profiling
and feature selection is done towards the design of rule based or
neural networks based solution for fraud detection [44]. Fraud
detection indicators are e.g. based on the duration and frequency
of calls, national or international call and changes in behaviour
(in comparison to normal behaviour). The problem involves the
analysis of massive data sets with millions of data points. The
classification problems are highly unbalanced in the sense that
there are many more examples of non-fraude versus fraude.
• Big data:
In Big data with the ”3Vs” model, one considers the increase of
Volume, Velocity, and Variety (other models consider additional
Vs). In the survey paper [28] the following examples of big data
cases are mentioned:
16
0.9
0.8
Normalized load
0.7
0.6
0.5
0.4
Mon Tue Wed Thu Fri Sat Sun

0.3
0 24 48 72 96 120 144 168
Hour
Figure 1.9: Electric load forecasting.

Introduction 17
– Internet of Things (IoT) is an important source of big data.

Among smart cities constructed based on IoT, big data
may come from industry, agriculture, traffic, transporta-
tion, medical care, public departments, and families, etc.
– By April 2013, Android Apps has provided more than 650,000
applications, covering nearly all categories. By the end of
2012, the monthly mobile data flow has reached 885 PB.
– One sequencing of human gene may generate 100 600GB
raw data. In the China National Genebank in Shenzhen,
there are 1.3 million samples including 1.15 million hu-
man samples and 150,000 animal, plant, and microorganism
samples. By the end of 2013, 10 million traceable biological
samples will be stored, and by the end of 2015, this figure
will reach 30 million.
– GenBank is a nucleotide sequence database maintained by
the U.S. National Bio-Technology Innovation Center. Data
in this database may double every 10 months. By Au-
gust 2009, Genbank has more than 250 billion bases from
150,000 different organisms [34].
– The Sloan Digital Sky Survey (SDSS), the biggest sky sur-
vey project in astronomy, has recorded 25TB data from
1998 to 2008. As the resolution of the telescope is im-
proved, by 2004, the data volume generated per night will
surpass 20TB.
– In the beginning of 2008, the Atlas experiment of Large
Hadron Collider (LHC) of European Organization for Nu-
clear Research generates raw data at 2PB/s and stores
about 10TB processed data per year.
Notations
The notations in this course will be defined locally within each chapter
or section of a chapter.
18
Chapter 2
Neural Networks and

Modelling
In this chapter we discuss basic neural network architectures, model

structures for system identification and time-series prediction and their
parameterization by neural networks, universal approximation theo-
rems, backpropagation learning and optimization algorithms. This
chapter is partially based on [11, 17, 18, 22].
2.1 Multilayer perceptrons and radial ba-

sis function networks
2.1.1 Biological neurons and McCulloch-Pitts model
One estimates that the human brain contains over 1011 neurons and
1014 synapses in the human nervous system. On the other hand the
neuron’s switching time is much slower than for transistors in com-
puters, but the connectivity is higher than in today’s supercomputers.
Biological neurons basically consist of three main parts: the neuron
cell body, branching extensions called dendrites for receiving input
and axons that carry the neuron’s output to the dendrites of other
neurons (Fig.2.1) [40].
A simple and popular model for neurons is the McCulloch-Pitts
model (Fig.2.2). However, one should be aware that this is a strong
mathematical abstraction of reality. The neuron is modelled in this
case as a simple static nonlinear element which takes a weighted
19
20
Figure 2.1: Biological neurons.
sum of incoming signals xi multiplied with interconnection weights

wi . After
P adding a bias term b (or threshold) the resulting activation
a = i wi xi + b is sent through a static nonlinearity f (·) (activation
function) yielding the output y such that
X
y = f( wi xi + b). (2.1)
i
The nonlinearity is typically of the saturation type, e.g. tanh(·). Bio-

logically this corresponds to the firing of a neuron depending on gath-
ered information of incoming signals that exceeds a certain threshold
value.
2.1.2 Multilayer perceptrons

A single neuron model is not very powerful as one could expect. How-
ever, if one organizes the neurons into a layered network with several
layers one obtains models which are able to approximate general con-
tinuous nonlinear functions. Such a network consists of one or more
hidden layers and an output layer. A multilayer perceptron (MLP)
with one hidden layer is shown in Fig.2.3. Mathematically it is de-
Neural Networks and Modelling 21
x1 w1
w2 y
x2
w3 f(a)
x3
wn
xn b
a = w1 x 1+ w2 x 2 + ... + wn x n + b
y=f(a)
Figure 2.2: McCulloch-Pitts model of neuron.
scribed as follows. In matrix-vector notation one has
y = W σ(V x + β) (2.2)
with input x ∈ Rm , output y ∈ Rl and interconnection matrices

W ∈ Rl×nh , V ∈ Rnh ×m for the output layer and hidden layer, re-
spectively. The bias vector is β ∈ Rnh and consists of the threshold
values of the nh hidden neurons. This notation is more compact than
the elementwise notation:
nh
X m
X
yi = wir σ( vrj xj + βr ), i = 1, ..., l. (2.3)
r=1 j=1
In these descriptions a linear activation function is taken for the output

layer. Depending on the application one might choose other functions
as well (Fig.2.4). For problems of nonlinear function estimation and
regression one takes a linear activation function in the output layer.
Sometimes a neural network with two hidden layers (Fig.2.5) is
chosen although a single hidden layer is sufficient to have a universal
approximator (provided a sufficient number of hidden units is taken).
In matrix-vector notation one has:
y = W σ(V2 σ(V1 x + β1 ) + β2 ) (2.4)
with input x ∈ Rm , output y ∈ Rl and interconnection matrices W ∈

Rl×nh2 , V2 ∈ Rnh2 ×nh1 , V1 ∈ Rnh1 ×m and bias vectors β2 ∈ Rnh2 ,
22
Figure 2.3: Multilayer perceptron with one hidden layer.
β1 ∈ Rnh1 with number of neurons in the hidden layers nh1 and nh2 .
In elementwise notation this becomes:
nh2 nh1 m
(1)
X X X
(2)
yi = wir σ( vrs σ( vsj xj + βs(1) ) + βr(2) ), i = 1, ..., l
r=1 s=1 j=1
(2.5)
where the upper indices indicate the layer numbers. Sometimes the
inputs are considered to be part of a so-called input layer. However, in
order to specify the number of layers of a network and avoid confusion
it is prefered to mention the number of hidden layers and define the
number of layers to be the sum of the number of hidden layers plus
the output layer.
2.1.3 Radial basis function networks

While MLPs contain saturation type nonlinearities, another impor-
tant class of neural networks called Radial Basis Function networks
(RBF) makes use of localized basis functions, typically with Gaussian
activation functions organized within one hidden layer (Fig.2.6).
The network description is
nh
X
y= wi h(kx − ci k). (2.6)
i=1
For a Gaussian activation function this becomes

nh
X
y= wi exp(−kx − ci k22 /σi2 )
i=1
sigmoid tanh
2 2
1 1
0 0
-1 -1
-2 -2
-5 0 5 -5 0 5
sat sign
2 2
1 1
0 0
-1 -1
-2 -2
-5 0 5 -5 0 5
Figure 2.4: Some typical activation functions.
Figure 2.5: Multilayer perceptron with two hidden layers.

24
Gaussian
0.8
0.6
0.4
0.2
0
25
20 25
15 20
10 15
10
5 5
0 0
Figure 2.6: Radial basis function network with Gaussian activation

function.
with input x ∈ Rm , output y ∈ R, output weights w ∈ Rnh , centers

ci ∈ Rm and widths σi ∈ R (i = 1, ..., nh ) where nh denotes the number
of hidden neurons.
2.2 Model structures and parameteriza-

tions
2.2.1 State space models
In order to explain how neural networks can be used within a context of
dynamical systems and how one comes from feedforward to recurrent
networks, we discuss first some elements of systems theory in order to
put these problems within the right perspective [17, 18, 55].
It is well-known that discrete time linear systems with input vec-
tor u ∈ Rm , output vector y ∈ Rl and state vector x ∈ Rn can be
represented in state space form as

xk+1 = Axk + Buk
(2.7)
yk = Cxk .
In the context of nonlinear systems one can consider nonlinear state

space descriptions
xk+1 = f (xk , uk )
(2.8)
yk = g(xk )
where f (·) : Rn × Rm → Rn and g(·) : Rn → Rl are nonlinear map-
pings. When one parameterizes these nonlinear functions by means of
a feedforward neural network (such as MLP and RBF) one obtains a
recurrent neural network.
While the stability of linear systems can be completely understood
by checking the eigenvalues of the system matrix A (for discrete time
systems all eigenvalues should belong to the open unit disk in the
complex plane), stability issues of nonlinear systems are much more
complicated. They can possess e.g. multiple equilibrium points, limit
cycles, chaotic behaviour etc. Let us consider a simple example to
illustrate this for an autonomous system with xk ∈ R3
xk+1 = f (xk ) (2.9)
parameterized by an MLP as
xk+1 = W tanh(V xk ) (2.10)
with W, V ∈ R3×3 . By taking random choices for these matrices sev-

eral kinds of behaviour can be obtained even in such a simple neural
network. As shown in Fig.2.7, depending on random choices of W
and V , one might obtain global asymptotic stability, multiple equi-
libria, quasi-periodic behaviour and chaos. This example shows that
dynamical models which are parameterized by neural nets can repre-
sent complex behaviour, but on the other hand it also means that the
analysis of such systems is highly non-trivial [58].
The systems that have been discussed here are deterministic. How-
ever, in many real life situations the system is corrupted by noise.
26
1 4
0.8 3
0.6 2
0.4 1
x_i
x_i
0.2 0
0 -1
-0.2 -2
-0.4 -3
-0.6 -4
0 5 10 15 20 25 30 35 40 45 50 0 2 4 6 8 10 12 14 16 18 20
k k
2.5 2.5
2 2
1.5 1.5
1 1
0.5 0.5
x_2
x_i
0 0
-0.5 -0.5
-1 -1
-1.5 -1.5
-2 -2
-2.5 -2.5
0 5 10 15 20 25 30 35 40 45 50 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5
k x_1
4 4
3 3
2 2
1 1
x_2
x_i
0 0
-1 -1
-2 -2
-3 -3
-4 -4
0 10 20 30 40 50 60 70 80 90 100 -1.5 -1 -0.5 0 0.5 1 1.5 2
k x_1
Figure 2.7: Example of several kinds of behaviour in a simple recurrent

network: (Top-Left) global asymptotic stability; (Top-Right) multiple
equilibria; (Middle) quasi-periodic behaviour; (Bottom) chaos.
System representations that take this into account are

xk+1 = Axk + Buk + wk
(2.11)
yk = Cxk + vk
for the linear case and

xk+1 = f (xk , uk , wk )
(2.12)
yk = g(xk , vk )
for the nonlinear case where wk , vk denote the process noise and ob-
servation (or measurement) noise, respectively.
2.2.2 Input/output models

Let us take a look now at input/output (I/O) models instead of state
space models. Usually one employs a NARX (Nonlinear ARX (Auto
Regressive with eXogenous input)) model structure
ŷk = f (yk−1, yk−2 , ..., yk−p, uk−1, uk−2, ..., uk−p) (2.13)
where yk denotes the measured output at discrete time instant k, uk

the input at time k and ŷk the estimated output at time k. The num-
ber p corresponds to the order of the system. A parameterization of
the nonlinearity f (·) by neural networks leads to a static feedforward
model from the viewpoint of neural networks, despite the fact that
this is used in a dynamical systems context. The reason is that the
equation of this model does not have a recursion on the variable ŷk (it
appears only at the left hand side of the equation). The parameteri-
zation by an MLP with one hidden layer gives
ŷk+1 = w T σ(V zk|k−p + β) (2.14)
with zk|k−p = [yk−1 ; yk−2; ...; yk−p; uk−1; uk−2; ...; uk−p].
In the so-called NARMAX (Nonlinear ARMAX (Auto Regressive
Moving Average with eXogenous input)) one also tries to model the
noise influence by taking the following model structure
yk = f ( yk−1 , yk−2, ..., yk−p, uk−1, uk−2, ..., uk−p ,

(2.15)
ǫk−1 , ǫk−2, ..., ǫk−q ) + ǫk
28
with error ǫk = yk − ŷk . Another model structure which leads to recur-

rent networks instead of feedforward networks is the NOE (Nonlinear
OE (Output Error)) model structure
ŷk = f (ŷk−1, ŷk−2, ..., ŷk−p , uk−1, uk−2, ..., uk−p). (2.16)
Note that one has a recursion now on the variable ŷk in contrast with
the NARX model.
In system identification one has given input/output data available
in order to model the system (Fig.2.8). After choosing a model struc-
ture and parameterizing it by means of neural networks a cost function
is formulated in the unknown interconnection weights and optimized
by a certain optimization algorithm (Fig.2.9).
2.2.3 Time-series prediction models

Model structures for time-series predictions (Fig.2.10) are obtained
by omitting the external input signals uk from the previous NARX
models. One has
ŷk+1 = f (yk , yk−1, ..., yk−p ) (2.17)
which is parameterized as
ŷk+1 = w T tanh(V [yk ; yk−1; ...; yk−p] + β). (2.18)
It is not necessary that the past values yk , yk−1, ..., yk−p are subsequent
in time; certain values could be omitted or values at different time
scales could be taken. In order to generate predictions, the true values
yk are replaced by the estimated values ŷk and the iterative prediction
is generated by the recurrent network
ŷk+1 = w T tanh(V [ŷk ; ŷk−1; ...; ŷk−p ] + β) (2.19)
for a given initial condition.
2.3 Universal approximation theorems

2.3.1 Neural nets are universal approximators
In fact the history on artificial neural networks dates back from the
beginning of the previous century around 1900, when Hilbert formu-
lated a list of 23 challenging mathematical problems for the century
Inputs Outputs
System
input measurements output measurements
Model
Figure 2.8: In system identification one aims at estimating models

from given input/output data measurements {uk , yk } on systems.
to come. In his famous 13th problem he formulated the conjecture

that there are analytical functions of three variables which cannot be
represented as a finite superposition of continuous functions of only
two variables. This conjecture was refuted by Kolmogorov and Arnold
in 1957.
Kolmogorov Theorem (1957) Any continuous function f (x1 , ..., xn )

defined on [0, 1]n (n ≥ 2) can be represented in the form
2n+1
X n
X
f (x) = χj ( φij (xi ))
j=1 i=1
where χj , φij are continuous functions and φij are also monotone.
This theorem was later refined as follows.
Sprecher Theorem (1965) There exists a real monotone in-

creasing function φ(x) : [0, 1] → [0, 1] depending on n (n ≥ 2) having
the following property: ∀δ > 0, ∃ǫ (0 < ǫ < δ) such that every real
30
Model structure
Parameterization
Cost function in
unknown weights
Learning and optimization
Testing
Figure 2.9: Several stages in nonlinear modelling and system identifi-

cation.
Training mode
yk
y k−1
^y
k+1
y
k−p
Iterative prediction mode
^y
k
^y
k−1
^y
k+1
^y
k−p
?
y
k
time k
Figure 2.10: Time-series prediction with neural networks: (Top) iden-

tification with a NARX model structure; (Middle) iterative prediction
as a recurrent network by replacing the true values yk with the es-
timated values ŷk at the input of the network; (Bottom) illustrative
example of time-series prediction.
32
continuous function f (x) : [0, 1]n → R can be represented as

2n+1
X Xn
f (x) = χ[ λi φ(xi + ǫ(j − 1)) + j − 1]
j=1 i=1
where χ is a real continuous function and λ is a constant.
A link between the Sprecher Theorem and MLP neural networks was
made by Hecht-Nielsen.
Hecht-Nielsen Theorem (1987) Any continuous mapping f (x) :

[0, 1]n → Rm can be represented by a neural net with two hidden lay-
ers.
The proof of this Theorem relies on the Sprecher Theorem. Later in

1989 new Theorems were proven which showed that for universal ap-
proximation it is sufficient to have MLPs with one hidden layer.
Hornik Theorem (1989) [39] Regardless of the activation func-

tion (can be discontinuous), the dimension of the input space, a neural
network with one hidden layer can approximate any continuous func-
tion arbitrarily well (in a certain metric).
These results also hold for networks with multiple outputs. The fol-
lowing Theorem is more specific about the kind of activation functions
that are allowed.
Leshno Theorem (1993) A standard multilayer feedforward NN

with locally bounded piecewise continuous activation function can ap-
proximate any continuous function to any degree of accuracy if and
only if the network’s activation function is not a polynomial.
In addition to these universal approximation theorems for MLP neu-

ral networks, Theorems have been proven for RBF networks as well,
e.g. Park and Sandberg in 1991 showed that it is possible to approxi-
mate any continuous nonlinear function by means of an RBF network.
These results mathematically guarantee us that when we param-
eterize nonlinear functions and nonlinear model structures we obtain
universal tools for nonlinear modelling. However, these universal ap-
proximation results are not constructive in the following sense. First,
given a nonlinear function to be approximated, the proof does not

provide us with an algorithm for determination of the interconnection
weights of the neural nets. Secondly, it is not clear how many hidden
neurons are sufficient to approximate a given nonlinear function on a
compact interval. These issues will be discussed in the sequel of this
course.
Though neural networks consisting of one hidden layer are uni-
versal approximators, in recent years one is also employing neural
networks consisting of many more layers. This is due to new devel-
opments and insights in the area of deep learning [25, 41]. One has
shown how different hierarchical sets of features, extracted from dif-
ferent layers, can improve the performance with respect to one hidden
layer networks. It is also known that functions that can be compactly
representable with k layers may require exponential size with k − 1
layers [25].
2.3.2 The curse of dimensionality

Universal approximation for neural networks is a nice property. How-
ever, one could argue that also polynomial expansions possess this
property. So is there really a reason why we should use neural nets in-
stead of these? The answer is yes. The reason is that neural networks
are better able to cope with the curse of dimensionality. Barron [23]
has shown in 1993 that neural networks can avoid the curse of dimen-
sionality in the sense that the approximation error becomes indepen-
dent from the dimension of the input space (under certain conditions),
which is not the case for polynomial expansions. The approximation
error for MLPs with one hidden layer is order of magnitude O(1/nh ),
2/n
but O(1/np ) for polynomial expansions where nh denotes the number
of hidden units, n the dimension of the input space and np the number
of terms in the expansion. Consider for example y = f (x1 , x2 , x3 ). A
polynomial expansion with terms up to degree 7 would contain many
many terms
y = a1 x1 + a2 x2 + a3 x3 + a11 x21 + a22 x22 + a33 x23 +

a12 x1 x2 + a13 x1 x3 + a23 x2 x3 + a111 x31 + a222 x32 +
a333 x33 + ... + a1111111 x71 + a2222222 x72 + ....
This means that for a given number of training data, the number of
parameters to be estimated is huge (which should be avoided as we
34
will discuss later in the Chapter on learning and generalization). For

MLPs  
x1
y = w T tanh(V  x2  + β)
x3
the number of interconnection weights grows less dramatically when
the dimension of the input space grows.
2.4 Backpropagation training

2.4.1 Generalized delta rule
In the previous Section we have discussed about model structures and
their parameterizations, meaning that the problem becomes reduced
to the determination of the interconnection weights of the neural net-
work. The first algorithm invented for the training of multilayer per-
ceptrons (and multilayer feedforward networks in general) is the back-
propagation method [53]. The invention of this supervised learning
method caused a worldwide boom in neural networks research and
applications within a large variety of areas. We present the method
here in its general form.
Consider a multilayer feedforward neural network with L layers
(L − 1 hidden layers) (index l = 1, ..., L), P input patterns and corre-
sponding desired output patterns (index p = 1, ..., P ) and Nl neurons
in layer l. For the network description we have the following relation
between layer l and layer l + 1 (Fig.2.11):
Nl
X
xl+1
i,p = l+1
σ(ξi,p ), l+1
ξi,p = wijl+1 xlj,p (2.20)
j=1
where the upper index is the layer index and the lower indices indicate
the neuron within a layer and the pattern index. xli,p denotes the i-th
component of the output vector of layer number l for pattern p and wijl
the ij-th entry of the interconnection matrix of layer l. Eventually, the
activation function σ(·) might also change from layer to layer. Before
we are in a position to formulate the backpropagation algorithm we
first have to define so-called δ variables
l ∂Ep
δi,p = l (2.21)
∂ξi,p
l+1
w ij
l l+1
x xi
i
Figure 2.11: One layer from a feedforward network consisting of a

number of L layers.
where Ep = 12 N
P L d L 2 d
i=1 (xi,p − xi,p ) is the error for pattern p and xi,p
denotes the desired output (note that this method is supervised).
The objective function (sometimes called energy function) that one
usually optimizes for the neural network is the mean squared error
(MSE) on the training set of patterns:
P L N
1 X 1X
min E = Ep Ep = (xdi,p − xLi,p )2 . (2.22)
l
wij P p=1 2 i=1
The backpropagation algorithm (or generalized delta rule) is given

then by the following rule:
∂Ep


 ∆wijl = η δi,pl
xl−1
j,p = −η ∂w l
 ij
 δ L = (xd − xL ) σ ′ (ξ L )

i,p i,p i,p i,p
Nl+1 (2.23)
X
l+1 l+1

l ′ l
 δi,p = ( δr,p wri ) σ (ξi,p ), l = 1, ..., L − 1



r=1
with learning rate η. Note that the last equation is a backward recur-
l
sive relation on the δi,p variable in the layer index l. The backprop-
agation algorithm is an elegant method in order to obtain analytic
expressions for the gradient of the cost function defined on a feedfor-
ward network with many layers. One could imagine that obtaining
36
expressions for the gradient in the case of one hidden layer is straight-
forward. However, suppose one has a network with e.g. 100 layers, it
is clear then that obtaining an expression for the gradient becomes far
from trivial, while by applying this generalized delta rule it becomes
straightforward. The special structure appearing in this generalized
delta rule is due to the layered structure of the network. In order to
fix the ideas, an application of the generalized delta rule is shown for
an MLP with one hidden layer (L = 2) in Fig.2.12. In this case the
equations become
∆wij2 = η δi,p
2
x1j,p






 ∆wij1 = η δi,p
1
x0j,p



(2.24)
2
δi,p = (xdi,p − x2i,p ) σ ′ (ξi,p
2
)







= ( N

1
P 2 2 2 ′ 1
δi,p r=1 δr,p wri ) σ (ξi,p ).

This backpropagation method can be used either off-line (batch

mode) or on-line. Off-line means that one first presents all training
patterns before updating the weights after calculating ∆wijl . Present-
ing all the training data before doing this update is called one epoch.
This corresponds in fact to one iteration step if one takes an optimiza-
tion theory point of view. On-line updating means that one updates
∆wijl each time a new pattern of the training data set is presented.
The latter opens the possibility for its application to adaptive signal
processing (in fact one can also show that backpropagation is an ex-
tension of the well-known LMS algorithm towards layered nonlinear
models). More advanced on-line learning algorithms have also been
developed which are based on extended Kalman filtering. Often a mo-
mentum term is used in the backpropagation in order to speed up the
method. One adapts the interconnection weights then according to
∆wijl (k + 1) = η δi,p
l
xl−1 l
j,p + α ∆wij (k) (2.25)
where k is the iteration step and 0 < α < 1. Often also an adaptive
learning rate η is taken. The previous change in the interconnection
weights is taken into account in this learning rule.
1 2
w w
ij ij
forward 0 1 2
x x x
i i i
1 2
δ δ backward
i i
Figure 2.12: Illustration of backpropagation for an MLP with one

hidden layer.
2.4.2 Application to nonlinear system identifica-

tion
Let us know apply the backpropagation algorithm to a NARX model
ŷk+1 = f (zk|k−n ), zk|k−n = [yk ; yk−1; ... ; yk−n ; uk ; uk−1 ; ...; uk−n]
(2.26)
where f (·) is parameterized by an MLP. The training set consists
of input patterns {zk|k−n }N N
k=1 and output patterns {yk+1 }k=1 with N
given data points. For the objective function
N
1 X
min (yk+1 − ŷk+1 )2 (2.27)
l
wij 2N k=1
one has
N
1 X 1
E= Ek Ek = (yk+1 − ŷk+1 )2 (2.28)
N k=1 2
38
and application of the generalized delta rule (index p becomes the

discrete time index k) gives
l−1
∆wijl = η δi,k
l


 xj,k



δkL = (yk+1 − ŷk+1)σ ′ (ξi,k
L
) (2.29)



= ( N
 P l+1 l+1 l+1 ′ l
l
δi,k r=1 δr,k wri ) σ (ξi,k ), l = 1, ..., L − 1

with boundary conditions:
xLk = ŷk+1, x0k = zk|k−n .
For most practical applications one takes L = 2.

When minimizing this objective function one should be careful not
to obtain overfitting. This typically occurs when too many hidden
units are chosen and the interconnection weights are optimized to a
local minimum. If one checks the objective function on an independent
test set (fresh data) one will observe that the error on this set starts
increasing at a certain moment. In such a case one should do early
stopping, i.e. stop when the minimal error on the test set is obtaining
instead of on the training set (Fig.2.13). Otherwise the neural network
will fail obtaining a good generalization (this is in fact the intelligence
of the neural network) and would just memorize the presented training
patterns. These issues will be discussed in more detail in the Chapter
on learning and generalization issues.
2.5 Learning and optimization

2.5.1 From steepest descent to Newton method
When we consider the case of off-line learning the minimization of
the objective function can be studied from the viewpoint of optimiza-
tion theory, where many efficient and reliable methods have been de-
veloped. For the sake of notational convenience, let us denote the
problem
P NL
1 X 1X
min E = Ep Ep = (xd − xLi,p )2
l
wij P p=1 2 i=1 i,p
MSE error
test set
training set
epoch
stop training
Figure 2.13: In neural networks with too many hidden units overfitting
will take place during the optimization process, meaning that the error
on an independent test set will increase while the training set error
remains decreasing.
as the unconstrained nonlinear optimization

min f (x)
x∈Rn
where f, x correspond to the cost function and the unknown intercon-

nection weights, respectively.
The simplest optimization algorithm is of course a steepest descent
algorithm
xk+1 = xk − αk ∇f (xk )
where xk denotes the k-th iterate. In this case the search direction
corresponds to minus the gradient of the cost function i.e. backprop-
agation without momentum term. However, in the area of local op-
timization algorithms more advanced methods exist which are much
faster [8].
Let us consider a current point x0 in the search space and consider
a Taylor expansion of the cost function around the point
1
f (x) ≃ f (x0 ) + g T ∆x + ∆xT H∆x. (2.30)
2
Then ∆x = x − x0 denotes the step, g = ∇f (x0 ) the gradient at x0
and H = ∇2 f (x0 ) the Hessian at x0 . Locally, the optimal step is given
40
cost function
weight 2
weight 1
Figure 2.14: Off-line learning can be considered as an optimization

problem. Starting from the initial random guess x0 for the intercon-
nection weights, a sequence of points {x0 , x1 , x2 , ...} is generated in
search space until convergence to a local minimum. For cost functions
defined on neural networks there exist a huge amount of local min-
ima. Nevertheless, many of these local minima solutions will be good
solutions to the problem.
by
∂f
= g + H ∆x = 0 → ∆x = −H −1 g. (2.31)
∂(∆x)
One can see that the optimal step is determined by the gradient and
the inverse of the Hessian matrix. A sequence of points in the search
space {x0 , x1 , x2 , ...} is generated by applying these search directions
and calculating the optimal stepsize along these directions. It is well-
known that the Newton method converges quadratically which is much
faster than the steepest descent algorithm.
Unfortunately, one encounters a number of problems when trying
to apply the Newton method to the training of neural networks. A first
problem is that it often occurs that the Hessian has zero eigenvalues
which means that one cannot take the inverse of the matrix. A second
problem is that computing the second order derivatives analytically
for neural networks is very complicated. We have seen that even the
calculation of the gradient by means of the backpropagation method is
not that simple. One can overcome these two problems by considering
Levenberg-Marquardt and quasi-Newton methods.
One also has to be aware of the fact that there are very many local
minima solutions when training neural networks. MLP architectures
contain many symmetries (sign flip symmetries of the weights and
permutation of the number of hidden units). In general, for nh number
of hidden units one has nh !2nh weight symmetries that lead to the
same input/output mapping for the networks. One also starts from
small random interconnection weights in order to avoid (too) bad local
minima solutions.
2.5.2 Levenberg-Marquardt method

The Levenberg-Marquardt method is obtained by imposing an addi-
tional constraint k∆xk2 = 1 to the problem
1
min f (x) = f (x0 ) + g T ∆x + ∆xT H∆x. (2.32)
∆x 2
This leads to the Lagrangian function
1 1
L(∆x, λ) = f (x0 ) + g T ∆x + ∆xT H∆x + λ(∆xT ∆x − 1) (2.33)
2 2
42
∂L ∂L
and the solution follows from ∂(∆x)
= 0, ∂λ
= 0 where
∂L
= g + H∆x + λ∆x = 0 → ∆x = −[H + λI]−1 g. (2.34)
∂(∆x)
Note that for λ = 0 this corresponds to the Newton method and
for λ → ∞ this becomes a steepest descent algorithm. By adding a
positive definite matrix λI to H with λ positive, one can always invert
this matrix by taking λ sufficiently large. At every iteration step a
suitable value for λ is selected. Note that for λ small this method will
converge faster. In the case of a sum squared error cost function (as
used for neural networks) the Hessian H has a special structure which
can be exploited. Often an approximation is taken then for H based
on a Jacobian matrix.
2.5.3 Quasi-Newton methods

In quasi-Newton methods one aims at building up an approximation
for the Hessian based upon gradient information only during the it-
erative learning process. From f (x) = f (x0 ) + g T ∆x + 12 ∆xT H∆x it
follows that ∇x f = g + H(x − x0 ) or
Hdk = yk (2.35)
with dk = xk+1 − xk and yk = gk+1 − gk in general. This means that
there is a linear mapping between changes in the gradient and changes
in position. Assume now that the function f is nonquadratic and Bk
is an estimate for H. We have then
Bk+1 = Bk + ∆B (Hessian update)
(2.36)
Bk+1 dk = yk (Quasi-Newton condition).
Consider now rank 1 and rank 2 updates of the Hessian with the
goal of building up curvature information. When taking the rank 1
update
∆B = q zz T (2.37)
q and z follow from the Quasi-Newton condition (Bk + q zz T )dk = yk
which gives z = yk − Bk dk , q = z T1d . One obtains then the symmetric
k k
rank 1 update formula
(yk − Bk dk )(yk − Bk dk )T
Bk+1 = Bk + (2.38)
(yk − Bk dk )T dk
which starts with B0 = I as steepest descent.

For the rank 2 update
∆B = q1 z1 z1T + q2 z2 z2T (2.39)
the quasi-Newton condition becomes (Bk + q1 z1 z1T + q2 z2 z2T )dk = yk .
A possible choice is then z1 = yk , q1 = z T1d , z2 = Bk dk , q2 = − z T1d .
1 k 2 k
This results into the BFGS (Broyden, Fletcher, Goldfarb, Shanno)
formula
yk ykT (Bk dk )(Bk dk )T
Bk+1 = Bk + T − . (2.40)
y k dk (Bk dk )T dk
Instead of the Quasi-Newton condition Bk+1 dk = yk one can also
consider the condition
dk = Rk+1 yk (2.41)
in order to avoid direct inversion of the Hessian H in ∆x = −H −1 g.
In other words, one updates an approximation for the inverse Hessian
instead of the Hessian itself. A well-known method that one can obtain
then is the DFP (Davidon, Fletcher, Powell) formula
dk dTk Rk yk ykT Rk
Rk+1 = Rk + − . (2.42)
dTk gk ykT Rk yk
These modifications to the Newton method result into a superlinear
speed of convergence. Unfortunately, when the neural networks con-
tain many interconnection weights, it becomes hard to store the matri-
ces into computer memory. Therefore, for large scale neural networks
conjugate gradient methods are to be preferred [1, 48, 62].
2.6 Methods for large scale problems

A theory of conjugate gradient algorithms has been developed origi-
nally for solving linear systems Ax = y with A = AT > 0 in an itera-
tive way. In this case it is related to the optimization of a quadratic
cost function f (x) = c + bT x + 21 xT Ax, with x ∈ Rn . One has the
algorithm

p = −g0
 0


gT g
 xk+1 = xk + αk pk , αk = pTkApk

k
k
(2.43)

 g k+1 = g k + αk Ap k
 gT g
 pk+1 = −gk+1 + βk pk , βk = k+1T k+1

gk gk
44
where pk denotes the search direction at iteration k and gk the gra-

dient of the cost function. The method starts as a steepest descent
algorithm.
This conjugate gradient method possesses some nice properties.
One can show that the search directions are so-called conjugated with
respect to the matrix A, i.e. the property pTi Apj = δij holds with δij
the Kronecker delta. The factors αk that determine the steplength
d
are optimal in the sense of dα f (xk + αpk )|α=αk = 0 meaning that
T
[g(xk+1 )] pk = 0 (vectors pk and g(xk+1) are orthogonal). One can
show also that the algorithm converges in at most n steps, depending
on the condition number of the matrix A. This conjugate gradient
algorithm will also be used for solving least squares support vector
machines which will be discussed in the last Chapter.
The application of the conjugate gradient algorithm to non-quadratic
smooth cost functions (which is the case for neural networks) is done
as follows:

 p0 = −g0
xk+1 = xk + αk pk , αk s.t. min f (xk + αk pk ) (Line search)
pk+1 = −gk+1 + βk pk

(2.44)
with
T
gk+1 gk+1
βk = gkT gk
(Fletcher − Reeves)
T
gk+1 (gk+1 −gk ) (2.45)
βk = gkT gk
(Polak − Ribiere).
Note that in the equation pk+1 = −gk+1 + βk pk one also has a momen-
tum effect, but a difference with the backpropagation with momentum
term is that βk is now automatically adjusted during the iterative pro-
cess. In these algorithms often a restart procedure is applied after n
steps, i.e. one resets the search direction again to p0 = −g0 . Modi-
fied versions of these algorithms have been successfully applied to the
optimization of neural networks [48]. The advantages of conjugate
gradient methods are that they are faster than backpropagation and
that no storage of matrices is required.
2.7 Recurrent networks and dynamic back-

propagation
Previously we have discussed the backpropagation algorithm for feed-
forward networks. In fact it is basically a method for analytically com-
puting the gradient of a cost function which is defined on a static net-
work (static nonlinear function or feedforward neural network). When
the cost function is defined on a recurrent neural network (i.e. a net-
work that contains feedback interconnections) then the computation
becomes (even) more complicated.
Let us start and illustrate the ideas by a simple example. Consider
a nonlinear state space model
ẋ(t) = f [x(t), α, u(t)], x(t0 ) = x0 (2.46)
with state vector x(t) ∈ Rn , input vector u(t) ∈ Rm . Suppose that

α ∈ R is some unknown parameter to be adjusted in order to minimize
the following objective
Z T
J(α) = [x(t) − d(t)]2 dt (2.47)
0
where d(t) ∈ Rn is a desired reference trajectory in state space. When

one applies one of the previously discussed optimization algorithms
the gradient of the cost function is needed, which is given by
T
∂J ∂x
Z
= 2[x(t) − d(t)] dt. (2.48)
∂α 0 ∂α
∂x
Clearly we need to find then an expression for ∂α
. This follows from
a so-called sensitivity model
∂ ẋ(t) ∂x(t) ∂x(t0 )

= fx (t) + fα (t), =0 (2.49)
∂α ∂α ∂α
where the Jacobian fx (t) = ∂f∂x

and fα (t) are evaluated around the
nominal values. This sensitivity model is obtained by deriving the left
hand side and right hand side of the state space model with respect to
α. This right hand side contains both an implicit and explicit depen-
dency on α and the sensitivity model follows then from application of
46
the chain rule. Finally, one can see that one has to simulate the aug-
mented system consisting of the state space model and the sensitivity
model with state vectors x and ∂x(t)
∂α
, respectively.
A similar reasoning holds for input/output representations of sys-
tems. Consider for example a second order scalar nonlinear differential
equation
ÿ + F (α, ẏ) + y = u (2.50)
where F (·) is some nonlinear function depending on ẏ and α. The
initial condition y(0) = y10 , ẏ(0) = y20 and α some scalar parameter
to be adjusted. Let us take a cost function
Z T
J(α) = [y(t) − d(t)]2 dt (2.51)
0
where d(t) is a reference trajectory. The gradient of the cost function

is given by
Z T
∂J(α) ∂y(t)
= 2[y(t) − d(t)] dt. (2.52)
∂α 0 ∂α
The variable ∂y(t)/∂α follows from the sensitivity model
∂F ∂F
z̈ + ż + z = − (2.53)
∂ ẏ ∂α
with z = ∂y/∂α and initial state z(0) = ż(0) = 0.

This method of computing sensitivity models is applicable both to
discrete time and continuous time systems and to models that depend
on a vector of unknown interconnection weights instead of a single
adjustable parameter α.
The procedure can be applied now to the training of recurrent neu-
ral networks either in input-output form (like the previously discussed
NOE model) or in state space form. Other procedures instead of sen-
sitivity models also exist such as backpropagation through time [64].
The use of a sensitivity model in combination with a gradient based lo-
cal optimization algorithm is called dynamic backpropagation [50]. We
illustrate this for a nonlinear state space model that is parameterized
by MLPs (neural state space models)

x̂k+1 = WAB tanh(VA x̂k + VB uk + βAB ) ; x̂0 = x0
(2.54)
ŷk = WCD tanh(VC x̂k + VD uk + βCD )
where W∗ , V∗ , β∗ are interconnection matrices and bias vectors with

dimensions compatible with the input vector uk , estimated state vec-
tor x̂k and estimated output vector ŷk . Given a training set of N
input/output data the cost function is given by
N
1 X
J(θ) = ǫk (θ)T ǫk (θ) (2.55)
2N k=1
with ǫk = yk − ŷk (θ) and unknown parameter vector θ = [WAB (:); VA (:

); VB (:); βAB ; WCD (:); VC (:); VD (:); βCD ]. where the matlab notation ‘:’
denotes columnwise scanning of a matrix to a vector. The gradient of
the cost function is
N
∂J 1 X ∂ǫk (θ) T
= ǫk (θ) (2.56)
∂θ N ∂θ
k=1
where
∂ǫk (θ) ∂[yk − ŷk (θ)] ∂ ŷk (θ)
= =− (2.57)
∂θ ∂θ ∂θ
and ∂ ŷ∂θ
k (θ)
follows from the sensitivity model. The neural state space
model is of the form

x̂k+1 = Φ(x̂k , uk , ǫk ; α) ; x̂0 = x0 given
(2.58)
ŷk = Ψ(x̂k , uk ; β)
with α, β elements of parameter vector θ. The sensitivity model be-

comes (Fig.2.15)
 ∂ x̂k+1 ∂Φ ∂ x̂k ∂Φ

 ∂α
= .
∂ x̂k ∂α
+ ∂α

∂ ŷk ∂Ψ ∂ x̂k
∂α
= .
∂ x̂k ∂α
(2.59)

∂ ŷk
 ∂Ψ
= .

∂β ∂β
From an elementwise notation of the model

j
( i P i
P j r
P j s
x̂ := j wAB j tanh( r vA r x̂ + s vB s u + βAB )
j
(2.60)
ŷ i = i j r j s
P P P
j wCD j tanh( r vC r x̂ + s vD s u + βCD ),
48
y Model u
^ ^
δ y δ y Sensitivity δΦ δΨ
, ,
δα δ β Model δα δ β
Figure 2.15: In dynamic backpropagation of recurrent networks a sen-

sitivity model is simulated simultaneously with the model in order to
compute the gradient of the cost function.
defining φl = r vA lr x̂r + s vB ls us + βAB l

ρl = r vC lr x̂r +
P P P
and
P afterl s l
s vD s u +βCD , one obtains the derivatives

∂Φi

 ∂wAB jl
= δji tanh(φl )



 ∂Φi i 2 j l
 ∂vA j = wAB j (1 − tanh (φ )) x̂


∂Φ l
∂α
: ∂Φi i 2 j l
(2.61)


 j
∂vB l
= w AB j (1 − tanh (φ )) u


 ∂Φi j = wAB ij (1 − tanh2 (φj ))


 ∂βAB
∂Ψi


 ∂wCD jl
= δji tanh(ρl )



∂Ψi
= wCD ij (1 − tanh2 (ρj )) x̂l


∂vC jl

∂Ψ
∂β
:
∂Ψi


 ∂vD jl
= wCD ij (1 − tanh2 (ρj )) ul



∂Ψi
= wCD ij (1 − tanh2 (ρj ))


∂βCD j
∂Φi
∂Φ
wAB ij (1 − tanh2 (φj )) vA jr
P
∂ x̂k
: ∂ x̂r
= j
∂Ψi
∂Ψ
wCD ij (1 − tanh2 (ρj )) vC jr .
P
∂ x̂k
: ∂ x̂r
= j
where δij denotes the Kronecker delta.

2.8 Model validation

Once a model has been trained it is important to check the validity
of the derived model. As discussed earlier it can very well be that
the performance on the training set is good with a low cost function
value but that the performance on an independent test set is very bad.
These important issues of learning and generalization will be discussed
in a following Chapter. We briefly discuss here some other relevant
tests which can be used in order to check whether the obtained models
are statistically valid.
One approach is to check whiteness of residuals, meaning that the
residuals ǫk = yk −ŷk must be uncorrelated with all linear and nonlinear
combinations of past inputs and outputs. This means in fact that the
residuals do not contain any remaining or relevant information that
could be further incorporated in the model. According to [29, 30] one
should check
φǫǫ (k) = impulse function
φuǫ (k) = 0 ∀k
φǫ(ǫu) (k) = 0 k ≥ 0 (2.62)
φu2 ǫ (k) = 0 ∀k
′
φu2′ ǫ2 (k) = 0 ∀k
where φxz indicates the cross-correlation function between xk and zk ,
′
and u2 (k) = u2 (k) − u2 (k) where u2 (k) is the time average or mean
of u2 (k). The first two tests are common in the area of linear sys-
tem identification. In practice one works with normalized correlations
(−1 ≤ φψ1 ψ2 (τ ) ≤ 1)
PN −τ
ψ1 (k)ψ2 (k + τ )
φψ1 ψ2 (τ ) = PN k=12 (2.63)
[ k=1 ψ1 (k) N 2
P 1/2
k=1 ψ2 (k)]
for two sequences ψ1 (k), ψ2 (k) with discrete time index k. One defines
then 95% confidence bands as 1.96 √
N
with N the number of data. The
identified model should be rejected if it is not within these bands. In
that case one should try to look for other inputs, other past input and
output values to incorporate into the model.
A second well-known method is to apply a hypothesis test. Sup-
pose Ω(k) is an s-dimensional vector valued function of past inputs,
outputs and prediction errors. A convenient choice is Ω(k) = [ω(k); ω(k−
1); ... ; ω(k − s + 1)] with ω a function of input or output data. As the
50
null hypothesis one could take “The data are generated by the obtained
model”. If this hypothesis is true then the statistic ζ defined by
ζ = NµT (ΓT Γ)−1 µ (2.64)
with ΓT Γ = N1 N
P T 1
PN
k=1 Ω(k)Ω (k), µ = N k=1 Ω(k)ǫ(k)/σǫ is asymp-
totically chi-squared distributed with s degrees of freedom. Here σǫ2
denotes the variance of residuals ǫ. The model is regarded as adequate
then if
ζ < χ2s (α) (2.65)
where χ2s (α) is the critical value of the chi-squared distribution with
s degrees of freedom given a significance level α (0.05) (acceptance
region 95%).
Chapter 3
Neural Networks and

Classification
In this chapter we discuss basic notions of classification and pattern

recognition, especially in relation to neural networks, including linear
versus nonlinear separability, Bayes rule and classification problems,
density estimation, preprocessing, dimensionality reduction and ROC
curves. This chapter is partially based on [1, 3, 5, 6, 14].
3.1 Single neuron case: perceptron algo-

rithm
The perceptron algorithm was one of the early methods in the area
of neural networks. Especially in the sixties this algorithm has been
investigated. A perceptron consists of a single neuron with a sign acti-
vation function (hardlimiter). Given a set of features x the perceptron
is able to realize a linear classification into two halfspaces.
The perceptron network (Fig.3.1):
y = sign(v T x + b)
(3.1)
= sign(w T z)
is trained in a supervised way for a given training set {x(i) , d(i) }N

i=1 .
The following notation is used here: augmented input vector z =
[x; 1] ∈ Rn+1 , w = [v; b], input x ∈ Rn , output y ∈ R and desired
outputs d.
51
52
x1
v1
x2
v2
vn
+
xn b d
error
-
1
x2
v^T x + b = 0 x1
Figure 3.1: Training of a perceptron.

Neural Networks and Classification 53
Perceptron algorithm
1. Choose c > 0
2. Initialize small random weights w. Set k = 1, i = 1 and set cost

function E = 0.
3. The training cycle begins here. Present the i-th input and com-
pute the corresponding output
y (i) = sign(w T z (i) ).
4. Update the weights

1
w := w + c [d(i) − y (i) ] z (i) .
2
5. Compute the cycle error

1
E := E + [d(i) − y (i) ]2 .
2
6. If i < N, then i := i + 1 and k := k + 1 and go to step 3.

Otherwise, go to step 7.
7. If E = 0 then stop the training. If E > 0 then set E = 0, i = 1

and enter a new training cycle by going to step 3.
Note that the weights w are updated by taking into account the errors
[d(i) − y (i) ] which is also the case in the backpropagation algorithm
where the errors are backpropagated by means of the δ variables. A
proof of convergence exists for this perceptron algorithm.
3.2 Linear versus non-linear separability

The use of a perceptron model has serious limitations. A simple ex-
ample of the well-known XOR problem shows that the perceptron is
unable to find a decision line which correctly separates the two classes
for this problem (Fig.3.2). Indeed the perceptron can only realize a
hyperplane in the input space of given features (or straight line in the
case of two inputs). Because of this reason there was few research
in the area of neural networks in the seventies until the introduction
54
x2
x o
o x
x1
Figure 3.2: Exclusive-OR (XOR) problem.
of multilayer perceptrons with the backpropagation algorithm. By

adding a hidden layer (Fig.3.3) the XOR problem can be solved be-
cause nonlinear decision boundaries can be realized. By means of a
hidden layer one can realize convex regions, and furthermore by means
of two hidden layers non-connected and non-convex regions can be re-
alized. The universal approximation ability of neural networks makes
it also a powerful tool in order to solve classification problems.
An interesting theorem about separability is Cover’s Theorem. It
considers the case of continuous input variables in a d-dimensional
space and addresses the question of what is the probability that a
random set of patterns that is generated in this input space is linearly
separable. One assigns randomly points to classes C1 , C2 with equal
probability where each possible assignment for the complete data set
is called a dichotomy. For N points there are 2N possible dichotomies.
The fraction F (N, d) of these dichotomies which is linearly separable
is given by:

 1
 , N ≤d+1
d
F (N, d) = 1
X N −1 (3.2)
 2N−1 , N ≥d+1
i

i=0
N!
N

with M = (N −M )!M !
the number of combinations of M objects
selected from a total of N. Some important conclusions at this point
Figure 3.3: From perceptron to multilayer perceptron (MLP) classi-

fiers.
are that if N < d + 1 any labelling of points will always lead to linear
separability and for larger d it becomes likely that more dichotomies
are linearly separable. If one takes e.g. N = 4 and d = 2 one of the
possible dichotomies corresponds to the XOR problem configuration.
According to Fig.3.4 one can indeed see that not all problems are
linearly separable. However, if one considers the same problem e.g. for
d = 5 instead of d = 2 the problem becomes linearly separable. The
problem of how to select a good decision boundary will be discussed
later. Cover’s theorem should rather be considered as a theorem about
existence of hyperplanes and not about whether this hyperplane is
good or bad in terms of generalization.
3.3 Multilayer perceptron classifiers

A popular approach for applying MLPs to classification problems is
to consider it as a regression problem (which has been explained in
the previous Chapter) with the class labels as target outputs.
Consider an MLP with input vector x ∈ Rm and output y ∈ Rl :
y = W tanh(V x + β). (3.3)

56
1.0 d=infty
d=1
0.5
d=20
0 1 2 3 4
N/(d+1)
Figure 3.4: Fraction F (N, d) of dichotomies which is linearly separable

according to Cover’s Theorem, where N is the number of points and
d is the dimension of the input space.
The output values {−1, +1} (or {0, 1}) denote then the two classes.
In the case of multiple outputs, one can encode 2l classes by means of
l outputs, e.g. for l = 2 one has the following option
y1 y2
Class 1 +1 +1
Class 2 +1 −1
Class 3 −1 +1
Class 4 −1 −1
but on the other hand one can also utilize one output per class:
y1 y2 y3 y4
Class 1 +1 −1 −1 −1
Class 2 −1 +1 −1 −1
Class 3 −1 −1 +1 −1
Class 4 −1 −1 −1 +1
which is better from the viewpoint of information theory. For example
in trying to recognize the letters of the alphabet by an MLP one can
take 26 outputs (Fig.3.5). Training can be done then in the same way
as for regression with the class labels as target values for the outputs.
After training of the classifier, decisions are made by the classifier as
follows:
y = sign[W tanh(V x + β)] (3.4)
7 x 5 pixels 0
0
0
0
1
0
0
0
0
0
input vector output vector

dimension 35 dimension 26
Figure 3.5: Illustration of a regression approach of MLPs to a multi-

class classification problem.
where the sign function is a hardlimiter for obtaining a binary decision.

If the output does not correspond to one of the defined class codes,
one can assign it to the class which is closest in terms of Hamming
distance.
3.4 Classification and Bayes rule

3.4.1 Bayes rule - discrete variables
In the previous Section we discussed a pragmatic use of neural net-
works for classification problems. Now, we will analyse the problem of
classification in a more theoretical way in relation to Bayesian decision
theory.
In order to fix the ideas let us consider a problem of fraud detection
in mobile communications. A decision made by a classifier system is
done on the basis of a number of defined features (Fig.3.6). Some
relevant features are e.g. the duration of a phone call, the destination
of the call, the frequency of the phone calls relative with respect to the
58
user profile etc. When a mobile phone is stolen one might discriminate
between fraud and non-fraud by checking such features.
Now, typically the number of examples of non-fraud (data of class
C1 ) is much larger than the examples of fraud (data of class C2 ),
suppose e.g. that there would be 1000 times more examples of non-
fraudulent calls than fraudulent ones. In a probabilistic setting one
could say then that the prior class probabilities are P (C1 ) = 1000/1001,
P (C2 ) = 1/1001. At this stage the best classification rule would be
P (C1 ) > P (C2 ) meaning that we would always classify a new case as
belonging to class C1 according to this classification rule. The question
is then whether there exists a formalism which can combine this prior
knowledge with information of a training data set. The answer is Yes:
Bayesian decision theory can help us at this point.
Suppose that we consider fraud detection based on the frequency of
phone calls and assume that we characterize this feature as x1 (total
feature vector x is d dimensional) in term of discrete values Xl for
l = 1, ..., L where L denotes the total number of discrete values that
this variable can take. One can then consider the joint probability
P (Ck , Xl ), the conditional probability P (Xl |Ck ) (i.e. the probability
that an observation takes value Xl given it belongs to class Ck ), the
conditional probability P (Ck |Xl ) (i.e. the probability that the class
is Ck given that an observation takes value Xl ) and the prior class
probability P (Ck ). For the case of a binary classification problem one
has k = 1, 2.
One can write then
P (Ck , Xl ) = P (Xl |Ck )P (Ck )
but also
P (Ck , Xl ) = P (Ck |Xl )P (Xl ).
Hence
P (Xl |Ck )P (Ck )
P (Ck |Xl ) = (3.5)
P (Xl )
or conceptually
Class conditional Prob. × Prior Prob.

Posterior Prob. = (3.6)
Normalization
which is Bayes’ Theorem (Thomas Bayes 1702-1761).
The normalization is done by P (Xl ) which can be re-expressed

from 
 P (C1 |Xl ) + P (C2 |Xl ) = 1


P (Xl |Ck )P (Ck )
 P (Ck |Xl ) = , k = 1, 2


P (Xl )
giving
P (Xl ) = P (Xl |C1 )P (C1 ) + P (Xl |C2 )P (C2 ).
Hence Bayes’ rule can be written as
P (Xl |Ck )P (Ck )

P (Ck |Xl ) = , k = 1, 2. (3.7)
P (Xl |C1 )P (C1 ) + P (Xl |C2 )P (C2 )
Using this Bayes theorem, the classification can be based now on the
posterior probability P (Ck |Xl ) instead of the prior probabilities P (Ck )
only. The probability of misclassification is minimized by selecting the
class for which the posterior P (Ck |Xl ) is maximal.
3.4.2 Bayes rule - continuous variables

Let us consider now continuous variables in the feature space instead of
discrete variables Xl . Recall that the probability that x is lying in [a, b]
Rb
is P (xR∈ [a, b]) = a p(x)dx with probability density p(x) and P (x ∈
R) = R p(x)dx. The average value of R function Q(x) with respect to
the density p(x) is given by E[Q] = R Q(x)p(x)dx. Suppose a finite
data set x1 , ..., xN P
is drawn from p(x), then the approximation to this
integral E[Q] ≃ N N 1
n=1 Q(xn ) makes sense. After these preliminaries
we are in a position now to formulate Bayes theorem for densities and
a number of c classes. Bayes rule becomes
p(x|Ck )P (Ck )

 P (Ck |x) = , k = 1, ..., c , x ∈ Rd
p(x)







c



 X
P (Ck |x) = 1 (normalization) (3.8)

 k=1



c



 X
 p(x) = p(x|Ck )P (Ck ) (unconditional density).


k=1
60
data
preprocessing
definition of
feature space
classifier design
(training)
categorization
pattern classes
Figure 3.6: Several steps in a pattern recognition process.

The class-conditional density p(x|Ck ) is e.g. obtained by probability

density estimation methods. If p(x|Ck ) has a parameterized form,
then it is called a likelihood function. One can state then
Likelihood × Prior
Posterior = . (3.9)
Normalization
3.4.3 Decision making

In order to make decisions based upon the Bayes rule one assigns x to
class k ∗ where
k ∗ = arg max P (Ck |x). (3.10)
k=1,...,c
Hence, the class with maximum posterior probability is selected. In-

deed, a feature vector x is assigned to Ck if
P (Ck |x) > P (Cj |x) ∀j 6= k
or
p(x|Ck )P (Ck ) > p(x|Cj )P (Cj ) ∀j 6= k.
Selecting the class Ck with largest posterior probability minimizes

the probability of misclassification. In order to understand this, let us
write down the probability for a correct and wrong classification. In
the case of two classes one has
P (error) = P (x ∈ R2 , C1 ) + P (x ∈ R1 , C2 )
= P
Z (x ∈ R2 |C1 )P (C1 ) + P
Z (x ∈ R1 |C2 )P (C2 )
= p(x|C1 )P (C1 )dx + p(x|C2 )P (C2 )dx
R2 R1
c (3.11)
X
P (correct) = P (x ∈ Rk |Ck )P (Ck )
k=1
Xc Z
= p(x|Ck )P (Ck )dx.
k=1 Rk
The shaded area in Fig.3.7 is minimal if one minimizes the probability

of misclassification. This classification leads to regions R1 , ..., Rc in
the feature space. These regions can also be disjoint.
62
p( x | C1 ) P( C1 )
p( x | C2 ) P( C2 )
x
R1 R2
Figure 3.7: A minimal shaded area corresponds to minimizing the

probability of misclassification, when considering a moving threshold
line (dashed vertical line).
3.4.4 Discriminant functions

The Bayes decision rule can be interpreted as assigning x to class Ck
if
yk (x) > yj (x) ∀j 6= k (3.12)
where yk (x), yj (x) are discriminant functions. The case
yk (x) ∝ p(x|Ck )P (Ck ) (3.13)
corresponds then to minimizing the probability of misclassification.

Only relative magnitudes of discriminant functions are important.
One may apply a monotonic function g(·) as g(yk (x)), e.g.:
yk (x) = log[p(x|Ck )P (Ck )]

(3.14)
= log p(x|Ck ) + log P (Ck ).
The decision boundaries yk (x) = yj (x) are not influenced by the choice
of the monotonic function.
In the case of a binary classification problem one may take a re-
formulation by a single discriminant function:
y(x) = y1 (x) − y2 (x) (3.15)
with class C1 if y(x) ≥ 0 and class C2 if y(x) < 0. Instead of using two
discriminant functions one can take a single one.
3.4.5 Minimizing risk

Minimizing the probability of misclassification may not be the best
criterion in some circumstances: e.g. it might be more serious when
a fraudulent phone call is classified as normal than a non-fraudulent
call classified as fraud.
For this purpose one can define a so-called loss matrix L where an
element Lkj denotes the penalty associated with assigning a pattern
to class Cj when it belongs to Ck . The expected loss is then defined as
Xc Z
Rk = Lkj p(x|Ck )dx (3.16)
j=1 Rj
and the overall expected loss or risk equals

c
X
R = Rk P (Ck )
k=1
Xc Z c
X (3.17)
= { Lkj p(x|Ck )P (Ck )}dx.
j=1 Rj k=1
The region Rj is chosen then if

c
X c
X
Lkj p(x|Ck )P (Ck ) < Lki p(x|Ck )P (Ck ) ∀i 6= j. (3.18)
k=1 k=1
In the case of two classes this means that one selects class 1 if
L11 p(x|C1 )P (C1 ) + L21 p(x|C2 )P (C2 )
(3.19)
< L12 p(x|C1 )P (C1 ) + L22 p(x|C2 )P (C2 )
or
(L21 − L22 )p(x|C2 )P (C2 ) < (L12 − L11 )p(x|C1 )P (C1 ). (3.20)
Usually one has Lij > Lii . One obtains then the likelihood ratio
p(x|C1 ) P (C2 ) L21 − L22
l12 (x) = > = θ12 . (3.21)
p(x|C2 ) P (C1 ) L12 − L11
The classification is made as follows: class C1 if l12 (x) > θ12 and class
C2 if l12 (x) < θ12 . A typical choice for the loss matrix is
Lkj = 1 − δkj (3.22)

64
with δkj the Kronecker delta. For c = 2 one has L11 = L22 = 0,
L12 = L21 = 1. This means that the loss is equal to 1 if the pattern
is placed in the wrong class and the loss is zero if pattern is placed in
the correct class. This would correspond then again to the minimal
misclassification decision rule.
3.5 Gaussian density assumption

3.5.1 Normal density
Let us now investigate what happens if we assume that certain densi-
ties appearing in the Bayes rule are Gaussian.
Recall that the normal density in a single variable x ∈ R is
1 (x − µ)2
p(x) = exp{− } (3.23)
(2πσ 2 )1/2 2σ 2
R∞
with −∞ p(x)dx = 1, µ the mean and σ 2 the variance (σ is the stan-
dard deviation). One has
Z ∞
µ = E[x] = xp(x)dx
−∞ Z ∞ (3.24)
2 2
σ = E[(x − µ) ] = (x − µ)2 p(x)dx.
−∞
In higher dimensions one has the multivariate normal probability den-

sity:
1 1
p(x) = exp{− (x − µ)T Σ−1 (x − µ)} (3.25)
(2π)d/2 |Σ|1/2 2
R∞
with −∞ p(x)dx = 1, µ ∈ Rd the mean and Σ ∈ Rd×d the covariance
matrix (Σ = ΣT > 0). One has
µ = E[x]
(3.26)
Σ = E[(x − µ)(x − µ)T ].
The so-called Mahalanobis distance from x to µ is given by ∆2 = (x −
µ)T Σ−1 (x−µ). Surfaces of constant probability density (∆2 = const.)
are characterized by hyperellipsoids with principal axes ui related to
the eigenvalue decomposition of the covariance matrix
Σ u i = λi u i (3.27)
x2
u2
u1
1/2
λ λ
1/2
2 1
x1
Figure 3.8: Multivariate normal density: eigenvalues and eigenvectors

of the covariance matrix.
with ui, λi eigenvectors and eigenvalues of Σ, respectively (Fig.3.8).

Because Σ is symmetric positive definite the eigenvalues are real and
positive. Note that when Σ is diagonal,
Qd the components x are statis-
tically independent because p(x) = i=1 p(xi ) holds in that case.
The normal density has some interesting properties. It is a simple

assumption which usually leads to analytically computable results.
The central limit theorem states that under rather general circum-
stances the mean of a number of M random variables tends to be
distributed normally, in the limit as M tends to infinity. In many
real-life datamining applications this is a reasonable assumption (when
outliers are removed) because one usually has large data sets. Also,
related marginal densities (densities obtained by integrating out some
of the variables), have a normal distribution then as well. The related
conditional densities are also normally distributed. Another property
is that by coordinate transformation one can diagonalize the covari-
ance matrix and one can prove that among allRpossible densities the
normal density maximizes the entropy S = − p(x) log p(x)dx for a
given mean and covariance matrix.
66
3.5.2 Discriminant functions

Let us now assume that the densities p(x|Ck ) are normally distributed
for all k and consider the discriminant functions
yk (x) = log p(x|Ck ) + log P (Ck ). (3.28)
One obtains then
1 1
yk (x) = − (x − µk )T Σ−1
k (x − µk ) − log |Σk | + log P (Ck ). (3.29)
2 2
The decision boundaries yk (x) = yj (x) are given by quadratic forms
in Rd . For the binary classification case one obtains the following
important special cases (Fig.3.9):
1. Σ1 = Σ2 = Σ:
For equal covariance matrices the decision boundary becomes
linear, either for distributions having a small or large overlap.
2. Σ1 = Σ2 = Σ = σ 2 I:
Under this assumption one obtains
kx − µk k22
yk (x) = − + log P (Ck ), k = 1, 2. (3.30)
2σ 2
If the prior probabilities are equal, then the mean vectors µ1 , µ2
act as prototypes. Under these assumptions the classification
can be done by simple calculation of the Euclidean distance to
µk . When applying this kind of classification rule in general one
should be aware of the assumptions under which the procedure
is optimal.
3.5.3 Logistic discrimination

For normally distributed class-conditional densities, the posterior prob-
abilities can be obtained by a logistic one-neuron network:
y = σ(a)
(3.31)
= σ(w T x + b)
with logistic activation function
1
σ(a) = ∈ [0, 1].
1 + exp(−a)
Case Σ1 = Σ2 = Σ
Case Σ1 6= Σ2
Figure 3.9: Optimal decision boundaries under different assumptions

for the covariance matrices of the classes.
68
Similar arguments hold for the case of binary variables with Bernoulli
distribution.
For class-conditional densities with Σ1 = Σ2 = Σ we had
1 1
p(x|Ck ) = exp{− (x − µk )T Σ−1 (x − µk )}.
(2π)d/2 |Σ|1/2 2
This leads to the following posterior
P (C1 |x) = σ(w T x + b) (3.32)
with
w = Σ−1 (µ1 − µ2 )
(3.33)
b = − 21 µT1 Σ−1 µ1 + 21 µT2 Σ−1 µ2 + log PP (C1)
(C2 )
.
Note that the bias term b depends on the prior class probabilities
(which are often unknown in practice). Hence, different prior class
probabilities will lead to a translational shift of the hyperplane (or
straight line in the case of a two dimensional feature space).
3.6 Density estimation

When we want to calculate the posterior from
p(x|Ck )P (Ck )
P (Ck |x) = , k = 1, ..., c , x ∈ Rd
p(x)
we should try to find ways for estimating p(x|Ck ). In the previous

subsection certain Gaussian assumptions were taken, but in general
one could try to estimate the underlying density itself. For this pur-
pose there are several possibilities: parametric, non-parametric and
semi-parametric methods.
In parametric methods the density is parameterized by means of
some unknown parameter vector θ (often denoted as p(x|θ) or p(x; θ)).
This parameterized function p(x|θ) is called the likelihood. In maxi-
mum likelihood estimation methods one assumes then that the data
are drawn independently form p(x|θ)
N
Y
p(χ|θ) = p(xn |θ) = L(θ) (3.34)
n=1
with likelihood L(θ) of θ = [θ1 ; ...; θM ] ∈ RM for the data χ =

{x1 , ..., xN }. The maximum likelihood solution is given by
min E = − log L(θ)

θ
N
Y
= −log[ p(xn |θ)]
n=1
(3.35)
N
X
= − log p(xn |θ).
n=1
For example, when parameterizing p(x|θ) as a normal density with

parameters
P θ = [µ; σ] 2one 1finds
PN as maximum likelihood solution that
µ̂ = N1 N x
n=1 n and σ̂ = N
(x
n=1 n − µ̂) 2
and E[σ̂ 2 ] = NN−1 σ 2 .
Another method used in order to estimate θ for such a parameter-
ized density p(x|θ) is Bayesian inference. In this case a distribution
is considered on θ (rather than considering it as a point in the param-
eter space). Bayes’ Theorem applies as follows
p(χ|θ)p(θ)
p(θ|χ) = . (3.36)
p(χ)
For independent
QN data points xn (n = 1, ..., N) one has the likelihood
p(χ|θ) = n=1 p(xn |θ). The normalization factor
Z N
Y
′ ′ ′
p(χ) = p(θ ) p(xn |θ )dθ
n=1
R
ensures p(θ|χ)dθ = 1. Fig.3.10 illustrates the process of obtaining a
sharper estimate of the posterior p(θ|χ) by combining the prior p(θ)
with the data χ.
In non-parametric methods typically a Gaussian function is
placed at each data point xn (n = 1, ..., N)
N
1 X 1 kx − xn k22
p̃(x) = exp{− } (3.37)
N n=1 (2πh2 )d/2 2h2
where a careful choice of the width h should be made (Fig.3.11). The

problem in this case is, however, that all data points need to be stored
in order to represent p̃(x).
70
posterior
p( θ | χ)
prior
p(θ )
posterior prior
Data
p( θ | χ) p(θ )
Figure 3.10: Bayesian inference: starting from a prior p(θ) with a large
uncertainty on the estimate θ, the data χ are used in order to generate
a posterior p(θ|χ) which becomes more accurate.
h too large
h too small
Figure 3.11: Non-parametric density estimation.
3.7 Mixture models and EM algorithm

In this Section we discuss the problem of density estimation more
specifically for so-called mixture models. A mixture distribution takes
the form
M
X
p(x) = p(x|j)P (j) (3.38)
j=1
with
M
X
P (j) = 1, 0 ≤ P (j) ≤ 1 (3.39)
j=1
R
and normalization of the component densities p(x|j) such that p(x|j)dx =
1, where x ∈ Rd and N data points are given. The mixing parame-
ters P (j) correspond to the prior probability that data point has been
generated from component j in the mixture.
A network interpretation of the mixture model is given in Fig.3.12.
It is closely related to RBF networks if one takes a Gaussian mixture
72
p(x|1) x1
P(1)
p(x)
P(M)
p(x|M) xd
Figure 3.12: Network interpretation of a mixture model.
model with component densities
1 kx − µj k22
p(x|j) = exp{− } (3.40)
(2πσj2 )d/2 2σj2
and adjustable parameters P (j), µj , σj for (j = 1, ..., M), which are

to be considered then as elements of the unknown parameter vector θ
for the density estimation.
In a maximum likelihood approach to determine these parameters,
the following cost function is optimized
N
Y N
X M
X
min E = − log L = − log p(xn ) = − log{ p(xn |j)P (j)}.
n=1 n=1 j=1
(3.41)
This can be minimized with a gradient basedPoptimization method.
However, one should be able to guarantee that M j=1 P (j) = 1 and 0 ≤
P (j) ≤ 1. Instead of using a constrained optimization method, one
might re-parameterize the problem in other unknowns which guarantee
that the constraints are automatically satisfied. By taking γj for j =
1, ..., M such that
exp(γj )
P (j) = PM , (3.42)
j=1 exp(γj )
which is the so-called softmax function, this can be done. There are
no constraints then on the values γj which can take any real value.
Another approach is by means of the so-called EM algorithm (Ex-

pectation-Maximization) (Dempster, 1977). Conceptually this method
works as follows. First a reformulation of the cost function in terms
of hidden variables znj is done
XN M
X
E = − log{ p(xn |j)P (j)}
n=1 j=1
M
N X (3.43)
X
= − znj log p(xn |zn , j)P (zn )
n=1 j=1
in order to convert log Σ into Σ log. The EM algorithm works then as

follows:
EM algorithm for Gaussian mixture model

For k = 1, .., stop
1. E-step (Expectation step)
πnj = E[znj |xn ]
σj−d (k) exp{−kx − µj (k)k22 /2σj2 (k)}
= PM −d 2 2
m=1 σj (k) exp{−kx − µm (k)k2 /2σm (k)}
2. M-step (Maximization step)

N
1 X
P (j)(k+1) = πnj
N n=1
PN
n=1 πnj xn
µj (k + 1) = P N
n=1 πnj
PN
n=1 πnj kxn − µj (k + 1)k22
σj (k + 1) = PN .
n=1 πnj
3.8 Preprocessing and dimensionality re-

duction
3.8.1 Preprocessing
It is always desirable to scale the input/output data to a suitable
format before one starts training neural networks or applying any other
74
method. Input normalization is done by componentwise linear scaling

or by whitening.
Given training data x(n) ∈ Rd for n = 1, ..., N one computes
N
1 X (n)
xi = x
N n=1 i
(3.44)
N
1 X (n)
σ̂i2 = (xi − xi )2
N −1 n=1
where i denotes the i-th component. The rescaled variables

(n)
(n) xi − xi
x̃i = (3.45)
σ̂i
have then zero mean and unit standard deviation.
In the whitening method (Fig.3.13) one computes
N
1 X (n)
x = x
N n=1
(3.46)
N
1 X
Σ̂ = (x(n) − x)(x(n) − x)T
N −1 n=1
and from the eigenvalue decomposition of the covariance matrix Σ̂ uj =

λj uj one finds the transformed input variables
x̃(n) = Λ−1/2 U T (x(n) − x) (3.47)
where U = [u1 ...ud ], Λ = diag{λ1 , ..., λd }.

Often the variables are not real-valued (e.g. categorical variables).
It is better then to scale these component values to a maximal value
of 1 by a linear scaling.
3.8.2 PCA dimensionality reduction

Often dimensionality reduction is done on the feature space of input
variables. For example when classifying microarray data in the area of
bioinformatics one has a huge dimensional input space (e.g. dimension
x2
original
distribution
whitened
distribution
x1
Figure 3.13: Illustration of scaling the data by whitening after com-

puting eigenvalues and eigenvector of the covariance matrix.
7000) that corresponds to measured expression levels of genes, but on

the other hand quite a few number of given data points (expensive
experiments). Suppose one would take an MLP classifier with 5 hid-
den units and 1 output (binary classification e.g. with discrimination
between two types of cancer) one has more than 5 × 7000 = 35000
parameters to estimate but one has only order of magnitude 50 data
points available. Therefore, if one wants to apply MLPs in this case a
serious reduction of the input space dimension is needed first.
A well-known and frequently used technique for dimensionality re-
duction is linear PCA analysis. Suppose one wants to map vectors
x ∈ Rd into z ∈ RM with M < d. One proceeds then by computing
the covariance matrix:
1 X (n)
Σ̂ = (x − x)(x(n) − x)T (3.48)
N −1 n
and compute the eigenvalue decomposition:
Σ̂ uj = λj uj . (3.49)
By selecting the M largest eigenvalues and the corresponding eigen-

vectors, one obtains the transformed variables
z(i) = uTi x. (3.50)

76
1 N
1
G = number of genes
N = number of experiments
Figure 3.14: (Top) Schematic interpretation of microarray data in

bioinformatics: the input space is huge dimensional and consists of
the activity of several thousands of genes. The columns are the data
experiments; (Bottom) gene expression levels from a gene chip.
Negative Positive
True TN TP
False FN FP
Figure 3.15: Confusion matrix.
One has to note, however, that these transformed variable are no

longer real physical variables. The error resulting from the dimen-
sionality reduction is characterized as follows:
d
X
Error = λi , (3.51)
i=M +1
meaning that the error made can be expressed in terms of the smallest
eigenvalues of the neglected components.
3.9 ROC curves

For binary classification problems one can consider the following so-
called confusion matrix shown in Fig.3.15 with
TP number of correctly classified positive cases
TN number of correctly classified negative cases
FP number of wrongly classified positive cases
FN number of wrongly classified negative cases
and where negative/positive (class 1/class 2) means e.g. benign/malignant
tumor in a biomedical application or fraud/non-fraud in a fraud detec-
tion datamining problem. True/false means correctly/wrongly classi-
fied data, respectively.
It is convenient then to define
Sensitivity = TP/(TP + FN)
Specificity = TN/(FP + TN) (3.52)
False positive rate = 1 − Specificity = FP/(FP + TN).
78
SENSITIVITY
C θ3
B
θ2
A
θ1
0 1
FALSE POSITIVE RATE
Figure 3.16: ROC curve for three classifiers A,B,C. The classifier C has
the best performance. It has the largest area under the ROC curve.
An ROC curve is obtained by moving the threshold of a classifier.
The receiver-operating characteristic (ROC)1 curve shows then the

sensitivity with respect to the false positive rate (Fig.3.16). The larger
the area under the ROC curve the better the classifier. These perfor-
mances should be tested not only on the training data but also on the
test data. In Fig.3.16 classifier C is better than B and the classifier
with curve A has no discriminatory power at all. The points θ1 , θ2 , θ3
illustrated on a ROC curve B on Fig.3.16 are examples of different
operating points on the ROC curve. These correspond e.g. to varying
decision threshold values of an output unit (Fig.3.17). Note that in
the case of logistic discrimination this corresponds in fact to varying
the prior class probabilities, because the bias term explicitly depends
on these probabilities.
1
This ROC curve dates back from the second world war where it was used on
radar signal. After a paper by Swets has been published in Science it became
popular in the medical world.
1
varying threshold
Class 1
0.5
Class 2
0
Figure 3.17: Varying threshold in a classifier for generating the ROC

curve. For the case of logistic discrimination this can be explicitly
related to varying the prior class probabilities.
80
Chapter 4
Learning and Generalization
In this Chapter we discuss aspects of learning and generalization and

tackle important questions such as how we can design networks so as to
obtain good performance on unseen data. We present interpretations
of network outputs, bias-variance trade-off, regularization and weight
decay, effective number of parameters, early stopping, pruning, com-
mittee networks, cross-validation, complexity criteria, Bayesian learn-
ing and automatic relevance determination. This chapter is mainly
based on [1, 14, 45, 46].
4.1 Interpretation of network outputs

Consider a training set {xn , tn }N
n=1 where xn ∈ R
m
denotes the input
data and tn ∈ R the target (output) data. The (static) model is
denoted as y(xn ; w) with output y ∈ R and parameters w. When
deriving models in general the goal is not to memorize data but rather
to model the underlying generator of the data, characterized by p(x, t)
which is the joint probability density of inputs x and targets t. One
has
p(x, t) = p(t|x)p(x) (4.1)
with p(t|x) the probability density of t given a particular value of x

and p(x) the unconditional density of x.
According to [1] let us consider now the cost E in the limit N → ∞,
81
82
i.e. for an infinite data set size:

N
1 X
E = lim {y(xn ; w) − tn }2
N →∞ 2N
Z Z n=1
1
= 2 {y(x; w) − t}2 p(t, x) dtdx (4.2)
Z Z
= 21 {y(x; w) − t}2 p(t|x)p(x) dtdx.
Defining the conditional averages

Z
ht|xi = t p(t|x)dt
Z (4.3)
ht2 |xi = t2 p(t|x)dt
one has
{y − t}2 = {y − ht|xi + ht|xi − t}2

= {y − ht|xi}2 + 2{y − ht|xi}{ht|xi − t} + {ht|xi − t}2 .
(4.4)
As a result one obtains
Z Z
1 1
E = 2 {y(x; w) − ht|xi} p(x)dx + 2 {ht2 |xi − ht|xi2 }p(x)dx.
2
(4.5)
The first term means that at the local minimum w ∗ of the error func-
tion, we have
y(x; w ∗) = ht|xi, (4.6)
meaning that the output approximates the conditional average of the
target data. The second term represents the intrinsic noise on the
data and sets a lower limit on the achievable error.
4.2 Bias and variance

In practice we often have only one specific and finite data set D.
In order to eliminate the dependency on a specific data set D one
considers
ED [{y(x) − ht|xi}2 ]
Learning and Generalization 83
where ED denotes the ensemble average. The question is then how

close the estimated mapping is to the true one. In order to answer
this question we write
{y(x) − ht|xi}2 = {y(x) − ED [y(x)] + ED [y(x)] − ht|xi}2

= {y(x) − ED [y(x)]}2 + {ED [y(x)] − ht|xi}2 + (4.7)
2{y(x) − ED [y(x)]}{ED [y(x)] − ht|xi}.
By taking the expectation over the ensemble of the data sets one gets
ED [{y(x) − ht|xi}2 ] = {ED [y(x)] − ht|xi}2 + ED [{y(x) − ED [y(x)]}2 ].

(4.8)
The bias and variance follow then from these two terms. The first
term is given by
1
Z
2
(bias) = {ED [y(x)] − ht|xi}2 p(x)dx. (4.9)
2
The second term gives
1
Z
variance = ED [{y(x) − ED [y(x)]}2 ]p(x)dx. (4.10)
2
A first illustration of the bias-variance trade-off is given in terms of
two extreme cases shown in Fig.4.1. Suppose
tn = h(xn ) + ǫn (4.11)
with true function h(x) and y(x) an estimate of h(x). Consider the
following two extremes:
1. Fix y(x) = g(x) independent of any data set. Then:
ED [y(x)] = g(x) = y(x)
which gives zero variance but a large bias.
2. Consider an exact interpolant of the data. Then
ED [y(x)] = ED [h(x) + ǫ] = h(x) = ht|xi
which gives zero bias, but large variance according to
ED [{y(x) − ED [y(x)]}2 ] = ED [{y(x) − h(x)}2 ] = ED [ǫ2 ].

84
y
h(x)
g(x)
y
h(x)
Figure 4.1: Illustration of bias-variance trade-off by means of two

extreme cases: (Top) fixing a function g(x) independent of the data:
zero variance but large bias; (Bottom) exact interpolant of the data
under all circumstances: zero bias but large variance. The function
h(x) is the true function.
In order to make the theoretical arguments more specific, consider

e.g. a situation where one generates 100 data sets by sampling a true
underlying function h(x) and adding noise. Note that h(x) is known in
this experiment, but in a real situation it would of course be unknown.
Estimate the mappings yi(x) for i = 1, 2, ..., 100, e.g. 100 MLPs where
each MLP results from a generated data set. One gets
100
1 X
y(x) = yi(x)
100 i=1
X
(Bias)2 = {y(xn ) − h(xn )}2
n
X 1 X 100
Variance = {yi(xn ) − y(xn )}2 .
n
100 i=1
Let us consider now a second example to further illustrate the

bias-variance phenomenon. Assume that training data are generated
from h(x) = 0.5 + 0.4 sin(2πx) with training data x generated in the
interval [0.1, 1] with steps of 0.1. Test data are taken in [0.01, 1] with
steps of 0.01 (except the points that belong to the training set). Then
Gaussian noise with standard deviation 0.05 is added to the data.
From the training data polynomials of degree d with d ∈ {1, 2, ..., 10}
are estimated. For example, for the polynomial model of degree 3
y = a1 x + a2 x2 + a3 x3 + b (4.12)
an overdetermined set of linear equations is constructed from the given

data {xn , yn }10
n=1
    
x1 x21 x31 1 a1 y 1
 x2 x2 x3 1    y2 
2 2   a2 
 .. .. .. ..   a  =  ..  . (4.13)
  
 . . . .  3
  . 
x10 x210 x310 1 b y10
This overdetermined system is of the form
Aθ = B, e = Aθ − B (4.14)
with A ∈ Rp×q (p > q), B ∈ Rp and unknown parameter vector θ ∈ Rq .

By taking a cost function in linear least squares sense
1 1
min JLS (θ) = eT e = (Aθ − B)T (Aθ − B). (4.15)
θ 2 2
86
the condition for optimality

∂JLS
= AT Aθ − AT B = 0 (4.16)
∂θ
gives the solution
θLS = (AT A)−1 AT B = A† B (4.17)
where A† denotes the pseudo inverse matrix. The results are shown
on Fig.4.2 and Fig.4.3 for the training and test set and for different
degrees of the polynomials. One can observe that for the higher order
polynomial (order 7) the solution starts oscillating. Let us now modify
the least squares cost function by an additional term which aims at
keeping the norm of the solution vector small. This technique is called
regularization or ridge regression in the context of linear systems.
One solves
1
min Jridge (θ) = JLS (θ) + λkθk22 , λ > 0. (4.18)
θ 2
The condition for optimality is given by
∂Jridge
= AT Aθ + λθ − AT B = 0 (4.19)
∂θ
with solution
θridge = (AT A + λI)−1 AT B. (4.20)
This technique is useful when AT A is ill conditioned. The results of
ridge regression for the order 7 polynomial model is shown on Fig.4.4
for different values of the regularization constant λ. Fig.4.5 concep-
tually shows the bias-variance trade-off in terms of the regularization
constant λ. A large value λ decreases the variance but leads to a larger
bias. This value of λ is chosen as a trade-off solution by minimizing
the sum of the variance and the bias square contributions.
4.3 Regularization and early stopping

In general there are several ways to handle the problem of bias-variance
trade-off. First of all it is better to have more data because one can af-
ford more complex models in that case (which can reduce the bias) and
5
10
0
10
−5
10
MSE
−10
10
−15
10
−20
10
−25
10
1 2 3 4 5 6 7 8 9 10
order of polynomial
Figure 4.2: Comparison of the polynomial models on the toy problem

of estimating a noisy sinusoidal function: training set (-) and test set
(- -) performance is shown as a function of the order of the polynomial.
the model becomes more constrained by the data (which reduces the
variance). However, in daily practice, the data sets are given and one
has to do the best possible given the available amount of information.
In order to obtain models with good generalization ability we discuss
now methods of regularization, early stopping and cross-validation.
4.3.1 Regularization
For a given data set and model one often applies regularization
Ẽ(w) = E(w) + νΩ(w) (4.21)
to the original cost function E
N
1 X
E(w) = [tn − y(xn ; w)]2 (4.22)
N n=1
and ν a positive real regularization constant. Usually a weight decay
term is taken (similar to ridge regression)
1X 2
Ω(w) = w (4.23)
2 i i
88
order 1
0.9
0.8
0.7
0.6
0.5
y
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
order 3
0.9
0.8
0.7
0.6
0.5
y
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
order 7
3
2.5
1.5
y
0.5
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
Figure 4.3: Estimation of a noisy sinusoidal function: polynomial

models of order 1,3 and 7.
−6
order 7 − lambda = 10
0.9
0.8
0.7
0.6
0.5
y
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
order 7 − lambda = 0.1
0.9
0.8
0.7
0.6
0.5
y
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
Figure 4.4: Estimation of a noisy sinusoidal function: ridge regression

applied to the polynomial model of order 7: (Top) λ = 10−6 which
avoids the oscillation; (Bottom) λ = 0.1 results in too much regular-
ization.
90
(bias)^2 + variance
E
(bias)^2
variance
log λ
Figure 4.5: Bias-variance trade-off when adding a regularization term

to the cost function. The regularization constant λ is chosen such that
the sum of bias and variance contributions is minimized.
where the vector w contains all interconnection weights and bias terms
of the neural network model. This additional term aims at keeping
the interconnection weight values small.
Let us analyse now the influence of this weight decay term. Con-
sider the case of a quadratic cost function which can be related to a
Taylor expansion at a local point in the weight space. We have
1
E(w) = E0 + bT w + w T Hw (4.24)
2
with E0 a constant and the regularized version
1
Ẽ(w̃) = E(w̃) + ν w̃ T w̃. (4.25)
2
The gradient of these cost functions is
∂E
∂w
= b + Hw = 0
(4.26)
∂ Ẽ
∂ w̃
= b + H w̃ + ν w̃ = 0.
Consider the eigenvalues and eigenvectors of H:
Huj = λj uj (4.27)
and expand w and w̃ as follows

X X
w= wj u j , w̃ = w̃j uj . (4.28)
j j
P P
One
P can write then Hw − H
P w̃ − ν w̃ = 0 or j Hw j u j − j H w̃j uj −
ν j w̃j uj = 0 which gives j (λj wj − λj w̃j − ν w̃j )uj = 0. Finally this
results into
λj
w̃j = wj
λj + ν
meaning that if λj ≫ ν then w̃j ≃ wj and if λj ≪ ν the components
|w̃j | ≪ |wj | are suppressed. Hence thanks to the regularization mech-
anism one can implicitly work with less parameters than the number
of unknown interconnection weights. The so-called effective number
of parameters
X λj
γ= (4.29)
j
λ j + ν
is then characterized by the number of eigenvalues λj that is larger
than the regularization constant ν. In general a large regularization
constant ν large will lead to a smaller model structure giving smaller
variance but larger bias. A small value for ν will keep the model
structure larger which gives large variance but smaller bias.
For MLPs one may in principle also take different penalty factors
for each layer, e.g.
1 X T 1 X T
Ω(w) = ν1 w w + ν2 w w (4.30)
2 w∈W 2 w∈W
1 2
where W1 , W2 denote the sets of weights for layer 1 and 2, although

this is not often done in practice.
4.3.2 Early stopping and validation set

Instead of applying regularization to the cost function one might min-
imize the original cost function E(w) instead of the regularized Ẽ(w)
but do early stopping of the training process when the minimal error
on a separate validation set is obtained. One works then with three
sets: a training set for estimation of the model, a validation set to
stop the training process earlier and a test set which contains data
that are not used to derive the model (Fig.4.6).
92
validation set
training set
iteration step
Validation Test
Training
Set Set
Set
Figure 4.6: Training, validation and test set where the validation set
is used for early stopping of the training process. The designer is
responsible for making a good choice!
Somewhat surprisingly, it is possible to show that early stopping is

closely related to regularization. Conceptually this can be understood
as follows according to [1]. A quadratic approximation of the cost
function at the minimum w ∗ gives
1
E = E0 + (w − w ∗ )T H(w − w ∗ ) (4.31)
2
with constant E0 and positive definite Hessian H. Consider a simple
gradient descent
w (τ ) = w (τ −1) − η∇E (4.32)
with iteration step τ , learning rate η and w (0) = 0 for simplicity. One
can show then that
(τ )
wj = {1 − (1 − ηλj )τ }wj∗ (4.33)
where wj = w T uj with uj , λj eigenvectors and eigenvalues of H re-
spectively. As τ → ∞ one has w (τ ) → w ∗ , provided |1 − ηλj | < 1.
Assume that the training is stopped then after τ steps, one obtains
(τ )
wj ≃ wj∗ when λj ≫ 1/(ητ )
(τ ) (4.34)
|wj | ≪ |wj∗ | when λj ≪ 1/(ητ ).
Hence 1/(ητ ) plays a similar role as the regularization parameter ν.
4.3.3 Cross-validation
Working with a specific validation set has two main disadvantages.
First the results might be quite sensitive with respect to the specific
data points belonging to that validation set. Secondly when working
with a training, validation and test set a part of the training data can
no longer be used for training as it belongs then to the validation set.
A good procedure is then to apply cross-validation (Wahba,
1975) (Fig.4.7). One divides the training set into a number of S seg-
ments and trains in each run on S − 1 segments. The error on the
sum of the segments that were left out in the S runs serves then as
a validation set performance. A typical choice (which is both compu-
tationally attractive and of good statistical quality) is S = 10 (called
10-fold cross validation). In the extreme case one can take S = N
meaning that one has N runs with N − 1 data points (called leave-
one-out cross-validation). This is only recommended for small data
sets and certainly not for datamining applications with millions of
data points.
94
D_1 D_2 D_S

1111
0000
0000
1111
0000
1111
0000
1111 run 1
0000
1111
0000
1111
00000
11111
00000
11111
00000
11111
00000
11111 run 2
00000
11111
00000
11111
11111
00000
00000
11111
00000
11111
00000
11111 run S
00000
11111
00000
11111
1111
0000
0000
1111
0000
1111
0000
1111 = omitted for training
0000
1111
0000
1111
Figure 4.7: In the cross-validation method a number of S runs are

done where in each run a different part of the data set is omitted and
the validation error is finally checked as the sum of error costs on the
sets that were omitted.
4.4 Pruning
In order to improve the generalization performance of the trained mod-
els one can remove interconnection weights that are irrelevant. This
procedure is called pruning. We discuss here methods of optimal brain
damage, optimal brain surgeon and weight elimination.
In Optimal Brain Damage (Le Cun, 1990) one considers the
error cost function change due to small changes in the interconnection
weights:
X ∂E 1 XX
δE ≃ δwi + Hij δwi δwj + O(δw 3 ) (4.35)
i
∂wi 2 i j
2
where Hij = ∂w∂i ∂w
E
j
is the Hessian. One takes the following assumption
after convergence:
1X
δE ≃ Hii δwi2 (4.36)
2 i
and one measures the relative importance of the interconnection weights
by
Hii wi2 /2
which are also called saliency values. The algorithm looks as follows:
Pruning algorithm (optimal brain damage):
1. Choose a relatively large initial network architecture.
2. Train the network in the usual way until some stopping criterion
is satisfied.
3. Compute the saliencies Hii wii2 /2.
4. Sort weights by saliency and delete low-saliency weights
5. Go to 2 and repeat until some overall stopping criterion is reached.
Optimal brain damage has been applied to the recognition of hand-

written zip codes, where networks with 10000 interconnection weights
have been pruned by a factor 4.
96
Another pruning method is Optimal Brain Surgeon (Hassibi

and Stork, 1993). This method can be understood as follows. By
neglecting higher order terms one has after convergence
1
δE = δw T Hδw. (4.37)
2
Setting a weight wi to zero corresponds to δwi = −wi or
eTi δw + wi = 0 (4.38)
where ei is a unit vector. One considers then the optimization problem

1
min δE = δw T Hδw s.t. eTi δw + wi = 0 (4.39)
δw 2
with Lagrangian
1
L(δw, λ) = δw T Hδw − λ(eTi δw + wi ) (4.40)
2
and conditions for optimality:
 ∂L −1
 ∂(δw) = Hδw − λei = 0 → δw = λH ei
∂L
= eTi δw + wi = 0 → λeTi H −1ei = λ[H −1 ]ii = −wi .

∂λ
(4.41)
This results into
wi
δw = − −1
H −1 ei (4.42)
[H ]ii
and
1 wi2
δEi = . (4.43)
2 [H −1 ]ii
The pruning algorithm looks then as follows:
Pruning algorithm (optimal brain surgeon):
1. Train a relatively large network to a minimum of the error func-

tion.
2. Evaluate the inverse Hessian H −1 .

3. Evaluate δEi for each value of i and select the value of i which
gives the smallest increase in error.
4. Update all the weights according to δw = − [Hw−1i ]ii H −1 ei .
5. Go to 3 and repeat until some stopping criterion is reached.
The performance of this algorithm is better than optimal brain dam-

age.
In order to eliminate weights one may also take a different regular-
ization term instead of weight decay. The following technique is called
Weight Elimination (Weigend, 1990):
X (wi/c)2
Ẽ = E + ν . (4.44)
i
1 + (wi /c)2
This algorithm is more likely to eliminate weights (i.e. putting weights

to zero) than the weight decay method. A drawback is the choice of
the additional tuning parameter c.
In recent years, related to sparse linear modelling and compressed
sensing, it is popular to use L1 regularization to achieve sparse solu-
tions. This is done by solving
X
Ẽ = E + ν |wi |. (4.45)
i
for the case of a linear model, with ν > 0 the regularization

P 2 constant.
Hence
P instead of taking a regularization of the form i wi one is using
i |w i which leads to zero elements in the solution vector w [2, 9, 10].
|
Another scheme is to use so-called elastic net regularization [10]
which corresponds to
X X
Ẽ = E + ν1 |wi | + ν2 wi2 . (4.46)
i i
with regularization constants ν1 , ν2 > 0. The use of L1 regularization

leads to a non-smooth optimization problem. However, in the case of
linear models this can be solved by convex optimization [2].
98
4.5 Committee networks and combining

models
A common approach is to train several models and select the best
individual model. However, one might improve the results in view of
the bias-variance trade-off by forming a committee of networks and
combining the models.
Let us consider L trained networks yi (x), i = 1, ..., L. This could
be either a number of L trained MLPs or L totally different kind of
static models. Assume the true regression function h(x) is such that
yi(x) = h(x) + ǫi (x) i = 1, ..., L (4.47)
where ǫi (x) is the error function related to the i-th network. The
average sum-of-squares error for the individual model yi (x) is then
Ei = E[{yi (x) − h(x)}2 ] = E[ǫ2i ]. (4.48)
In 1993 Perrone showed that the performance of the committee net-

work can be better than the performance of the best single network.
One can take a simple averaged committee network or a weighted
average committee network.
A simple Average Committee Network is given by
L
1X
yCOM (x) = yi (x). (4.49)
L i=1
The error of the committee network is

L L
1X 1X 2
ECOM = E[ ( yi (x) − h(x))2 ] = E[ ( ǫi ) ]. (4.50)
L i=1 L i=1
One can show that

L
X L
X
2
( ǫi ) ≤ L ǫ2i ⇒ ECOM ≤ EAV (4.51)
i=1 i=1
where
L L
1X 1X 2
EAV = Ei = E[ǫi ]. (4.52)
L i=1 L i=1
Hence the results improve by this averaging network.

One can further improve the committee method by taking a Weighted
Average Committee Network (Fig.4.8):
L
X
yCOM (x) = αi yi (x)
i=1
L (4.53)
X
= h(x) + αi ǫi (x)
i=1
PL
where i=1 αi = 1. One considers the correlation matrix
Cij = E[ǫi (x)ǫj (x)] (4.54)
where in practice one works e.g. with a finite-sample approximation
on training data
N
1 X
Cij = [yi (xn ) − tn ][yj (xn ) − tn ]. (4.55)
N n=1
The committee error equals
ECOM = E[{yCOM (x) − h(x)}2 ]
L
X L
X
= E[ ( αi ǫi ) ( αj ǫj ) ]
i=1 j=1 (4.56)
L X
X L
= αi αj Cij = αT Cα.
i=1 j=1
An optimal choice of α is made as follows:

L
1 T X
min α Cα s.t. αi = 1. (4.57)
α 2
i=1
From the Lagrangian

L
1 T X
L(α, λ) = α Cα − λ( αi − 1)
2 i=1
one obtains the following conditions for optimality:

 ∂α = Cα − λ~1 = 0
 ∂L
(4.58)
∂L
= ~1T α − 1 = 0 → λ = 1

∂λ ~1T C −1~1
100
x y (x)
1
α
1
y2 (x) α2
x
+ y (x)
COM
αL
x y (x)
L
Figure 4.8: Combining models using a committee network.
with optimal solution
C −1~1
α= (4.59)
~1T C −1~1
and corresponding committee error
ECOM = 1/(~1T C −1~1) (4.60)
where ~1 = [1; 1; ...; 1]. In case of an ill-conditioned matrix C one can

apply additional regularization to the C matrix or also impose αi ≥ 0
to avoid large negative and positive weights. Other popular methods
which are related to committees are bagging and boosting1 [31, 35].
1
At http://www.boosting.org/ many material about tutorials, papers and soft-
ware is available.
4.6 Complexity criteria

So far we have stressed that it is dangerous to do the training solely
on the basis of a training set without looking at the performance
on independent sets, otherwise a bad overfitting solution will be the
result.
In fact this is also the message of complexity criteria which state
that one should not only try to minimize training errors but also keep
the model complexity as low as possible. This is basically the Occam’s
razor2 For nonlinear models the following complexity criterion holds
(Moody, 1992)
γ 2
GPE = Training Error + N
σ
P
where GPE is the generalized prediction error and γ = i λi /(λi + ν)
is the effective number of parameters. λi denote the eigenvalues of the
Hessian of the unregularized cost function and ν the regularization
constant. σ 2 is the variance of the noise on the data and N the number
of training data. Eigenvalues λi ≪ ν do not contribute to the sum.
This criterion is an extension of the well-known Akaike information
criterion which states the following for the prediction error (PE):
W
PE = Training Error + N
σ2
where W is the number of adjustable parameters of the model. This
Akaike criterion is only valid for linear models.
4.7 Bayesian learning

An important class of methods for reliable training of neural networks
is based on Bayesian learning and were proposed by MacKay [45, 46].
4.7.1 Bayes theorem and model comparison

Recall that Bayes’ Theorem states that
P (A|B, H)P (B|H)
P (B|A, H) = (4.61)
P (A|H)
2
William Occam (1280-1349) was a monk who claimed that “No more things
should be presumed to exist than are absolutely necessary” (Entia non sunt multi-
plicanda praeter necessitatem).
102
for events A, B and a given model assumption H. For a model H that

is parameterized by the parameter vector w and data D one has
P (D|w, H)P (w|H)

P (w|D, H) = (4.62)
P (D|H)
meaning that
Likelihood × Prior
Posterior = .
Evidence
The Bayes rule can also be used for model comparison purposes. Con-
sider two alternative models H1 , H2 and data D
P (D|H1)P (H1 )
P (H1 |D) =
P (D)
(4.63)
P (D|H2)P (H2 )
P (H2 |D) = .
P (D)
One obtains
P (H1 |D) P (H1 ) P (D|H1)
= (4.64)
P (H2 |D) P (H2 ) P (D|H2)
which means in fact that Bayes’ Theorem automatically embodies Oc-
cam’s razor as illustrated in Fig.4.9. Indeed, suppose that equal prior
probabilities P (H1 ) = P (H2 ) hold. Assume that the model H1 makes
a limited range of predictions given by the evidence P (D|H1 ). The
more powerful model H2 with more free parameters is able then to pre-
dict a larger variety of data sets. If the data fall in region C1 on Fig.4.9
the simpler model H1 becomes more probable (larger P (H1 |D)) then
according to Bayes rule.
4.7.2 Probabilistic interpretations of models

Let us now investigate probabilistic interpretations for regression and
classification within the Bayesian learning framework.
Consider a regression problem with N given training data D =
{x(n) , t(n) }N
n=1 . Assume the model is parameterized by w and the ob-
jective function is
min M(w) = βED (w) + αEW (w) (4.65)

w
Evidence
P( D | H1 )
P( D | H2 )
D
C1
Figure 4.9: Illustration of model comparison using Bayes’ rule which

embodies Occam’s razor according to MacKay.
with
1 X X (n)
ED (w) = [t − yi (x(n) ; w)]2
2 n i i
1X 2 (4.66)
EW (w) = w .
2 i i
An important interpretation is that one can relate M(w), ED (w) and

EW (w) to the log posterior, log likelihood and log prior respectively:
P (D|w, α, β, H)P (w|α, β, H)

P (w|D, α, β, H) =
P (D|α, β, H) (4.67)
1
= exp(−M(w))
ZM
with
1
P (D|w, α, β, H) = exp(−βED )
ZD (β)
1 (4.68)
P (w|α, β, H) = exp(−αEW )
ZW (α)
where ZM , ZW , ZD are normalization factors. Gaussian noise on the
targets is assumed here.
For binary classification networks one has targets t(n) ∈ {0, 1}.
One can consider the neural network output y(x; w) as a probability
104
P (t = 1|x, w) with an output neuron having a logistic activation func-

tion (with output in the interval [0, 1]). One takes the cost function
M(w) = −G(w) + αEW (w) (4.69)
with
X
G(w) = t(n) log y(x(n) ; w) + (1 − t(n) ) log(1 − y(x(n) ; w)) (4.70)
n
without a β factor, in contrast with the regression problem.

For multiclass classification networks one can take as many
outputs as classes and use a softmax function in order to ensure that
the l outputs can be considered as probabilities
yi = P (ti = 1|x, w). (4.71)
This is realized by taking

l
X
yi = exp(ai )/ exp(aj ) (4.72)
j=1
where ai is the activation of output unit i such that 0 ≤ yi ≤ 1 and

P
i yi = 1. For the multiclass case one has
X X (n)
G(w) = ti log yi (x(n) ; w). (4.73)
n i
4.7.3 Levels of inference

In order to determine now the parameters w of the model together
with the hyperparameters α, β one can apply the Bayes rule now at
several levels of inference.
At Level 1 one infers the parameters w for given α, β:
P (D|w, α, β, H)P (w|α, β, H)

P (w|D, α, β, H) = . (4.74)
P (D|α, β, H)
At Level 2 one infers the hyperparameters α, β:
P (D|α, β, H)P (α, β, H)

P (α, β|D, H) = . (4.75)
P (D|H)
At Level 3 one compares several models by applying

P (H|D) ∝ P (D|H)P (H) (4.76)
where H is a certain model, e.g. with a fixed number of hidden units.
Models with different number of hidden units can then be compared
at Level 3.
More specifically for Level 1 a Taylor expansion of the log poste-
rior is considered at the maximum posterior (denoted as MP) local
minimum solution wM P . One writes
1 1
P (w|D, α, β, H) ≃ exp[−M(wM P )− (w−wM P )T A(w−wM P )]
ZF (α, β) 2
(4.77)
2
where A = −∇ log P (w|D, α, β, H)|wM P .
The evidence for α, β at Level 1 becomes the likelihood at Level 2
ZF (α, β)
log P (D|α, β, H) = log
ZW (α)ZD (β)
= −M(wM P ) − 21 log det( 2πA
) − log ZW (α) − log ZD (β)
(4.78)
by assuming a uniform prior P (α, β, H). After some calculations one
can show that the maximum evidence is obtained for
γ
αM P = 2EW (wM P )
N −γ
(4.79)
βM P = 2ED (wM P )
which are re-estimation formulas for αM P , βM P . These equations de-

pend implicitly on αM P , βM P . The effective (or well-determined) num-
ber of parameters equals
γ = k − αM P trace(A−1 )
k
X λi (4.80)
=
λ + αM P
i=1 i
with k the total number of parameters and λi the eigenvalues of

β∇2 ED .
At Level 3 one has the following if the posterior is well approxi-
mated by a Gaussian
P (D|Hi) ≃ P (D|wM P , Hi ) × P (wM P |Hi ) det−1/2 (A/2π) (4.81)
or
106
Evidence ≃ Best fit likelihood × Occam factor

with Hessian A = −∇2 log P (w|D, Hi). Hence the Bayesian model
selection is a simple extension of maximum likelihood selection and
one can compare different models Hi at this third level of inference,
e.g. MLPs with different number of hidden units.
Finally, also error bars can be obtained on the predictions. Suppose
given N training data. For the case of regression with single output
one can take the following approximation at a new point xN +1 :
y(xN +1 ; w) ≃ y(xN +1 ; wM P ) + g(w − wM P ) (4.82)

∂y
with sensitivity g = | N+1 ,wM P .
∂w x
One obtains then a predictive dis-
tribution with mean
y(xN +1 ; wM P ) (4.83)
and variance
2 1
σt|α,β = g T A−1 g + (4.84)
β
with A = −∇2 log P (w|D, α, β, H). Hence A−1 measures the size of
the error bars on w.
4.7.4 Practical implementations and automatic rel-

evance determination
From the results at the second level of inference one can see that this
Bayesian learning approach requires the knowledge of the Hessian or
at least an approximation for it. Therefore one can use a Levenberg-
Marquardt based approach to the Bayesian learning (Foresee
& Hagan, 1997)3.
The following algorithm has been proposed which works success-
fully for the training of MLPs:
Bayesian learning with Levenberg-Marquardt approximation

1. Initialize α, β and parameter vector w.
2. Take one step of the Levenberg-Marquardt algorithm to mini-

mize the objective function βED + αEW .
3
This algorithm is implemented in Matlab 5.3 Toolbox Neural Networks (func-
tion trainbr for Bayesian regularization)
3. Compute the effective number of parameters γ with Gauss-Newton

approximation to the Hessian available from the Levenberg-Marquardt
algorithm.
γ
4. Compute new estimates for the hyperparameters: α = 2EW (w)
and β = 2END−γ
(w)
.
5. Iterate steps 2-4 until convergence.
A more sophisticated method of Automatic Relevance Deter-

mination (ARD)4 can be used in order to automatically determine
which inputs of the model are most important (in a relative sense).
This is done by taking separate hyperparameters αi for each set of
weights associated with the inputs of the network (instead of a single
hyperparameter α) (Fig.4.10). After training one obtains different αi
values related to each input of the network. From these values one
can investigate the relative importance of the inputs with respect to
each other, provided all input training data were normalized. One can
gradually prune the least relevant inputs and retrain the network.
Another important issue is the incorporation of expert knowl-
edge. A first option is to let an expert generate additional data which
can be utilized then to augment the given measured training data
(Fig.4.11). There exist also methods of Bayesian networks [38] and
graphical models which are well suited for incorporating prior knowl-
edge.
4
Netlab software available from http://www.ncrg.aston.ac.uk/netlab/
108
α
1
α2
.
.
.
α
. m
.
.
Figure 4.10: Automatic relevance determination where each set of in-

terconnection weights associated with a network input has a separate
hyperparameter αi . In this way one can select the most relevant in-
puts.
EXPERT
Training data + Expert training data
Figure 4.11: Including prior knowledge by augmenting the original

data set with data from the expert.
Chapter 5
Unsupervised Learning and

Regularization Theory
In this Chapter we discuss aspects of dimensionality reduction, non-

linear PCA, cluster algorithms, vector quantization, self-organizing
maps and regularization theory. This chapter is mainly based on
[1, 3, 12, 14, 51]. In unsupervised learning operations are only done
on the input space data and not on target output data. Often these
methods are applied in order to get a better idea of the underlying
structure and density of the data and extracting knowledge from it.
In methods like self-organizing maps one has an additional objective
in mind of visualizing the data.
5.1 Dimensionality reduction and nonlin-

ear PCA
Often one tries do reduce large dimensional input spaces to lower
dimensions such as in PCA analysis. A reduction of the input space
for parametric models usually leads to less parameters to be estimated
which is desirable from the viewpoint of complexity criteria.
Consider patterns x ∈ Rd in the original space and transformed
inputs z ∈ Rm in a lower dimensional space with m ≪ d. Conceptually
this problem can be understood as an encoding/decoding problem
where one tries to minimize the reconstruction error. For the encoder
mapping
z = G(x) (5.1)
109
110
^
x G(x) z F(z) x
Figure 5.1: Information bottleneck in dimensionality reduction. Lin-

ear PCA is obtained for linear mappings F (·), G(·). Nonlinear PCA
analysis is obtained by considering nonlinear mappings that can be
parameterized by MLPs.
and decoder mapping
x̂ = F (z) (5.2)
one has the following objective in mind of minimizing the squared

distortion error
N
1 X
min E = kxi − x̂i k22
N i=1
(5.3)
N
1 X
= kxi − F (G(xi ))k22 .
N i=1
An important special case is when F (·), G(·) are linear mappings. It

can be proven that this corresponds to PCA analysis. However, one
can take these mappings also nonlinear and parameterize it e.g. by
means of MLPs. In that case it leads to so-called nonlinear PCA
analysis. The variable z in the reduced dimension leads to an infor-
mation bottleneck (Fig.5.1). These problems have been studied also
in rate-distortion theory within information theory. Some examples
about dimensionality reduction and linear versus nonlinear PCA anal-
ysis are shown in Fig.5.2, Fig.5.3 and Fig.5.4.
Unsupervised Learning and Regularization Theory 111
x
2 u
2
u
1
x
1
Figure 5.2: The data set in two dimensions has intrinsic dimensionality
1. The data can be explained in terms of the single parameter η, while
linear PCA is unable to detect the lower dimensionality.
x
2
η
x
1
Figure 5.3: Addition of a small level of noise to data in two dimensions

having an intrinsic dimensionality of 1 can increase its dimensionality
to 2. Nevertheless, the data can be represented to a good approxima-
tion by a single variable η.
112
x2
C1 u1
u
2
C2
x
1
Figure 5.4: In this simple classification problem linear PCA analysis

would discard the discriminatory information. Suppose data are taken
from two Gaussian distributed classes C1 and C2 . Dimensionality re-
duction to one dimension would give a projection of the data onto
vector u1 which would remove all ability to discriminate between the
two classes.
5.2 Cluster algorithms

An important class of unsupervised learning methods are cluster al-
gorithms. In this case one aims at finding groups of points which are
located close to each other according to a chosen distance measure.
A well-known method is the K-means algorithm which works as
follows.
K-means cluster algorithm
1. Choose K initial cluster centers z1 (1), ..., zK (1).
2. At iteration step k, distribute the samples {x} among the K

cluster domains by
x ∈ Sj (k) if kx − zj (k)k < kx − zi (k)k , ∀i = 1, 2, ..., K
where Sj (k) denotes the set of samples whose cluster center is

zj (k) at iteration step k.
3. New cluster centers:

1 X
zj (k + 1) = x, j = 1, 2, ..., K.
N
x∈Sj (k)
This minimizes the performance index

X
Jj = kx − zj (k + 1)k2 , j = 1, 2, ..., K.
x∈Sj (k)
4. If zj (k + 1) = zj (k) for j = 1, 2, ..., K the algorithm has con-

verged. Otherwise go to step 2.
Note that in this method one has to choose the number of centers
K (Fig.5.5). The method also depends on the initial choice of the
clusters. The performance of the method is characterized by the per-
formance indices Jj for each of the clusters j = 1, 2, ..., K. These
indices can be combined into a single performance measure. The K-
means algorithm can also be considered as a rough approximation to
the E-step of the EM algorithm for a mixture of Gaussians. Density
estimation methods such as mixture models can indeed also be consid-
ered as unsupervised learning. There also exist many other clustering
methods, e.g. isodata algorithm, hierarchical clustering, agglomera-
tive clustering, divisive clustering etc. [6, 9, 14].
5.3 Vector quantization

Vector quantization is a method which is somewhat related to cluster-
ing but while the precise objective of clustering is sometimes rather
vague (finding interesting groups of samples), in vector quantization
one optimizes a quantization error (distortion measure) for a fixed
number of prototype vectors.
Given data x(k) for k = 1, 2, ... and initial prototype centers cj (0)
for j = 1, 2, ..., m (m centers) the vector quantization method (com-
petitive learning or stochastic approximation version) is given by the
following algorithm.
Vector quantization algorithm

114
Initialization After convergence
feature 2 feature 2
feature 1 feature 1
data point
cluster center
Figure 5.5: K-means clustering algorithm.

1. Determine nearest center cj (k) to data point x(k). For the

squared error loss function the nearest neighbour rule is
j = arg min kx(k) − cl (k)k.

l
Finding the nearest center is often called competition among

centers.
2. Update centers:
cj (k + 1) = cj (k) + γ(k)[x(k) − cj (k)]

k := k + 1
where γ(k) should meet the conditions for stochastic approxi-

mation,
∞
X ∞
X
i.e. lim γ(k) = 0, γ(k) = ∞, γ 2 (k) < ∞.
k→∞
k=1 k=1
The vector quantizer Q is a mapping Q : Rd → C where C = {c1 , c2 , ..., cm }

(winning units). The space Rd is partitioned into the regions R1 , ..., Rm
where Rj = {x ∈ Rd : Q(x) = cj }. The algorithm minimizes a distor-
tion measure of the form
Z X
kx − cj I(x ∈ Rj )k2 p(x)dx (5.4)
j
where Rj corresponds to the j-th partition region and I(x ∈ Rj ) = 1

if x ∈ Rj and 0 otherwise. The partition regions of a vector quantizer
are non-overlapping and cover the entire input space Rd . The opti-
mal vector quantizer has the so-called nearest-neighbor partition (or
Voronoi partition) illustrated in Fig.5.6.
5.4 Self-organizing maps

Self-organising maps are largely based on vector quantization but in
addition it aims at having a visual representation on a low dimen-
sional map. Hence, self-organining maps (SOM) try to represent the
underlying density of the input data by means of prototype vectors
and at the same time one projects the higher dimensional input data
to a map of neurons (also called nodes or units) such that the data
116
R3
c3
c2
c1 R2
R1
Figure 5.6: Vector quantization leading to Voronoi partition with pro-

totype vectors c1 , c2 , c3 and regions R1 , R2 , R3 .
can be visualised. Typically one has a projection to a 2-dimensional

grid of neurons. In this way the SOM compresses information while
preserving the most important topological and metric relationships of
the primary data items on the display.
Consider input training data xi ∈ Rd for i = 1, ..., N, prototype
vectors cj ∈ Rd for j = 1, ..., b and map coordinates zj ∈ R2 for
j = 1, ..., b where N denotes the number of training data and b the
number of neurons. One can say that the neurons have in fact two
positions: in the ‘input’ space one has the prototypes cj ∈ Rd while
in the ‘output’ space one has the map coordinates zj ∈ R2 both for
j = 1, ..., b.
Dimensionality reduction is done by projecting to a 2-dimensional
space with coordinates z ∈ R2 . One typically takes a 2-dimensional
grid of neurons: ψ = {ψ1 , ψ2 , ..., ψb }, where ψ(j) denotes the j-th ele-
ment of ψ. A simple illustration is given for b = 16 neurons in Fig.5.7.
One can take several possible grid choices e.g. hexagonal grid, rectan-
gular grid√or others (Fig.5.8). A typical choice for number of neurons
is b = 5 N, where the computational load increases quadratically
with b.
For the SOM algorithm there exist both batch versions and on-line
z2
ψ1 ψ2
ψ
16
z1
Figure 5.7: Simple illustration of a 2-dimensional grid of neurons in

SOM with 16 neurons.
Figure 5.8: Some possible grid choices for the SOM.

118
adaptive versions which we discuss now. The batch algorithm is given

by
SOM Batch algorithm (off-line):

Repeat until convergence:
1. Projection:
j∗ = arg min kxi − cj k22
j
ẑi = ψ(j∗ ), i = 1, ..., N.
2. Update centers
cj = F (ψ(j), σ), j = 1, ..., b
where PN
Kσ (z, zi )xi
F (z, σ) = Pi=1
N
i=1 Kσ (z, zi )
with
Kσ (z, zi ) = exp(−kz − zi k2 /2σ 2 ).
3. Decrease σ:
σfinal k/kmax
σ(k) = σinitial ( )
σinitial
at iteration step k. The initial value of σ is chosen such that the
neighborhood covers all the unit. The final value controls the
smoothness of the mapping.
The online SOM algorithm is given by:
SOM On-line algorithm:

1. Determine winning unit (also called best matching unit)
z(k) = ψ(arg min kx(k) − cj (k − 1)k)

j
2. Update all units

cj (k) = cj (k − 1) + β(k)Kσ(k) (ψ(j), z(k)) [x(k) − cj (k − 1)]
k := k + 1
for j = 1, ..., b.
Figure 5.9: Examples of SOM neighborhood size functions (sizes

1,2,3). In a similar way sizes can be controlled by σ when using the
Kσ (z, zi ) = exp(−kz − zi k2 /2σ 2 ) function.
3. Decrease learning rate β(k) and width σ(k).
The batch version is usually faster than the on-line version and is
often preferred. The initialization of the SOM can be done either
at random or based upon the two principal eigenvectors from PCA
analysis. Missing values within the data set are usually excluded from
the distance calculations.
For the neighborhood functions several choices are possible. In the
case of the choice Kσ (z, zi ) = exp(−kz − zi k2 /2σ 2 ) the neighborhood
size is controlled by σ.
One also often works with neighborhood sizes 1, 2 or 3 for the neurons,
which is illustrated in Fig.5.9.
Nice visualizations can be made by SOMs. It is important, how-
ever, to carefully interpret the results after the training of the SOM.
One gets insight by looking at the color or black/white maps. De-
pending on color definition by the user, dark areas might mean that
the data are very dense and clustered in that region (many data close
to each other). This information is obtained by calculating distances
between the prototype vectors. In case the class labels of the data
are given (supervised information) one can also show these on the
120
SOM map. In the so-called WEBSOM, the SOM method has been
applied to problems of webmining where millions of documents have
been processed (Fig.5.10) [43].
5.5 Regularization theory

5.5.1 RBF networks and regularization theory
In this Section we will establish a link between dimensionality re-
duction and clustering algorithms from unsupervised learning with
supervised problems of regression. This is done based upon insights
obtained from regularization theory. The following results have been
described in [51] (Poggio & Girosi, 1990). In a previous chapter we
already discussed the importance of parametric regularization in the
cost function (ridge regression, weight decay for MLPs etc.). On the
other hand regularization can also be done in a non-parametric sense,
by expressing that the derivatives of the estimated function should be
small such that the obtained solution doesn’t oscillate too much.
Consider a given data set S = {(xi , yi ) ∈ Rn × R | i = 1, ..., N}
(supervised learning) and the following problem in which a functional
is minimized
N
X
min H[f ] = (yi − f (xi ))2 + λkP f k2 (5.5)
f
i=1
where λ > 0 denotes the regularization parameter, P a constraint op-

erator (usually a differential operator) and k · k usually corresponds to
the l2 norm defined on the function space. In order to solve this prob-
lem and find the optimal function one can apply the theory of calculus
of variations. Regularization theory finds its origin in so-called inverse
problems which are ill-posed. In such problem the reconstruction is
not unique, does not exist or depends not continuously on the data
or noise effects are important. Some examples are surface reconstruc-
tion in vision, recovery of motion and optical flow, edge detection,
radioastronomy, seismic exploration, medical diagnosis and others. A
unified theory on regularization of ill-posed problems was studied by
Tikhonov (1960) and is closely related to splines. The obtained models
are also very relevant for datamining problems in general.
Figure 5.10: Example of a SOM map after training where insight about
the clustering of the data is obtained from the color or black/white
regions. This figure illustrates a result from webmining by means of
the WEBSOM http://websom.hut.fi/websom/.
122
Calculus of variations with application of the Euler-Lagrange equa-

tions leads to the partial differential equation
N
1X
P̂ P f (x) = (yi − f (x)) δ(x − xi ) (5.6)
λ i=1
where P̂ is the adjoint of the differential operator P . The solution

can be expressed in terms of the Green’s function G satisfying the
distributional differential equation
P̂ P G(x; y) = δ(x − y). (5.7)
The optimal solution can be written as

N
X
f (x) = ci G(x; xi ). (5.8)
i=1
The solution f lies in an N-dimensional subspace of the space of

smooth functions. The basis for this subspace is given by the N func-
tions G(x; xi ). G(x; xi ) is the Green’s function centered at the point
xi and the points xi are the centers of the expansion. The coefficients
ci are the solution to the following linear system:
(G + λI)c = y (5.9)
where (y)i = yi , (c)i = ci and (G)ij = G(xi ; xj ). A pure interpolation

problem (λ = 0) corresponds to the linear system Gc = y. The matrix
G is symmetric (hence real eigenvalues) because the Green’s function
is symmetric G(x; y) = G(y; x). G + λI is of full rank unless −λ is
equal to one of the eigenvalues.
Depending on the properties of the operator P one obtains specific
solutions:
• P translationally invariant:
In this case the Green’s function is translationally invariant
G(x; xi ) = G(x − xi ). (5.10)
• P rotationally and translationally invariant:

In this case the Green’s function is a radial function
G(x; xi ) = G(kx − xi k). (5.11)

For a stabilizer P
M n Z
2
X
m 2 m 2
X ∂ m f (x)
kP f k = am kO f k , kO f k = dx
m=0 i1 ...im Rn ∂xi1 ...xim
(5.12)
one obtains as optimal solution the Radial Basis Function
neural networks (RBF) with Gaussian activation function.
An example for n = 2 is
∂2f ∂2f 2 ∂2f
Z
2 2
kO f k = [( 2 )2 + 2( ) + ( 2 )2 ]dx1 dx2 . (5.13)
R2 ∂x1 ∂x1 ∂x2 ∂x2
In [51] the following learning algorithm is given for training of RBF

networks. The original theory with optimal solution
N
X
f (x) = ci G(kx − xi k) (5.14)
i=1
is modified at two points. One takes a number of hidden neurons nh

(nh < N) instead of N which will lead to moving centers. Also a
weighted norm kx − xi kW is utilized instead of kx − xi k, which will
lead to dimensionality reduction.
One proposes then a solution of the form
nh
X
f ∗ (x) = ci φi (x), nh < N (5.15)
i=1
with {φi }ni=1

h
linearly independent functions. The optimal solution is
obtained by
∂H[f ∗ ]
=0 i = 1, ..., nh (5.16)
∂ci
and is equal to
Xnh
∗
f (x) = cα G(x; tα ) (5.17)
α=1
with centers tα . Remark that in the special case nh = N one has

{tα }N N
α=1 = {xi }i=1 . In the weighted norm case one seeks to minimize
N
X
min HW [f ] = (yi − f (xi ))2 + λkP f k2y (5.18)
f
i=1
124
Gaussian
0.8
0.6
0.4
0.2
0
25
20 25
15 20
10 15
10
5 5
0 0
Thin-plate-spline
0.2
-0.2
-0.4
-0.6
-0.8
25
20 25
15 20
10 15
10
5 5
0 0
Figure 5.11: Regularization theory: (Top) Gaussian activation func-

tion h(r) = exp(−r 2 /c2 ); (Bottom) thin plate spline h(r) = r 2 ln r
(−h on the figure).
where P is radially symmetric in the variable y = W x. Assuming

moving centers one obtains
nh
X
∗
f (x) = cα G(kx − tα k2W ). (5.19)
α=1
Finding the best stabilizer P corresponds to minW HW [f ].

Consider now a solution of the following form
nh
X
∗
f = cα G(kx − tα k2W ). (5.20)
α=1
The goal is then to minimize

N
X nh
X
∗
min H[f ] = (yi − cα G(kx − tα k2W ))2 + λkP f k2y (5.21)
cα ,tα ,W
i=1 α=1
∗
with y = W x. The conditions for optimality are given by ∂H[f ∂cα
]
= 0,
∂H[f ]∗ ∗
∂H[f ]
∂tα
= 0, ∂W = 0 for α = 1, ..., nh . In the case λ = 0 the gradient
is given by
N

∂H[f ∗ ]
X
∆i G(kx − tα k2W )

= −2



 ∂cα

 i=1


 N
∂H[f ∗ ]
X
∂tα
= 4cα ∆i G′ (kx − tα k2W )W T W (xi − tα ) (5.22)


 i=1

 nh N
∂H[f ∗ ]
 X X
= −4W cα ∆i G′ (kx − tα k2W )Qi,α


 ∂W

α=1 i=1
where ∆i = yi − f ∗ (xi ) and Qi,α = (xi − tα )(xi − tα )T . Remark that

PN
i=1 Qi,α is an estimate for the correlation matrix of all the samples
xi relative to the centers tα . The following interpretation can be made
for the gradients. The variation of H with respect to cα means in fact
that the adaptation is proportional with the error on the example and
the activity of the unit that represents with its center that example.
In the case of fixed tα , W the optimal solution is
c = (GT G + λg)−1 GT y (5.23)

126
with (y)i = yi , (c)α = cα , (G)iα = G(xi ; tα ), gαβ = G(tα ; tβ ). The

variation of H with respect to tα can be understood as follows. For
fixed cα and W (W = I) the optimal solution is
PN
Piα xi
tα = Pi=1 N α
, α = 1, ..., nh
P
i=1 i (5.24)
Piα = ∆i G′ (kxi − tα k2 )
which can be interpreted as task-dependent clustering with optimal
centers equal to a weighted sum of the data and the weights Piα pro-
portional to the errors ∆i . Finally, the variation of H with respect to
W leads to dimensionality reduction by finding an optimal metric.
Based upon these insights the following practical learning algo-
rithm has been proposed in [51].
Learning algorithm RBF networks

1. Choose nh < N.
2. Set centers positions tα to a subset of the examples xi .
P P
3. Choose spanrow (W ) ⊥ span{eigenvectors of α i Qi,α with
largest eigenvalues}
4. Compute c:
c = (GT G + λg)−1 GT y, y = W x.
5. Use these values cα , tα , W as initial values for a gradient descent

algorithm
∗]
ċα = −η ∂H[f



 ∂c α
∗]
ṫα = −η ∂H[f∂tα

 ∗]
Ẇ = −η ∂H[f

∂W
with learning rate η.
5.5.2 Learning by a separation principle

The insights on the links between clustering and RBF networks have
been further exploited in the work of Chen & Billings (1992) in [29, 30]
for nonlinear modelling.
Consider the nonlinear input/output model

ŷk+1 = f (zk|k−p ),
zk|k−p = [yk ; yk−1; ...; yk−p; uk ; uk−1; ...; uk−p]
(5.25)
where f (·) is parameterized by an RBF network
nh
X
ŷk+1 = wi φ(kzk|k−p − ci k) (5.26)
i=1
with centers ci , output weights wi and nh number of hidden units.

Suppose the norm k · k is unweighted. Remark that for fixed centers
the parameterization is linear in the output weights. This property is
utilized in the following algorithm:
1. First determine the centers ci using a cluster algorithm (e.g. K-
means) or a self-organizing map.
2. In a second step find the output weights wi by solving a linear
least squares problem, for given fixed ci values.
Hence, one can train RBF networks by a combination of unsupervised
learning and linear regression. In this way the training of the hidden
layer and the output layer are separated from each other. However,
one should be aware that this procedure is only suboptimal despite
the nice interpretation of the method.
5.5.3 Link with fuzzy models

The use of Gaussian activation functions in neural networks is also very
much related to the theory of fuzzy systems [65]. For the Gaussian
kxj − cij k22
µAij (xj ; cij , σij ) = exp(− ) (5.27)
σij2
µ denotes then the membership function and Aij is a fuzzy set. In
fuzzy logic and fuzzy modelling one works with “vague statements”
such as “It is cold” instead of a crisp statement like “The tempera-
ture is 7 degrees”. In general, several choices of membership functions
are possible, e.g. trapezoidal, Gaussian and triangular (Fig.5.12). The
membership function also expresses the uncertainty that one has about
facts. Classically, uncertainty is often expressed in terms of probabil-
ity. However, note that probabilities and membership grades are not
the same. A simple example of a fuzzy system could be
128
Membership grade
cold hot
warm
0
10 20 30 40
Temperature
Figure 5.12: Illustration of the concept of a membership function in

fuzzy systems.
If the heating power is high

then the temperature will increase fast.
This is a system with functional form y = f (u) where u corresponds to

the first sentence and y to the second sentence. In fuzzy systems one
can have inputs/outputs that are fuzzy/crisp, crisp/fuzzy or fuzzy/fuz-
zy.
Fuzzy logic, as originally introduced by Lofti Zadeh, is the theory
of how to compute with such vague statements, instead of with the
usual binary 0/1 logic. In this way it can also be considered as a
theory for computing with words (linguistic variables). The concepts
of fuzzy logic have been used and developed within nonlinear modelling
(fuzzy models) and control (fuzzy control). For control applications
it has the desirable property that models and control strategies can
be understood and translated into words such that human operators
can understand what the system is doing and, moreover, that human
knowledge can be incorporated within these systems.
Chapter 6
Support Vector Machines
In this Chapter we discuss Support Vector Machines (SVM)1 for linear

and nonlinear classification and function estimation. The chapter is
mainly based on [4, 16, 20, 21, 27, 56, 59, 60].
6.1 Motivation
Despite the fact that classical neural networks (MLPs, RBF networks)
have nice properties such as universal approximation and reliable al-
gorithms presently exist for this class of techniques, they still have a
number of persistent drawbacks. A first problem is the existence of
many local minima solutions. Although many of these local solutions
actually can be good solutions, it is often inconvenient, e.g. from a
statistical perspective. Another problem is how to choose the number
of hidden units (Fig.6.1).
The theory of Support Vector Machines (SVMs) sheds a new light
on these problems. Support vector machines have been introduced
by Vapnik. In fact the original idea of linear SVMs dates back al-
ready from the sixties but it became more important and popular in
recent years when extensions to general nonlinear SVMs have been
made [20, 21]. In SVMs one works with kernel based representations
of the network allowing linear, polynomial, splines, RBF and other
kernels. Several operations on kernels are allowed and for specific
applications such as textmining string kernels can be used. The so-
1
At http://www.kernel-machines.org/ many material about tutorials, papers
and software is available.
129
130
cost function
How many neurons ? weight space
Figure 6.1: Drawbacks of classical neural networks: (Top) problem

of number of hidden units; (Bottom) existence of many local minima
solutions.
Support Vector Machines 131
lution is characterized by a convex optimization problem (typically

quadratic programming) which has a unique solution in contrast with
MLPs. Moreover also the model complexity (e.g. number of hidden
neurons) follows from the solution to this convex optimization prob-
lem. The support vectors can be interpreted as informative points.
SVM models also work well in huge dimensional input spaces. SVMs
have been originally proposed within the context of statistical learning
theory, where also probabilistic bounds on the generalization error of
models have been derived. These bounds are expressed in terms of
the VC (Vapnik-Chervonenkis) dimension, which can be considered
as a combinatorial measure for model complexity. A drawback, how-
ever, is that SVMs have been mainly developed for static problems as
classification, function and density estimation. Also a kernel version
of PCA and kernel versions of cluster algorithms exist which will not
be treated here in this course. SVMs have been successfully applied
to many real-life problems [47] including text categorisation, image
recognition, handwritten digit recognition, bioinformatics, protein ho-
mology detection and financial engineering.
6.2 Maximal margin classifiers and linear

SVMs
6.2.1 Margin
In Fig.6.2 an illustrative example is given of a separable problem in

a two-dimensional feature space. One can see that there exist several
separating hyperplanes that separate the data of the two classes (data
depicted by ‘x’ and ‘+’). Towards the development of SVM theory it
is important to define a unique separating hyperplane. This is done
by maximizing the distance to the nearest points of the two classes.
According to Vapnik, one can do then a rescaling of the problem such
that mini |w T xi + b| = 1, i.e. the scaling is done such that the point
closest to the hyperplane has a distance 1/kwk2. The margin between
the classes is then equal to 2/kwk2. Maximizing the margin corre-
sponds then to minimizing kwk2 .
132
x2
+
+
Class 2
+
+
+
+ +
x
x
x x
?
x x
x x
Class 1
x1
x2
+
+ Class 2
+
+
+
+ +
x
x
x x
x x x
x
Maximize distance to
Class 1 nearest points
x1
Figure 6.2: Linear classification: (Top) separable problem where the

separating hyperplane is not unique; (Bottom) definition of a unique
hyperplane which is maximizing the distance to the nearest points.
x2
+
+ Class 2
+
+
+
+ +
x
x
x x w^T x + b = +1
x x
x w^T x + b = 0
x
w^T x + b = −1
Class 1
x1
Figure 6.3: Linear classification: definition of unique separating hy-

perplane. The margin is the distance between the dashed lines.
134
6.2.2 Linear SVM classifier: separable case

After introducing the margin concept, we are in a position now to
formulate the linear SVM classifier, which was originally proposed by
Vapnik (1964).
Consider a given training set {xk , yk }N
k=1 , input patterns xk ∈ R
n
and output patterns yk ∈ R with class labels yk ∈ {−1, +1}. Assume

T
w xk + b ≥ +1 , if yk = +1
(6.1)
w T xk + b ≤ −1 , if yk = −1
which is equivalent to
yk [w T xk + b] ≥ 1, k = 1, ..., N. (6.2)
One formulates then the optimization problem:

1 T
min w w s.t. yk [w T xk + b] ≥ 1, k = 1, ..., N. (6.3)
w,b 2
The Lagrangian for this problem is

N
1 X
L(w, b; α) = w T w − αk {yk [w T xk + b] − 1} (6.4)
2 k=1
with Lagrange multipliers αk ≥ 0 for k = 1, ..., N (later called the

support values). The solution is given by the saddle point of the
Lagrangian
max min L(w, b; α). (6.5)
α w,b
One obtains  N
X
∂L

=0 → w= αk yk xk



 ∂w
k=1
N (6.6)
 X
∂L
=0 → αk yk = 0


∂b


k=1
with resulting classifier
XN
y(x) = sign[ αk yk xTk x + b]. (6.7)
k=1
By replacing the expression for w in the Lagrangian one obtains the fol-
lowing Quadratic Programming (QP) problem (Dual Problem) which
solves the problem in the Lagrange multipliers
N N
1 X X
max Q(α) = − yk yl xTk xl αk αl + αk (6.8)
α 2 k,l=1 k=1
such that
N
X
αk yk = 0, αk ≥ 0 (6.9)
k=1
Note that this problem is solved in α, not in w. One can prove that
the solution to the QP problem is global and unique. The data related
to nonzero αk are called support vectors, in other words these data
points contribute to the sum in the classifier model. A drawback is,
however, that the QP problem matrix size grows with number of data
N, e.g. when one has 1,000,000 data points the size of the matrix
involved in the QP problem will be 1,000,000 × 1,000,000 which is too
huge for computer memory storage.
6.2.3 Linear SVM classifier: non-separable case

For most real-life problems, when taking a linear classifier, not all the
data points of the training set will be correctly classified (Fig.6.4) un-
less the true underlying problem is perfectly linearly separable. How-
ever, often the distributions of the two classes will have a large overlap
such that misclassifications have to be tolerated.
Therefore one modifies the inequalities into
yk [w T xk + b] ≥ 1 − ξk , k = 1, ..., N (6.10)
with slack variables ξk ≥ 0 such that the original inequalities can be
violated for certain points if needed (for ξk > 1 the original inequality
is violated for that data point). The optimization problem becomes
N
1 T X
min J (w, ξ) = w w + c ξk (6.11)
w,b,ξ 2 k=1
subject to
yk [w T xk + b] ≥ 1 − ξk , k = 1, ..., N

(6.12)
ξk ≥ 0, k = 1, ..., N.
136
x2
+
+
Class 2
+
+
+
+ +
x
x +
x
x x
x x x
x Maximize distance to
Class 1 nearest points
x1
Figure 6.4: Problem of non-separable data, due to overlapping distri-

butions.
with Lagrangian
N
X N
X
T
L(w, b, ξ; α, ν) = J (w, ξ) − αk {yk [w xk + b] − 1 + ξk } − νk ξk
k=1 k=1
(6.13)
and Lagrange multipliers αk ≥ 0, νk ≥ 0 for k = 1, ..., N. The solution
is given by saddle point of Lagrangian:
max min L(w, b, ξ; α, ν). (6.14)
α,ν w,b,ξ
One obtains
N

X
∂L




 ∂w
=0 → w= αk yk xk
k=1


N
X (6.15)
∂L


 ∂b
=0 → αk yk = 0


 k=1
∂L
= 0 → 0 ≤ αk ≤ c, k = 1, ..., N

∂ξk
which gives the quadratic programming problem (dual problem):

N N
1 X T
X
max Q(α) = − yk yl xk xl αk αl + αk (6.16)
αk 2 k,l=1 k=1
such that 
N
 X

αk yk = 0
(6.17)
 k=1
0 ≤ αk ≤ c, k = 1, ..., N.

This problem has additional box constraints now. The computation

of b follows from the KKT (Karush-Kuhn-Tucker) conditions.
6.3 Kernel trick and Mercer condition

Important progress in SVM theory has been made thanks to the fact
that the linear theory has been extended to nonlinear models (Vapnik,
1995) [20, 21]. In order to achieve this, one maps the input data
into a high dimensional feature space which can be potentially infinite
dimensional. A construction of the linear separating hyperplane is
done then in this high dimensional feature space2 , after a nonlinear
mapping ϕ(x) of the input data to the feature space (Fig.6.5).
Surprisingly, no explicit construction of the nonlinear mapping
ϕ(x) is needed. One makes use of the so-called Mercer theorem
(often called the kernel trick). This states that there exists a map-
ping ϕ and an expansion
M
X
K(x, y) = ϕi (x)ϕi (y), x, y ∈ Rn , (6.18)
i=1
if and only if, for any g(x) such that

Z
g(x)2 dx is finite (6.19)
one has Z
K(x, y)g(x)g(y)dxdy ≥ 0. (6.20)
Hence the kernel should be positive definite. By applying this theorem

one can avoid computations in the huge dimensional feature space.
Instead one chooses a kernel function. For RBF kernels M → ∞
while for linear and polynomial kernels M is finite.
2
In fact it would be better to call this a high-dimensional hidden layer, be-
cause in pattern recognition one frequently uses the term feature space in another
meaning of input space. Nevertheless we will use the term feature space in the
sequel.
138
φ (x)
+ +
+
+
+ +
+ +
+
+ x
+
+
x x x
x
x x +
x x
+
x x
x x x
x
x
Input space
Feature space
=
T
K(x,y) φ (x)
φ (y)
Figure 6.5: (Top) Mapping of the input space to a high dimensional

feature space where a linear separation is made, which corresponds to a
nonlinear separation in the original input space; (Bottom) Illustration
of using a positive definite kernel function K.
6.4 Nonlinear SVM classifiers

The extension from linear SVM classifiers to nonlinear SVM classifiers
is straightforward. One starts formulating the problem in the primal
space i.e. in the w vector (which could be infinite dimensional). As-
sume T
w ϕ(xk ) + b ≥ +1 , if yk = +1
T (6.21)
w ϕ(xk ) + b ≤ −1 , if yk = −1
which is equivalent now to
yk [w T ϕ(xk ) + b] ≥ 1, k = 1, ..., N. (6.22)
No explicit construction of ϕ(·) : Rn → Rnh (nh not specified) is
needed at this point. In principle nh can also be infinite dimensional.
The optimization problem becomes
N
1 T X
min J (w, ξ) = w w + c ξk (6.23)
w,b,ξ 2 k=1
subject to
yk [w T ϕ(xk ) + b] ≥ 1 − ξk , k = 1, ..., N

(6.24)
ξk ≥ 0, k = 1, ..., N.
One constructs the Lagrangian:
N
X N
X
L(w, b, ξ; α, ν) = J (w, ξk )− αk {yk [w T ϕ(xk )+b]−1+ξk }− νk ξk
k=1 k=1
(6.25)
with Lagrange multipliers αk ≥ 0, νk ≥ 0 (k = 1, ..., N). The solution
is given by the saddle point of the Lagrangian:
max min L(w, b, ξ; α, ν). (6.26)
α,ν w,b,ξ
One obtains
N

X
∂L




 ∂w
=0 → w= αk yk ϕ(xk )
k=1


N
∂L
X (6.27)


 ∂b
=0 → αk yk = 0


 k=1
∂L
= 0 → 0 ≤ αk ≤ c, k = 1, ..., N.

∂ξk
140
The quadratic programming problem (dual problem) becomes

N N
1 X X
max Q(α) = − yk yl K(xk , xl ) αk αl + αk (6.28)
αk 2 k,l=1 k=1
such that 
N
 X

αk yk = 0
(6.29)
 k=1
0 ≤ αk ≤ c, k = 1, ..., N.

Note that w and ϕ(xk ) are never calculated but all calculations are
done in the dual space. We make use of the Mercer condition by
choosing a kernel
K(xk , xl ) = ϕ(xk )T ϕ(xl ). (6.30)
Finally, the nonlinear SVM classifier takes the form
XN
y(x) = sign[ αk yk K(x, xk ) + b] (6.31)
k=1
with αk positive real constants and b a real constant, which follow as

solution to the QP problem. Non-zero αk are called support values
and the corresponding data points are called support vectors. The
bias term b follows from the KKT conditions.
Several choices are possible for the kernel K(·, ·):
K(x, xk ) = xTk x (linear SVM)
K(x, xk ) = (xTk x+η)d (polynomial SVM of degree d), η ≥ 0
K(x, xk ) = exp{−kx − xk k22 /σ 2 } (RBF kernel)
K(x, xk ) = tanh(κ xTk x + θ) (MLP kernel) .
The Mercer condition holds for all σ values in the RBF case, but not
for all possible choices of κ, θ in the MLP case (therefore the use of
an MLP kernel is not popular in SVM methods). In the case of an
RBF and MLP kernel, the number of hidden units corresponds to the
number of support vectors, e.g. for the RBF kernel one has
X N
y(x) = sign[ αk yk exp{−kx − xk k22 /σ 2 } + b]
k=1
X (6.32)
= sign[ αk yk exp{−kx − xk k22 /σ 2 } + b]
k∈SSV
x
x x x
x
x x +
+
+ +
+ + x
+
x + + +
x +
+ +
x x
x
x
x
x x
Figure 6.6: In the abstract figure the encircled points are support vec-
tors. These points have a non-zero support value αk . The decision
boundary can be expressed in terms of these support vectors (which
explains the terminology). In standard QP type support vector ma-
chines all support vectors are located close to the decision boundary.
where SSV denotes the set of support vectors. It means that each
hidden unit corresponds to a support vector (non-zero support values
αk ) and the number of hidden units equals the number of support
values. The support vectors also have a nice geometrical meaning
(Fig.6.6). They are located close to the decision boundary and the
decision boundary can be expressed in terms of these support vectors
(which explains the terminology).
6.5 SVMs for function estimation

6.5.1 SVM for linear function estimation
Consider regression in the set of linear functions
f (x) = w T x + b (6.33)
142
−ε 0 +ε
Figure 6.7: Vapnik ǫ-insensitive loss function for function estimation.
with N training data xk ∈ Rm and output values yk ∈ R. The empir-

ical risk minimization is defined as
N
1 X
Remp = |yk − w T xk − b|ǫ . (6.34)
N k=1
For standard SVM function estimation one employs the so-called Vap-
nik’s ǫ-insensitive loss function3

0 , if |y − f (x)| ≤ ǫ
|y − f (x)|ǫ = (6.35)
|y − f (x)| − ǫ , otherwise
shown in Fig.6.7.
By taking such a cost function one can formulate the following
optimization problem
1
min w T w (6.36)
2
subject to |yk − w T xk − b| ≤ ǫ or
yk − w T xk − b ≤ ǫ

(6.37)
w T xk + b − yk ≤ ǫ.
3
SVM theory can be extended to any convex cost function. Historically, the
SVM results were first derived for Vapnik’s ǫ-insensitive loss function. In general,
the choice of a 1-norm in the cost function is more robust than a 2-norm, e.g. with
respect to outliers and non-Gaussian noise on the data.
f(x)
+ε
0
x
ζ x x −ε
x
x x
x
x x
Figure 6.8: Tube of ǫ-accuracy and points which cannot meet this
accuracy, motivating the use of slack variables.
Here ǫ denotes the required accuracy as demanded by the user. How-

ever, a priori not all points will be able to meet this requirement.
Therefore one introduces slack variables
N
1 T X
min w w + c (ξk + ξk∗ ) (6.38)
2 k=1
subject to 
 yk − w T xk − b ≤ ǫ + ξk
w T xk + b − yk ≤ ǫ + ξk∗ (6.39)
ξk , ξk∗ ≥ 0.

The constant c > 0 determines the trade-off between flatness of f and

the amount up to which deviations larger than ǫ are tolerated, which
is illustrated in Fig.6.8.
The Lagrangian is
L(w, b, ξ, ξ ∗; α, α∗, η, η ∗) =
N N
1 T X
∗
X
w w+c (ξk + ξk ) − αk (ǫ + ξk − yk + w T xk + b)
2 k=1 k=1 (6.40)
N
X XN
∗ ∗ T
− αk (ǫ + ξk + yk − w xk − b) − (ηk ξk + ηk∗ ξk∗ )
k=1 k=1
144
with positive Lagrange multipliers αk , αk∗ , ηk , ηk∗ ≥ 0. The saddle point

of the Lagrangian is characterized by
max min L(w, b, ξ, ξ ∗; α, α∗, η, η ∗) (6.41)

α,α∗ ,η,η∗ w,b,ξ,ξ ∗
with conditions for optimality:

N

X
∂L
=0 → w= (αk − αk∗ )xk


∂w



k=1


N


 X
∂L
∂b
=0 → (αk − αk∗ ) = 0 (6.42)

 k=1
∂L

= 0 → c − αk − ηk = 0



 ∂ξk

∂L

= 0 → c − αk∗ − ηk∗ = 0.

∂ξk∗
The dual problem becomes

N
1 X
maxα,α∗ Q(α, α∗ ) = − (αk − αk∗ )(αl − αl∗ ) xTk xl
2
k,l=1
N N
(6.43)
X X
−ǫ (αk + αk∗ ) + yk (αk − αk∗ )
k=1 k=1
subject to 
N
 X
(αk − αk∗ ) = 0

(6.44)
 k=1 ∗

αk , αk ∈ [0, c].
The resulting SVM for linear function estimation is
f (x) = w T x + b (6.45)
PN
with w = k=1 (αk − αk∗ )xk such that
N
X
f (x) = (αk − αk∗ )xTk x + b. (6.46)
k=1
The support vector expansion is sparse in the sense that many support
values will be zero.
6.5.2 SVM for nonlinear function estimation

Consider again a nonlinear mapping to the feature space:
f (x) = w T ϕ(x) + b (6.47)
with given data {xk , yk }N
k=1 . The optimization problem in the primal
weight space, which could be infinite dimensional, becomes
N
1 T X
min w w + c (ξk + ξk∗ ) (6.48)
2 k=1
subject to 
 yk − w T ϕ(xk ) − b ≤ ǫ + ξk
w T ϕ(xk ) + b − yk ≤ ǫ + ξk∗ (6.49)
ξk , ξk∗ ≥ 0

with as a resulting dual problem

N
1X
maxα,α∗ Q(α, α∗ ) = − (αk − αk∗ )(αl − αl∗ ) K(xk , xl )
2 k,l=1
N N
(6.50)
X X
−ǫ (αk + αk∗ ) + yk (αk − αk∗ )
k=1 k=1
subject to 
N
 X
(αk − αk∗ ) = 0

(6.51)
 k=1 ∗

αk , αk ∈ [0, c].
One applies the Mercer condition K(xk , xl ) = ϕ(xk )T ϕ(xl ) which gives
N
X
f (x) = (αk − αk∗ )K(x, xk ) + b. (6.52)
k=1
6.6 Least squares SVM (LS-SVM) classi-

fiers
As we have discussed, SVMs have many nice properties but on the
other hand they are computationally heavy when the data sets be-
come larger. Many optimization methods have been studied for train-
ing of SVMs. Interior point methods (Vanderbei, Smola) have been
146
applied to data sets of about 1,000 data using Vanderbei’s LOQO

software. In order to handle larger data sets one applies chunking or
decomposition methods with subset selection (Vapnik, Osuna) with
e.g. SVMlight software applicable to larger datasets. An extreme form
of chunking is Platt’s SMO (Sequential Minimal Optimization) which
is an extreme form of chunking with a subset selection of two points
such that parts of the solution can be found analytically. Successive
OverRelaxation (SOR) methods (Mangasarian) have been applied to
massive data sets with millions of points (on supercomputer) and also
Linear Programming (LP) SVM formulations (Smola) have been made
for which large scale routines exist.
Another approach introduced in [59, 60] is the simplify the formu-
lations so as to obtain a linear system without losing any advantages
of the standard SVM formulation. This is done within the LS-SVM
(Least Squares Support Vector Machine) framework (Suykens et al.).
For LS-SVM classifiers one starts from the optimization problem
N
1 1X 2
min J (w, b, e) = w T w + γ e (6.53)
w,b,e 2 2 k=1 k
subject to the equality constraints
yk [w T ϕ(xk ) + b] = 1 − ek , k = 1, ..., N. (6.54)
Hence a squared error (2-norm) cost function is taken with equality

instead of inequality constraints. The Lagrangian becomes
N
X
L(w, b, e; α) = J (w, b, e) − αk {yk [w T ϕ(xk ) + b] − 1 + ek } (6.55)
k=1
where αk are Lagrange multipliers. The conditions for optimality are

N

X
∂L
=0 → w= αk yk ϕ(xk )


∂w



k=1


N

 X
∂L
 ∂b
=0 → αk yk = 0

 k=1
∂L

= 0 → αk = γek , k = 1, ..., N



 ∂ek
∂L T
= 0 → yk [w ϕ(xk ) + b] − 1 + ek = 0, k = 1, ..., N

∂αk
which gives the following set of linear equations:
−Z T
    
I 0 0 w 0
 0 0 0 −Y T   b =
   0 
  (6.56)
 0 0 γI −I   e   0 
Z Y I 0 α ~1
with Z = [ϕ(x1 )T y1 ; ...; ϕ(xN )T yN ], Y = [y1 ; ...; yN ], ~1 = [1; ...; 1],

e = [e1 ; ...; eN ], α = [α1 ; ...; αN ]. After elimination of w, e one obtains
YT

0 b 0
−1 = ~ (6.57)
Y Ω+γ I α 1
where
Ω = ZZ T (6.58)
and the Mercer’s condition is applied
Ωkl = yk yl ϕ(xk )T ϕ(xl )

(6.59)
= yk yl K(xk , xl ).
An example is shown in Fig.6.9 for an LS-SVM classifier with RBF

kernel applied to double spiral classification problem. Extensions of
LS-SVM classifiers can also be made for multi-class problems either by
taking additional output variables or by treating a multi-class problem
as a collection of binary subproblems. In order to solve large scale
problems one can apply conjugate gradient algorithms (discussed in
Chapter 2). Therefore one represents the original problem of the form
0 YT

ξ1 d1
= (6.60)
Y H ξ2 d2
with H = Ω + γ −1 I, ξ1 = b, ξ2 = α, d1 = 0, d2 = ~1 as
−d1 + Y T H −1 d2

s 0 ξ1
= (6.61)
0 H ξ2 + H −1Y ξ1 d2
with s = Y T H −1 Y > 0 (H = H T > 0). This transformed prob-

lem has a positive definite matrix such that the conjugate gradient
method for linear systems can be applied to it. When solving larger
problems, e.g. 10,000 ... 50,000 data points the matrices are not stored
in memory.
148
40
30
20
10
−10
−20
−30
−40
−40 −30 −20 −10 0 10 20 30 40
Figure 6.9: LS-SVM classifier with RBF kernel applied to a double

spiral classification problem with the two classes indicated by ’o’ and
’*’ and 180 training data for each class.
6.7 LS-SVM for nonlinear function esti-

mation
The LS-SVM model for nonlinear function estimation starts from the
primal weight space formulation
y(x) = w T ϕ(x) + b (6.62)
with x ∈ Rn , y ∈ R. The nonlinear mapping ϕ(·) is similar to the

classifier case. Given a training set {xk , yk }N
k=1 one formulates the
optimization problem
N
1 1X 2
min J (w, e) = w T w + γ e (6.63)
w,b,e 2 2 k=1 k
subject to equality constraints
yk = w T ϕ(xk ) + b + ek , k = 1, ..., N. (6.64)
This is a form of ridge regression (but in a possibly infinite dimensional

space). The Lagrangian becomes
N
X
L(w, b, e; α) = J (w, e) − αk {w T ϕ(xk ) + b + ek − yk } (6.65)
k=1
with Lagrange multipliers αk . The conditions for optimality are

N

X
∂L
=0 → w= αk ϕ(xk )


∂w



k=1


N

 X
∂L
=0 → αk = 0 (6.66)
 ∂b

 k=1
∂L

= 0 → αk = γek , k = 1, ..., N



 ∂ek
∂L T
= 0 → w ϕ(xk ) + b + ek − yk = 0, k = 1, ..., N.

∂αk
The solution is given by
~1T

0 b 0
~1 = (6.67)
Ω + γ −1 I α y
150
with y = [y1 ; ...; yN ], ~1 = [1; ...; 1], α = [α1 ; ...; αN ] and by applying
Mercer’s condition
Ωkl = ϕ(xk )T ϕ(xl ), k, l = 1, ..., N
(6.68)
= K(xk , xl ).
The resulting LS-SVM model for function estimation is

N
X
y(x) = αk K(x, xk ) + b.
k=1
A drawback of LS-SVM is that sparseness is lost due to the use of a

sum squared error cost function. This is also clear from the condition
αk = γek . However, one can exploit the fact that the importance of a
data point is given by its support value. In this way one can also ob-
tain sparseness for LS-SVMs by pruning the support value spectrum.
Less meaningful data points are gradually removed from the training
set (shifting spectrum of sorted |αk | values). This leads to a sparse
approximation as for standard SVMs. One may take the following
simple pruning algorithm.
LS-SVM pruning
1. Compute LS-SVM for N training data
2. Reduce small amount of the training set (e.g. 5 %) based upon

the sorted support value spectrum
3. Re-estimate the LS-SVM on the reduced training set
4. Go to 2, unless the user-defined performance index degrades.

In this algorithm γ, σ can be modified while pruning. While for
MLPs the computation of a Hessian or inverse Hessian is needed this
is not the case for LS-SVM where the pruning information follows
from the solution vector itself. Some examples are shown in Fig.6.11,
Fig.6.12 and Fig.6.13 for function estimation and classification respec-
tively.
An important aspect of (LS-)SVMs is also that they generalize
well in huge dimensional input spaces. Suppose inputs x ∈ Rn and
N training data. Then for MLPs the number of unknowns is larger
than the number of hidden units times n, while for (LS-)SVMs the
α
k
0
N k
Figure 6.10: Imposing sparseness to LS-SVMs. From the sorted ab-

solute αk values the smallest values are removed and the LS-SVM is
re-estimated.
number of unknowns is proportional to the number of data points

N. Hence, (LS-)SVMs are scaling better to high-dimensional input
spaces which is interesting e.g. towards classification of microarray
data in bioinformatics.
6.8 Tuning parameter selection

In the case of an RBF kernel and using LS-SVMs one finds the sup-
port values αk as the solution to a linear system. A good choice of
the tuning parameters (γ, σ) is very important at this point. One has
several possibilities. The simplest (but not necessarily the best) way is
optimization on a separate validation set. In this case the designer is
responsible for defining a meaningful training and validation set. The
generalization performance should be checked on a completely inde-
pendent test set. In a cross-validation approach (γ, σ) are optimized
on the sum of the errors of the sets that were left out in the several
runs. The advantage is that no additional validation set is needed and
one can check the generalization performance on an independent test
set (Fig.6.14). Other approaches are bootstrapping, Bayesian infer-
ence and application of generalization bounds from VC theory.
152
1.5 1.5
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−1.5 −1.5
−5 −4 −3 −2 −1 0 1 2 3 4 5 −5 −4 −3 −2 −1 0 1 2 3 4 5
sorted abs(alpha k)
1.5 0.16
0.14
1
0.12
0.5
0.1
abs(alphak)
0 0.08
0.06
−0.5
0.04
−1
0.02
0
−1.5 0 50 100 150 200 250 300 350 400 450 500
−5 −4 −3 −2 −1 0 1 2 3 4 5 k
Figure 6.11: LS-SVM sparse approximation of a noiseless sinc function

using an RBF kernel: 500 SV → 250 SV → 50 SV.
1.5 1.5
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−1.5 −1.5
−5 −4 −3 −2 −1 0 1 2 3 4 5 −5 −4 −3 −2 −1 0 1 2 3 4 5
sorted abs(alpha )
k
1.5 0.8
0.7
1
0.6
0.5
0.5
abs(alpha )
k
0.4
0
0.3
−0.5
0.2
−1
0.1
0
−1.5 0 50 100 150 200 250 300 350 400 450 500
−5 −4 −3 −2 −1 0 1 2 3 4 5 k
Figure 6.12: LS-SVM sparse approximation of a noisy sinc function

with RBF kernel: 500 SV → 250 SV → 50 SV.
154
1.5 1.5
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−1.5 −1.5
−1.5 −1 −0.5 0 0.5 1 1.5 −1.5 −1 −0.5 0 0.5 1 1.5
8
1.5
1
6
0.5 5
abs(alphak)
4
0
−0.5
2
−1
1
0
−1.5 0 50 100 150 200 250 300 350 400 450 500
−1.5 −1 −0.5 0 0.5 1 1.5 k
Figure 6.13: LS-SVM sparse approximation with RBF kernel illus-

trated on a classification problem: 500 SV → 250 SV → 50 SV.
γ1 σ1
test
training
validation
validation error 1
γ2 σ2
validation error 2
γ σ
3 3
validation error 3
select γ and σ with minimal validation error
Figure 6.14: Illustration of a simple procedure for selection of (γ, σ)

based upon a specific training and validation set.
156
Chapter 7
Conclusions
Growing volumes of data pose interesting new challenges in a wide

range of application areas towards black-box modelling, classification
and data exploration tools. An important class of methods for this
purpose are neural networks. In the last decade important progress
has been made at this point in supervised learning as well as unsuper-
vised learning. Thanks to the increasing computer power it becomes
feasible to apply such techniques to large scale problems. In this course
we discussed methods of neural networks for nonlinear modelling and
classification, aspects of decision theory, learning and generalization,
unsupervised learning and self-organizing maps, regularization theory
and support vector machines and other issues related to these meth-
ods. While some of these techniques were often applied in a miracle
approach fashion in the past, they are currently quite well understood
and demystified such that they can be applied in a reliable way to
many real-life datamining problems. It is to be expected that this will
lead to new breakthroughs for the future. We hope that this intro-
ductory course may further stimulate the interested reader to consult
the specialised literature and that it may serve as a useful guide for
tackling important questions to be addressed in industrial applications
as well as academic research.
157
158
Bibliography
[1] Bishop C.M., Neural networks for pattern recognition, Oxford

University Press, 1995.
[2] Boyd S., Vandenberghe L., Cambridge University Press, 2004
[3] Cherkassky V., Mulier F., Learning from data: concepts, theory
and methods, John Wiley and Sons, 1998.
[4] Cristianini N., Shawe-Taylor J., An introduction to support vec-

tor machines, Cambridge University Press, 2000
[5] Devroye L., Györfi L., Lugosi G., A Probabilistic Theory of Pat-
tern Recognition, NY: Springer, 1996.
[6] Duda R.O., Hart P.E., Stork D. G., Pattern Classification (2ed.),
Wiley, 2001.
[7] Fayyad U.M., Piatetsky-Shapiro G., Smyth P., Uthurasamy R.

(Ed.), Advances in Knowledge Discovery and Data Mining, MIT
Press, 1996.
[8] Fletcher R., Practical methods of optimization, Chichester and

New York: John Wiley and Sons, 1987.
[9] Hastie T., Tibshirani R., Friedman J., The elements of statistical
learning, Springer-Verlag, 2001.
[10] Hastie T., Tibshirani R., Wainwright M., Statistical Learning

with Sparsity: The Lasso and Generalizations, Chapman &
Hall/CRC, 2015.
[11] Haykin S., Neural Networks: a Comprehensive Foundation,

Macmillan College Publishing Company: Englewood Cliffs, 1994.
159
160
[12] Kohonen T., Self-Organizing Maps, Springer Series in Informa-

tion Sciences, Vol. 30, 1997.
[13] MacKay D.J.C., Information Theory, Inference and Learning

Algorithms, book in preparation available at http://wol.ra.-
phy.cam.ac.uk/mackay/
[14] Ripley B.D., Pattern Recognition and Neural Networks, Cam-

bridge: Cambridge University Press, 1996.
[15] Ritter H., Martinetz T., Schulten K., Neural Computation and
Self-Organizing Maps: An Introduction, Addison-Wesley, Read-
ing, MA, 1992.
[16] Schölkopf B., Burges C., Smola A., Advances in Kernel Methods:
Support Vector Learning, MIT Press, Cambridge, MA, December
1998.
[17] Suykens J.A.K., Vandewalle J., De Moor B., Artificial Neu-

ral Networks for Modelling and Control of Non-Linear systems,
Kluwer Academic Publishers, Boston, 1996.
[18] Suykens J.A.K., Vandewalle J. (Eds.) Nonlinear Modeling:

advanced black-box techniques, Kluwer Academic Publishers,
Boston, 1998.
[19] Suykens J.A.K., Van Gestel T., De Brabanter J., De Moor B.,
Vandewalle J., Least Squares Support Vector Machines, World
Scientific, Singapore, 2002.
[20] Vapnik V., The Nature of Statistical Learning Theory, Springer-

Verlag, 1995.
[21] Vapnik V., Statistical learning theory, John Wiley, New-York,

1998.
[22] Weigend A.S., Gershenfeld N.A. (Eds.), Time Series Prediction:

Forecasting the Future and Understanding the Past, Addison-
Wesley, 1994.
[23] Barron A.R., “Universal approximation bounds for superposition

of a sigmoidal function,” IEEE Transactions on Information The-
ory, Vol.39, No.3, pp.930-945, 1993.
Conclusions 161
[24] Bassett D.E., Eisen M.B., Boguski M.S., “Gene expression in-
formatics - it’s all in your mine,” Nature Genetics, supplement,
Vol.21, pp.51-55, Jan 1999.
[25] Bengio Y., “Learning deep architectures for AI,” Foundations and
trends in Machine Learning, 2(1): 1-127, 2009.
[26] Brown P., Botstein D., “Exploring the new world of the genome
with DNA microarrays,” Nature Genetics, supplement, Vol.21,
pp.33-37, Jan 1999.
[27] Burges C., “A Tutorial on Support Vector Machines for Pattern

Recognition,” Knowledge Discovery and Data Mining, 2(2), 1998.
[28] Chen M., Mao S., Liu Y., “Big Data: A Survey,” Mobile Networks
and Applications, 19(2), 171-209, April 2014.
[29] Chen S., Billings S., Grant P., “Nonlinear system identification
using neural networks,” International Journal of Control, Vol.51,
No.6, pp.1191-1214, 1990.
[30] Chen S., Billings S., “Neural networks for nonlinear dynamic sys-
tem modelling and identification,” International Journal of Con-
trol, Vol.56, No.2, pp.319-346, 1992.
[31] Dietterich, T. G., “Machine Learning Research: Four Current

Directions,” AI Magazine, 18 (4), 97-136, 1997.
[32] Espinoza M., Suykens J.A.K., Belmans R., De Moor B., “Electric
Load Forecasting,” IEEE Control Systems Magazine, Vol. 27, No.
5, pp. 43-57, Oct. 2007.
[33] Evgeniou T., Pontil M., Poggio T., “Regularization networksand

support vector machines,” Advances in Computational Mathe-
matics, Vol.13, No.1, pp.1-50, 2000.
[34] Fayyad U., Haussler D., Stolorz P. “Mining Scientific data,” Com-
munications of the ACM, Vol.39, No.11, pp.51-57, 1996.
[35] Freund Y., Schapire R.E,. “A short introduction to boosting,”

Journal of Japanese Society for Artificial Intelligence, 14(5):771-
780, 1999.
162
[36] Glymour C., Madigan D., Pregibon D., Smyth P., “Statistical
inference and data mining,” Communications of the ACM, Vol.39,
No.11, pp.35-41, 1996.
[37] Guyon I., Matic N., Vapnik V., “Discovering informative pat-
terns and data cleaning,” in U.M. Fayyad, G. Piatetsky-Shapiro,
P. Smyth, and R. Uthurusamy, Eds., Advances in Knowledge Dis-
covery and Data Mining, pp. 181-203, MIT Press, 1996.
[38] Heckerman D., “A tutorial on learning with Bayesian networks,”
Technical Report MSR-TR-95-06, Microsoft Research, March,
1995.
[39] Hornik K., Stinchcombe M., White H., “Multilayer feedforward
networks are universal approximators,” Neural Networks, Vol.2,
pp.359-366, 1989.
[40] Jain A., Mao J., Mohiuddin K., “Artificial neural networks: a
tutorial,” IEEE Computer, Vol.29, No.3, pp.31-44, 1996.
[41] Jones N., “Computer science: The learning machines,” Nature,
news feature, 8 Jan 2014.
[42] Kohonen T., “The self-organizing map,” Proc. IEEE, Vol.78,
No.9, pp.1464-1480, 1990.
[43] Kohonen T., Kaski S., Lagus K., Salojärvi J., Paatero V., Saarela
A., “Organization of a massive document collection,” IEEE
Transactions on Neural Networks (special issue on neural net-
works for data mining and knowledge discovery), Vol.11, No.3,
pp. 574-586, 2000.
[44] Lerouge E., Moreau Y., Verrelst H., Vandewalle J., Stoermann
C., Gosset P., Burge P., “Detection and management of fraud in
UMTS networks”, in Proc. of the Third International Conference
on The Practical Application of Knowledge Discovery and Data
Mining (PADD99), London, UK, Apr. 1999, pp. 127-148.
[45] MacKay D.J.C, “Bayesian interpolation,” Neural Computation,
4(3): 415-447, 1992.
[46] MacKay D.J.C, “A practical Bayesian framework for backpropa-
gation networks,” Neural Computation, 4(3): 448-472, 1992.
Conclusions 163
[47] Mjolsness E., DeCoste D., “Machine Learning for Science: State
of the Art and Future Prospects,” Science, Vol.293, pp. 2051-
2055, 2001.
[48] Møller M.F., “A scaled conjugate gradient algorithm for fast su-
pervised learning,” Neural Networks, Vol.6, pp.525-533, 1993.
[49] Morgan N., Bourlard H., “Continuous speech recognition: an in-
troduction to the hybrid HMM/connectionist approach,” IEEE
Signal Processing Magazine, pp.25-42, May 1995.
[50] Narendra K.S., Parthasarathy K., “Gradient methods for the
optimization of dynamical systems containing neural networks,”
IEEE Transactions on Neural Networks, Vol.2, No.2, pp.252-262,
1991.
[51] Poggio T., Girosi F., “Networks for approximation and learning,”
Proceedings of the IEEE, Vol.78, No.9, pp.1481-1497, 1990.
[52] Reed R., “Pruning algorithms - a survey,” IEEE Transactions on
Neural Networks, Vol.4, No.5, pp.740-747, 1993.
[53] Rumelhart D.E., Hinton G.E., Williams R.J., “Learning represen-
tations by back-propagating errors,” Nature, Vol.323, pp.533-536,
1986.
[54] Schölkopf B., Sung K.-K., Burges C., Girosi F., Niyogi P., Poggio
T., Vapnik V., “Comparing support vector machines with Gaus-
sian kernels to radial basis function classifiers,” IEEE Transac-
tions on Signal Processing, Vol.45, No.11, pp.2758-2765, 1997.
[55] Sjöberg J., Zhang Q., Ljung L., Benveniste A., Delyon B., Glo-
rennec P., Hjalmarsson H., Juditsky A., “Nonlinear black-box
modeling in system identification: a unified overview,” Automat-
ica, Vol.31, No.12, pp.1691-1724, 1995.
[56] Smola A., Schölkopf B., “A Tutorial on Support Vector Regres-
sion,” NeuroCOLT Technical Report NC-TR-98-030, Royal Hol-
loway College, University of London, UK, 1998.
[57] Smola A., Schölkopf B., Müller K.-R., “The connection between
regularization operators and support vector kernels,” Neural Net-
works, 11, 637-649, 1998.
164
[58] Suykens J.A.K., Vandewalle J., De Moor B., “NLq theory: check-
ing and imposing stability of recurrent neural networks for nonlin-
ear modelling,” IEEE Transactions on Signal Processing (special
issue on neural networks for signal processing), Vol.45, No.11, pp.
2682-2691, Nov. 1997.
[59] Suykens J.A.K., Vandewalle J. “Least squares support vector ma-

chine classifiers,” Neural Processing Letters, Vol.9, No.3, pp.293-
300, June 1999.
[60] Suykens J.A.K., “Least squares support vector machines for clas-
sification and nonlinear modelling,” Neural Network World (Spe-
cial Issue on PASE 2000), Vol.10, No.1-2, pp.29-48, 2000.
[61] Van Calster B., Timmerman D., Lu C., Suykens J.A.K., Valentin
L., Van Holsbeke C., Amant F., Vergote I., Van Huffel S., “Preop-
erative diagnosis of ovarian tumors using Bayesian kernel-based
methods”, Ultrasound in Obstetrics and Gynecology, vol. 29, no.
5, May 2007, pp. 496-504.
[62] van der Smagt P.P., “Minimisation methods for training feedfor-
ward neural networks,” Neural Networks, Vol.7, No.1, pp.1-11,
1994.
[63] Van Gestel T., Suykens J.A.K., Baesens B., Viaene S., Vanthienen
J., Dedene G., De Moor B., Vandewalle J., “Benchmarking Least
Squares Support Vector Machine Classifiers,” Machine Learning,
vol. 54, no. 1, Jan. 2004, pp. 5-32.
[64] Werbos P., “Backpropagation through time: what it does and

how to do it,” Proceedings of the IEEE, 78 (10), pp.1150-1560,
1990.
[65] Zadeh L.A., “Fuzzy logic, neural networks and soft computing,”
Communications of the ACM, Vol.37, No.3, pp.77-84, 1994.

bookDMNN 1516 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

bookDMNN 1516 PDF

Uploaded by

Copyright:

Available Formats

Data Mining and Neural

Prof. Dr. ir. Johan Suykens

Academic year 2015-2016

For this course on Datamining and Neural Networks the course

The course Datamining and Neural Networks consists of

• 10 lectures and 5 computer exercise sessions if you attend course

• 8 lectures and 3 exercise sessions if you attend course number

2 Neural Networks and Modelling 19

3 Neural Networks and Classification 51

3.2 Linear versus non-linear separability . . . . . . . . . . 53

4 Learning and Generalization 81

5 Unsupervised Learning and Regularization Theory 109

5.3 Vector quantization . . . . . . . . . . . . . . . . . . . . 113

6 Support Vector Machines 129

Although there is currently not a universally accepted definition of

Figure 1.1: Typical steps in the KDD process.

techniques used within datamining are related to several fields includ-

Task discovery domain model

Goal data model data output

query statistics and visualization transformation monitor

input output tool process task

process flow data flow tool usage

Figure 1.2: The complete KDD process.

Data Mining and

Artificial Intelligence Pattern Recognition

Machine Learning Statistics

Figure 1.3: Data Mining and related disciplines

• IEEE Transactions on Neural Networks

• Neural Processing Letters

• Journal of Machine Learning Reserach

• Data Mining and Knowledge Discovery

• IJCNN International Joint Conference on Neural Networks

• NIPS Neural Information Processing Systems

• ICANN International Conference Artificial Neural Networks

• ESANN European Symposium Artificial Neural Networks

• KDD International Conference on Knowledge Discovery and Data

• PKDD Principles and Practice of Knowledge Discovery in Databases

• IEEE BigData Conference

• SIAM International Conference on Data Mining

Other useful links

• Guide to Data Mining, Web Mining, Knowledge Discovery

• SIGKDD Special Interest Group on KDD

• UCI benchmark data sets

• Delve Data for Evaluating Learning in Valid Experiments

• Support vector machines and kernel based methods

• Self organizing maps

• Machine Learning Resources

• NATO Advanced Study Institute on Learning Theory and Prac-

When we talk about neural networks in this course it refers to

interpreted as a general class of nonlinear models. It has indeed been

input output input output

Figure 1.5: (a) Supervised versus (b) unsupervised learning.

and generalization issues, avoiding bad local minima solutions, how to

• Microarray data analysis:

Figure 1.6: Microarray data analysis.

Analysis of microarrays presents a number of unique challenges

sponding to the number of genes, is typically thousands. Given

– Gene Selection: in data mining terms this is a process of

The first generation of microarray analysis methodologies has

• Protein data analysis:

Figure 1.7: Proteomics.

Mass spectrometry has become a tool of choice in the search for

mally invasive surgery. Therefore, an accurate prediction about