Professional Documents
Culture Documents
(Codientu - Org) NN Lectures
(Codientu - Org) NN Lectures
1
v
y
) ... (
) (
2 2 1 1
b x w x w x w f
b wx f y
R R
1
w
R
w
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network
Architectures and Learning
#50
2.2 Activation functions (1/2)
Activation function defines the output of a neuron
Types of activation functions
Threshold function Linear function Sigmoid function
0 if
0 if
v
v
v y
0
1
) ( v v y ) (
) exp( 1
1
) (
v
v y
y y y
v v
v
26
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network
Architectures and Learning
#51
Activation functions (2/2)
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network
Architectures and Learning
#52
McCulloch-Pitts Neuron (1943)
Vector input, threshold activation function
Extremely simplified model of real biological neurons
Missing features: non-binary outputs, non-linear summation, smooth thresholding,
stochasticity, temporal information processing
Nevertheless, computationally very powerful
Network of McCulloch-Pits neurons is capable of universal computation
b wx y
b wx y
b wx v y
if 0
if 1
) sgn( ) (
R
x
x
1
v y
) ( b wx f y
The output is binary, depending on whether
the input meets a specified threshold
27
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network
Architectures and Learning
#53
Matlab notation
Presentation of more complex neurons and networks
Input vector p is represented by the solid dark vertical bar [R x 1]
Weight vector is shown as single-row, R-column matrix W [1 x R]
p and W multiply into scalar Wp
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network
Architectures and Learning
#54
Matlab Demos
nnd2n1 One input neuron
nnd2n2 Two input neuron
28
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network
Architectures and Learning
#55
2.3 Network architectures
About network architectures
Two or more of the neurons can be combined in a layer
Neural network can contain one or more layers
Strong link between network architecture and learning algorithm
1. Single-layer feedforward networks
Input layer of source nodes projects onto an output layer of neurons
Single-layer reffers to the output layer (the only computation layer)
2. Multi-layer feedforward networks
One or more hidden layers
Can extract higher-order statistics
3. Recurrent networks
Contains at least one feedback loop
Powerfull temporal learning capabilities
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network
Architectures and Learning
#56
Single-layer feedforward networks
29
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network
Architectures and Learning
#57
Multi-layer feedforward networks (1/2)
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network
Architectures and Learning
#58
Multi-layer feedforward networks (2/2)
Data flow strictly feedforward: input output
No feedback Static network, easy learning
30
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network
Architectures and Learning
#59
Recurrent networks (1/2)
Also called Dynamic networks
Output depends on
current input to the network (as in static networks)
and also on current or previous inputs, outputs, or states of the network
Simple recurrent network
Delay
Feedback loop
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network
Architectures and Learning
#60
Recurrent networks (2/2)
Layered Recurrent Dynamic Network example
31
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network
Architectures and Learning
#61
2.4 Learning algorithms
Important ability of neural networks
To learn from its environment
To improve its performance through learning
Learning process
1. Neural network is stimulated by an environment
2. Neural network undergoes changes in its free parameters as a result of this
stimulation
3. Neural network responds in a new way to the environment because of its
changed internal structure
Learning algorithm
Prescribed set of defined rules for the solution of a learning problem
1. Error correction learning
2. Memory-based learning
3. Hebbian learning
4. Competitive learning
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network
Architectures and Learning
#62
Error-correction learning (1/2)
1. Neural network is driven by input x(t) and responds with output y(t)
2. Network output y(t) is compared with target output d(t)
Error signal = difference of network output and target output
) ( ) ( ) ( t d t y t e
x(t)
y(t)
d(t)
e(t)
32
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network
Architectures and Learning
#63
Error-correction learning (2/2)
Error signal control mechanism to correct synaptic weights
Corrective adjustments designed to make network output y(t)
closer to target d(t)
Learning achieved by minimizing instantaneous error energy
Delta learning rule (Widrow-Hoff rule)
Adjustment to a synaptic weight of a neuron is proportional to the product of the error signal
and the input signal of the synapse
Comments
Error signal must be directly measurable
Key parameter: Learnign rate
Closed loop feedback system Stability determined by learning rate
) (
2
1
) (
2
t e t
) ( ) ( ) ( t x t e t w
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network
Architectures and Learning
#64
Memory-based learning
All (or most) past experiences are stored in a memory
of input-output pairs (inputs and target classes)
Two essential ingredients of memory-based learning
1. Define local neighborhood of a new input x
new
2. Apply learning rule to adapt stored examples in the local neighborhood of x
new
Examples of memory-based learning
Nearest neighbor rule
Local neighborhood defined by the nearest training example (Euclidian distance)
K-nearest neighbor classifier
Local neighborhood defined by k-nearest training examples robust against outliers
Radial basis function network
Selecting the centers of basis functions
N
i i i
y x
1
) , (
33
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network
Architectures and Learning
#65
Hebbian learning
The oldest and most famous learning rule (Hebb, 1949)
Formulated as associative learning in a neurobiological context
When an axon of a cell A is near enough to exite a cell B and repeatedly or
persistently takes part in firing it, some growth process or metabolic changes take place
in one or both cells such that As efficiency as one of the cells firing B, is increased.
Strong physiological evidence for Hebbian learning in hippocampus,
important for long term memory and spatial navigation
Hebbian learning (Hebbian synapse)
Time dependent, highly local, and strongly interactive mechanism to increase
synaptic efficiency as a function of the correlation between the presynaptic and
postsynaptic activities.
1. If two neurons on either side of a synapse are activated simultaneously, then the
strength of that synapse is selectively increased
2. If two neurons on either side of a synapse are activated asynchronously, then that
synapse is selectively weakned or eliminated
Simplest form of Hebbian learning
) ( ) ( ) ( t x t y t w
x
y
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network
Architectures and Learning
#66
Competitive learning
Competitive learning network architecture
1. Set of inputs, connected to a layer of outputs
2. Each output neuron receives excitation from all inputs
3. Output neurons of a neural network compete to
become active by exchanging lateral inhibitory connections
4. Only a single neuron is active at any time
Competitive learning rule
Neuron with the largest induced local field becomes a winning neuron
Winning neuron shifts its synaptic weights toward the input
Individual neurons specialize on ensambles of similar patterns
feature detectors for different classes of input patterns
Inputs
34
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network
Architectures and Learning
#67
2.5 Learning paradigms
Learning algorithm
Prescribed set of defined rules for the solution of a learning problem
Learning paradigm
Manner in which a neural network relates to its environment
1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning
1. Error correction learning
2. Memory-based learning
3. Hebbian learning
4. Competitive learning
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network
Architectures and Learning
#68
Supervised learning
Learning with a teacher
Teacher has a knowledge of the environment
Knowledge is represented by a set of input-output examples
Learning algorithm
Error-correction learning
Memory-based learning
Environment Teacher
Learning
system
+
-
Error signal
Target response = optimal action
35
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network
Architectures and Learning
#69
Unsupervised learning
Unsupervised or self-organized learning
No external teacher to oversee the learning process
Only a set of input examples is available, no output examples
Unsupervised NNs usually perform some kind of data compression, such as
dimensionality reduction or clustering
Learning algorithms
Hebbian learning
Competitive learning
Environment
Learning
system
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network
Architectures and Learning
#70
Reinforcement learning
No teacher, environment only offers primary reinforcement signal
System learns under delayed reinforcement
Temporal sequence of inputs which result in the generation of a reinforcement signal
Goal is to minimize the expectation of the cumulative cost of actions taken over
a sequence of steps
RL is realized through two neural networks:
Critic and Learning system
Critic network converts primary reinforcement
signal (obtained directly from environment)
into a higher quality heuristic reinforcement signal
which solves temporal credit assignment problem
Environment Critic
Learning
system
Actions
Primary
reinforcement
Heuristic
reinforcement
36
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network
Architectures and Learning
#71
2.6 Learning tasks (1/7)
1. Pattern Association
Associative memory is brain-like distributed memory that learns by association
Two phases in the operation of associative memory
1. Storage
2. Recall
Autoassociation
Neural network stores a set of patterns by repeatedly presenting them to the network
Then, when presented a distored pattern, neural network is able to recall the original
pattern
Unsupervised learning algorithms
Heteroassociation
Set of input patterns is paired with arbitrary set of output patterns
Supervised learning algorithms
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network
Architectures and Learning
#72
2. Pattern Recognition
In pattern recognition, input signals are assigned to categories (classes)
Two phases of pattern recognition
1. Learning (supervised)
2. Classification
Statistical nature of pattern recognition
Patterns are represented in multidimensional
decision space
Decision space is divided by separate
regions for each class
Decision boundaries are determined by a
learning process
Support-Vector-Machine example
Learning tasks (2/7)
37
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network
Architectures and Learning
#73
3. Function Approximation
Arbitrary nonlinear input-output mapping
y = f(x)
can be approximated by a neural network, given a set of labeled examples
{x
i
, y
i
}, i=1,..,N
The task is to approximate the mapping f(x) by a neural network F(x)
so that f(x) and F(x) are close enough
||F(x) f(x)|| < for all x, ( is a small positive number)
Neural network mapping F(x) can be realized by supervised learning
(error-correction learning algorithm)
Important function approximation tasks
System identification
Inverse system
Learning tasks (3/7)
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network
Architectures and Learning
#74
Learning tasks (4/7)
System identification
Inverse system
Environment
Unknown
System
Neural
network
+
-
Error signal
Unknown system response
Environment System
Neural
network
+
-
Error signal
Inputs from the environment
38
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network
Architectures and Learning
#75
4. Control
Neural networks can be used to control a plant (a process)
Brain is the best example of a paralled distributed generalized controller
Operates thousands of actuators (muscles)
Can handle nonlinearity and noise
Can handle invariances
Can optimize over long-range planning horizon
Feedback control system (Model reference control)
NN controller has to supply inputs that will drive a plant according to a reference
Learning tasks (5/7)
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network
Architectures and Learning
#76
Model predictive control
NN model provides multi-step ahead predictions for optimizer
39
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network
Architectures and Learning
#77
5. Filtering
Filter device or algorithm used to extract information about a prescribed
quantity of interest from a noisy data set
Filters can be used for three basic information processing tasks:
1. Filtering
Extraction of information at discrete time n by using measured data up to and including
time n
Examples: Cocktail party problem, Blind source separation
2. Smoothing
Differs from filtering in:
a) Data need not be available at time n
b) Data measured later than n can be used to obtain this information
3. Prediction
Deriving information about the quantity in the future at time n+h, h>0, by using data
measured up to including n
Example: Forecasting of energy consumption, stock market prediction
Learning tasks (6/7)
o o o o o o o o o o
o o o o o o x o o o
o o o o o o o o o o x
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network
Architectures and Learning
#78
6. Beamforming
Spatial form of filtering, used to distinguish between the spatial properties of a
target signal and background noise
Device is called a beamformer
Beamforming is used in human auditory response and echo-locating bats
the task is suitable for neural network application
Common beamforming tasks: radar and sonar systems
Task is to detect a target in the presence of receiver noise and interfering signals
Target signal originates from an unknown direction
No a priori information available on interfering signals
Neural beamformer, neuro-beamformer, attentional neurocomputers
Learning tasks (7/7)
40
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network
Architectures and Learning
#79
Adaptation
Learning has spatio-temporal nature
Space and time are fundamental dimensions of learning (control, beamforming)
1. Stationary environment
Learning under the supervision of a teacher, weights then frozen
Neural network then relies on memory to exploit past experiences
2. Nonstationary environment
Statistical properties of environment change with time
Neural network should continuously adapt its weights in real-time
Adaptive system continuous learning
3. Pseudostationary environment
Changes are slow over a short temporal window
Speech stationary in interval 10-30 ms
Ocean radar stationary in interval of several seconds
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network
Architectures and Learning
#80
41
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network
Architectures and Learning
#81
2.7 Knowledge representation
What is knowledge?
Stored information or models used by a person or machine to interpret, predict,
and appropriately respond to the outside world (Fischler & Firschein, 1987)
Knowledge representation
Good solution depends on a good representation of knowledge
Knowledge of the world consists of:
1. Prior information facts about what is and what has been known
2. Observations of the world measurements, obtained through sensors designed
to probe the environment
Observations can be:
1. Labeled input signals are paired with desired response
2. Unlabeled input signals only
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network
Architectures and Learning
#82
Knowledge representation in NN
Design of neural networks based directly on real-life data
Examples to train the neural network are taken from observations
Examples to train neural network can be
Positive examples ... input and correct target output
e.g. sonar data + echos from submarines
Negative examples ... input and false output
e.g. sonar data + echos from marine life
Knowledge representation in neural networks
Defined by the values of free parameters (synaptic weights and biases)
Knowledge is embedded in the design of a neural network
Interpretation problem neural networks suffer from inability to explain how a
result (decision / prediction / classification) was obtained
Serious limitation for safe-critical application (medicial diagnosis, air trafic)
Explanation capability by integration of NN and other artificial intelligence methods
42
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network
Architectures and Learning
#83
Knowledge representation rules for NN
Rule 1
Similar inputs from similar classes should produce similar
representations inside the network, and should be classified to the
same category
Rule 2
Items to be categorized as separate classes should be given
widelly different representations in the network
Rule 3
If a particular feature is important, then there should be a large
number of neurons involved in the representation of that item in the
network
Rule 4
Prior information and invariances should be built into the design of
a neural network, thereby simplifying the network design by not
having to learn them
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network
Architectures and Learning
#84
Prior information and invariances (Rule 4)
Application of Rule 4 results in neural networks with
specialized structure
Biological visual and auditory networks are highly specialized
Specialized network has smaller number of parameters
needs less training data
faster learning
faster network throughput
cheaper because of its smaller size
How to build prior information into neural network
design
Currently no well defined rules, but usefull ad-hoc procedures
We may use a combination of two techniques
1. Receptive fields restricting the network architecture by using local connections
2. Weight-sharing several neurons share same synaptic weights
43
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network
Architectures and Learning
#85
How to build invariances into NN
Character recognition example
Transformations Pattern recognition system should be invariant to them
Techniques
1. Invariance by neural network structure
2. Invariance by training
3. Invariant feature space
Original Size Rotation Shift Incomplete image
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network
Architectures and Learning
#86
Invariant feature space
Neural net classifier with invariant feature extractor
Features
Characterize the essential information content of an input data
Should be invariant to transformations of the input
Benefits
1. Dimensionality reduction number of features is small compared to the original
input space
2. Relaxed design requirements for a neural network
3. Invariances for all objects can be assured (for known transformations)
Prior knowledge is required!
Input Class estimate
Invariant
feature
extractor
Neural
network
classifier
44
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network
Architectures and Learning
#87
Example 2A (1/4)
Invariant character recognition
Problem: distinguishing handwritten characters a and b
Classifier design
Image representation
Grid of pixels (typically 256x256) with gray level [0..1] (typically 8-bit coding)
Class estimate: A, B
Invariant
feature
extractor
Neural
network
classifier
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network
Architectures and Learning
#88
Example 2A (2/4)
Problems with image representation
1. Invariance problem (various transformations)
2. High dimensionality problem
Image size 256x256 65536 inputs
Curse of dimensionality increasing input dimensionality leads to sparse data
and this provides very poor representation of the mapping
problems with correct classification and generalization
Possible solution
Combining inputs into features
Goal is to obtain just a few features instead of 65536 inputs
Ideas for feature extraction (for character recognition)
width character
heigth character
1
F
45
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network
Architectures and Learning
#89
Example 2A (3/4)
Feature extraction
Extracted feature:
Distribution for various samples from class A and B
Overlaping distributions: need for additional features
F
1
, F
2
, F
3
, ...
width character
heigth character
1
F
samples from
class A
samples from
class B
Decision
Class A Class B
F
1
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network
Architectures and Learning
#90
Example 2A (4/4)
Classification in multi feature space
Classification in the space of two features (F
1
, F
2
)
Neural network can be used for classification in the
feature space (F
1
, F
2
)
2 inputs instead of 65536 original inputs
Improved generalization and classification ability
F
2
F
1
Decision boundary
samples from
class A
samples from
class B
46
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network
Architectures and Learning
#91
Generalization and model complexity
What is the optimal decision boundary?
Best generalization is achieved by a model whose complexity is
neither too small nor too large
Occams razor principle: we should prefer simpler models to more
complex models
Tradeoff: modeling simplicity vs. modeling capacity
Linear classifier is insufficient,
false classifications
Optimal classifier ? Over-fitting, correct classification
but poor generalization
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network
Architectures and Learning
#92
2.8 Neural networks vs. stat. methods (1/3)
Considerable overlap between neural nets and statistics
Statistical inference means learning to generalize from noisy data
Feedforward nets are a subset of the class of nonlinear regression and discrimination models
Application of statistical theory to neural networks: Bishop (1995), Ripley (1996)
Most NN that can learn to generalize effectively from
noisy data are similar or identical to statistical methods
Single-layered feedforward nets are basically generalized linear models
Two-layer feedforward nets are closely related to projection pursuit regression
Probabilistic neural nets are identical to kernel discriminant analysis
Kohonen nets for adaptive vector quantization are similar to k-means cluster analysis
Kohonen self-organizing maps are discrete approximations to principal curves and surfaces
Hebbian learning is closely related to principal component analysis
Some neural network areas have no relation to statistics
Reinforcement learning
Stopped training (similar to shrinkage estimation, but the method is quite different)
47
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network
Architectures and Learning
#93
Neural networks vs. statistical methods (2/3)
Many statistical methods can be used for flexible
nonlinear modeling
Polynomial regression, Fourier series regression
K-nearest neighbor regression and discriminant analysis
Kernel regression and discriminant analysis
Wavelet smoothing, Local polynomial smoothing
Smoothing splines, B-splines
Tree-based models (CART, AID, etc.)
Multivariate adaptive regression splines (MARS)
Projection pursuit regression, various Bayesian methods
Why use neural nets rather than statistical methods?
Multilayer perceptron (MLP) tends to be useful in similar situations as projection pursuit
regression, i.e.:
the number of inputs is fairly large,
many of the inputs are relevant, but
most of the predictive information lies in a low-dimensional subspace
Some advantages of MLPs over projection pursuit regression
computing predicted values from MLPs is simpler and faster
MLPs are better at learning moderately pathological functions than are many other
methods with stronger smoothness assumptions
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network
Architectures and Learning
#94
Neural networks vs. statistical methods (2/3)
Neural Network Jargon
Generalizing from noisy data ....................................
Neuron, unit, node ....................................................
Neural networks .......................................................
Architecture ..............................................................
Training, Learning, Adaptation .................................
Classification ............................................................
Mapping, Function approximation ............................
Competitive learning .................................................
Hebbian learning ......................................................
Training set ...............................................................
Input .........................................................................
Output .......................................................................
Generalization ..........................................................
Prediction .................................................................
Statistical Jargon
Statistical inference
A simple linear or nonlinear computing element that
accepts one or more inputs and computes a
function thereof
A class of flexible nonlinear regression and
discriminant models, data reduction models,
and nonlinear dynamical systems
Model
Estimation, Model fitting, Optimization
Discriminant analysis
Regression
Cluster analysis
Principal components
Sample, Construction sample
Independent variables, Predictors, Regressors,
Explanatory variables, Carriers
Predicted values
Interpolation, Extrapolation, Prediction
Forecasting
48
MATLAB example
nn02_neuron_output
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network
Architectures and Learning
#95
MATLAB example
nn02_custom_nn
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network
Architectures and Learning
#96
49
MATLAB example
nnstart
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network
Architectures and Learning
#97
2012 Primo Potonik NEURAL NETWORKS (2) Neuron Model, Network
Architectures and Learning
#98
50
2012 Primo Potonik NEURAL NETWORKS (3) Perceptrons and Linear Filters #99
3. Perceptrons and Linear Filters
3.1 Perceptron neuron
3.2 Perceptron learning rule
3.3 Perceptron network
3.4 Adaline
3.5 LMS learning rule
3.6 Adaline network
3.7 ADALINE vs. Perceptron
3.8 Adaptive filtering
3.9 XOR problem
2012 Primo Potonik NEURAL NETWORKS (3) Perceptrons and Linear Filters #100
Introduction
Pioneering neural network contributions
McCulloch & Pits (1943) the idea of neural networks as computing machines
Rosenblatt (1958) proposed perceptron as the first supervised learning model
Widrow and Hoff (1961) least-mean-square learning as an important
generalization of perceptron learning
Perceptron
Layer of McCulloch-Pits neurons with adjustable synaptic weights
Simplest form of a neural network for classification of linearly separable patterns
Perceptron convergence theorem for two linearly separable classes
Adaline
Similar to perceptron, trained with LMS learning
Used for linear adaptive filters
51
2012 Primo Potonik NEURAL NETWORKS (3) Perceptrons and Linear Filters #101
3.1 Perceptron neuron
Perceptron neuron (McCulloch-Pits neuron):
hard-limit (threshold) activation function
Perceptron output: 0 or 1 usefull for classification
If y=0 pattern belongs to class A
If y=1 pattern belongs to class B
0 if
0 if
v
v
v y
0
1
) (
R
x
x
1
v
y
y
v
2012 Primo Potonik NEURAL NETWORKS (3) Perceptrons and Linear Filters #102
Linear discriminant function
Perceptron with two inputs
Separation between the two classes is a straight line, given by
Geometric representation
Perceptron represents linear discriminant function
) ( ) (
2 2 1 1
b x w x w f b wx f y
0
2 2 1 1
b x w x w
2
1
2
1
2
w
b
x
w
w
x
2
x
1
x
1
x
2
x
v
y
52
2012 Primo Potonik NEURAL NETWORKS (3) Perceptrons and Linear Filters #103
Matlab Demos (Perceptron)
nnd2n2 Two input perceptron
nnd4db Decision boundaries
2012 Primo Potonik NEURAL NETWORKS (3) Perceptrons and Linear Filters #104
How to train a perceptron?
How to train weights and bias?
Perceptron learning rule
Least-means-square learning rule or delta rule
Both are iterative learning procedures
1. A learning sample is presented to the network
2. For each network parameter, the new value is computed by adding a correction
Formulation of the learning problem
How do we compute w(t) and b(t) in order to classify the learning patterns
correctly?
) ( ) ( ) 1 (
) ( ) ( ) 1 (
n b n b n b
n w n w n w
j j j
1
x
R
x
2
x
v
y
53
2012 Primo Potonik NEURAL NETWORKS (3) Perceptrons and Linear Filters #105
3.2 Perceptron learning rule
A set of learning samples (inputs and target classes)
Objective:
Reduce error e between target class d and neuron response y
(error-correction learning)
e = d - y
Learning procedure
1. Start with random weights for the connections
2. Present an input vector x
i
from the set of training samples
3. If perceptron response is wrong: yd, e0, modify all connections w
4. Go back to 2
1 , 0 , ) , (
1 i i
N
i i i
d x d x
2012 Primo Potonik NEURAL NETWORKS (3) Perceptrons and Linear Filters #106
Three conditions for a neuron
After the presentation of input x, the neuron can be in
three conditions:
CASE 1:
If neuron output is correct, weights w are not altered
CASE 2:
Neuron output is 0 instead of 1 (y=0, d=1, e=d-y=1)
Input x is added to weight vector w
This makes the weight vector point closer to the input vector, increasing the chance that
the input vector will be classified as 1 in the future.
CASE 3:
Neuron output is 1 instead of 0 (y=1, d=0, e=d-y=-1)
Input x is subtracted from weight vector w
This makes the weight vector point farther away from the input vector, increasing the
chance that the input vector will be classified as a 0 in the future.
54
2012 Primo Potonik NEURAL NETWORKS (3) Perceptrons and Linear Filters #107
Three conditions rewritten
Three conditions for a neuron rewritten
CASE 1: e = 0 w = 0
CASE 2: e = 1 w = x
CASE 3: e = -1 w = -x
Three conditions in a single expression
w = (d-y)x = ex
Similar for the bias
b = (d-y)(1) = e
Perceptron learning rule
) ( ) ( ) 1 (
) ( ) ( ) ( ) 1 (
n e n b n b
n x n e n w n w
j j j
1
x
1
x
2
x
v
y
2012 Primo Potonik NEURAL NETWORKS (3) Perceptrons and Linear Filters #108
Convergence theorem
For the perceptron learning rule there exists a
convergence theorem:
Theorem 1 If there exists set of connection weights w which is able to perform the
transformation d=y(x), the perceptron learning rule will converge to some solution
in a finite number of steps for any initial choice of the weights.
Comments
Theorem is only valid for linearly separable classes
Outliers can cause long training times
If classes are linearly separable, perceptron offers a powerfull pattern recognition
tool
55
2012 Primo Potonik NEURAL NETWORKS (3) Perceptrons and Linear Filters #109
Perceptron learning rule summary
1. Start with random weights for the connections w
2. Select an input vector x from the set of training samples
3. If perceptron response is wrong: yd, modify all
connections according to learning rule:
4. Go back to 2 (until all input vectors are correctly classified)
e b
x e w
2012 Primo Potonik NEURAL NETWORKS (3) Perceptrons and Linear Filters #110
Matlab demo (Preceptron learning rule)
nnd4pr Two input perceptron
56
2012 Primo Potonik NEURAL NETWORKS (3) Perceptrons and Linear Filters #111
MATLAB example
nn03_perceptron
Classification of linearly separable data with a perceptron
2012 Primo Potonik NEURAL NETWORKS (3) Perceptrons and Linear Filters #112
Matlab demo (Presence of an outlier)
demop4 Slow learning with the presence of an outlier
57
2012 Primo Potonik NEURAL NETWORKS (3) Perceptrons and Linear Filters #113
Matlab demo (Linearly non-separable classes)
demop6 Perceptron attempts to classify linearly non-
separable classes
2012 Primo Potonik NEURAL NETWORKS (3) Perceptrons and Linear Filters #114
Matlab demo (Classification application)
nnd3pc Perceptron classification fruit example
58
2012 Primo Potonik NEURAL NETWORKS (3) Perceptrons and Linear Filters #115
3.3 Perceptron network
Single layer of perceptron neurons
Classification in more than two linearly separable
classes
2012 Primo Potonik NEURAL NETWORKS (3) Perceptrons and Linear Filters #116
MATLAB example
nn03_perceptron_network
Classification of 4-class problem with a 2-neuron perceptron
59
2012 Primo Potonik NEURAL NETWORKS (3) Perceptrons and Linear Filters #117
3.4 Adaline
ADALINE = Adaptive Linear Element
Widrow and Hoff, 1961:
LMS learning (Least mean square) or Delta rule
Important generalization of perceptron learning rule
Main difference with perceptron activation function
Perceptron: Threshold activation function
ADALINE: Linear activation function
Both Perceptron and ADALINE can only solve linearly
separable problems
2012 Primo Potonik NEURAL NETWORKS (3) Perceptrons and Linear Filters #118
Linear neuron
Basic ADALINE element
v v y ) (
R
x
x
1
v
y
y
v
Linear transfer function
b wx y
60
2012 Primo Potonik NEURAL NETWORKS (3) Perceptrons and Linear Filters #119
Simple ADALINE
Simple ADALINE with two inputs
Like a perceptron, ADALINE has a decision boundary
defined by network inputs for which network output is zero
see Perceptron decision boundary
ADALINE can be used to
classify objects into categories
b x w x w b wx f y
2 2 1 1
) (
1
x
2
x
v
y
0
2 2 1 1
b x w x w
2012 Primo Potonik NEURAL NETWORKS (3) Perceptrons and Linear Filters #120
3.5 LMS learning rule
LMS = Least Square Learning rule
A set of learning samples (inputs and target classes)
Objective: reduce error e between target class d and
neuron response y (error-correction learning)
e = d y
Goal is to minimize the average sum of squared errors
i i
N
i i i
d x d x , ) , (
1
N
n
n y n d
N
mse
1
2
) ( ) (
1
61
2012 Primo Potonik NEURAL NETWORKS (3) Perceptrons and Linear Filters #121
LMS algorithm (1/3)
LMS algorithm is based on approximate steepest decent
procedure
Widrow & Hoff introduced the idea to estimate mean-square-error
by using square-error at each iteration
and change the network weights proportional to the negative derivative of error
with some learning constant
2 2
) ( ) ( ) ( n y n d n e
j
j
w
n e
n w
) (
) (
2
N
n
n y n d
N
mse
1
2
) ( ) (
1
2012 Primo Potonik NEURAL NETWORKS (3) Perceptrons and Linear Filters #122
LMS algorithm (2/3)
Now we expand the expression for weight change ...
Expanding the neuron activation y(n)
and using the cosmetic correction
we finaly obtain the weight change at step n
j j j
j
w
n y n d
n e
w
n e
n e
w
n e
n w
) ( ) (
) ( 2
) (
) ( 2
) (
) (
2
) ( ) ( ) ( ) ( ) (
1 1
n x w n x w n x w n Wx n y
R R j j
) ( ) ( ) ( n x n e n w
j j
2
62
2012 Primo Potonik NEURAL NETWORKS (3) Perceptrons and Linear Filters #123
LMS algorithm (3/3)
Final form of LMS learning rule
Learning is regulated by a learning rate
Stable learning learning rate must be less then the reciprocal of the largest
eigenvalue of the correlation matrix x
T
x of input vectors
Limitations
Linear network can only learn linear input-output mappings
Proper selection of learning rate
) ( ) ( ) 1 (
) ( ) ( ) ( ) 1 (
n e n b n b
n x n e n w n w
j j j
2012 Primo Potonik NEURAL NETWORKS (3) Perceptrons and Linear Filters #124
Matlab demo (LMS learning)
pp02 Gradient descent learning by LMS learnig rule
63
2012 Primo Potonik NEURAL NETWORKS (3) Perceptrons and Linear Filters #125
3.6 Adaline network
ADALINE network = MADALINE
(single layer of ADALINE neurons)
2012 Primo Potonik NEURAL NETWORKS (3) Perceptrons and Linear Filters #126
3.7 ADALINE vs. Perceptron
Architectures
Learning rules
LMS learning Perceptron learning
) ( ) ( ) 1 (
) ( ) ( ) ( ) 1 (
n e n b n b
n x n e n w n w
j j j
) ( ) ( ) 1 (
) ( ) ( ) ( ) 1 (
n e n b n b
n x n e n w n w
j j j
v
y
v
y
ADALINE PERCEPTRON
64
2012 Primo Potonik NEURAL NETWORKS (3) Perceptrons and Linear Filters #127
ADALINE and Perceptron summary
Single layer networks can be built based on ADALINE or
Perceptron neurons
Both architectures are suitable to learn only linear input-
output relationships
Perceptron with threshold activation function is suitable
for classification problems
ADALINE with linear output is more suitable for
regression & filtering
ADALINE is suitable for continuous learning
2012 Primo Potonik NEURAL NETWORKS (3) Perceptrons and Linear Filters #128
3.8 Adaptive filtering
ADALINE is one of the most widely used neural
networks in practical applications
Adaptive filtering is one of its major application areas
We introduce the new element:
Tapped delay line
Input signal enters from the left and passes through
N-1 delays
Output of the tapped delay line (TDL) is an N-dimensional
vector, composed from current and past inputs
Input
65
2012 Primo Potonik NEURAL NETWORKS (3) Perceptrons and Linear Filters #129
Adaptive filter
Adaptive filter = ADALINE combined with TDL
b i k p w b Wp k a
i
i
) 1 ( ) (
2012 Primo Potonik NEURAL NETWORKS (3) Perceptrons and Linear Filters #130
Simple adaptive filter example
Adaptive filter with three delayed inputs
b t p w t p w t p w t a ) 2 ( ) 1 ( ) ( ) (
3 2 1
66
2012 Primo Potonik NEURAL NETWORKS (3) Perceptrons and Linear Filters #131
Adaptive filter for prediction
Adaptive filter can be used to predict the next value of a
time series p(t+1)
p(t-2) p(t-1) p(t) p(t+1) Time
p(t-2) p(t-1) p(t) p(t+1) Time
Learning
Operation
Now
Learning
2012 Primo Potonik NEURAL NETWORKS (3) Perceptrons and Linear Filters #132
Noise cancellation example
Adaptive filter can be used to cancel engine noise in
pilots voice in an airplane
The goal is to obtain a signal that contains
the pilots voice, but not the engine noise.
Linear neural net is adaptively trained to
predict the combined pilot/engine signal m
from an engine signal n. Only engine noise
n is available to the network, so it only
learns to predict the engines contribution to
the pilot/engine signal m.
The network error e becomes equal to the
pilots voice. The linear adaptive network
adaptively learns to cancel the engine noise.
Such adaptive noise canceling generally
does a better job than a classical filter,
because the noise here is subtracted from
rather than filtered out of the signal m.
67
2012 Primo Potonik NEURAL NETWORKS (3) Perceptrons and Linear Filters #133
Single-layer adaptive filter network
If more than one output neuron is required, a tapped
delay line can be connected to a layer of neurons
2012 Primo Potonik NEURAL NETWORKS (3) Perceptrons and Linear Filters #134
Matlab demos (ADALINE)
nnd10eeg ADALINE for noise filtering of EEG signals
nnd10nc Adaptive noise cancelation
68
2012 Primo Potonik NEURAL NETWORKS (3) Perceptrons and Linear Filters #135
MATLAB example
nn_03_adaline
ADALINE time series prediction with adaptive linear filter
2012 Primo Potonik NEURAL NETWORKS (3) Perceptrons and Linear Filters #136
3.9 XOR problem
Single layer perceptron cannot represent XOR function
One of Minsky and Paperts most discouraging results
Example: perceptron with two inputs
Only AND and OR functions can be represented by Perceptron
1
x
2
x
2
1
2
1
2
w
b
x
w
w
x
Discriminant function
v
y
69
2012 Primo Potonik NEURAL NETWORKS (3) Perceptrons and Linear Filters #137
XOR solution
Extending single-layer perceptron to multi-layer
perceptron by introducing hidden units
XOR problem can be solved but we no longer have a
learning rule to train the network
Multilayer perceptrons can do everything How to train
them?
1
x
2
x
1 , 2
w
3 , 2
w
2 , 2
w
5 . 0
1 5 . 0
2 1
1 1
2
3 , 2 1
2 , 2 2 , 1
1 , 2 1 , 1
b
w b
w w
w w
v
y
Homework
Create a two-layer perceptron to solve XOR problem
Create a custom network
Demonstrate solution
2012 Primo Potonik NEURAL NETWORKS (3) Perceptrons and Linear Filters #138
70
2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #139
4. Backpropagation
4.1 Multilayer feedforward networks
4.2 Backpropagation algorithm
4.3 Working with backpropagation
4.4 Advanced algorithms
4.5 Performance of multilayer perceptrons
2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #140
Introduction
Single-layer networks have severe restrictions
Only linearly separable tasks can be solved
Minsky and Papert (1969)
Showed a power of a two layer feed-forward network
But didnt find the solution how to train the network
Werbos (1974)
Parker (1985), Cun (1985), Rumelhart (1986)
Solved the problem of training multi-layer networks by back-propagating the
output errors through hidden layers of the network
Backpropagation learning rule
71
2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #141
4.1 Multilayer feedforward networks
Important class of neural networks
Input layer (only distributting inputs, without processing)
One or more hidden layers
Output layer
Commonly referred to as multilayer perceptron
2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #142
Properties of multilayer perceptrons
1. Neurons include nonlinear activation function
Without nonlinearity, the capacity of the network is reduced to that of a single
layer perceptron
Nonlinearity must be smooth (differentiable everywhere), not hard-limiting as in
the original perceptron
Often, logistic function is used:
2. One or more layers of hidden neurons
Enable learning of complex tasks by extracting features from the input patterns
3. Massive connectivity
Neurons in successive layers are fully interconnected
) exp( 1
1
v
y
72
2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #143
Matlab demo
nnd11nf Response of the feedforward network with
one hidden layer
2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #144
About backpropagation
Multilayer perceptrons can be trained by
backpropagation learning rule
Based on error-correction learning rule
Generalization of LMS learnig rule (used to train ADALINE)
Backpropagation consists of two passes through the
network
1. Forward pass
Input is applied to the network and propagated to the output
Synaptic weights stay frozen
Based on the desired response, error signal is calculated
2. Backward pass
Error signal is propagated backwards from output to input
Synaptic weights are adjusted according to the error gradient
73
2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #145
4.2 Backpropagation algorithm (1/9)
A set of learning samples (inputs and target outputs)
Error signal at output layer, neuron j, learning iteration n
Instantaneous error energy of output layer with R neurons
Average error energy over all learning set
R
n
M
n
N
n n n
d x d x , ) , (
1
R
j
j
n e n E
1
2
) (
2
1
) (
) ( ) ( ) ( n y n d n e
j j j
N
n
n E
N
E
1
) (
1
2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #146
Backpropagation algorithm (2/9)
Average error energy represents a cost function as a
measure of learning performance
is a function of free network parameters
synaptic weights
bias levels
Learning objective is to minimize average error energy
by minimizing free network parameters
We use an approximation: pattern-by-pattern learning
instead of epoch learning
Parameter adjustments are made for each pattern presented to the network
Minimizing instantaneous error energy at each step instead of average error energy
E
E
E
74
2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #147
Backpropagation algorithm (3/9)
Similar as LMS algorithms, backpropagation applies
correction of weights proportional to partial derivative
Expressing this gradient by the chain rule
) (
) (
) (
n w
n E
n w
ji
ji
) (
) (
) (
) (
) (
) (
) (
) (
) (
) (
n w
n v
n v
n y
n y
n e
n e
n E
n w
n E
ji
j
j
j
j
j
j ji
output error
network output
induced local field
synaptic weight
Instantaneous error energy
i
y
j
v
j
y
ji
w
R
j
j
j j j
e E
y d e
1
2
2
1
2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #148
Backpropagation algorithm (4/9)
1. Gradient on output error
2. Gradient on network output
3. Gradient on induced local field
4. Gradient on synaptic weight
) (
) (
) (
n e
n e
n E
j
j
)) ( (
) (
) (
n v f
n v
n y
j
j
j
1
) (
) (
n y
n e
j
j
) ( ) ( ) ( n y n d n e
j j j
) (
) (
) (
n y
n w
n v
i
ji
j
R
j
i ji j
n y n w n v
0
) ( ) ( ) (
i
y
j
v
j
y
ji
w
75
2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #149
Backpropagation algorithm (5/9)
Putting gradients together
Correction of synaptic weight is defined by delta rule
) ( )) ( ( ) (
) ( )) ( ( ) 1 ( ) (
) (
) (
) (
) (
) (
) (
) (
) (
) (
) (
n y n v f n e
n y n v f n e
n w
n v
n v
n y
n y
n e
n e
n E
n w
n E
i j j
i j j
ji
j
j
j
j
j
j ji
) ( )) ( ( ) (
) (
) (
) (
n y n v f n e
w
n E
n w
i
n
j j
ji
ji
j
) ( ) ( ) ( n y n n w
i j ji
Local gradient Learning rate
i
y
j
v
j
y
ji
w
2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #150
Backpropagation algorithm (6/9)
CASE 1 Neuron j is an output node
Output error e
j
(n) is available
Computation of local gradient is straightforward
CASE 2 Neuron j is a hidden node
Hidden error is not available Credit assignment problem
Local gradient solved by backpropagating errors through the network
)) ( ( ) ( ) ( n v f n e n
j j j
) ( ) (
) (
) (
) (
) (
) (
) (
) (
) (
) (
) (
n y
ji
j
n
j
j
j
j
j ji
i j
n w
n v
n v
n y
n y
n e
n e
n E
n w
n E
)) ( (
) (
) (
) (
) (
) (
) (
) ( n v f
n y
n E
n v
n y
n y
n E
n
j
j j
j
j
j
2
))] ( exp( 1 [
)) ( exp(
)) ( (
)) ( exp( 1
1
)) ( (
n av
n av a
n v f
n av
n v f
j
j
j
j
j
i
y
j
v
j
y
ji
w
)) ( (
) (
) (
n v f
n v
n y
j
j
j
derivative of output error energy E on hidden layer output y
j
?
76
2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #151
Backpropagation algorithm (7/9)
CASE 2 Neuron j is a hidden node ...
Instantaneous error energy of the output layer with R neurons
Expressing the gradient of output error energy E on hidden layer output y
j
kj
k
k
kj k
k
k
k
w
j
k
n v f
k
k
k
k j
k
k
j
w
w n v f e
n y
n v
n v
n e
e
n y
n e
e
n y
n E
kj
k
)) ( (
) (
) (
) (
) (
) (
) (
) (
) (
)) ( (
R
k
k
n e n E
1
2
) (
2
1
) (
)) ( ( ) (
) ( ) ( ) (
n v f n d
n y n d n e
k k
k k k
M
j
j kj k
n y n w n v
0
) ( ) ( ) (
j
y
k
v
k
y
kj
w
2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #152
Backpropagation algorithm (8/9)
CASE 2 Neuron j is a hidden node ...
Finally, combining ansatz for hidden layer local gradient
and gradient of output error energy on hidden layer output
gives final result for hidden layer local gradient
kj
k
k
j
w
n y
n E
) (
) (
)) ( (
) (
) (
) ( n v f
n y
n E
n
j
j
j
kj
k
k j j
w n v f n )) ( ( ) (
77
2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #153
Backpropagation algorithm (9/9)
Backpropagation summary
1. Local gradient of an output node
2. Local gradient of a hidden node
) ( ) ( ) ( n y n n w
i j ji
Weight Learning Local Input of
correction rate gradient neuron j
)) ( ( ) ( ) ( n v f n e n
k k k
kj
k
k j j
w n v f n )) ( ( ) (
i
x
j
v
j
y
ji
w
k
v
k
y
kj
w
2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #154
Two passes of computation
1. Forward pass
Input is applied to the network and propagated to the output
Inputs Hidden layer output Output layer output Output error
2. Backward pass
Recursive computing of local gradients
Output local gradients Hidden layer local gradients
Synaptic weights are adjusted according to local gradients
) ( ) ( ) ( n y n n w
j k kj
)) ( ( ) ( ) ( n v f n e n
k k k kj
k
k j j
w n v f n )) ( ( ) (
i ji j
x w f y ) ( ) ( ) ( n y n d n e
k k k
) (n x
i
) ( ) ( ) ( n x n n w
i j ji
i
x
j
v
j
y
ji
w
k
v
k
y
kj
w
j kj k
y w f y
78
2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #155
Summary of backpropagation algorithm
1. Initialization
Pick weights and biases from the uniform distribution with zero mean and
variance that induces local fields between the linear and saturated parts of
logistic function
2. Presentation of training samples
For each sample from the epoch, perform forward pass and backward pass
3. Forward pass
Propagate training sample from network input to the output
Calculate the error signal
4. Backward pass
Recursive computation of local gradients from output layer toward input layer
Adaptation of synaptic weights according to generalized delta rule
5. Iteration
Iterate steps 2-4 until stopping criterion is met
2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #156
Matlab demo
nnd11bc Backpropagation calculation
79
2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #157
Matlab demo
nnd12sd1 Steepest descent
2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #158
Using backpropagation learning for ADALINE
No hidden layers, one output neuron
Linear activation function
Backpropagation rule
Original Delta rule
Backpropagation is a generalization of a Delta rule
) ( ) ( ) (
) ( )) ( ( ) ( ) (
), ( ) ( ) (
n x n e n w
n e n v f n e n
x y n y n n w
i i
i i i i
Backpropagation for ADALINE
1 )) ( ( ) ( )) ( ( n v f n v n v f
) ( ) ( ) ( n x n e n w
i i
R
x
x
1
v
y
80
2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #159
4.3 Working with backpropagation
Efficient application of backpropagation requires some
fine-tuning
Various parameters, functions and methods should be
selected
Training mode (sequential / batch)
Activation function
Learning rate
Momentum
Stopping criterium
Heuristics for efficient backpropagation
Methods for improving generalization
2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #160
Sequential and batch training
Learning results from many presentations of training
examples
Epoch = presentation of the entire training set
Batch training
Weight updating after the presentation of a complete epoch
Sequential training
Weight updating after the presentation of each training example
Stochastic nature of learning, faster convergence
Important practical reasons for sequential learning:
Algorithm is easy to implement
Provides effective solution to large and difficult problems
Therefore sequential training is preferred training mode
Good practice is random order of presentation of training examples
81
2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #161
Activation function
Derivative of activation function is required for
computation of local gradients
Only requirement for activation function: differentiability
Commonly used: logistic function
Derivative of logistic function
)) ( ( n v f
j
)) ( ( ) (
2
))] ( exp( 1 [
)) ( exp(
)) ( (
n v f n y
j
j
j
j j
n av
n av a
n v f
) (
, 0
)) ( exp( 1
1
)) ( (
n v
a
n av
n v f
j j
j
)] ( 1 )[ ( )) ( ( n y n y a n v f
j j j
Local gradient can be calculated
without explicit knowledge of the
activation function
2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #162
Other activation functions
Using sin() activation functions
Equivalent to traditional Fourier analysis
Network with sin() activation functions can be trained by backpropagation
Example: Approximating periodic function by
1
) sin( ) (
k
k k
kx c a x f
8 sigmoid hidden neurons 4 sin hidden neurons
82
2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #163
Learning rate
Learning procedure requires
Change in the weight space to be proportional to error gradient
True gradient descent requires infinitesimal steps
Learning in practice
Factor of proportionality is learning rate
Choose a learning rate as large as possible without leading to oscillations
) ( ) ( ) ( n y n n w
i j ji
010 . 0
035 . 0
040 . 0
2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #164
Stopping criteria
Generally, backpropagation cannot be shown to converge
No well defined criteria for stopping its operation
Possible stopping criteria
1. Gradient vector
Euclidean norm of the gradient vector reaches a sufficiently small gradient
2. Output error
Output error is small enough
Rate of change in the average squared error per epoch is sufficiently small
3. Generalization performance
Generalization performance has peaked or is adequate
4. Max number of iterations
We are out of time ...
83
2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #165
Heuristics for efficient backpropagation (1/3)
1. Maximizing information content
General rule: every training example presented to the backpropagation algorithm should
be chosen on the basis that its information content is the largest possible for the task at
hand
Simple technique: randomize the order in which examples are presented from one epoch
to the next
2. Activation function
Faster learning with antisimetric sigmoid activation functions
Popular choice is:
67 . 0
72 . 1
) tanh( ) (
b
a
bv a v f
1
1 ) 0 (
1 ) 1 ( , 1 ) 1 (
v
f
f f
at derivative second max
gain effective
2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #166
Heuristics for efficient backpropagation (2/3)
3. Target values
Must be in the range of the activation function
Offset is recommended, otherwise learnig is driven into saturation
Example: max(target) = 0.9 max(f)
4. Preprocessing inputs
a) Normalizing mean to zero
b) Decorrelating input variables (by using principal component analysis)
c) Scaling input variables (variances should be approx. equal)
Original a) Zero mean b) Decorrelated c) Equalized variance
84
2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #167
Heuristics for efficient backpropagation (3/3)
5. Initialization
Choice of initial weights is important for a successful network design
Large initial values saturation
Small initial values slow learning due to operation only in the saddle point near origin
Good choice lies between these extrem values
Standard deviation of induced local fields should lie between the linear and saturated
parts of its sigmoid function
tanh activation function example (a=1.72, b=0.67):
synaptic weights should be chosen from a uniform distribution with zero mean and
standar deviation
6. Learning from hints
Prior information about the unknown mapping can be included into the learning
proces
Initialization
Possible invariance properties, symetries, ...
Choice of activation functions
2 / 1
m
v
m ... number of synaptic weights
2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #168
Generalization
Neural network is able to generalize:
Input-output mapping computed by the network is correct for test data
Test data were not used during training
Test data are from the same population as training data
Correct response even if input is slightly different than the training examples
Overfitting Good generalization
85
2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #169
Improving generalization
Methods to improve generalization
1. Keeping the network small
2. Early stopping
3. Regularization
Early stopping
Available data are divided into three sets:
1. Training set used to train the network
2. Validation set used for early stopping,
when the error starts to increase
3. Test set used for final estimation of
network performance and for comparison
of various models
Early stopping
2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #170
Regularization
Improving generalization by regularization
Modifying performance function
with mean sum of squares of network weights and biases
thus obtaining new performance function
Using this performance function, network will have smaller weights and biases,
and this forces the network response to be smoother and less likely to overfit
N
n
j j
n y n d
N
mse
1
2
)) ( ) ( (
1
M
m
m
w
M
msw
1
2
1
msw mse msreg ) 1 (
86
2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #171
Deficiencies of backpropagation
Some properties of backpropagation do not guarantee the
algorithm to be universally useful:
1. Long training process
Possibly due to non-optimum learning rate
(advanced algorithms address this problem)
2. Network paralysis
Combination of sigmoidal activation and very large weights can decrease
gradients almost to zero training is almost stopped
3. Local minima
Error surface of a complex network can be very complex, with many hills and
valleys
Gradient methods can get trapped in local minima
Solutions: probabilistic learning methods (simulated annealing, ...)
2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #172
4.4 Advanced algorithms
Basic backpropagation is slow
Adjusts the weights in the steepest descent direction (negative of the gradient) in which
the performance function is decreasing most rapidly
It turns out that, although the function decreases most rapidly along the negative of the
gradient, this does not necessarily produce the fastest convergence
1. Advanced algorithms based on heuristics
Developed from an analysis of the performance of the standard steepest descent
algorithm
Momentum technique
Variable learning rate backpropagation
Resilient backpropagation
2. Numerical optimization techniques
Application of standard numerical optimization techniques to network training
Quasi-Newton algorithms
Conjugate Gradient algorithms
Levenberg-Marquardt
87
2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #173
Momentum
A simple method of increasing learning rate yet avoiding
the danger of instability
Modified delta rule by adding momentum term
Momentum constant
Accelerates backpropagation in steady downhill directions
) 1 ( ) ( ) ( ) ( n w n y n n w
ji i j ji
1 0
Small learning rate Large learning rate
(oscillations)
Learning with momentum
2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #174
Variable learning rate (t)
Another method of manipulating learning rate and
momentum to accelerate backpropagation
1. If error decreases after weight update:
weight update is accepted
learning rate is increased ............................................. (t+1) = (t), >1
if momentum has been previously reset to 0, it is set to its original value
2. If error increases less than after weight update:
weight update is accepted
learning rate is not changed ......................................... (t+1) = (t),
if momentum has been previously reset to 0, it is set to its original value
3. If error increases more than after weight update:
weight update is discarded
learning rate is decreased ............................................ (t+1) = (t), 0<<1
momentum is reset to 0
Possible parameter values:
05 . 1 , 7 . 0 %, 4
88
2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #175
Resilient backpropagation
Slope of sigmoid functions approaches zero as the input
gets large
This causes a problem when you use steepest descent to train a network
Gradient can have a very small magnitude also changes in weights are small,
even though the weights are far from their optimal values
Resilient backpropagation
Eliminates these harmful effects of the magnitudes of the partial derivatives
Only sign of the derivative is used to determine the direction of weight update,
size of the weight change is determined by a separate update value
Resilient backpropagation rules:
1. Update value for each weight and bias is increased by a factor
inc
if derivative of the
performance function with respect to that weight has the same sign for two successive
iterations
2. Update value is decreased by a factor
dec
if derivative with respect to that weight
changes sign from the previous iteration
3. If derivative is zero, then the update value remains the same
4. If weights are oscillating, the weight change is reduced
2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #176
Numerical optimization (1/3)
Supervised learning as an optimization problem
Error surface of a multilayer perceptron, expressed by instantaneous error
energy E(n), is a highly nonlinear function of synaptic weight vector w(n)
w
1
w
2
E(w
1
,w
2
)
)) ( ( ) ( n w E n E
89
2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #177
Numerical optimization (2/3)
Expanding the error energy by a Taylor series
)) ( ( ) ( n w E n E
) ( ) ( ) (
2
1
) ( ) ( )) ( ( )) ( ) ( ( n w n H n w n w n g n w E n w n w E
T T
) (
2
2
) (
) (
) (
) (
) (
n w w
n w w
T
w
w E
n H
w
w E
n g
Local gradient
Hessian matrix
2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #178
Numerical optimization (3/3)
Steepest descent method (backpropagation)
Weight adjustment proportional to the gradient
Simple implementation, but slow convergence
Significant improvement by using higher order information
Adding momentum term crude approximation to use second order information
about error surface
Quadratic approximation about error surface The essence of Newtons method
H
-1
is the inverse of Hessian matrix
) ( ) ( n g n w
) ( ) ( ) (
1
n g n H n w
gradient descent
Newtons method
90
2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #179
Quasi-Newton algorithms
Problems with the calculation of Hessian matrix
Inverse Hessian H
-1
is required, which is computationally expensive
Hessian has to be nonsingular which is not guaranteed
Hessian for neural network can be rank defficient
No convergence guarantee for non-quadratic error surface
Quasi-Newton method
Only requires calculation of the gradient vector g(n)
The method estimates the inverse Hessian directly without matrix inversion
Quasi-Newton variants:
Davidon-Fletcher-Powell algorithm
Broyden-Fletcher-Goldfarb-Shanno algorithm ... best form of Quasi-Newton algorithm!
Application for neural networks
The method is fast for small neural networks
2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #180
Conjugate gradient algorithms
Conjugate gradient algorithms
Second order methods, avoid computational problems with the inverse Hessian
Search is performed along conjugate directions, which produces generally faster
convergence than steepest descent directions
1. In most of the conjugate gradient algorithms, the step size is adjusted at each iteration
2. A search is made along the conjugate gradient direction to determine the step size that
minimizes the performance function along that line
Many variants of conjugate gradient algorithms
Fletcher-Reeves Update
Polak-Ribire Update
Powell-Beale Restarts
Scaled Conjugate Gradient
Application for neural networks
Perhaps the only method suitable for large scale problems (hundreds or
thousands of adjustable parameters) well suited for multilayer perceptrons
gradient descent
conjugate gradient
91
2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #181
Levenberg-Marquardt algorithm
Levenberg-Marquardt algorithm (LM)
Like the quasi-Newton methods, LM algorithm was designed to approach
second-order training speed without having to compute the Hessian matrix
When the performance function has the form of a sum of squares (typical in
neural network training), then the Hessian matrix H can be approximated by
Jacobian matrix J
where Jacobian matrix contains first derivatives of the network errors with
respect to the weights
Jacobian can be computed through a standard backpropagation technique that is
much less complex than computing the Hessian matrix
Application for neural networks
Algorithm appears to be the fastest method for training moderate-sized
feedforward neural networks (up to several hundred weights)
J J H
T
2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #182
Advanced algorithms summary
Practical hints (Matlab related)
Variable learning rate algorithm is usually much slower than the other
methods
Resiliant backpropagation method is very well suited to pattern
recognition problems
Function approximation problems, networks with up to a few hundred
weights: Levenberg-Marquardt algorithm will have the fastest
convergence and very accurate training
Conjugate gradient algorithms perform well over a wide variety of
problems, particularly for networks with a large number of weights
(modest memory requirements)
92
Training algorithms in MATLAB
2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #183
2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #184
4.5 Performance of multilayer perceptrons
Approximation error is influenced by
Learning algorithm used ... (discused in last section)
This determines how good the error on the training set is minimized
Number and distribution of learning samples
This determines how good training samples represent the actual function
Number of hidden units
This determines the expressive power of the network. For smooth functions
only a few number of hidden units are needed, for wildly fluctuating functions
more hidden units will be needed
93
2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #185
Number of learning samples
Function approximation example y=f(x)
Learning set with 4 samples has small training error but gives very poor
generalization
Learning set with 20 samples has higher training error but generalizes well
Low training error is no guarantee for a good network performance!
4 learning samples 20 learning samples
2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #186
Number of hidden units
Function approximation example y=f(x)
A large number of hidden units leads to a small training error but not necessarily
to a small test error
Adding hidden units always leads to a reduction of the training error
However, adding hidden units will first lead to a reduction of test error but then to
an increase of test error ... (peaking efect, early stopping can be applied)
5 hidden units 20 hidden units
94
2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #187
Size effect summary
Number of training samples Number of hidden units
Error
rate
Error
rate
Test set
Test set
Training set
Training set
Number of training samples Number of hidden units
Optimal number of
hidden neurons
2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #188
Matlab demo
nnd11fa Function approximation, variable number of
hidden units
95
2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #189
Matlab demo
nnd11gn Generalization, variable number of hidden
units
2012 Primo Potonik NEURAL NETWORKS (4) Backpropagation #190
96
2012 Primo Potonik NEURAL NETWORKS (5) Dynamic networks #191
5. Dynamic Networks
5.1 Historical dynamic networks
5.2 Focused time-delay neural network
5.3 Distributed time-delay neural network
5.4 Layer recurrent network
5.5 NARX network
5.6 Computational power of dynamic networks
5.7 Learning algorithms
5.8 System identification
5.9 Model reference adaptive control
2012 Primo Potonik NEURAL NETWORKS (5) Dynamic networks #192
Introduction
Time
An essential ingredient of the learning process
Important for many practical tasks: speech, vision, signal processing, control
Many applications require temporal processing
Time series prediction
Noise cancelation
Adaptive control
System identification
...
Linear systems well developed theories
Nonlinear systems neural networks have the potential to solve such problems
97
2012 Primo Potonik NEURAL NETWORKS (5) Dynamic networks #193
Introduction
How can we build time into the operation of neural
networks?
Extending static neural networks into dynamic neural networks
networks become responsive to the temporal structure of input signals
Networks become dynamic by adding
TEMPORAL MEMORY and/or FEEDBACK
Feedback loop
2012 Primo Potonik NEURAL NETWORKS (5) Dynamic networks #194
Static / dynamic networks
Neural network categories
1. Static networks structural pattern recognition
Feedforward networks
No feedback elements, no delays
Output is calculated directly from the input through feedforward connections
2. Dynamic networks temporal pattern recognition
Output depends on
current input to the network
also on previous inputs
previous network output
previous network states
Dynamic networks can be divided into two categories
1. Networks that have only feedforward connections
2. Networks with feedback or recurrent connections
A need for short-term memory and feedback
98
2012 Primo Potonik NEURAL NETWORKS (5) Dynamic networks #195
Memory
Memory
Long-term memory
Acquired through supervised learning and stored into synaptc weights
Short-term memory
Temporal memory, usefull to capture temporal dimension
Implemented as time delays at various parts of the network
Long-term memory
2012 Primo Potonik NEURAL NETWORKS (5) Dynamic networks #196
Tapped delay line
The simplest form of short-term memory
Already mentioned at linear adaptive filters
Most commonly used for dynamic networks
Tapped delay line (TDL) consists of N unit delay operators
Output of TDL is an N+1 dimensional vector, composed from current and past
inputs
)] ( ),..., 1 ( ), ( [ )) ( ( N n x n x n x n x TDL
) (n x
) (n x
) 1 (n x
) ( N n x
99
2012 Primo Potonik NEURAL NETWORKS (5) Dynamic networks #197
5.1 Historical dynamic networks
Hopfield (1982)
Jordan (1986)
Elman (1990)
2012 Primo Potonik NEURAL NETWORKS (5) Dynamic networks #198
Hopfield network
Hopfield network (Hopfield, 1982)
Network consists of N interconnected neurons which update their activation
values asynchronously and independently of other neurons
All neurons are both input and output neurons
Activation values are binary (-1, +1)
Multiple-loop feedback system
interesting to study stability of the system
Primary applications
Associative memory
Solving optimization problems
MATLAB example: demohop1.m
100
2012 Primo Potonik NEURAL NETWORKS (5) Dynamic networks #199
Jordan network
Jordan network (Jordan, 1986)
Network outputs are fed back as extra inputs (state units)
Each state unit is fed with one network output
The connections from output to state units are fixed (+1)
Learning takes place only in the
connections between input to hidden
units as well as hidden to output units
Standard backpropagation learnig rule
can be applied to train the network
2012 Primo Potonik NEURAL NETWORKS (5) Dynamic networks #200
context
units
Elman network
Elman network (Elman, 1990)
Similar to Jordan network, with the following differences:
1. Hidden units are fed back (instead of output units)
2. Context units have no self-connections
101
2012 Primo Potonik NEURAL NETWORKS (5) Dynamic networks #201
5.2 Focused time-delay neural network
The most straightforward dynamic network
feedforward network + tapped delay line at input
Temporal dynamics only at the input layer of a static network
Nonlinear extention of linear adaptive filters
Backpropagation training can be used
The structure is suitable for time-series prediction
2012 Primo Potonik NEURAL NETWORKS (5) Dynamic networks #202
Input delays = [0 6 12] Inputs {x(t), x(t-6), x(t-12)}
Prediction horizon = 1 Output x(t+1)
Input delays = [12 6 0]
Prediction horizon = 1
Known world
Unknown world
TDL & prediction horizon
102
2012 Primo Potonik NEURAL NETWORKS (5) Dynamic networks #203
Online prediction application
Past Now Future
2012 Primo Potonik NEURAL NETWORKS (5) Dynamic networks #204
MATLAB example (1/3)
Application of focused time-delay neural network for
prediction of chaotic MacKay-Glass time series
Objective
Design Focused time-delay neural network for recursive one-step-ahead predictor
Fixed network parameters
Number of hidden layers: 1
Hidden layer activation func.: Logistic
Output layer activation func.: Linear
Variable network parameters
Input delays = ?
Hidden layer neurons = ?
17 , 2 . 0 , 1 . 0
) ( 1
) (
) ( ) (
10
c b
t y
t cy
t by t y
103
2012 Primo Potonik NEURAL NETWORKS (5) Dynamic networks #205
MATLAB example (2/3)
Samples
500 training samples
500 validation samples, recursive prediction
Results
(A)
2012 Primo Potonik NEURAL NETWORKS (5) Dynamic networks #206
MATLAB example (3/3)
(B)
(C)
104
2012 Primo Potonik NEURAL NETWORKS (5) Dynamic networks #207
5.3 Distributed time-delay neural network
Tapped delay lines distributed throughout the network
Distributted temporal dynamics ability to handle non-stationary environments
Backpropagation training cannot be used any more
the need for temporal backpropagation
Possible applications:
phoneme recognition, recognition of various frequency contents in signals
2012 Primo Potonik NEURAL NETWORKS (5) Dynamic networks #208
Temporal backpropagation
Backpropagation algorithm
Suitable for static networks and focused time-delay neural networks
Temporal backpropagation
Supervised learning algorithm
Extension of backpropagation
Required for distributed time delay neural networks
Computationaly demanding
Which form of backpropagation to use?
Based on the nature of the temporal processing task
1. STATIONARY ENVIRONMENT
Standard backpropagation + Focused time-delay neural networks
2. NON-STATIONARY ENVIRONMANT
Temporal backpropagation + Distributed time delay neural networks
105
2012 Primo Potonik NEURAL NETWORKS (5) Dynamic networks #209
Example (1/2)
Wan (1994): Time series prediction by using a
connectionist network with internal delay lines
Winner of the Santa Fe Institute Time-Series Competition, USA (1992)
Task: Nonlinear prediction of a nonstationary time series exhibiting chaotic
pulsations of NH
3
laser
2012 Primo Potonik NEURAL NETWORKS (5) Dynamic networks #210
Example (2/2)
Prediction results
106
2012 Primo Potonik NEURAL NETWORKS (5) Dynamic networks #211
5.4 Layer recurrent network
Layer recurrent network = Recurrent multilayer perceptron
One or more hidden layers
Each computation layer has feedback link
2012 Primo Potonik NEURAL NETWORKS (5) Dynamic networks #212
Layer recurrent network structure
Feedback loop with single delay for hidden layer
Can be trained by backpropagation
Elman (1990)
107
2012 Primo Potonik NEURAL NETWORKS (5) Dynamic networks #213
Example (1/3)
Phoneme detection problem
Recognition of various frequency components
Layer recurrent network
1 hidden layer
8 neurons
5 delays
2012 Primo Potonik NEURAL NETWORKS (5) Dynamic networks #214
Example (2/3)
Network training
Successful recognition of two phonemes
108
2012 Primo Potonik NEURAL NETWORKS (5) Dynamic networks #215
Example (3/3)
Network testing
Unreliable generalization, works only on trained phonemes
OK OK
2012 Primo Potonik NEURAL NETWORKS (5) Dynamic networks #216
5.5 NARX network
Networks discussed so far
Focused or distributed time delays
Feedback only localized to specific network layers
NARX = Nonlinear AutoRegressive Network with
EXogenous Inputs
Reccurent network with global feedback
Feedback over several layers of the network
Based on linear ARX model
Defining equation for NARX model
Output y is a nonlinear function of past outputs and past inputs
Nonlinear function f can be implemented by a neural network
109
2012 Primo Potonik NEURAL NETWORKS (5) Dynamic networks #217
NARX structure
NARX network with global feedback
Possible application areas
Nonlinear prediction and modelling
Adaptive equalization of communication channels
Speech processing
Automobile engine diagnostics
2012 Primo Potonik NEURAL NETWORKS (5) Dynamic networks #218
NARX training considerations
NARX output is an estimate of the output of some nonlinear dynamic system
Output is fed back to the input of the feedforward neural network parallel architecture
True output is available during training possible to create series-parallel architecture
True output is used instead of feeding back the estimated output
Advantages of series-parallel architecture for training
1. Training input to the feedforward network is more accurate improved training accuracy
2. Resulting network is purely feedforward static backpropagation can be used
110
2012 Primo Potonik NEURAL NETWORKS (5) Dynamic networks #219
Example (1/5)
Problem: Magnetic levitation
Objective
to control the position of a magnet suspended above an
electromagnet, where the magnet can only move in the
vertical direction
Equation of motion
y(t) = distance of the magnet above the electromagnet
i(t) = current flowing in the electromagnet
M = mass of the magnet
g = gravitational constant
= viscous friction coefficient (determined by the material in which the magnet moves)
= field strength constant, determined by the number of turns of wire on the
electromagnet and the strength of the magnet
2012 Primo Potonik NEURAL NETWORKS (5) Dynamic networks #220
Example (2/5)
Data
Sampling interval: 0.01 sec
Input: current i(t)
Output: magnet position y(t)
NARX network structure
3 hidden neurons
5 input delays
5 global feedback delays
5
5
3 neurons
111
2012 Primo Potonik NEURAL NETWORKS (5) Dynamic networks #221
Example (3/5)
Series-parallel training results for NARX network
2012 Primo Potonik NEURAL NETWORKS (5) Dynamic networks #222
Example (4/5)
Parallel recursive prediction (1000 steps)
112
2012 Primo Potonik NEURAL NETWORKS (5) Dynamic networks #223
Example (5/5)
Possible learning results: unstable learning, local minima
Case A: OK Case B: Unstable Case C: Local minimum
2012 Primo Potonik NEURAL NETWORKS (5) Dynamic networks #224
5.6 Computational power of dynamic networks
Fully and partially connected recurrent networks
113
2012 Primo Potonik NEURAL NETWORKS (5) Dynamic networks #225
Theorems
Theorem I (Siegelmann & Sontag, 1991)
All Turing machines may be simulated by fully connected recurrent networks built
on neurons with sigmoid activation functions.
(Turing machine is a theoretical abstraction that is functionally as powerfull as
any computer, see http://aturingmachine.com )
Theorem II (Siegelmann et. al. 1997)
NARX networks with one layer of hidden neurons with BOSS* activation
functions and a linear output neuron can simulate fully connected recurrent
networks with BOSS* activation functions, except for a linear slowdown
Corollary to Theorem I and II
NARX networks with one hidden layer of neurons with BOSS* activation
functions and a linear output neuron are Turing equivalent.
* BOSS = bounded, one-sided saturated function
2012 Primo Potonik NEURAL NETWORKS (5) Dynamic networks #226
5.7 Learning algorithms
Two modes of training for recurrent networks
1. Epochwise training
For a given epoch, the recurrent network starts running from some initial state
until it reaches a new state, at which point the training is stopped and the
network is reset to initial state for the next epoch
METHOD: Backpropagation through time
2. Continuous training
Suitable if no reset states are available or online learning is required
Network learns while it is performing signal processing
The learning process never stops
METHOD: Real-time recurrent learning
114
2012 Primo Potonik NEURAL NETWORKS (5) Dynamic networks #227
Backpropagation through time
Extension of the standard backpropagation algorithm
Derived by unfolding the temporal operation of the network into a layered
feedforward network
The topology grows by one layer at every time step
EXAMPLE: unfolding the temporal operation of a 2-neuron recurrent network
2012 Primo Potonik NEURAL NETWORKS (5) Dynamic networks #228
Backpropagation through time example (1/2)
Nguyen (1989): The truck backer-upper
115
2012 Primo Potonik NEURAL NETWORKS (5) Dynamic networks #229
... example (2/2)
Training
Generalization
2012 Primo Potonik NEURAL NETWORKS (5) Dynamic networks #230
5.8 System Identification
System identification = experimental approach to
modeling the process with unknown parameters
STEP 1: Experimental planning
STEP 2: Selection of a model structure
STEP 3: Parameter estimation
STEP 4: Model validation
Unknown nonlinear dynamical process dynamic
neural networks can be used as identification model
Two basic identification approaches:
1. System identification using state-space model
2. System identification using input-output model
116
2012 Primo Potonik NEURAL NETWORKS (5) Dynamic networks #231
System identification using state-space model
State
A vital role in the mathematical formulation of a dynamical system
State of a dynamical system defined as a set of quantities that summarizes all
the information about the past behavior of the system that is needed to uniquely
describe its future behavior, except for the purely external effects arising from the
applied input.
Plant description by a state-space model
State:
Output:
f, h : unknown nonlinear vector functions
Two dynamic neural networks can be used to approximate f and h
) ( ) (
) ( ), ( ) 1 (
n x h n y
n u n x f n x
2012 Primo Potonik NEURAL NETWORKS (5) Dynamic networks #232
State-space solution to the identification problem
Both networks are trained by gradient descent minimizing error signals e
I
and e
II
Neural network (II)
Identification of plant output
Actual state x(n) is used as input rather
than the predicted output
Neural network (I)
Identification of plant state
State must be physically accessible!
117
2012 Primo Potonik NEURAL NETWORKS (5) Dynamic networks #233
System identification using input-output model
If system state is not accessible
identification by input-output model
Plant description by an input-output model
f is unknown nonlinear vector function
Input-output formulation is equivalent to NARX formulation
NARX neural network can be used to approximate f
q past inputs and outputs should be available
) 1 ( , ), ( ), 1 ( , ), ( ) 1 ( q n u n u q n y n y f n y
2012 Primo Potonik NEURAL NETWORKS (5) Dynamic networks #234
Input-output solution to the identification problem
NARX neural network can be used
as a dynamic identification model
Series-parallel learning
system output is used as feedback,
not the predicted output
Parallel architecture for application
118
2012 Primo Potonik NEURAL NETWORKS (5) Dynamic networks #235
5.9 Model-reference adaptive control
Dynamic networks are important for feedback control
systems
MULTIPLE PROBLEMS:
Nonlinear coupling of plant state with control signals
Presence of unmeasured or random disturbances
Possibility of a nonunique plant inverse
Presence of unobservable plant states
MRAC = Model reference adaptive control
Well suited for the use of neural networks
Possible control methods:
Direct MRAC
Indirect MRAC
2012 Primo Potonik NEURAL NETWORKS (5) Dynamic networks #236
MRAC using direct control
Unknown plant dynamics adaptive learning
Controller + plant = closed loop feedback system
Controller and plant build externaly recurrent network
How to get plant gradients indirect control
) ( ), ( ) 1 (
), ( ), ( ), ( ) (
n r n x g n d
w n r n y n x f n u
r
p c c
model Reference
Controller
119
2012 Primo Potonik NEURAL NETWORKS (5) Dynamic networks #237
MRAC using indirect control
Two step procedure to train the controller
1. Identification of the plant (identification model)
2. Using plant model to obtain dynamic
derivatives to train the controller
Controller and plant model build
externaly recurrent network
2012 Primo Potonik NEURAL NETWORKS (5) Dynamic networks #238
Summary
Layer recurrent network
Focused time-delay neural network Distributed time-delay neural network
NARX network
120
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #239
6. Radial Basis Function Networks
6.1 RBFN structure
6.2 Exact interpolation
6.3 Radial basis functions
6.4 Radial basis function networks
6.5 RBFN training
6.6 RBFN for classification
6.7 Comparison with multilayer perceptron
6.8 Probabilistic networks
6.9 Generalized regression networks
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #240
Introduction
RBFN = Radial Basis Function Network
New class of neural networks
Multilayer perceptrons output is a nonlinear function of the scalar product of
input vector and weight vector
RBFN activation of a hidden unit is determined by distance between input
vector and prototype vector
RBFN theory forms a link between
Function approximation
Regularization
Noisy interpolation
Density estimation
Optimal classification theory
121
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #241
6.1 RBFN structure
Feedforward network with two computation layers
1. Hidden layer implements a set of radial basis functions (e.g. Gaussian functions)
2. Output layer implements linear summation functions (as in MLP)
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #242
RBFN properties
Two-stage training procedures
1. Training of hidden layer weights
2. Training of output layer weights
Training/learning is very fast
RBFN provides excellent interpolation
122
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #243
6.2 Exact interpolation (1/3)
Exact interpolation task = mapping of every input vector exactly
onto the corresponding output vector in the multi-dimensional space
The goal is to find a function that will map input vectors x into target
vectors t
Radial basis function approach (Powell, 1987) introduces a set of N
basis functions, one for each data point x
p
, in the form
Basis functions are nonlinear, and depend on the distance
measure between input x and stored prototype x
p
2 2
1 1
) ( ) (
p
M M
p p
x x x x x x
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #244
Exact interpolation (2/3)
Output is a linear combination of basis functions
Goal is to find the weights w
p
such that the function goes through all
data points
We introduce the matrix formulation
Provided that inverse of exist, the weights are obtained by any
standard matrix inversion techniques
123
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #245
Exact interpolation (3/3)
For large class of functions , the matrix is indeed non-singular
provided that the data points are distinct
Solution represents a continuous diferentiable surface
that passes exactly through each data point
Both theoretical and empirical studies confirm (in the context of
exact interpolation) that many properties of the interpolating function
are relatively insensitive to the precise form of the basis functions
Various forms of basis functions can be used
r
p
x x
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #246
6.3 Radial basis functions (1/2)
1. Gaussian
2. Multi-Quadratic
3. Generalized Multi-Quadratic
4. Inverse Multi-Quadratic
124
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #247
Radial basis functions (2/2)
5. Generalized Inverse Multi-Quadratic
6. Thin Plate Spline
7. Cubic
8. Linear
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #248
2 2
1 1
) ( ) (
p
M M
p p
x x x x r x x
125
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #249
Properties of radial basis functions
Gaussian and Inverse Multi-Quadric basis functions are localised
Localised property is not strictly necessary all the other functions
(Multi-Quadratic, Cubic, Linear, ...) are not localised
Note that even the Linear Function is still non-
linear in the components of x In one dimension, this leads to a
piecewise-linear interpolating function which performs the simplest
form of exact interpolation
For neural network mappings, there are good reasons for preferring
localised basis functions we will focus on Gaussian basis functions
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #250
Exact interpolation example (1/2)
Interpolation problem
We would like to find a function
which fits all data points
Solution approach
Supperposition of Gaussian
radial basis functions
126
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #251
Exact interpolation example (2/2)
= 0.02
= 1
= 20
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #252
6.4 Radial basis function networks
Exact interpolation model using RB functions can already be
described as a radial basis function network
N training inputs directly determine hidden layer prototypes (centers
of hidden layer neurons)
Training inputs and outputs also directly determine output weights
127
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #253
Problems with exact interpolation
1. Exact interpolation of noisy data is highly oscillatory function
such interpolating functions are generally undesirable
2. Number of basis functions is equal to the number of data patterns
exact RBF networks are not computationally efficient
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #254
RBF neural network model
Introduced by Moody & Darken (1989) by several
modifications of exact interpolation procedure
Number M of basis functions (hidden units) need not equal the number N
of training data points. In general it is better to have M much less than N.
Centers of basis functions do not need to be defined as the training data
input vectors. They can instead be determined by a training algorithm.
Basis functions need not all have the same width parameter . These
can also be determined by a training algorithm.
We can introduce bias parameters into the linear sum of activations at the
output layer. These will compensate for the difference between the
average value over the data set of the basis function activations and the
corresponding average value of the targets.
128
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #255
Improved RBFN
Including the proposed changes + expanding to the multidimensional output
Which can be simplified by introducing an extra basis function
0
= 1
For the case of a Gausian RBF
centers
widths
M
1
M
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #256
RBFN in Matlab notation
RBF neuron
RBF network
center
width
centers
widths biases
output weights
129
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #257
Computational power of RBFN
Hartman et al. (1990)
Formal proof of universal approximation property for networks with Gaussian
basis functions in which the widths are treated as adjustable parameters
Girosi & Poggio (1990)
Showed that RBF networks posses the best approximation property which
states: in the set of approximating functions there is one function which has
minimum approximating error for any given function to be approximated.
This property is not shared by multilayer perceptrons!
As with the corresponding proofs for MLPs, RBFN proofs rely on the
availability of an arbitrarily large number of hidden units (i.e. basis functions)
However, proofs provide a theoretical foundation on which practical
applications can be based with confidence
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #258
6.5 RBFN training
Key aspect of RBFN:
different roles of first and second computational layer
Training process can be divided into two stages
1. Hidden layer training
2. Output layer training
Hidden layer can be trained by unsupervised methods (random
selection, clustering, ...)
Output layer has linear activation output weights are determined
analitically by solving a set of linear equations
Gradient descent learning is not needed for RBFN, therefore
training is very fast!
130
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #259
Hidden layer training
One major advantage of RBF networks is the possibility of choosing
suitable hidden unit (basis function) parameters without having to
perform a full non-linear optimization of the whole network
Methods for unsupervised selection of basis function centers
Fixed centres selected at random
Orthogonal least squares
K-means clustering
Problems with unsupervised methods
Selection of number of centers M
Selection of center widths
It is also possible to perform a full supervised non-linear optimization
of the network instead
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #260
Fixed centres selected at random
Simplest and quickest approach to setting RBFN parameters
Centers fixed at M points selected randomly from the N data points
Widths fixed to be equal at an appropriate size for the distribution of data points
Specifically, we can use RBFs centred at {
j
} defined by
Widths
j
are all related in the same way to the maximum or
average distance between the chosen centres
j
Common choices are
which ensure that the individual RBFs are neither too wide, nor too narrow, for the given
training data
For large training sets, this approach gives reasonable results
131
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #261
Orthogonal least squares
A more principled approach to selecting a sub-set of data points as
the basis function centres is based on the technique of orthogonal
least squares
1. Sequential addition of new basis functions, each centred on one of the data
points
2. At each stage, we try out each potential Lth basis function by using the NL
other data points to determine the networks output weights
3. The potential Lth basis function which gives the smallest output error is used,
and we move on to choose which L+1th basis function to add
To get good generalization we generally use cross-validation to
stop the process when an appropriate number of data points have
been selected as centers
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #262
K-means clustering
A potentially even better approach is to use clustering techniques to
find a set of centres which more accurately reflects the distribution of
the data points
K-Means Clustering Algorithm
Select the number of centres (K) in advance
Apply a simple re-estimation procedure to partition the data points {x
p
} into K disjoint sub-
sets S
j
containing N
j
data points to minimize the sum squared clustering function
where
j
is the mean/centroid of the data points in set S
j
given by
Once the basis centres have been determined in this way, the
widths can then be set according to the variances of the points in the
corresponding cluster
132
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #263
K-means clustering example
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #264
Output layer training
After training the hidden layer neurons (selection of centers and
widths), output layer training essentially means optimization of a
single layer linear network
As with MLPs, a sum-squared output error can be defined
At the minimum of E, gradients with respect to weights w
ki
are zero
133
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #265
Computing the output weights
Equations for the weights are most conveniently written in matrix
form by defining matrices
which gives
and the formal solution for the weights is
here we have the standard pseudo inverse of
Network weights can be computed by fast linear matrix inversion
techniques
In practice, singular value decomposition (SVD) is often used to avoid possible
ill-conditioning of , i.e.
T
being singular or near singular
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #266
Supervised RBFN training
Supervised training of basis function parameters can give good
results, but the computational costs are usually enormous
Obvious approach is to perform gradient descent on a sum squared
output error function as in MLP backpropagation learning. Error
function would be
Supervised RBFN training would iteratively update the weights
(basis function parameters) using gradients
134
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #267
Supervised RBFN training
By using the Gaussian basis functions
derivatives of error function
become very complex and therefore computationally inefficient
Additionally, we get all the problems of choosing the learning rates,
avoiding local minima ... that we had for training MLPs by
backpropagation
And there is a tendency for the basis function widths to grow large
leaving non-localised basis functions
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #268
Regularization theory for RBFN
Alternative approach to prevent overfitting in RBFN
Based on the theory of regularization, which is a method of
controlling the smoothness of mapping functions
We can have one basis function for each training data point as in the
case of exact interpolation, but add an extra term to the error
measure which penalizes mappings which are not smooth
135
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #269
Regularization term in error measure
In regularization approach, error measure is modified with additional
regularization term that is composed from
differential operator P, and
regularization parameter
Regularization parameter determines the relative importance of
smoothness compared with error
Differential operator P can have many possible forms, but the
general idea is that mapping functions which have large curvature
should yield large regularization term and hence contribute a large
penalty in the total error function
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #270
RBFN training summary
Option 1) Exact interpolation model + Regularization
Option 2) Supervised RBFN training
Option 3) Two-stage hybrid training
3a) Hidden layer training
Fixed centres selected at random
Orthogonal least squares
K-means clustering
3b) Output layer training
Linear matrix operation
Where to start?
Two stage hybrid training with K-means clustering and linear
matrix operation for output layer
136
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #271
6.6 RBFN for classification
Key insight into RBFN can be obtained by using such networks for
classification problems
Suppose we have data set with three classes
MLP RBFN
Multilayer perceptron can separate classes by using hidden units to
form hyperplanes in the input space
Alternative approach is to model the separate class distributions by
localised radial basis functions
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #272
Implementing RBFN for classification
Define an output function y
k
(x) for each class k with appropriate targets
RBFN is trained with input patterns x and corresponding target classes t
Underlying justification for using RBFN for classification is found in
Covers theorem which states
A complex pattern classification problem cast in a high dimensional
space non-linearly is more likely to be linearly separable than in a low
dimensional space.
Once we have linear separable patterns, the classification problem can be
solved by a linear layer
137
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #273
6.7 Comparison with multilayer perceptron
Similarities between RBF networks and MLPs
1. They are both non-linear feed-forward networks
2. They are both universal approximators for arbitrary nonlinear functional mappings
3. They can be used in similar application areas
There always exists an RBF network capable of accurately mimicking a specified
MLP, or vice versa.
MLP RBFN
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #274
RBFN / MLP differences
MLP
1. Can have any number of hidden layers
2. Computation nodes (processing units) in
different layers share a common neuronal
model, though not necessarily the same
activation function
3. Argument of each hidden unit activation
function is the inner product of the input
and the weights
4. Usually trained with a single global
supervised algorithm
5. Construct global approximations to non-
linear input-output mappings with
distributed hidden representations
6. Require a smaller number of parameters
RBFN
1. Single hidden layer
2. Hidden nodes (basis functions)
operate very differently, and have a
different purpose compared to the
output nodes
3. Argument of each hidden unit
activation function is the distance
between the input and the weights
(RBF centres)
4. Usually trained one layer at a time
with the first layer unsupervised
5. Use localised non-linearities
(Gaussians) at the hidden layer to
construct local approximations
6. Fast training
138
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #275
6.8 Probabilistic networks
Probabilistic neural networks (PNN) can be used for classification problems
First layer computes distances from the input vector to the training input
vectors (prototypes) and produces a vector whose elements indicate how
close the input is to a training input
Second layer sums these contributions for each class of inputs to produce as
its net output a vector of probabilities
Finally, a competitive output layer picks the maximum of these probabilities,
and produces 1 for that class and 0 for the other classes
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #276
PNN example 1
Three training patterns
Classifying new sample
PNN division of the input space
139
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #277
PNN example 2 (1/4)
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #278
PNN example 2 (2/4)
140
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #279
PNN example 2 (3/4)
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #280
PNN example 2 (4/4)
141
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #281
PNN considerations
Probabilistic neural networks are specialized to classification (less
general than RBFN or MLP)
PNN are sensitive to the selection of spread parameter spread
can be optimized by leave-one-out cross-validation technique
1. Leave one training sample out, train PNN and test on the omitted sample
2. Repeat procedure for all samples and save results
3. Find optimal spread that yields minimal average classification error
Benefits
Little or no training required (except spread optimization)
Beside classifications, PNN also provides Bayesian posterior probabilities solid theoretical
fundation to support confidence estimates for the networks decisions
Robust against outliers outliers have no real effect on decisions
Drawbacks
PNN performance depends strongly on a thoroughly representative training set
Entire training set must be stored large memory and poor execution speed
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #282
6.9 Generalized regression networks
GRNN can be well explained by reviewing the regression problem:
How to use measured values (independent variables) to predict the
value of a dependent variable?
Linear regression is OK Linear regression fails
142
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #283
Simple linear regression
Simple linear regression is expressed with
Given the training data, the slope a and bias b are computed as
Compute sum of squares
Compute slope and bias
Resulting linear equation will minimize mean squared error of
predicted values y in the training set
b ax y
y y x x SS x x SS
i
i
i xy
i
i x
,
2
x a y b
SS SS a
x xy
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #284
Multiple regression
Several independent variables x
1
, x
2
, x
3
, ...
Matrix notation
Pack training data into matrices
Parameter can be expressed as
Final solution is usually obtained numerically by singular value
decomposition method (SVD)
4 3 3 2 2 1 1
b x a x a x a y
a y x x x x x 1 , , ,
3 2 1
a X Y
Y X X X
1
a
143
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #285
Best predictor for dependent variable y is defined by its conditional
expectation, given the independent variable x
Joint density function f
xy
(x,y) is not known but can be approximated by
Parzen estimator
By using the Parzen approximator with Gaussian kernels, we obtain
equation for GRNN predictor
General regression neural network
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #286
GRNN properties
GRNN closely resembles RBFN with normalization term in
denominator it is sometimes called Normalized RBFN
GRNN also resembles PNN but is used for regression (function
approximation), not for classification
Width parameter spread must be selected, as in all RBF networks
First layer has Gaussian kernels located at each training case and
computes distances from the input vector to the training input vectors
(prototypes)
Second layer is a special linear layer with normalization operator
Normalization makes GRNN a very robust predictor
144
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #287
GRNN architecture
Standard Radial basis layer Normalization Linear layer
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #288
RBFN vs. GRNN example (1/3)
145
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #289
RBFN vs. GRNN example (2/3)
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #290
RBFN vs. GRNN example (3/3)
146
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #291
Summary
RBFN PNN
GRNN
2012 Primo Potonik NEURAL NETWORKS (6) Radial Basis Function Networks #292
147
2012 Primo Potonik NEURAL NETWORKS (7) Self-Organizing Maps #293
7. Self-Organizing Maps
7.1 Self-organization
7.2 Self-organizing maps
7.3 SOM algorithm
7.4 Properties of the feature map
7.5 SOM discussion & examples
2012 Primo Potonik NEURAL NETWORKS (7) Self-Organizing Maps #294
Introduction
1. We discussed so far a number of networks which were trained to
perform a mapping
INPUTS OUTPUTS
which corresponds to supervised learning paradigm
2. However, problems exist where target outputs are not available
the only information is provided by a set of input patterns
INPUTS ??
which corresponds to unsupervised learning paradigm
148
2012 Primo Potonik NEURAL NETWORKS (7) Self-Organizing Maps #295
Examples of problems
Clustering
Input data are grouped into clusters for any input, neural net should return a
corresponding cluster label
Vector quantization
Continuous space has to be discretised neural net has to find optimal
discretisation of the input space
Dimensionality reduction
Input data are grouped in a subspace with lower dimensionality than the original
data Neural net has to learn an optimal mapping such that most of the
variance in the input data is preserved in the output data
Feature extraction
System has to extract features from the input signal this often means a
dimensionality reduction as described above
2012 Primo Potonik NEURAL NETWORKS (7) Self-Organizing Maps #296
7.1 Self-organization
What is self-organization?
System structure appears without explicit pressure or involvement from outside
the system
Constraints on form (i.e. organization) of interest to us are internal to the system,
resulting from the interactions among the components
The organization can evolve in either time or space, maintain a stable form or
show transient phenomena
149
2012 Primo Potonik NEURAL NETWORKS (7) Self-Organizing Maps #297
Self-organization properties
Typical features include (in rough order of generality)
Autonomy (absence of external control)
Dynamic operation (evolution in time)
Fluctuations (noise / searches through options)
Symmetry breaking (loss of freedom)
Global order (emergence from local interactions)
Dissipation (energy usage / far-from-equilibrium)
Instability (self-reinforcing choices / nonlinearity)
Multiple equilibria (many possible attractors)
Criticality (threshold effects / phase changes)
Redundancy (insensitivity to damage)
Self-maintenance (repair / reproduction metabolisms)
Adaptation (functionality / tracking of external variations)
Complexity (multiple concurrent values or objectives)
Hierarchies (multiple nested self-organized levels)
John Conways Game of Life
John Conway (1970), published paper in Scientific American
Game of Life:
infinite two-dimensional grid of square cells,
each cell is in one of two possible states, alive or dead,
every cell interacts with its eight neighbours
RULES:
1. Alive cell with less than 2 or more than 4 neighbours dies (loneliness / overcrowding)
2. Dead cell with 3 neighbours turns alive (reproduction)
2012 Primo Potonik NEURAL NETWORKS (7) Self-Organizing Maps #298
Glider Gun creating gliders
150
2012 Primo Potonik NEURAL NETWORKS (7) Self-Organizing Maps #299
Self-organization in neural networks
Self-organizing networks are based on competitive learning
output neurons of the network compete to be activated and only one neuron can
become a winning neuron
Self-organizing maps (SOM)
learn to recognize groups of similar input vectors in such a way that neurons
physically near each other in the neuron layer respond to similar input vectors
Learning vector quantization (LVQ)
a method for training competitive layers in a supervised manner
learns to classify input vectors into target classes chosen by the user
2012 Primo Potonik NEURAL NETWORKS (7) Self-Organizing Maps #300
Neurobiological motivation
Neurobiological studies indicate that different sensory inputs (tactile, visual,
auditory, etc.) are mapped onto different areas of the cerebral cortex in an
ordered fashion
This form of a map, known as a topographic map, has two important
properties:
1. At each stage of representation, or processing, each piece of incoming
information is kept in its proper context / neighbourhood
2. Neurons dealing with closely related pieces of information are kept close
together so that they can interact via short synaptic connections
Our interest is in building artificial topographic maps that learn through
self-organization in a neurobiologically inspired manner
We shall follow the principle of topographic map formation:
The spatial location of an output neuron in a topographic map corresponds
to a particular domain or feature drawn from the input space
151
2012 Primo Potonik NEURAL NETWORKS (7) Self-Organizing Maps #301
7.2 Self-organizing maps (SOM)
Neurons are placed at the nodes of a lattice, usually 1D or 2D
Neurons are trained by self-organized competitive learning rule
Neurons become selectively tuned to various input patterns or
classes of input patterns
Locations of neurons become ordered in a way that a meaningfull
topographic map of input patterns is created
The process of ordering is automatic (self-organized) without
guidance from outside
Self-organizing maps are inherently nonlinear a nonlinear
generalization of principal component analysis (PCA)
2012 Primo Potonik NEURAL NETWORKS (7) Self-Organizing Maps #302
Organization of a self-organizing map
Points x from the input space are mapped to points
I(x) in the output space (self-organizing map)
Each point I in the output space will map to a corresponding point
w(I) in the input space
152
2012 Primo Potonik NEURAL NETWORKS (7) Self-Organizing Maps #303
Kohonen network
Kohonen (1982) : Self-organized formation of topologically correct
feature maps. Biological Cybernetics
Kohonen network or Self-Organizing Map (SOM) has a single
computational layer arranged in rows and columns
1D, 2D, 3D
Each neuron is fully connected to all source nodes in the input layer
2012 Primo Potonik NEURAL NETWORKS (7) Self-Organizing Maps #304
SOM architecture
Calculating the distance
between inputs and
neurons
dist
Competitive layer
selection of a winning
neuron and its
neighborhood
dist, linkdist, mandist,
boxdist
Topologies:
1D, 2D, 3D
153
2012 Primo Potonik NEURAL NETWORKS (7) Self-Organizing Maps #305
7.3 SOM algorithm
1. Initialization
Define SOM topology, then initialize weights with small random
values
2. Competition
For each input pattern, neurons compute their values of a distance
function which provides the basis for competition. A neuron with the
smallest distance to the input pattern is declared the winner.
3. Cooperation
Winning neuron determines the topological neighbourhood of
excited neurons, thereby providing the basis for cooperation among
neighbouring neurons
4. Adaptation
Excited neurons decrease their distance to the input pattern through
adjustment of synaptic weights response of the winning neuron
to the subsequent application of a similar input pattern is enhanced
2012 Primo Potonik NEURAL NETWORKS (7) Self-Organizing Maps #306
Competition - Cooperation - Adaptation
We have m-dimensional input space
Synaptic weight vector of each neuron in the network has the same
dimension as input space
The best match of the input vector x with the synaptic weight vectors w
j
can
be found by comparing the Euclidean distance between input vector x and
each neuron j
Neuron whose weight vector comes closest to the input vector (i.e. is most
similar to it) is declared the winning neuron
In this way the continuous input space can be mapped to the discrete output
space of neurons by a simple process of competition between the neurons
] , , , [
2 1 m
x x x x
K j w w w w
jm j j j
, , 1 , ] , , , [
2 1
j j
w x x d ) (
154
2012 Primo Potonik NEURAL NETWORKS (7) Self-Organizing Maps #307
Competition - Cooperation - Adaptation
Winning neuron locates the center of a topological neighborhood of
cooperating neurons
Neurobiological studies confirm that there is lateral interaction
within a set of excited neurons
When one neuron fires, its closest neighbours tend to get excited more than
those further away
Topological neighbourhood decays with distance
We define a similar neurobiologically correct topological
neighbourhood for the neurons in SOM and assume two
requirements:
1. Topological neighborhood is symetric around the winning neuron
2. Amplitude of the topological neighborhood decreases monotonically with
increasing lateral distance
(and decaying to zero in the limit d which is neccessary for covergence)
2012 Primo Potonik NEURAL NETWORKS (7) Self-Organizing Maps #308
Competition - Cooperation - Adaptation
A typical choise of a topological neighbourhood function that covers
both requirements is defined by Gaussian function
Gaussian function is translation invariant
(independent of the location of the winning neuron)
2
2
,
,
2
exp ) (
i j
i j
d
x h
Effective width of the
topological neighborhood
155
2012 Primo Potonik NEURAL NETWORKS (7) Self-Organizing Maps #309
Competition - Cooperation - Adaptation
For cooperation to be effective, topological neighborhood must
depend on lateral distance between winning neuron and its
neughbors in the output space and NOT on distance measure in the
original input space
Winning neuron
Neighbours
Distance:
dist
linkdist
mandist
boxdist
2012 Primo Potonik NEURAL NETWORKS (7) Self-Organizing Maps #310
Competition - Cooperation - Adaptation
Another special feature of the SOM algorithm is that size of the
topological neighborhood shrinks with time
Shrinking requirement is fulfilled by decreasing the width of the
Gaussian neighborhood function with time. Popular choice is
exponential temporal decay
Consequently, topological neighborhood function assumes time-
varying form
Time increases width decreases neighborhood shrinks
,..., 2 , 1 , exp ) (
1
0
n
n
n
,..., 2 , 1 ,
) ( 2
exp ) , (
2
2
,
,
n
n
d
n x h
i j
i j
156
2012 Primo Potonik NEURAL NETWORKS (7) Self-Organizing Maps #311
Competition - Cooperation - Adaptation
Time increases width decreases neighborhood shrinks
2012 Primo Potonik NEURAL NETWORKS (7) Self-Organizing Maps #312
Competition - Cooperation - Adaptation
Clearly, SOM must involve some kind of adaptation or learning by
which the outputs become self-organised and the feature map
between inputs and outputs is formed
Meaning of the topographic neighbourhood is that not only the
winning neuron gets its weights updated, but its neighbours will
have their weights updated as well
Learning rule for adaptation
the rule is applied to all neurons inside the topological
neighbourhood of the winning neuron i
Adaptation moves the synaptic weights w
j
of the chosen neurons
toward the input vector x
)) ( )( , ( ) ( ) ( ) 1 (
,
n w x n x h n n w n w
j i j j j
157
2012 Primo Potonik NEURAL NETWORKS (7) Self-Organizing Maps #313
Competition - Cooperation - Adaptation
Adaptation algorithm leads to a topological ordering of the feature
map neurons that are neighbours in the lattice will tend to have
similar weight vectors
Learning parameter (n) should be decreasing with time for a proper
convergence
Thus, SOM algorithm requires choice of several parameters:
Even if not optimal, section of parameters usually leads to the
formation of the feature map in a self-organized manner
,..., 2 , 1 , exp ) (
2
0
n
n
n
2 0 1 0
, , ,
2012 Primo Potonik NEURAL NETWORKS (7) Self-Organizing Maps #314
Competition - Cooperation - Adaptation
Adaptation process can be decomposed in two phases
1. Self-organizing or ordering phase
Topological ordering of weight vectors
typically cca 1000 iterations of SOM algorithm
needs proper choice of neighbourhood function and learning rate
2. Convergence phase
Feature map fine-tuning
provides statistical quantification of the input space
typically the number of iterations at least 500 times the number of neurons
Result of SOM algorithm
Starting from the initial state of complete disorder, SOM algorithm
gradually leads to an organized representation of activation
patterns drawn from the input space
However, it is possible to end up in a metastable state in which the feature map
has a topological defect
158
2012 Primo Potonik NEURAL NETWORKS (7) Self-Organizing Maps #315
SOM algorithm essentials
Essential characteristics of the SOM algorithm:
Continuous input space of activation patterns that are generated
according to a certain probability distribution
Discrete output space in a form of a lattice of neurons
Shrinking neighborhood function h that is defined around a
winning neuron
Decreasing learning rate that is exponentially decreasing with time
2012 Primo Potonik NEURAL NETWORKS (7) Self-Organizing Maps #316
SOM algorithm summary
1. Initialization
Choose random values for the initial weight vectors w
j
2. Sampling
Draw a sample training input vector x from the input space
3. Competition
Find the winning neuron with weight vector closest to input vector
4. Cooperation
Select neurons in the topological neighbourhood of the winning
neuron
5. Adaptation
Adjust synaptic weights of the selected neurons
6. Iteration
Continue with step 2 until the feature map stops changing
)) ( )( , ( ) ( ) ( ) 1 (
,
n w x n x h n n w n w
j i j j j
159
2012 Primo Potonik NEURAL NETWORKS (7) Self-Organizing Maps #317
Visualizing the SOM algorithm (1/2)
Step 1
Suppose we have four data points (x)
in our continuous 2D input space, and
want to map this onto four points in a
discrete 1D output space (o). The
output nodes map to points in the input
space (o). Random initial weights start
the circles at random positions in the
centre of the input space.
Step 2
We randomly pick one of the data
points for training (). The closest
output point represents the winning
neuron ( ). That winning neuron is
moved towards the data point by a
certain amount, and the two
neighbouring neurons move by smaller
amounts ().
2012 Primo Potonik NEURAL NETWORKS (7) Self-Organizing Maps #318
Visualizing the SOM algorithm (2/2)
Step 3
Next we randomly pick another data
point for training (). The closest
output point gives the new winning
neuron ( ). The winning neuron
moves towards the data point by a
certain amount, and the one
neighbouring neuron moves by a
smaller amount ().
Step 4
We carry on randomly picking data
points for training (). Each winning
neuron moves towards the data point
by a certain amount, and its
neighbouring neuron(s) move by
smaller amounts (). Eventually the
whole output grid unravels itself to
represent the input space.
160
2012 Primo Potonik NEURAL NETWORKS (7) Self-Organizing Maps #319
Example: 1D Lattice driven by 2D distribution
2D input data distribution Initial condition of 1D lattice
End of ordering phase End of convergence phase
2012 Primo Potonik NEURAL NETWORKS (7) Self-Organizing Maps #320
Parameters for 1D example
(a) Exponential decay of
neighborhood width
(n)
(b) Exponential decay of
learning rate (n)
(c) Initial neighborhood
function (spanning
over 100 neurons)
(d) Final neighborhood
function at the end of
the ordering phase
161
2012 Primo Potonik NEURAL NETWORKS (7) Self-Organizing Maps #321
Example: 2D Lattice driven by 2D distribution
Initial condition of 2D lattice
End of ordering phase End of convergence phase
2D input data distribution
2012 Primo Potonik NEURAL NETWORKS (7) Self-Organizing Maps #322
Matlab examples
nnd14fm1 1D feature map
nnd14fm2 2D feature map
162
2012 Primo Potonik NEURAL NETWORKS (7) Self-Organizing Maps #323
7.4 Properties of the feature map
Property 1: Approximation of the input space
Property 2: Topological ordering
Property 3: Density matching
Property 4: Feature selection
2012 Primo Potonik NEURAL NETWORKS (7) Self-Organizing Maps #324
Property 1: Approximation of the input space
The feature map represented by the set of weight vectors {w
i
} in the
output space, provides a good approximation to the input space
Goal of SOM can be formulated as to store a large set of input vectors {x} by
a smaller set of prototypes {w
i
} that provide a good approximation to the
original input space.
Goodness of the approximation is given by the total squared distance which
we wish to minimize
If we work through gradient descent style mathematics we do end up with the
SOM weight update algorithm, which confirms that it is generating a good
approximation to the input space
163
2012 Primo Potonik NEURAL NETWORKS (7) Self-Organizing Maps #325
Property 2: Topological ordering
The feature map computed by the SOM algorithm is topologically
ordered in the sense that the spatial location of a neuron in the output
lattice corresponds to a particular domain or feature of input patterns
The topological ordering property is a direct consequence of the weight
update equation:
Not only the winning neuron but also the neurons in the topological
neighbourhood are updated
Consequently the whole output space becomes appropriately ordered
Visualise the feature map as elastic net ...
2012 Primo Potonik NEURAL NETWORKS (7) Self-Organizing Maps #326
Property 3: Density matching
The feature map reflects variations in the statistics of the input
distribution: Regions in the input space from which the sample training
vectors x are drawn with high probability of occurrence are mapped
onto larger domains of the output space, and therefore with better
resolution than regions of input space from which training vectors are
drawn with low probability.
We can relate the input vector probability distribution p(x) to the magnification
factor m(x) of the feature map. Generally, for two dimensional feature maps
the relation cannot be expressed as a simple function, but in one dimension
we can show that
So the SOM algorithm doesnt match the input density exactly, because of
the power of 2/3 rather than 1.
As a general rule, the feature map tend to over-represent the regions with low
input density and to under-represent regions with high input density.
164
2012 Primo Potonik NEURAL NETWORKS (7) Self-Organizing Maps #327
Property 4: Feature selection
Given data from an input space with a non-linear distribution, the self-
organizing map is able to select a set of best features for approximating
the underlying distribution
This property is a natural culmination of properties 1,2,3
Principal Component Analysis (PCA) is able to compute the input dimensions
which carry the most variance in the training data. It does this by computing
the eigenvector associated with the largest eigenvalue of the correlation
matrix.
PCA is fine if the data really does form a line or plane in input space, but if
the data forms a curved line or surface, linear PCA is no good, but a SOM will
overcome the approximation problem by virtue of its topological ordering
property.
The SOM provides a discrete approximation of finding so-called
principal curves or principal surfaces, and may therefore be viewed as a
non-linear generalization of PCA
2012 Primo Potonik NEURAL NETWORKS (7) Self-Organizing Maps #328
SOM is a neural network built around 1D or 2D lattice of neurons for
capturing important features contained in input data
SOM provides a structural representation of input data by neurons
weight vectors as prototypes
SOM is neurobiologically inspired and incorporates self-organizing
mechanisms
Competition
Cooperation
Adaptation
SOM is simple to implement yet mathematically difficult to analyze
7.5 SOM discussion & examples
165
2012 Primo Potonik NEURAL NETWORKS (7) Self-Organizing Maps #329
4 clusters, 1D SOM
2012 Primo Potonik NEURAL NETWORKS (7) Self-Organizing Maps #330
4 clusters, 2D SOM
166
2012 Primo Potonik NEURAL NETWORKS (7) Self-Organizing Maps #331
Uniform distribution, square
0.5 1 1.5 2 2.5
0.5
1
1.5
2
2.5
W(i,1)
W
(
i
,
2
)
Weight Vectors
0.5 1 1.5 2 2.5
0.5
1
1.5
2
2.5
50 neurons
Uniform distribution of
1000 points in a square
0.5 1 1.5 2 2.5
0.5
1
1.5
2
2.5
W(i,1)
W
(
i
,
2
)
Weight Vectors
10x10 neurons
2012 Primo Potonik NEURAL NETWORKS (7) Self-Organizing Maps #332
Uniform distribution of
1000 points in a circle
Uniform distribution, circle
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
-1 -0.5 0 0.5 1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
W(i,1)
W
(
i
,
2
)
Weight Vectors
-1 -0.5 0 0.5 1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
W(i,1)
W
(
i
,
2
)
Weight Vectors
50 neurons
10x10 neurons
167
2012 Primo Potonik NEURAL NETWORKS (7) Self-Organizing Maps #333
Gaussian distribution,
square
-1.5 -1 -0.5 0 0.5 1 1.5
-1.5
-1
-0.5
0
0.5
1
1.5
-1.5 -1 -0.5 0 0.5 1 1.5
-1.5
-1
-0.5
0
0.5
1
1.5
W(i,1)
W
(
i
,
2
)
Weight Vectors
-1.5 -1 -0.5 0 0.5 1 1.5
-1.5
-1
-0.5
0
0.5
1
1.5
W(i,1)
W
(
i
,
2
)
Weight Vectors
Gaussian distribution of
1000 points in a square
50 neurons
10x10 neurons
2012 Primo Potonik NEURAL NETWORKS (7) Self-Organizing Maps #334
Complex distribution
0.5 1 1.5 2 2.5
0.5
1
1.5
2
2.5
0.5 1 1.5 2 2.5
0.5
1
1.5
2
2.5
W(i,1)
W
(
i
,
2
)
Weight Vectors
0.5 1 1.5 2 2.5
0.5
1
1.5
2
2.5
W(i,1)
W
(
i
,
2
)
Weight Vectors
50 neurons
10x10 neurons
Complex distribution
in a square
168
2012 Primo Potonik NEURAL NETWORKS (7) Self-Organizing Maps #335
4 clusters, 2D SOM
-1.5 -1 -0.5 0 0.5 1 1.5 2 2.5
-1
-0.5
0
0.5
1
1.5
2
-1.5 -1 -0.5 0 0.5 1 1.5 2 2.5
-1
-0.5
0
0.5
1
1.5
2
W(i,1)
W
(
i
,
2
)
Weight Vectors
4 classes with uniform distribution
1000 points in each class
Net 8x8 neurons
2012 Primo Potonik NEURAL NETWORKS (7) Self-Organizing Maps #336
169
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #337
8. Practical Considerations
8.1 Designing the training data
8.2 Preparing data
8.3 Selection of inputs
8.4 Data encoding
8.5 Principal component analysis
8.6 Invariances and prior knowledge
8.7 Generalization
8.8 General guidelines
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #338
Introduction
Neural network could, in principle, map raw input data into required
outputs in practice, this will generally give poor results
For most applications, some data manipulations are recommended:
Preparing data
designing the training data
handling missing and extreme data
incorporating invariances and prior knowledge
Preparing inputs
pre-processing, rescaling, normalizing, standardizing, detrending
dimensionality reduction: principal component analysis
feature selection, feature extraction
Preparing outputs
encoding of classes, post-processing, rescaling, standardizing
170
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #339
8.1 Designing the training data
Good training data are required to train a NN
Neural nets are not good at extrapolation
Training data must be representative for the problem
considered
For pattern recognition
Every class must be represented
Within each class, statistical variation must be adequately represented
Potato chips factory example:
NN must be trained on 1) normal chips, 2) burned chips, 3) uncooked chips, ...
Large training set prevents overfitting
Overfitting = perfect fit to a small number of training data
Three-layer feedforward network example:
With 25 inputs and 10 hidden neurons over 260 free parameters
Apply at least 500-1000 training samples (preferably more) for proper training
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #340
8.2 Preparing data
Some data transformation are usually necessary to achieve good
neural network results
Rescaling
Add/subtract a constant and then multiply/divide by a constant
Example: convert a temperature from Celsius to Fahrenheit
Standardizing
Subtracting a measure of location and dividing by a measure of scale
Example: subtracting a mean and dividing by standard deviation, thereby obtaining a
"standard normal" random variable with mean 0 and standard deviation 1
Normalizing
Dividing a vector by its norm
Example: make the Euclidean length of the vector equal to one.
In the NN literature, "normalizing" often refers to rescaling into [0,1] range
Which operations should be applied to data? It depends!
171
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #341
Rescaling
Rescaling inputs
Often recommended rescaling of inputs to interval [0,1] is a misconception, there
is in fact no such requirement.
Interval [0,1] is usually a bad choice, rescaling to [-1,1] interval is better
Standardizing inputs is better than rescaling ...
Rescaling outputs
1. For bounded activation functions (range [0,1] or [-1,1] ) the target values must lie
within that range
The alternative is to use an activation function suited to the distribution of the
targets, for example linear activation function.
2. It is essential to rescale the multidimensional targets so that their variability
reflects their importance, or at least is not in inverse relation to their importance.
If the targets are of equal importance, they should typically be rescaled or
standardized to the same range or the same standard deviation.
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #342
Standardizing
Standardizing usually reffers to transforming data
into zero mean with standard deviation one
Statistics (mean, std) are computed from training data, not from validation data
Validation data must be standardized using the statistics computed from
training data
Standardizing inputs
Often very benefitial for MLP and RBFN networks
RBFN inputs are combined with via a distance function (Euclidean) therefore it is
important to standardize them into similar range
MLP standardizing enables utilization of steep parts of transfer functions faster
learning and avoidance of saturation
Standardizing outputs
Typically more a convenience for getting good initial weights than a necessity
Important for the equal relevance of targets
Note: use rescaling for bounded activation functions, not standardizing!
172
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #343
Time series transformations
Detrending
Removing linear trend from the time series
After neural network application, original trend is added to the results
Carefull: it is too easy to create trend where none belongs!
Removing seasonal components
Yearly, monthly, weekly, daily, hourly cycles can be removed before the
application of neural networks
Decomposition methods
Differencing
Working with differences between successive samples can sometimes bring
good results
Example: daily stock-market values convey one sort of information, the change
from one day to the next conveys entirely different information
Differencing can be applied at inputs and outputs, powerfull option is to apply raw
and diferrenced inputs!
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #344
Why detrending is dangerous?
173
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #345
Example of time series preparation
1. Original data x
nonstationary mean
nonstationary variance
2. Log(x)
stationary variance
3. Differencing
stationary mean
stationary variance
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #346
Time series decomposition
Time series can often be decomposed into components
Trend (T)
Seasonal cycle (S)
Residual (E)
Decomposition can be aditive or multiplicative
Aditive: Y = T + S + E
Multiplicative: Y = T * S * E
Methods
X-12-ARIMA (U.S. Census Bureau, Statistical Research Divison )
STL (Seasonal Trend Decomposition based on Loess)
174
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #347
Example of STL decomposition
Original data
Trend
Weekly cycle
Residual
Daily energy consumption
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #348
Missing data and outliers
Handling missing data is difficult
If not many data are missing, discard missing samples
Substituting the missing data with mean values
Input vector with a missing single variable:
Find similar input vectors (without missing variable) based on a distance measure
Take the missing value as an average of the variable contained in the similar input
vectors
Outliers can appear due to
Natural variation of the variable's distribution
Noise in data acquisition chain
Defects
Carefull examination of the experiment is required to
confirm validity of outliers
If outliers have some significance, keep them in the training data
Some abnormality is normal!
Do not reject a point unless it is really wild
175
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #349
8.3 Selection of inputs
Importance of inputs which inputs should be selected for best
results (classification or prediction)
Several aspects of importance:
Predictive importance
Increase in generalization error when an input is omitted from a network
Causal importance
How much the outputs change if inputs are changed (also called sensitivity)
Marginal importance
Considers inputs in isolation
Easy to compute without even training a neural net ...
(Pearson correlation, rank correlation, mutual information, ...)
Marginal importance is of little practical use other than for a preliminary
description of the data
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #350
How to measure importance of inputs
How to measure importance of inputs: very difficult!
Comparing weights in linear models can be misleading
Comparing standardized weights in linear models can be misleading
Comparing changes in the error function in linear models can be misleading
Statistical p-values can be misleading
Comparing weights in MLPs can be misleading
Sums of products of weights in MLPs can be misleading
Partial derivatives can be misleading
Average partial derivatives over the input space can be misleading
Average absolute partial derivative can be misleading
ftp://ftp.sas.com/pub/neural/importance.html
176
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #351
Methods of input selection
Practical approach:
Selection of inputs based on cross-validation
General framework
1. Select a subset of inputs
2. Train and validate the network based on the selected subset of inputs
3. Based on the validation result, decide upon further inclusion/rejection of inputs
4. Continue iterating until good results are obtained
Direct search methods
Exhaustive search
Forward selection
Backward elimination
Selection by genetic algorithms, ...
Prunning methods
Removing nonrelevant inputs during the neural network construction
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #352
8.4 Data encoding (1/3)
Numeric variables
No need for encoding check the need for rescaling or standardizing
Ordinal variables
Discrete data with natural ordering (e.g. 'small', 'medium', 'big')
Ordinal variables can often be represented by a single variable
Small | 1 |
Medium | 2 |
Big | 3 |
Thermometer coding (using dummy variables)
Small | 0 0 1 |
Medium | 0 1 1 |
Big | 1 1 1 |
Improved thermometer coding faster learning
Small | -1 -1 1 |
Medium | -1 1 1 |
Big | 1 1 1 |
177
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #353
Data encoding (2/3)
Categorical variables
Discrete data without ordering (e.g. 'apple', 'banana', 'orange' )
1-of-C coding
Red | 0 0 1 |
Green | 0 1 0 |
Blue | 1 0 0 |
1-of-(C-1) ... if the network has bias
Red | 0 0 |
Green | 0 1 |
Blue | 1 0 |
1-of-C coding with a softmax activation function
will produce valid posterior probability estimates
It is very important NOT to use a single variable for an unordered
categorical target
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #354
Circular discontinuity
How to encode variables that are fundamentaly circular? ...
e.g. Angle 0..360
Day of the week (Mon=1, ... Sun=7) we have discontinuity when passing
from 7 to 1, although Sunday and Monday are very close
Solutions
1. Discretizing and using any of the categorical coding (1-of-C)
2. Encoding with two dummy variables (sin,cos)
Data encoding (3/3)
sin
cos
178
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #355
8.5 Principal component analysis
In some situations, dimension of the input vector is large, but
the components of the vectors are highly correlated
It is useful in this situation to reduce the dimension of the
input vectors feature extraction
An effective procedure for performing this operation is
principal component analysis (PCA)
PCA is a vector space transform used to reduce
multidimensional data to lower dimensions for analysis
PCA method generates a new set of variables, called principal
components
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #356
Calculation of principal components
Input matrix X is represented as a linear combination of
principal components
Projection vectors
p
are eigen vectors of the covariance
matrix XX
T
Each principal component z
p
is obtained as a product of input
matrix with projection vectors
Each principal component is a linear combination of original
variables
All the principal components are orthogonal to each other, so
there is no redundant information
179
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #357
PCA example
Original data: x
1
, x
2
Principal components: z
1
, z
2
Variability
z
1
95%
z
2
5%
Benefit
Dimensionality reduction
by using only the first principal
component (z
1
) instead of
original 2D data (x
1
, x
2
)
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #358
Properties of principal components
Principal components form an orthogonal basis for the data
1st principal component variance of this variable is the
maximum among all possible choices of the first axis
2nd principal component is perpendicular to the 1st
principal component. The variance of this variable is the
maximum among all possible choices of this second axis
Often, the first few principal components explain the majority
of the total variance these few new variables can be taken
as low-dimensional input to neural network instead of high-
dimensional original data
180
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #359
How to use PCA for neural networks
1. Load original data X
2. Compute principal components z
3. Plot variance explained
4. Decide how much variance to keep ... 90%, 95%?
5. Keep only a few selected principal components, discard the
rest data dimensionality is reduced
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #360
Intrinsic dimensionality
Suppose we apply PCA on d-dimensional data and discover
that first n-eigenvalues have significantly larger values (n < d)
Consequently, data can be represented with high accuracy by
first n-eigenvalues effective dimensionality is only n
Generally, data in d-dimensions have intrinsic dimensionality
equal to n if data lies entirely within a n-dimensional subspace
181
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #361
Neural nets for dimensionality reduction
Multilayer feedforward neural networks can be used to perform
nonlinear dimensionality reduction
Auto-associative multilayer perceptron with extra hidden layers
can perform a general nonlinear dimensionality reduction
Number of neurons: 1024 300 50 300 1024
Nonlinear dimensionality reduction
32 x 32 pixels 32 x 32 pixels
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #362
8.6 Invariances and prior knowledge
In many practical situations we have, in addition to the
data itself, also a priori knowledge
General information about the form of the mapping
Prior probabilities of class membership
Information about constraints
Knowledge about invariances
How to build invariances into neural networks?
1. Invariance by neural network structure
Shared weights, Higher-order neural networks
2. Invariance by training
Include a large number of translated inputs to train NN
3. Invariant feature space
Extract features that are invariant for the problem considered
Review of the Lecture NN-02 feature extraction ...
182
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #363
Handwritten character recognition problem
Recognize handwritten characters a and b
Image representation
Grid of pixels (typically 256x256) 65536 inputs
Gray level [0..1] (typically 8-bit coding)
Extraction of the features:
Solving two problems
1. Invariance problem (translations)
2. Curse of dimensionality problem
,
heigth character
area closed
,
width character
heigth character
2 1
F F
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #364
8.7 Generalization
Goal of network training is not to exactly fit the data but
to build a statistical model that generates the data
Well trained network is able to generalize to make good
predictions on new inputs
Here it is assumed that the test data are drawn from the same
population used to generate the training data
Neural network designed to generalize well will produce
correct input-output mapping even if new inputs differ slightly
from the samples used to train the network
Overfitting problem neural net learns the complete training
set but not the underlaying function
183
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #365
Generalization in classification
The task of our network is to learn a classification decision boundary
If we know that training data contains noise, we dont necessarily
want the training data to be classified totally accurately, as that is
likely to reduce the generalization ability
Good generalization Overfitting
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #366
Generalization in function approximation
Function approximation based on noisy data samples
We can expect the neural network output to give a better
representation of the underlying function if its output curve does not
pass through all the data points
Again, allowing a larger error on the training data is likely to lead to
better generalization
Good generalization Overfitting
184
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #367
Overfitting, underfitting
Overfitting
Neural network perfectly learns the training data but gives poor results on test
data
Underfitting
Neural network is unable to properly learn the data due to insufficient number of
neurons or due to extreme regularization
Such network also generalizes poorly
Underfitting Overfitting
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #368
Improving generalization
How to prevent underfitting
1. Provide enough hidden units to represent the required mappings
2. Train the network for long enough so that the sum squared error cost
function is sufficiently minimised
How to prevent overfitting
3. Design the training data properly use large training set
4. Cross-validation check generalization ability on test data
5. Early stopping before NN had time to learn the training data too well
6. Restrict the number of adjustable parameters the network has
a) Reduce the number of hidden units, or
b) Force connections to share the same weight values
7. Add regularization term to the error function to encourage smoother
network mappings
8. Add noise to the training patterns to smear out the data points
185
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #369
Cross-validation
Cross-validation is used to estimate generalization error
based on resampling
Available data are randomly partitioned into a
Training set, and
Test set
Training set is further partitioned into
Estimation subset, used to train the model
Validation subset, used to validate the model
Training set is used to build and validate various candidate models and to
choose the best one
Generalization performance of the selected model is tested on
the test set which is different from the validation subset
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #370
Variants of cross-validation
If only a small set of data exists ...
Multifold cross-validation
Divide available N samples into K subsets
Model is trained on all subsets except one
Validation error is measured on the subset left out
Procedure is repeated K-times
Model performance is obtained by averaging K trials
Leave-one-out cross-validation
Extreme form of cross-validation
N-1 samples are used for training
Model is validated on the sample left out
Procedure is repeated N-times
Result is averaged over N-trials
Trial
1 2 ... K
186
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #371
Early stopping
Neural networks are often set up with more than enough parameters which
can cause over-fitting
For the iterative gradient descent based network training procedures
(backpropagation, conjugate gradients, ...), the training set error will
naturally decrease with increasing numbers of epochs of training
The error on the unseen validation and testing data sets, however, will start
off decreasing as the under-fitting is reduced, but then it will eventually
begin to increase again as over-fitting occurs
The natural solution to get the best
generalization, i.e. the lowest error
on the test set, is to use the
procedure of early stopping
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #372
Early stopping procedure
How to perform learning with early stopping?
1. Divide the training data into estimation and validation subsets
2. Use a large number of hidden units
3. Use very small random initial values
4. Use a slow learning rate
5. Compute the validation error rate periodically during training
6. Stop training when the validation error rate starts increasing
Since validation error is not a good estimate of the generalization
error, a third test set must be applied to estimate generalization
performance
Available data are divided as in cross-validation
Training set
Estimation subset
Validation subset
Test set
187
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #373
Practical considerations of early stopping
One potential problem: validation error may go up and down
numerous times during training the safest approach is generally
to train to convergence, saving the weights at each epoch, and then
go back to weights at the epoch with the lowest validation error
Early stopping resembles regularization with weight decay which
indicates that it will work best if the training starts with very small
random initial weights
General practical problems
How to best split available training data into training and validation subsets?
What fraction of the patterns should be in the validation set?
Should the data be split randomly, or by some systematic algorithm?
Such issues are problem dependent ...
Default Matlab parameters (train, validation, test): 70%, 15%, 15%
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #374
Weight restriction and weight sharing
Perhaps the most obvious way to prevent over-fitting in neural
networks is to restrict the number of free parameters
The simplest solution is to restrict the number of hidden units, as this
will automatically reduce the number of weights. Optimal number for
a given problem can be determined by cross-validation.
Alternative solution is to have many weights in the network, but
constrain certain groups of them to be equal
a) If there are symmetries in the problem, we can enforce hard weight sharing by
building them into the network in advance
b) In other problems we can use soft weight sharing where sets of weights are
encouraged to have similar values by the learning algorithm
one way to implement soft weight sharing is to add an appropriate term to
the error function regularization
188
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #375
Regularization
Regularization technique encourages smoother network mappings
by adding a penalty term to the standard (sum-squared-error) cost
function
where the regularization parameter controls the trade-off between
reducing the error E
sse
and increasing the smoothing
This modifies the gradient descent weight updates
The resulting neural network mapping is a compromise between
fitting the data and minimizing the regularizer
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #376
Regularization by weight decay
One of the simplest forms of regularizer is called weight decay and
consists of the sum of squares of the network weights
In conventional curve fitting this regularizer is known as ridge
regression. We can see why it is called weight decay when we
observe the extra term in the weight updates
In each epoch the weights decay in proportion to their size
Empirically, this leads to significant improvements in generalization.
Weight decay keeps the weights small and hence the mappings are
smooth
189
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #377
Training with noise / Jittering
Adding noise (jitter) to the inputs during training was also found
empirically to improve network generalization
Noise will smear out each data point and make it difficult for the
network to fit the individual data points precisely, and consequently
reduce over-fitting
Jittering is accomplished by generating new inputs by using original
inputs and small amounts of jitter. Adding jitter to the targets will not
change the optimal weights, it will just slow down training.
Jittering is also closely related to regularization methods such as
weight decay and ridge regression
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #378
Generalization summary
Preventing underfitting
1. Provide enough hidden units
2. Train the network for long enough
Preventing overfitting
3. Design the training data properly
4. Cross-validation
5. Early stopping
6. Restrict the number of adjustable parameters
7. Regularization
8. Jittering
190
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #379
8.8 General guidelines
General guidelines for designing successful neural
network solutions:
1. Understand and specify your problem
2. Acquire and analyze data, define inputs and outputs, remove outliers, apply
preprocessing methods (rescale, standardize, normalize), properly encode
outputs, ...
3. Acquire prior knowledge and apply it in terms of feature selection, feature
extraction, selection of neural network type, neural network complexity, etc.
4. Start with simple neural network architectures few layers, few neurons
5. Train the network and make sure it performs well on its training data.
If this doesnt work, increase the complexity of the network.
6. Test its generalization by checking its performance on new test data.
If this doesnt work, check your data, check partitioning of data into train/test
sets, check and modify network architecture, ...
2012 Primo Potonik NEURAL NETWORKS (8) Practical considerations #380