Lecture 10 A STLF ANN Introduction 02122020 104911am

Part II
Neural Networks: a short introduction
1
Neural methods for Short-Term Load Forecasting - F. Piglione, 2012
Some biological remarks
` The human brain (whose processing speed is around 100 Hz) is

g
able to make logical inferences,, but computers
p (processing
(p g speed
p
around 109 Hz) easily outperform it
` Conversely, the human brain can manage complex tasks (throw a

ball in a basket by trial and error, control an unknown machinery,
recognize a face in the crowd) that are almost impossible for
computers
` The reason lies in the high number of processing units (1010

neurons) of the brain and their massive interconnection (104
synapses)
` In fact, most instinctive activities of our brain are similar to

approximating complex functions and do not involve logical
inference
2
Artificial Neural Networks (ANNs)
` Neural networks are parallel processing structures composed by

many elementary
l units
i that
h reproduce
d non-linear
li relationships
l i hi
learned from examples
` Neural
N l networks
t k are freely
f l inspired
i i d by
b biological
bi l i l concepts
t
` ANNs are a wide class of logical structures, usually built as

computer
t code
d
` Perform complex tasks, such as approximation of functions,

associative data retrieval,
retrieval classification
` Several ANNs can be considered as non-linear and non-parametric

multivariate regression methods
3
ANNs: structure and glossary
` ANNs are composed by units (or neurons), organized in different ways

and connected by weighted connections
` The weights contain the stored information
` E
Eachh unitit performs
f a non-linear
li t
transformation
f ti off the
th sum off its
it
weighted input signals and propagates it to the inputs of the connected
units
` The weight values are adjusted by the learning procedure, performed

on a set of examples named Training Set (TS)
` Once trained, ANNs can produce the correct outputs for inputs not
previously included in the TS (generalization)
` ANNs can be supervised (learn input-output

input output relationships) or
unsupervised (discover features hidden in the TS)
4
A generic feed-forward neural network
units
The information goes from the input to the
output
t t units
it weights
i ht
input
units
inputs
output
5
Some historical remarks
` McCulloch and Pitts (1940) proposed a neuron model composed by

y threshold devices and stochastic algorithms
binary g
` Rosenblatt (1958) devised a class of binary linear machines named
Perceptrons
` Minski and Papert (1969) criticized Perceptrons, demonstrating that
their learning capabilities were limited to applications with linearly
separable patterns
` Rumelhart, Hinton, and Williams (1980’s) proposed the multilayer
perceptron (MLP) with backpropagation learning algorithm. In such
way they introduced the modern feed-forward
feed forward ANNs
` Two other important neural networks are:
– the Kohonen’s Self Organizing Map (1981) for data projection and
classification
p
– the Hopfield network ((1982)) for optimization
p p
purposes
p
6
A practical classification of the main types of ANNs
Feed-forward networks
– Distributed knowledge
• Multi Layer
y Perceptron
p ((MLP))
• Recurrent Neural Networks (RNN)
– Kernel-based
Kernel based
• Radial Basis Function network (RBF)
Vector quantization networks

– Kohonen
Kohonen’s
s Self
Self-Organizing
Organizing Map (SOM)
7
Some details on feed
feed-forward
forward neural networks
` A feed-forward ANN is a
supervised network
organized in layers.
` It can have any number of:
• layers
• units per layer
• network inputs
• network outputs
` Hidden layers are the layers
interposed between the
input and output layer
` The information flow moves
always from the input to the
output layer
8
Multi Layer Perceptron
Multi-Layer
` The most popular neural network
` Feed-forward network with multiple hidden layers
` It uses non-linear smooth transfer functions (logistic, hyperbolic

tangent)
` The weights and biases are adjusted in order to minimize the

approximation error (difference between target and output)
` MLP is trained with gradient–descent algorithms, whose the first

one was the error backpropagation algorithm proposed by
Werbos in 1974
9
MLP structure
hidden layers
(non-linear units)
input units ouput layer
(placeholders) (linear or non-linear units)
inputs outputs
10
A non
non-linear
linear unit
y = f ( wx + b )
w wx+b
x Σ ƒ y
b x: input
y: output
1 w: weight
b: bias
11
Transfer functions
Log-sigmoid
Log sigmoid Tan-sigmoid
Tan sigmoid
Linear
12
Tan-sigmoid
g
2
f (n) = −2 n
−1
1+ e
13
Log-sigmoid
1
f (n) =
1 + e− n
14
Effect of the weight (tan-sigmoid)
(tan sigmoid) y = f ( wx + b )
-2 2
bias = 0, weight varies from –2 to 2
15
Effect of the bias (tan-sigmoid) y = f ( wx + b )
2 0 -2
weight = 1, bias varies from 2 to -2
16
MLP as a general function approximator
` The transfer function of a MLP can be viewed as a non-linear

non linear
combination of the non-linear functions of the inputs
` IIn the
th particular
ti l case off onlyl one hidden
hidd layer
l and
d linear
li outputs
t t it
performs a linear combination of non-linear functions
` The Cybenko’s theorem states the conditions for the approximation

of continuous functions
` A MLP with biases, one sigmoidal hidden layer, and a linear output
layer is capable of approximate any function with a finite number of
discontinuities at any desired precision
17
Cybenko’s
Cybenko s theorem (1989)
Any continuous function of n variables F (x1, x2, …, xn)

can be presented in the form:
⎧ n ⎫ m
F ( x1 , x2 ,… , xn ) = ∑ g j ⎨∑ hij ( xi ) ⎬
j =1 ⎩ i =1 ⎭
where gj and hij are continuous functions and hij does
not depend on function F.
The theorem states that: A feed-forward neural network with one
internal layer, and an arbitrary continuous sigmoidal function can
approximate continuous functions with arbitrary precision,
provided that the number of hidden units m be sufficiently large.
18
MLP and its approximating function
MLP 3-4-1 with log-sigmoidal hidden layer and linear output layer
w u
x y
⎧ ⎫
4 ⎪ ⎪
⎪ 1 ⎪
y = ∑ ⎨u j ⎬+b
j =1 ⎪ ⎛ 3
⎞
1 + exp ⎜ −∑ wij xi + b j ⎟ ⎪
⎩⎪ ⎝ i =1 ⎠ ⎭⎪
19
Example of a random 3D surface (MLP 2-3-1)
These examples are obtained

giving
g g random values to the
weights of the hidden units of
a MLP with 2 inputs and 1
output
20
2 10 1)
21
The complexity of the

surface increases with the
number of hidden units
22
MLP design issues
` Network size and structure
` Selection of the training set
` Normalization of the input and output data
` Choice of the learning algorithm
` Generalization
` Overfitting
23
Network size and structure
` There is no formalized theory available for the design of MLP

` Often one goes on by trial and error
` How many hidden units?
A rule of thumb states that: N hidden = N input × N output
` How many hidden layers?
It seems there is no difference, provided that there are enough hidden
units
` Tan-sigmoid or log-sigmoid transfer functions?
It seems there is no difference
` The output layer is often composed by linear units
` Special features are sometimes obtained by ‘pruning’
pruning the connections (not
fully connected MLPs)
24
Selection of the training set
` The training
g set ((TS)) is composed
p of input-output
p p p pairs extracted
from the function to approximate
` The TS must cover all the domain of interest,

interest since MLP does
not have extrapolation capabilities
` According to the usual statistical criteria, the number of samples

must be much greater than the number of the parameters (units)
of the network
25
g of y = 3 x 2 + 2, from x=3 to x=6
Curve fitting
MLP 1-8-1 with

tansig
i hidd
hidden
layer and linear
output unit
range of the training set

— original function
— ANN
ANN’ss output
The error increases outside the range of the training set

26
Normalization of the input and output data
` The log-sigmoid output function exists in the range [0 ÷ 1], that of the
1 ÷ 1]
tan sigmoid in the range [[-1
tan-sigmoid
` The output data must be accordingly normalized
` The learning algorithm achieves better performance if the normalization

range is reduced of 10% on each side ([0.1 ÷ 0.9] and [-0.9 ÷ 0.9])
` With linear output units, the normalization of the output data is not
required
` Large input values (positive or negative) cause the learning algorithm

working with low derivatives of the transfer function and slow down the
process So it is advisable to normalize also the input data
process.
27
Learning algorithms: the backpropagation
` The backpropagation algorithm has exceeded the limits of the

P
Perceptron,
t allowing
ll i the
th training
t i i off multi-layer
lti l networks
t k with
ith non-linear
li
transfer functions
` The error computed on the output layer is (back-)propagated to all

previous layers and the weights are accordingly modified with:
∂E
Δwij ( t ) = −η
∂wij
where the term η is the learning rate
` The procedure is repeated until the overall quadratic error (squared

difference between targets and outputs) falls under a user-defined value
(goal)
28
Backpropagation
p p g training
gpprocedure
` Weights and biases are randomly initialized before the training

session Random weight initialization is the most popular method
session.
` Each learning cycle is composed by three steps:
1. present to the network the input vector of the sample and calculate
the output vector (forward step)
2. propagate
p p g backward the error from the output
p layer
y ((backward step)
p)
3. change the weights of each connection to reduce the output error
attributable to this connection (adjusting step)
` When these three steps have been performed for the entire TS,
one epoch has occurred
` The goall is
Th i to
t converge to
t a near-optimal
ti l solution
l ti b
based
d on the
th
overall squared error
29
Backpropagation learning cycle
30
Learning algorithms: speed up the backpropagation
` The original backpropagation algorithm requires many epochs to reach

the global minimum
` To reduce the learning time several improvements have been devised.

Some of these are:
Backpropagation improvements (improve the gradient descent):
– Backpropagation with adaptive learning rate
– Backpropagation with momentum
Block methods (use a matricial approach):
– Levenberg
Levenberg-Marquardt
Marquardt method
– Levenberg-Marquardt method with Bayesian regularization
31
Backpropagation with adaptive learning rate
` The performance of the gradient descent algorithm is improved if

the learning rate η can change during the learning process:
∂E
Δwij ( t ) = −η ( t )
∂wij
` The adaptive learning rate attempts to use a step size as large as
possible while keeping learning stable
` In this way the learning rate adapts locally to the complexity of

th error surface
the f
32
Backpropagation with momentum
` This method uses the weight change Δw of the previous cycle as

parameter for the computation of the new weight change:
∂E
Δwij ( t + 1) = −η + α Δwij ( t )
∂wij
` The purpose is to give a "momentum" to the search for the

minimum
` This prevents that the gradient descent algorithm is stuck by

local minima
33
Levenberg-Marquardt
Levenberg Marquardt algorithm and Bayesian regularization
` The Levenberg-Marquardt algorithm is designed to approach

second-order learning speed without computing the Hessian matrix
` When the performance function has the form of a sum of squares (as
in the MLP learning algorithm) the Jacobian matrix (first order
p
derivatives of the error with respect to the weights
g and biases)) can
be used for approximate the Hessian matrix
` The
Th BBayesian
i regularization
l i ti method,
th d coupled
l d tto L
Levenberg-
b
Marquardt, minimizes a linear combination of squared errors and
weights and improves the generalization capabilities of the network
34
Learning algorithms: Cover’s theorem and Linear Separation
` The Cover’s theorem (1965) states that: A complex pattern-classification

problem,, cast in a high-dimensional
p g space
p nonlinearly,
y, is more likelyy to be
linearly separable than in a low-dimensional space, provided that the space is
not densely populated
` This theorem is the basis of a very fast learning method, applicable to MLPs
with one hidden layer and linear output layer
` Known since the beginning of neural networks, this method was from time to
time “rediscovered” with different names: Random Activation Weight Neural Net
(RAWN) by Te Braake and Van Straten (1994), Extreme Learning Machine
(ELM) by Huang Guang-Bin (2005)
` These methods, as well the Support Vector Machines (SVM), exploit the so-
called ‘kernel trick’: transform the input space in a higher dimensional space
spanned by the hidden units, where the TS data can be fitted (hopefully) by
linear combination
35
Cover’s theorem: an application example
Points in the 2D original space Projection in the 3D space
Linear separation plane Points separated in the original

g space
36
Training
g the MLP with the kernel trick
` The kernel trick (from the Cover‘s theorem) transforms a non-linear
relationship in a linear one
one, that can be fitted by a linear method
` A MLP with one non-linear hidden layer ((logistic, tansig, gaussian) and
a linear output layer can be quickly trained using the kernel trick
` Unlike the traditional MLP, the hidden layer weights and biases are
randomly assigned and the output layer weights are found by linear
regression
` In this way there is no iterative learning: the weights are found using
matrix inversion
` The K non
non-linear
linear mapping functions (units of the hidden layer) which
span the high-dimensional space are called ‘kernels’
37
Training
g the MLP with the kernel trick …
` If K is sufficiently large any non-linear relationship can be adequately

fi d as a lilinear combination
fitted bi i off non-linear
li ffunctions
i
` However, this method requires many more hidden units than the classic
gradient descent methods
gradient-descent
` Linear separation is effective for fast curve fitting but less accurate in
generalization than the traditional MLP
` Let’s see a simple example in MATLAB:
38
Kernel trick: a simple
p example
p ((1))
% non-linear relationship from x to y
x=0:.001:1; % (1 x 1001)
y=x.^2+1; % (1 x 1001)
plot(x,y), grid
This relationship exists in the

bidimensional space and cannot be
directly approximated by linear
methods
39
p example
p ((2))
Let’s use the kernel trick with K = 2 (MLP 1-2-1). Probably the number
off kernels
k l is
i ttoo llow:
K = 2;
% random choice of the weights and biases of the first layer
W1 = randn(K,size(x,1)); % K x 1 hidden layer input weight matrix
bias = randn(si
randn(size(W1));
e(W1)) % K x 1 biases
W1 = w1+bias; % K x 1 weights plus biases
% the inputs are mapped by the logistic function into the new space
H = 1./(1 + exp(-w1*x)); % K x 1001 hidden layer input/output matrix
40
p example
p ((3))
Finally, let’s compute the estimated outputs by linear regression and
plot
p ot tthe
e data
data:
W2 = y*pinv(H); % 1 x K output layer weights

ye = W2
W2*H;
H; % 1 x 1001 estimated output vector
plot(x,y), grid, hold
p ( ,y , );
plot(x,ye,'r');
Nothing happened! The nearly

linear approximation (red line) is
very different from the original
relationship (blue line)
41
p example
p ((4))
Let’s use now a higher number of kernels. The approximation with K = 3 is
now qquite satisfactory.
y Setting
g K = 6 the fitting
g is very
yggood. The same
results can be obtained using a different transfer function in the hidden
layer (tansig or gaussian)
K=3 K=6
42
p example
p ((5))
The same example using Levenberg-Marquardt learning:
2 2
1.8 1.8
1.6 1.6
1.4 1.4
1.2 1.2
1 1
0.8 0.8
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
MLP 1-2-1 MLP 1-3-1

The same fitting quality is obtained with fewer units, but learning
takes much longer
43
Learning
g equations
q using
g the Kernel trick
` MLP with single hidden layer
N hidden
⎛ Ninput ⎞
yo = ∑ w j ,o f j ⎜⎜ ∑ wk , j xk + b j ⎟⎟
j =1 ⎝ k =1 ⎠
` Procedure
– Randomly assign weights wk,j
k j and biases bj of the hidden layer
– Present the input matrix X (NS x (NI+1)) and compute the output matrix of
the hidden layer H (NS x NH)
– Compute
C t th
the weights
i ht wj,o off the
th output
t t layer
l by:
b
W = H† Y
where W (NH x NO) is the weight vector of the output layer, Y (NS x NO) is
the TS output vector and H+ (NH x NS) is the Moore-Penrose pseudoinverse
of H
44
Comparison between Linear separation and Levenberg-Marquardt
methods
MATLAB Peaks function
Levenberg-Marquardt
Levenberg Marquardt Linear separation
45
46
Fitting with MLP 2-60-1 trained with Levenberg-Marquardt method
180 weigths,
i th llearning
i time
ti 338 s, MAPE = 330 %
47
Fitting with MLP 2-120-1 trained with Linear separation method
360 weigths,
i th llearning
i time
ti 5 s, MAPE = 614 %
48
Generalization
` Cybenko’s
Cybenko s theorem ensures the approximation of the training samples
but does not tell anything about the generalization capabilities of the
trained network
` Generalization is the ability to learn the “true” input-output relationship,

hidden in the training samples
` Training samples may be affected by noise or some regressors may be

unavailable
` So, a too accurate fitting of the training data could result in poor
generalization (overfitting)
49
Overfitting
` The overfitting is caused by the use of too many units compared with
the “true” unknown model
` The network reproduces the noisy samples of the training data set
instead of the true relationship
` P
Practically,
ti ll overfitting
fitti occurs when
h th number
the b off the
th training
t i i d t is
data i
small compared to the network size (overparameterization): the MLP
simply copies the training data
` Overfitting is avoided if the network size fits the actual relationship to be

approximated
pp
50
Overfitting
o samples
— actual
— MLP
Th approximation
The i ti tracks
t k the
th samples
l off the
th ttraining
i i sett
51
Correct fitting
o samples
— actual
— MLP
Th approximation
The i ti fits
fit the
th true
t unknown
k curve
52
How to avoid overfitting
g
` Ideally the MLP should reproduce as much as possible the “true”

relationship
` Because it is difficult to know beforehand how large a network should

be for a specific application,
application there are two methods to avoid overfitting:
– Early stopping (cross-validation)
The training samples are divided in Training set (actual training) and
Validation set (check of the learning quality). When the approximation
error decreases in the TS and increases in the VS the training is
stopped
pp
– Regularization
Overfitting occurs because the network attempts to track down every
single data point of the TS. The learning cost function is then modified
by adding a penalty term (regularization function) for the complexity of
the model (second derivatives increase)
53
Kernel-based neural networks
` Kernel-based networks are supervised networks with three layers:

– the units of the hidden layer (kernels) form a vectorial quantization of the
input space
– the input layer evaluates some measure of the distance amongst the input
vector and the kernels
– the output layer employs the outputs of the hidden layer to approximate the
relationship as linear combination on non-linear functions
` The information is localized: each hidden unit corresponds

p to a
receptive zone in the input space
` Kernel-based networks allow knowledge

g removal and incremental
learning
54
A kernel-based neural network
linear output layer
hidden layer (kernel units)
input layer (distance measure)
input space
receptive zones
55
Radial basis function networks
` A RBF is composed by:

– an input layer that computes the Euclidean distance amongst the input vector and the
weight vectors of the hidden layer
– a hidden layer composed by units with Gaussian transfer function (radial bases)
whose weight vectors form a vectorial quantization of the input space
– a linear output layer
` Performs a linear combination of non

non-linear
linear functions of the input values
` The learning process is divided in two parts:

– first, the weight vectors of the hidden layer (centroids of the radial bases) are found
by clustering techniques or randomly assigned in the input space. Alternatively, an
optimal method can be employed (orthogonal least squares algorithm)
– afterwards,, the weights
g of the output
p layer
y are computed
p byy linear regression
g ((Linear
separation method)
56
RBF network
hidden layer (bases)
input units
(distance measure)
linear ouput layer
x y
57
A radial basis unit
D(x,w)
compute
x ƒ y
distance
x: input vector
y: output value
w: weight vector
58
Gaussian
f (n) = e − n2
59
Some remarks on RBF networks
` F
Fast learning
l i (if clustering
l i techniques
h i and
d the
h linear
li separation
i method
h d
are employed)
` L
Localization
li ti off the
th knowledge
k l d ini the
th hidden
hidd layer
l (if clustering
l t i is i used
d to
t
find the centers of the radial bases)
` Overfitting
O fitti can be
b avoided
id d (fast
(f t design
d i off the
th network
t k by
b trial
t i l and
d error))
` With the usual learning methods (clustering and linear separation) the
quality
lit off the
th approximation
i ti iis llower with
ith respectt tto th
the MLP
60
Comparison between RBF network and MLP trained with LS
RBF network
t k MLP with linear separation
61
Fitting with RBF 2-120-1
360 weigths,
i th llearning
i titime 8 s, MAPE = 1210 %
62
Fitting with MLP 2-120-1
trained with Linear separation method
360 weigths,
i th llearning
i titime 5 s, MAPE = 614 %
63
Vector quantization methods
` Unsupervised/supervised
p p networks that p
project
j data from a high
g
dimensional space in a bidimensional map, preserving as much as
possible the original topology
` Allows the visualization of a multi-dimensional data set for a better

understanding
` Also
Al used
d as clustering
l t i method
th d
` The main VQ networks are:

– Kohonen’s
’ SSelf-Organizing
fO Map (SOM)
(SO ) (unsupervised)
( )
– Curvilinear Component Analysis (CCA) (unsupervised)
– Learning Vector Quantization (LVQ) method (supervised)
64
Self-Organizing Map (SOM)
` Kohonen’s SOM is a rectangular grid of competitive units with the

following properties:
– each unit has a fixed position in the grid
– each unit is associated to a weight vector of N components, where N is the
di
dimension
i off ththe iinputt space
` The learning procedure is:

– the
h weights
i h are randomly
d l iinitialized
i i li d
– each TS sample is presented as input
– the units compete and the unit closest to the input vector is the winning unit
– the weights of the winning unit and those of the adjacent units
(neighborhood) are adjusted to get closer to the input vector
– the
th procedure
d is
i repeated
t d until
til th
the map b
becomes stable
t bl
65
Structure of the SOM
winning
g unit neighbor
g units
input vector
66
Properties
p of the SOM
` After the learning session, the units are organized in a smooth

projection of the input data (codebook)
` The original topology is preserved: separate receptive zones (bubbles)

classify similar samples
` Amongst the bubbles there are smooth areas composed by units that
do not win the competition for the training data (dead units) but could
win for new similar data
` This smoothing property makes the method not suited for clear-cut
separation of the clusters but very useful for data visualization
67
An example:
p classification of daily
y load p
profiles
– 105 AEM daily load profiles (January – April 1995)
– 105 x 24 input data matrix
– A map with 16 x 12 units with 24-dimension weight vectors has been

chosen
– Th
The units
it off the
th map are randomly
d l initialized
i iti li d and
d after
ft ththe ttraining
i i
session become a smooth topological preserving projection of the load
profiles
– A k-means clustering algorithm can be applied to partition the

codebook’s units in five clusters
– The map shows that holidays and workdays are classified in well
separate clusters
68
16 x 12 codebook before training
g ((random initialization))
69
16 x 12 map codebook after training
70
Plot of a single unit of the codebook
71
Load profiles classified by a map unit (“hits”)
72
Partitioned codebook with the “hits” of the training data
h lid
holidays
Data hits are the

white circles, with
radius proportional
to the number of
samples classified
by that unit
weekdays
73
Codebook profiles superimposed to the clustered map
holidays
Weekday weekdays
samples
classified by
the same unit
74

Lecture 10 A STLF ANN Introduction 02122020 104911am

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 10 A STLF ANN Introduction 02122020 104911am

Uploaded by

Copyright:

Available Formats

Part II

Neural Networks: a short introduction

` The human brain (whose processing speed is around 100 Hz) is

` Conversely, the human brain can manage complex tasks (throw a

` The reason lies in the high number of processing units (1010

` In fact, most instinctive activities of our brain are similar to

` Neural networks are parallel processing structures composed by

` ANNs are a wide class of logical structures, usually built as

` Perform complex tasks, such as approximation of functions,

` Several ANNs can be considered as non-linear and non-parametric

` ANNs are composed by units (or neurons), organized in different ways

` The weights contain the stored information

` The weight values are adjusted by the learning procedure, performed

` ANNs can be supervised (learn input-output

` McCulloch and Pitts (1940) proposed a neuron model composed by

Vector quantization networks

` The most popular neural network

` Feed-forward network with multiple hidden layers

` It uses non-linear smooth transfer functions (logistic, hyperbolic

` The weights and biases are adjusted in order to minimize the

` MLP is trained with gradient–descent algorithms, whose the first

bias = 0, weight varies from –2 to 2

weight = 1, bias varies from 2 to -2

` The transfer function of a MLP can be viewed as a non-linear

` The Cybenko’s theorem states the conditions for the approximation

Any continuous function of n variables F (x1, x2, …, xn)

These examples are obtained

The complexity of the

` Network size and structure

` Selection of the training set

` Normalization of the input and output data

` Choice of the learning algorithm

` There is no formalized theory available for the design of MLP

` The TS must cover all the domain of interest,

` According to the usual statistical criteria, the number of samples

MLP 1-8-1 with

range of the training set

The error increases outside the range of the training set

` The output data must be accordingly normalized

` The learning algorithm achieves better performance if the normalization

` Large input values (positive or negative) cause the learning algorithm

` The backpropagation algorithm has exceeded the limits of the

` The error computed on the output layer is (back-)propagated to all

` The procedure is repeated until the overall quadratic error (squared

` Weights and biases are randomly initialized before the training

` The original backpropagation algorithm requires many epochs to reach

` To reduce the learning time several improvements have been devised.

 Backpropagation improvements (improve the gradient descent):

– Backpropagation with adaptive learning rate

– Backpropagation with momentum

 Block methods (use a matricial approach):

– Levenberg-Marquardt method with Bayesian regularization

` The performance of the gradient descent algorithm is improved if

` In this way the learning rate adapts locally to the complexity of

` This method uses the weight change Δw of the previous cycle as

` The purpose is to give a "momentum" to the search for the

` This prevents that the gradient descent algorithm is stuck by

` The Levenberg-Marquardt algorithm is designed to approach

` The Cover’s theorem (1965) states that: A complex pattern-classification

Points in the 2D original space Projection in the 3D space

Linear separation plane Points separated in the original

` If K is sufficiently large any non-linear relationship can be adequately

` Let’s see a simple example in MATLAB:

Backpropagation improvements (improve the gradient descent):

Block methods (use a matricial approach):