You are on page 1of 42

CSC445: Neural Networks

CHAPTER 04

MULTILAYER PERCEPTRONS

Prof. Dr. Mostafa Gadal-Haqq M. Mostafa


Computer Science Department
Faculty of Computer & Information Sciences
AIN SHAMS UNIVERSITY

(most of figures in this presentation are copyrighted to Pearson Education, Inc.)


Multilayer Perceptron

 Introduction

 Limitation of Rosenblatt’s Perceptron

 Batch Learning and On-line Learning

 The Back-propagation Algorithm

 Heuristics for Making the BP Alg. Perform Better

 Computer Experiment

ASU-CSC445: Neural Networks Prof. Dr. Mostafa Gadal-Haqq 2


Introduction
 Limitation of Rosenblatt’s Perceptron
 AND operation:

x1 x2 d +1
w0
0 0 0 w1
x1 y
0 1 0
w2
1 0 0 x2
Linear
1 1 1 Decision boundary

1
w1  0  w2  0  w0  0 w0  0 f ( z) 
1  e z
w1  0  w2  1  w0  0 w2   w0
w1  1  w2  0  w0  0 w1   w0 y  f (10 x1  10 x2  20)
w1  1  w2  1  w0  0 w1  w2   w0
Its easy to find a set of weight that satisfy the above inequalities.
ASU-CSC445: Neural Networks Prof. Dr. Mostafa Gadal-Haqq 3
Introduction
 Limitation of Rosenblatt’s Perceptron
 OR Operation:

x1 x2 d +1
w0
0 0 0 w1
x1 y
0 1 1
w2
1 0 1 x2 Linear
1 1 1 Decision boundary

1
w1  0  w2  0  w0  0 w0  0 f ( z) 
1  ez
w1  0  w2  1  w0  0 w2   w0
w1  1  w2  0  w0  0 w1   w0 y  f ( 20 x1  20 x2  10)
w1  1  w2  1  w0  0 w1  w2   w0
Its easy to find a set of weight that satisfy the above inequalities.
ASU-CSC445: Neural Networks Prof. Dr. Mostafa Gadal-Haqq 4
Introduction
 Limitation of Rosenblatt’s Perceptron
 XOR Operation:

x1 x2 d +1
w0
0 0 0
x1 w1 y
0 1 1
1 0 1 w2
x2
1 1 0
Non-linear
w1  0  w2  0  w0  0 Decision boundary
w0  0
w1  0  w2  1  w0  0 w2   w0
w1  1  w2  0  w0  0 w1   w0 y  f (???)
w1  1  w2  1  w0  0 w1  w2   w0
Clearly the second and third inequalities are incompatible with the fourth, so
there is no solution for the XOR problem. We need more complex networks!
ASU-CSC445: Neural Networks Prof. Dr. Mostafa Gadal-Haqq 5
The XOR Problem
 A two-layer Network to solve the XOR Problem

w11  w12  1 w21  w22  1


3 1
b1   b2  
2 2
w31  2, w32  1
1
b3  
2

Figure 4.8 (a) Architectural graph of network for solving the XOR problem. (b)
Signal-flow graph of the network.
ASU-CSC445: Neural Networks Prof. Dr. Mostafa Gadal-Haqq 6
The XOR Problem
 A two-layer Network to solve the XOR Problem

Figure 4.9 (a) Decision boundary constructed by hidden neuron 1 of the network in
Fig. 4.8. (b) Decision boundary constructed by hidden neuron 2 of the network. (c)
Decision boundaries constructed by the complete network.

ASU-CSC445: Neural Networks Prof. Dr. Mostafa Gadal-Haqq 7


MLP: Some Preliminaries
 The multilayer perceptron (MLP) is
proposed to overcome the limitations of the
perceptron
 That is, building a network that can solve
nonlinear problems.

 The basic features of the multilayer perceptrons:


 Each neuron in the network includes a nonlinear activation
function that is differentiable.
 The network contains one or more layers that are hidden from
both the input and output nodes.
 The network exhibits a high degree of connectivity.
ASU-CSC445: Neural Networks Prof. Dr. Mostafa Gadal-Haqq 8
MLP: Some Preliminaries
 Architecture of a multilayer perceptron

Figure 4.1 Architectural graph of a multilayer perceptron with two hidden layers.
ASU-CSC445: Neural Networks Prof. Dr. Mostafa Gadal-Haqq 9
MLP: Some Preliminaries
 Weight Dimensions

Wij

If network has n units in layer i , m units in layer i +1 , then the weight


matrix Wij will be of dimension m x (n+1) .

ASU-CSC445: Neural Networks Prof. Dr. Mostafa Gadal-Haqq 10


MLP: Some Preliminaries
 Number of neuron in the output layer

Pedestrian Car Motorcycle Truck

1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
Pedestrain Car Moto Truck

ASU-CSC445: Neural Networks Prof. Dr. Mostafa Gadal-Haqq 11


MLP: Some Preliminaries
 Training of the multilayer perceptron proceeds in
two phases:

 In the forward phase, the weights of the network are fixed and
the input signal is propagated through the network, layer by
layer, until it reaches the output.

 In the backward phase, the error signal, which is produced by


comparing the output of the network and the desired response,
is propagated through the network, again layer by layer, but in
the backward direction.

ASU-CSC445: Neural Networks Prof. Dr. Mostafa Gadal-Haqq 12


MLP: Some Preliminaries
 Function Signal:
 is the input signal that comes in
at the input end of the network,
propagates forward (neuron
by neuron) through the network,
and emerges at the output of the
network as an output signal.
 Error Signal:
 originate at the output neuron of
the network and propagates
backward (layer by layer) Figure 4.2 Illustration of the
through the network. directions of two basic signal flows
in a multilayer perceptron: forward
 Each hidden or output propagation of function signals
neuron computes these two and back propagation of error
signals. signals.

ASU-CSC445: Neural Networks Prof. Dr. Mostafa Gadal-Haqq 13


MLP: Some Preliminaries
 Function of the Hidden neurons
 The hidden neurons play a critical role in the operation of a
multilayer perceptron; they act as feature detectors.
 The nonlinearity transform the input data into a feature
space in which data may be separated easily.

 Credit Assignment Problem


 Is the problem of assigning a credit or a blame for overall
outcomes to the internal decisions made by the computational
units of the distributed learning system.
 The error-correction learning algorithm is easy to use for
training single layer perceptrons. But its not easy to use it for a
multilayer perceptrons,
 the backpropagation algorithm solves this problem.

ASU-CSC445: Neural Networks Prof. Dr. Mostafa Gadal-Haqq 14


The Back-propagation Algorithm
 An on-line learning algorithm.

m
v j (n)   w ji (n) yi (n)
i 0

y j (n)   (vi (n))

e j ( n)  d j ( n)  y j ( n)

Figure 4.3 Signal-flow graph highlighting the


details of output neuron j.

ASU-CSC445: Neural Networks Prof. Dr. Mostafa Gadal-Haqq 15


The Back-propagation Algorithm
 The weights are updated in a manner similar to the LMS and
the gradient descent method. That is, the instantaneous error
and the weight corrections are:
1 2 E j(n)
Ej (n)  e j (n) and Δw ji  η
2 w ji(n)
 Using the chain rule of calculus, we get:
E j(n) E j(n) e j(n) y j(n) v j(n)

w ji(n) e j(n) y j(n) v j(n) w ji(n)
 We have:

E j(n) e j(n) y j(n) v j(n)


 e j(n),  1,   j (v j (n)) ,and  yi (n)
e j(n) y j(n) v ji(n) w ji(n)

ASU-CSC445: Neural Networks Prof. Dr. Mostafa Gadal-Haqq 16


The Back-propagation Algorithm
 which yields:
E j(n)
 e j(n) j (v j (n)) yi (n)
w ji(n)
 Then the weight correction is given by:

Δw ji (n)  η j (n) yi (n)


 where the local gradient j (n) is defined by:
E j (n)
 j ( n)  
v j (n)
E j (n) e j (n) y j (n)

e j (n) y j (n) v j (n)
 j (n)  e j (n) j (v j (n))
ASU-CSC445: Neural Networks Prof. Dr. Mostafa Gadal-Haqq 17
The Back-propagation Algorithm
 That is, the local gradient of neuron j is equal to the product
of the corresponding error signal of that neuron and the
derivative of the associated of the activation function. Then,
we have two distinct cases:
 Case 1: Neuron j is an output node:
 In this case, it is easy to use the credit assignment rule to compute
the error signal ej(n), because we have the desired signal visible to
the output neuron. That is, ej(n)=dj(n) - yj(n).
 Case 2: Neuron j is an hidden node:
 In this case, the desired signal is not visible to the hidden neuron.
Accordingly, the error signal for the hidden neuron would have to be
determined recursively and working backwards in terms of the
error signals of all the neurons to which that hidden neuron is
directly connected.

ASU-CSC445: Neural Networks Prof. Dr. Mostafa Gadal-Haqq 18


The Back-propagation Algorithm
 Case 2: Neuron j is hidden node.

Figure 4.4 Signal-flow graph highlighting the details of output neuron k connected
to hidden neuron j.
ASU-CSC445: Neural Networks Prof. Dr. Mostafa Gadal-Haqq 19
The Back-propagation Algorithm
 We redefine the local gradient for a hidden neuron j as:
E (n) y j (n) E (n)
 j ( n)      j (v j (n))
y j (n) v j (n) y j (n)
 Where the total instantaneous error of the output neuron k:
1
E(n)   ek (n)
2
2 kC
 Differentiating w. r. t. yj (n) yields:
E (n) ek (n) ek (n) vk
  ek (n)   ek (n)
y j (n) k y j (n) k vk (n) y j (n)
 But ek(n)  dk (n)  yk (n)  dk (n)  k (vk (n))
 Hence ek
 k (vk (n))
vk (n)
ASU-CSC445: Neural Networks Prof. Dr. Mostafa Gadal-Haqq 20
The Back-propagation Algorithm
 Also, we have m
vk (n)   wkj (n) y j (n)
j 0
 Differentiating, yields:
vk (n)
 wkj (n)
 Then, we get y j (n)
E (n)
  ek (n)k (vk (n)) wkj    k (n) wkj
y j (n) k k

 Finally, the backpropagation for the local gradient of (hidden)


neuron j, (neuron k is output neuron), is given by:

 j (n)   j (v j (n))  k (n) wkj (n)


k

ASU-CSC445: Neural Networks Prof. Dr. Mostafa Gadal-Haqq 21


The Back-propagation Algorithm

Figure 4.5 Signal-flow graph of a part of the adjoint system pertaining to back-
propagation of error signals.
ASU-CSC445: Neural Networks Prof. Dr. Mostafa Gadal-Haqq 22
The Back-propagation Algorithm
 We summarize the relations for the back-propagation algorithm:

 First: the correction wji(n) applied to the weight connecting


neuron i to neuron j is defined by the delta rule:
 weight   learning  rate   local   input signal 
       
 correction    parameter    gradient    of neuron j 
 w (n)       ( n)   y ( n) 
 ji     j   i 
 Second: local gradient j (n) depends on neuron j :
 Neuron j is an output node:

 j (n)  e j (n) j (v j (n)); e j (n)  d j (n)  y j (n)

 Neuron j is an hidden node (neuron k is output or hidden):

 j (n)   j (v j (n))  k (n) wkj (n)


k

ASU-CSC445: Neural Networks Prof. Dr. Mostafa Gadal-Haqq 23


The Activation Function
 Differentiability is the only requirement that an activation
function has to satisfy in the BP Algoruthm.
 This is required to compute the  for each neuron.

 Sigmoidal functions are commonly used, since they satisfy


such a condition:

 Logistic Function

1 a exp( av)
 (v )  , a 0  ' (v )   a (v)[1   (v)]
1  exp( av) 1  exp( av)

 Hyperbolic Tangent Function

b
 (v)  a tanh(bv), a, b  0  ' (v)  [a   (v)][a   (v)]
a

ASU-CSC445: Neural Networks Prof. Dr. Mostafa Gadal-Haqq 24


The Rate of Learning
 A simple method of increasing the rate of learning
and avoiding instability (for large learning rate ) is
to modify the delta rule by including a momentum
term as:

Δw ji (n)  w ji (n  1)  η j (n) yi (n)


 where  is usually a positive
number called the momentum
constant.
 To ensure convergence, the
momentum constant must be
restricted to Figure 4.6 Signal-flow graph
illustrating the effect of
0   1 momentum constant α, which lies
inside the feedback loop.
ASU-CSC445: Neural Networks Prof. Dr. Mostafa Gadal-Haqq 25
Summary of the Back-propagation Algorithm

1. Initialization

2. Presentation of
training example
3. Forward
computation
4. Backward
computation
5. Iteration

Figure 4.7 Signal-flow graphical summary of back-propagation learning. Top part of


the graph: forward pass. Bottom part of the graph: backward pass.
ASU-CSC445: Neural Networks Prof. Dr. Mostafa Gadal-Haqq 26
Heuristics for making the BP Better
1. Stochastic vs. Batch update
 Stochastic (sequential) mode is computationally faster than the
batch mode.

2. Maximizing information content


 Use an example that results in large training error
 Use an example that is radically different from the others.

3. Activation function
 Use an odd function
 Hyperbolic not logistic function

 (v)  a tanh(bv)
ASU-CSC445: Neural Networks Prof. Dr. Mostafa Gadal-Haqq 27
Heuristics for making the BP Better
4. Target values
 Its very important to choose the
values of the desired response
to be within the range of the
sigmoid function.
5. Normalizing the input
 Each input should be
preprocessed so that its mean
value, averaged over the entire
training sample, is close to zero,
or else it will be small
compared to its standard
deviation. Figure 4.11 Illustrating the operation of mean
removal, decorrelation, and covariance
equalization for a two-dimensional input space.

ASU-CSC445: Neural Networks Prof. Dr. Mostafa Gadal-Haqq 28


Heuristics for making the BP Better
6. Initialization
 A good choice will be of tremendous help.
 Initialize the weights so that the standard deviation of the
induced local field v of a neuron lies in the transition area
between the linear and saturated parts or its sigmoid function.
7. Learning from hints
 Is achieved by allowing prior information that we may have
about the mapping function, e.g., symmetry, invariances, etc.
8. Learning rate
 All neurons in the multilayer should learn at the same rate,
except for that at the last layer, the learning rate should be
assigned smaller value than that of the front layers.

ASU-CSC445: Neural Networks Prof. Dr. Mostafa Gadal-Haqq 29


Batch Learning and On-line Learning
 Consider the training sample used to train the network in supervised
manner:
T = {x(n), d(n); n =1, 2, …, N}
 If yj(n) is the functional signal produced at the output neuron j. the
error signal produced at the same neuron is:
ej (n) = dj(n) – yj (n)
 the instantaneous error produced at the output neuron j is:
1
Ej (n)  e2j (n)
2
 the total instantaneous error of the whole network is:
1
E (n)   Ej (n)   e2j (n)
jC 2 jC
 the total instantaneous error averaged over the training sample:
1 N 1 N
Eav (n)  E (n)    e j (n)
2
N n 1 2N n 1 jC
ASU-CSC445: Neural Networks Prof. Dr. Mostafa Gadal-Haqq 30
Batch Learning and On-line Learning
Batch Learning:
 Adjustment of the weights of the MLP is performed after the
presentation of all the N training examples T.
 this is called an epoch of training.
 Thus, weight adjustment is made on epoch-by-epoch basis.
 After each epoch, the examples in the training samples T are randomly
shuffled.
 Advantages:
 Accurate estimation of the gradient vector (the derivates of the cost
function Eav w.r.t. the weight vector w), which therefore guarantee the
convergence of the method of steepest descent to a local minimum.
 Parallelization of the learning process.
 Disadvantages: it is demanding in terms of storage requirements.
ASU-CSC445: Neural Networks Prof. Dr. Mostafa Gadal-Haqq 31
Batch Learning and On-line Learning
On-line Learning:
 Adjustment of the weights of the MLP are performed on an example-
by-example basis.
 The cost function to be minimized is therefore the total instantaneous
error E (n).
 An epoch of training is the presentation all the N samples to the
network. Also, in each epoch the examples are randomly shuffled.
 Advantages:
Its stochastic learning nature, make it less likely to be trapped in
local minimum.
 it is much less demanding in terms of storage requirements.
 Disadvantages:
 We can not Parallelize the learning process.

ASU-CSC445: Neural Networks Prof. Dr. Mostafa Gadal-Haqq 32


Batch Learning and On-line Learning
 Batch learning vs. On-line Learning:

Batch learning On-line Learning


The learning process is performed The learning process is performed
by ensemble averaging, which in in stochastic manner.
statistical context my be viewed as
a form of statistical inference.
Guarantee for convergence to local It is less likely to be trapper in a
minimum. local minimum.
Can be parallelized Can not be parallelized
Require large storage Require much less storage
Well suited for nonlinear Well suited for pattern
regression problems. classification problems.

ASU-CSC445: Neural Networks Prof. Dr. Mostafa Gadal-Haqq 33


Generalization
 A network is said to generalize well when
the network input-output mapping is
correct (or nearly so) for the test data.
 If we viewed the learning process as “curve-
fitting”.
 When the network is trained with too many
sample, it may become overfitted, or
overtrained, which lead to wrong
generalization.
 Sufficient training-Sample Size
 Generalization is influenced by three factors:
 The size of the training sample
 The network architecture
 The physical complexity of the problem at hand
 In practice, good generalization is achieved if
we the training sample size, N, satisfies:
N  O(W /  )
 W is number of free parameters in the Figure 4.16 (a) Properly fitted nonlinear
network, and  is the fraction of classification mapping with good generalization. (b) Overfitted
error permitted on test data. nonlinear mapping with poor generalization.

ASU-CSC445: Neural Networks Prof. Dr. Mostafa Gadal-Haqq 34


Cross-Validation Method
 Cross-Validation is a standard tool in statistics that
provide appealing guiding principle:
 First: the available data set is randomly partitioned into a
training set and a test set.
 Second: the training set is further partitioned into two disjoint
subsets:
 An estimation subset, used to select the model (estimate the
parameters).
 A validation subset, used to test or validate the model
 The training set is used to assess various models and choose the
“best” one.
 However, this best model may be overfitting the validation data.
 Then, to guard against this possibility, the generalization
performance is measured on the test set, which is different from
the validation subset.
ASU-CSC445: Neural Networks Prof. Dr. Mostafa Gadal-Haqq 35
Cross-Validation Method
 Early-stopping Method
 (Holdout method)
 The training is stopped
periodically, i.e., after so many
epochs, and the network is
assessed using the validation
subset.
 When the validation phase is
complete, the estimation
(training) is resumed for another
period, and the process is
repeated.
 The best model (parameters) is
that at the minimum validation Figure 4.17 Illustration of the early-
error. stopping rule based on cross-
validation.

ASU-CSC445: Neural Networks Prof. Dr. Mostafa Gadal-Haqq 36


Cross-Validation Method
 Variant of Cross-Validation
 (Multifold Method)
 Divide the data set of N samples
into K subsets, where K>1.
 The network is validated in each
trial using a different subset.
After training the network using
the other subsets.
 The performance of the model is
assessed by averaging the
squared error under validation
Figure 4.18 Illustration of the multifold
over all trials. method of cross-validation. For a given trial,
the subset of data shaded in red is used to
validate the model trained on the remaining
data.

ASU-CSC445: Neural Networks Prof. Dr. Mostafa Gadal-Haqq 37


Computer Experiment
 d= -4

Figure 4.12 Results of the computer experiment on the back-propagation


algorithm applied to the MLP with distance d = –4. MSE stands for mean-square
error.
ASU-CSC445: Neural Networks Prof. Dr. Mostafa Gadal-Haqq 38
Computer Experiment
 d = -5

Figure 4.13 Results of the computer experiment on the back-propagation


algorithm applied to the MLP with distance d = –5.
ASU-CSC445: Neural Networks Prof. Dr. Mostafa Gadal-Haqq 39
Real Experiment
 Handwritten Digit Recognition*

*Courtesy of Yann LeCun.

ASU-CSC445: Neural Networks Prof. Dr. Mostafa Gadal-Haqq 40


Homework 4
•Problems:
•4.1, 4.3

•Computer Experiment
•4.15

41
Next Time

Kernel Methods and


RBF Networks

42

You might also like