Lect 3 PDF

University Of Khartoum
Department Of Electronics & Electrical

Engineering
Software & Control Engineering
EC5245: ARTIFICIAL NEURAL

NETWORK & FUZZY LOGIC
By: Ustaza Hiba Hassan
Lecture 3
24/4/2018 Hiba Hassan: U of K 2
Initializing the weights

• The weights are best initialized to small
random values.
• Usually they are taken from a uniform
distribution around zero [–std, +std], or from
a Gaussian distribution around zero with
standard deviation std.
Weight Space
• For any given network, there are fixed number of
connections with associated weights.
• So, if there are n weights, then each configuration
of weights that defines an instance of the network
is a vector, W, of length n.
• W can be considered to be a point in an n-
dimensional weight space, where each axis is
associated with one of the connections in the
network.
Choosing Appropriate Learning Rate

• Choosing a good value for the learning rate η is constrained
by two opposing factors:
1. If η is too small, it will take too long to get anywhere near
the minimum of the error function.
2. If η is too large, the weight updates will over-shoot the error
minimum and the weights will oscillate, or even diverge.
• However, the right learning rate is problem and network
dependent.
• Generally, different values (e.g. η = 0.1, 0.01, 1.0, 0.0001) are
implemented and the results are used as a guide.
• The learning rate may change during the learning process to
mimic the age dependent learning rates found in human
children.
Learning and Generalization in

Perceptron Networks
• The critical aspects of the network’s operation are:
1) Learning: The network must learn decision
boundaries from a set of training patterns to
enable it to classify them properly.
2) Generalization: After training, the network must be
able to classify new test patterns correctly.
• Usually we want the neural network to learn in a way
that produces good generalization.
• There is a trade-off between learning and
generalization.
Learning (cont.)
• The performance of the neural network is graded
based on the results of various metrics of the
testing set, such as, Mean square error, SNR, etc.
• Another method of estimating the error rate of the
neural network is Resampling.
• The idea is to iterate the training and testing
processes multiple times
• Two main techniques are used:
• Cross-Validation & Bootstrapping.
Overfitting: How it occurs!

• The training data may contain two types of noise:
1. The target values may be unreliable (though
not very common).
2. There is sampling error. There will be
accidental regularities just because of the
particular training cases that were chosen.
• When the model is fitted, it cannot differentiate
between the real regularities and those caused by
sampling error. So it fits both!
• If the model is very flexible it can model the
sampling error really well.
Overfitting (Cont.)
• Overfitting can also occur if a “good” training set
is not chosen
• A “good” training set must consist of:
• Samples that represent the general population.
• Samples that contain members of each class.
• Samples in each class that contain a wide range
of variations or noise effect.
A simple example of overfitting

• A simple model vs a
complicated one?
output = y
• The complicated model

fits the data better.
• But it is not economical.
input = x
Ways to reduce Overfitting

• Many different methods have been developed,
some of them are:
A. Weight-decay
B. Weight-sharing
C. Early stopping
D. Model averaging
E. Bayesian fitting of neural nets
F. Dropout
G. Generative pre-training
Assignment
• Write a report explaining a single overfitting reducing
technique:
• Explain overfitting; causes and effect.
• Explain in detail the chosen reducing technique.
• Use Matlab or appropriate language, to illustrate
your chosen technique by applying to data sets of
your choice.
• Students should form pairs of 2. Maximum of 5
groups are allowed to investigate a single
technique. However, every student pair are
required to use different data sets.
• Submission date: 8 May 2018

Perceptron Networks
• Which of these problems may be solved by a perceptron
network? Justify?
Solving with Perceptron Networks

• A single neuron perceptron can categorize 2
classes if they are linearly separable.
• An n-neuron perceptron can categorize 2n
classes if they are linearly separable.
• That is to categorize 4 classes, we need 2 neuron
perceptron network and so on.
Example
• We have a classification problem with four classes
of input vector.
• The four classes are
• class 1: , class 2:
• class 3: , class 4:
• Design a perceptron network to solve this problem.

Solution
• To solve a problem with four classes of input vector
we will need a perceptron with at least two neurons.
• Hence;
Solution (cont.)
• The light circles indicate class 1 vectors, the light
squares indicate class 2 vectors, the dark circles
indicate class 3 vectors, and the dark squares
indicate class 4 vectors.
Solution (cont.)
• We try to divide the input space into the four
categories.
• Thus our patterns are Linearly separable.

Solution (cont.)
• The weight vectors should be orthogonal to the decision
boundaries and point toward the regions where the
neuron outputs are 1.
• We can choose the target classes to be:
Cont.
• Hence, if we select the following weight vectors:
• We can apply the rule & find the corresponding

biases as such;
Final Solution
• Hence, to solve our problem we need a 2 neuron
perceptron network with the following weights
and biases:
GRADIENT DESCENT
LEARNING
Gradient Descent Learning in NN

• The gradient is the rate of change of f(x) at a particular
value of x.
• Hence, it is the partial derivative of f(x) with respect to
x.
• That led to Gradient Descent Learning, its aim is to find
the minimum error by computing the derivative of the
error function with respect to the weight. Sometimes it
is called Gradient Descent Minimization.
• Thus the weight update is computed via;
 E 
Wij    
 W 
 ij 
• Where η = is learning rate.
Finding the minimum of a function: gradient descent

gradient descent on an error

Cont.
• For a target (t) & an actual output (o), the error is
given by the following mean square error cost
function,
• Where D is the set of training examples.

• There are 2 types of gradient descent based cost
function, they are stated next;
Where,
Batch Training: ‫تحديث األوزان على دفعات‬

• Batch Training: In batch mode the weights and
biases of the network are updated only after the
entire training set has been applied to the network.
The gradients calculated at each training example
are added together to determine the change in the
weights and biases.
• Batch Gradient Descent: In the batch steepest
descent training function the weights and biases are
updated in the direction of the negative gradient of
the performance function. There is only one training
function associated with a given network.
Batch Gradient Descent with Momentum ‫عزم‬

‫حركي‬:
• This algorithm often provides faster convergence.
• Momentum allows a network to respond not only to
the local gradient, but also to recent trends in the
error surface.
• Acting like a low-pass filter, momentum allows the
network to ignore small features in the error surface.
• Without momentum a network may get stuck in a
shallow local minimum, such as shown in the next
figure.
Incremental Mode Gradient Descent

• When we use the gradient with respect to one
training example at a time, the gradient descent
becomes the Hoff’s delta rule, which is given by,
Wi   (t  o) xi
• Also called the Least Mean Square, LMS,
method.
LMS Learning Rule

Mean Square Error:
• Like the perceptron learning rule, the least mean
square error (LMS), the delta rule, algorithm is an
example of supervised training, in which the learning
rule is provided with a set of examples of desired
p1, t1 , p2 , t 2 , ... , pQ , tQ 

network behavior:
• We want to minimize the average of the sum of the

squared errors between target & actual network output:
1 Q 1 Q
mse   e(k ) 2   (t (k ) - a(k )) 2
Q k 1 Q k 1
LMS Algorithm: (Adaline rule)(Widrow-

Hoff rule)
• The LMS algorithm was presented by Widrow
and Hoff, hence, it is called Widrow-Hoff learning
algorithm.
• As seen before, it is based on an approximate
steepest descent procedure ‫طريقة االنحدار التدريجي‬.
• Widrow and Hoff decided that they could estimate
the mean square error by using the squared error
at each iteration.
Comparing Perceptron & Delta Rules

• Perceptron rule
• Thresholded output.
• Converges after a finite number of iterations to a
hypothesis that perfectly classifies the training data,
provided the training examples are linearly separable.
• Linearly separable data.
• Delta rule
• Unthresholded output.
• Converges toward the error minimum, possibly requiring
unbounded time, but converges regardless of whether the
training data are linearly separable or not.
• Linearly non-separable data.
Adaptive Linear Neuron Network

Architecture (ADALINE)
• The ADALINE network is a single layer neural
network with multiple nodes, where each node
accepts multiple inputs to generate one output.
• ADALINE networks are similar to the perceptron, but
their transfer function is linear rather than hard-
limiting. This allows their outputs to take on any
value, whereas the perceptron output is limited to
either 0 or 1.
• Both the ADALINE and the perceptron can only solve
linearly separable problems.
Cont.
• An adaptive linear system responds to changes in
its environment as it is operating.
• These networks are often used in error
cancellation, signal processing, and control
systems. For example, they are used by many
long distance phone lines for echo
cancellation.
• The pioneering ‫ الرائد‬work in this field was done
by Widrow and Hoff, who gave the name
ADALINE to adaptive linear elements.
The ADALINE Neural Network

Cont.
• Multiple layer ADALINE is called MADALINE.
• The Widrow-Hoff rule can only train single-layer
linear networks. This is not much of a disadvantage;
single-layer linear networks are just as capable as
multilayer linear networks.
• For every multilayer linear network, there is an
equivalent single-layer linear network.
BACKPROPAGATION
ALGORITHM
BackPropagation Algorithm
• The objective of this algorithm is to minimize the
error between the target and actual output and to
find Δw.
• The error is calculated at every iteration and is
back propagated through the layers of the ANN to
adapt the weights.
• The weights are adapted such that the error is
minimized.
• Once the error has reached a justified minimum
value, the training is stopped.
Cont.
• The configuration for training a neural network using
the BP algorithm is shown in the figure below.
The Generalized Delta Rule (G.D.R.)

• In BP algorithm, like in other learning algorithms,
the goal is to find the next value of the adaptation
weights (Δw) which is also known as the G.D.R..
• Consider the following ANN model:
Cont.
• We need to obtain the following algorithm to adapt
the weights between the output (k) and hidden (j)
layers:
• Where the weights are adapted as follows:
• And t is the iteration number and is the error

signal between the output and hidden layers & is
given by:
Cont.
• Adaptation between input (i) and hidden (j) layers :
• The new weight is thus:
• and the error signal through layer j is:
• Where,
• And,
Backpropagation Algorithm
• The following ANN model is used to derive the
backpropagation algorithm:
BP (cont.)
• The backpropagation has two steps,
• Forward propagation, and
• Backward propagation.
• Our ANN model has the following assumptions:
• A two-layer multilayer NN model, i.e. with 1 set of
hidden neurons.
• Neurons in layer i are fully connected to layer j and
neurons in layer j are fully connected to layer k.
• Input layer neurons have linear activation functions
and hidden and output layer neurons have logistic
activation functions (sigmoids).
Note: Sigmoid Function

• Sigmoids have a variable c that controls their firing
angle.
Cont.
• The firing angle used here is c=1.
• Bias weights are used with bias signals of 1 for hidden
(j) and output layer (k) neurons.
• In many ANN models, bias weights (θ) with bias signals
of 1 are used to speed up the convergence process.
• The learning parameter is given by the symbol η and is
usually fixed a value between 0 and 1, however, in
many applications nowadays an adaptive η is used.
• Usually η is set large in the initial stage of learning and
reduced to a small value at the final stage of learning.
• A momentum term α is also used in the G.D.R. to avoid
local minima.
Steps of BP Algorithm
• Step 1: Obtain a set of training patterns.
• Step 2: Set up neural network model: No. of Input
neurons, Hidden neurons, and Output Neurons.
• Step 3: Set learning rate η and momentum rate α
• Step 4: Initialize all connection Wji , Wkj and bias
weights θj θk to random values.
• Step 5: Set minimum error, Emin
• Step 6: Start training by applying input patterns one
at a time and propagate through the layers then
calculate total error.
Cont.
• Step 7: Backpropagate error through output and
hidden layer and adapt weights.
• Step 8: Backpropagate error through hidden and
input layer and adapt weights.
• Step 9: Check it Error < Emin
• If not repeat Steps 6-9. If yes stop training.
Solving an XOR Problem

• In this example we use the BP algorithm to solve a 2-bit
XOR problem.
• The training patterns of this ANN is the XOR example as
given in the next table.
• For simplicity, the ANN model has only 4 neurons (2
inputs, 1 hidden and 1 output) and has no bias weights.
• The input neurons have linear functions and the hidden
and output neurons have sigmoid functions.
• The weights are initialized randomly.
• We train the ANN by providing the patterns #1 to #4
through an iteration process until the error is minimized.
Cont.
• The training patterns of this ANN is the XOR
example as given in the following table:
Cont.
• The ANN model and its initial weights,
• Training begins when the pattern#1 and its target are

provided to the ANN.
• 1st pattern: 0, 0 target : 0
Compute the error by comparing this value to the target,

Cont.
• This error is now backpropagated through the layers
following the error signal equations given as follows:
• Between output (k) and hidden (j) layer
• Thus
• Between hidden (j) and input (i) layer :
• = -0.0035
Cont.
• Now we have calculated the error signal between
layers (k) and (j)
• If we had chosen the learning rate and momentum

term as follows :
• η = 0.1 and α= 0.9
• and the previous change in weight is 0 and Ojo= 0.5
• Then,
= -0.0064
Cont.
• This is the increment of the weight after the first
iteration for the weight between layers k and j.
• Now this change in weight is added to the actual
weight as follows
• and thus the weight between layers k and j has

been adapted.
Cont.
• Similarly for the weights between layers j and i, the
adaptation follows
• Now this change in weight is added to the actual

weight as follows:
• and this is the adapted weight between layers j and i

after pattern#1 is seen by the ANN in the first iteration.
• The whole calculation is then repeated for the next
pattern (pattern#2 = [0, 1]) with tk=1.
• After all the 4 patterns have been completed the whole
process is repeated for pattern#1 again.

Lect 3 PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lect 3 PDF

Uploaded by

Copyright:

Available Formats

University Of Khartoum

Department Of Electronics & Electrical

EC5245: ARTIFICIAL NEURAL

Initializing the weights

Choosing Appropriate Learning Rate

Learning and Generalization in

Overfitting: How it occurs!

A simple example of overfitting

• The complicated model

Ways to reduce Overfitting

• Submission date: 8 May 2018

Solving with Perceptron Networks

• Design a perceptron network to solve this problem.

• Thus our patterns are Linearly separable.

• We can apply the rule & find the corresponding

Gradient Descent Learning in NN

Finding the minimum of a function: gradient descent

gradient descent on an error

• Where D is the set of training examples.

Batch Training: ‫تحديث األوزان على دفعات‬

Batch Gradient Descent with Momentum ‫عزم‬

Incremental Mode Gradient Descent

LMS Learning Rule

p1, t1 , p2 , t 2 , ... , pQ , tQ 

• We want to minimize the average of the sum of the

LMS Algorithm: (Adaline rule)(Widrow-

Comparing Perceptron & Delta Rules

Adaptive Linear Neuron Network

The ADALINE Neural Network

The Generalized Delta Rule (G.D.R.)

• Where the weights are adapted as follows:

• And t is the iteration number and is the error

• The new weight is thus:

• and the error signal through layer j is:

Note: Sigmoid Function

Solving an XOR Problem

• Training begins when the pattern#1 and its target are

Compute the error by comparing this value to the target,

• If we had chosen the learning rate and momentum

• and thus the weight between layers k and j has

• Now this change in weight is added to the actual

• and this is the adapted weight between layers j and i

You might also like