You are on page 1of 59

University Of Khartoum

Department Of Electronics & Electrical


Engineering
Software & Control Engineering

EC5245: ARTIFICIAL NEURAL


NETWORK & FUZZY LOGIC
By: Ustaza Hiba Hassan
Lecture 3
24/4/2018 Hiba Hassan: U of K 2

Initializing the weights


• The weights are best initialized to small
random values.
• Usually they are taken from a uniform
distribution around zero [–std, +std], or from
a Gaussian distribution around zero with
standard deviation std.
24/4/2018 Hiba Hassan: U of K 3

Weight Space
• For any given network, there are fixed number of
connections with associated weights.
• So, if there are n weights, then each configuration
of weights that defines an instance of the network
is a vector, W, of length n.
• W can be considered to be a point in an n-
dimensional weight space, where each axis is
associated with one of the connections in the
network.
24/4/2018 Hiba Hassan: U of K 4

Choosing Appropriate Learning Rate


• Choosing a good value for the learning rate η is constrained
by two opposing factors:
1. If η is too small, it will take too long to get anywhere near
the minimum of the error function.
2. If η is too large, the weight updates will over-shoot the error
minimum and the weights will oscillate, or even diverge.
• However, the right learning rate is problem and network
dependent.
• Generally, different values (e.g. η = 0.1, 0.01, 1.0, 0.0001) are
implemented and the results are used as a guide.
• The learning rate may change during the learning process to
mimic the age dependent learning rates found in human
children.
24/4/2018 Hiba Hassan: U of K 5

Learning and Generalization in


Perceptron Networks
• The critical aspects of the network’s operation are:
1) Learning: The network must learn decision
boundaries from a set of training patterns to
enable it to classify them properly.
2) Generalization: After training, the network must be
able to classify new test patterns correctly.
• Usually we want the neural network to learn in a way
that produces good generalization.
• There is a trade-off between learning and
generalization.
24/4/2018 Hiba Hassan: U of K 6

Learning (cont.)
• The performance of the neural network is graded
based on the results of various metrics of the
testing set, such as, Mean square error, SNR, etc.
• Another method of estimating the error rate of the
neural network is Resampling.
• The idea is to iterate the training and testing
processes multiple times
• Two main techniques are used:
• Cross-Validation & Bootstrapping.
24/4/2018 Hiba Hassan: U of K 7

Overfitting: How it occurs!


• The training data may contain two types of noise:
1. The target values may be unreliable (though
not very common).
2. There is sampling error. There will be
accidental regularities just because of the
particular training cases that were chosen.
• When the model is fitted, it cannot differentiate
between the real regularities and those caused by
sampling error. So it fits both!
• If the model is very flexible it can model the
sampling error really well.
24/4/2018 Hiba Hassan: U of K 8

Overfitting (Cont.)
• Overfitting can also occur if a “good” training set
is not chosen
• A “good” training set must consist of:
• Samples that represent the general population.
• Samples that contain members of each class.
• Samples in each class that contain a wide range
of variations or noise effect.
24/4/2018 Hiba Hassan: U of K 9

A simple example of overfitting


• A simple model vs a
complicated one?
output = y

• The complicated model


fits the data better.
• But it is not economical.

input = x
24/4/2018 Hiba Hassan: U of K 10

Ways to reduce Overfitting


• Many different methods have been developed,
some of them are:
A. Weight-decay
B. Weight-sharing
C. Early stopping
D. Model averaging
E. Bayesian fitting of neural nets
F. Dropout
G. Generative pre-training
24/4/2018 Hiba Hassan: U of K 11

Assignment
• Write a report explaining a single overfitting reducing
technique:
• Explain overfitting; causes and effect.
• Explain in detail the chosen reducing technique.
• Use Matlab or appropriate language, to illustrate
your chosen technique by applying to data sets of
your choice.
• Students should form pairs of 2. Maximum of 5
groups are allowed to investigate a single
technique. However, every student pair are
required to use different data sets.

• Submission date: 8 May 2018


24/4/2018 Hiba Hassan: U of K 12

Perceptron Networks
• Which of these problems may be solved by a perceptron
network? Justify?
24/4/2018 Hiba Hassan: U of K 13

Solving with Perceptron Networks


• A single neuron perceptron can categorize 2
classes if they are linearly separable.
• An n-neuron perceptron can categorize 2n
classes if they are linearly separable.
• That is to categorize 4 classes, we need 2 neuron
perceptron network and so on.
24/4/2018 Hiba Hassan: U of K 14

Example
• We have a classification problem with four classes
of input vector.
• The four classes are
• class 1: , class 2:

• class 3: , class 4:

• Design a perceptron network to solve this problem.


24/4/2018 Hiba Hassan: U of K 15

Solution
• To solve a problem with four classes of input vector
we will need a perceptron with at least two neurons.
• Hence;
24/4/2018 Hiba Hassan: U of K 16

Solution (cont.)
• The light circles indicate class 1 vectors, the light
squares indicate class 2 vectors, the dark circles
indicate class 3 vectors, and the dark squares
indicate class 4 vectors.
24/4/2018 Hiba Hassan: U of K 17

Solution (cont.)
• We try to divide the input space into the four
categories.

• Thus our patterns are Linearly separable.


24/4/2018 Hiba Hassan: U of K 18

Solution (cont.)
• The weight vectors should be orthogonal to the decision
boundaries and point toward the regions where the
neuron outputs are 1.
• We can choose the target classes to be:
24/4/2018 Hiba Hassan: U of K 19

Cont.
• Hence, if we select the following weight vectors:

• We can apply the rule & find the corresponding


biases as such;
24/4/2018 Hiba Hassan: U of K 20

Final Solution
• Hence, to solve our problem we need a 2 neuron
perceptron network with the following weights
and biases:
24/4/2018 Hiba Hassan: U of K 21

GRADIENT DESCENT
LEARNING
24/4/2018 Hiba Hassan: U of K 22

Gradient Descent Learning in NN


• The gradient is the rate of change of f(x) at a particular
value of x.
• Hence, it is the partial derivative of f(x) with respect to
x.
• That led to Gradient Descent Learning, its aim is to find
the minimum error by computing the derivative of the
error function with respect to the weight. Sometimes it
is called Gradient Descent Minimization.
• Thus the weight update is computed via;
 E 
Wij    
 W 
 ij 
• Where η = is learning rate.
24/4/2018 Hiba Hassan: U of K 23

Finding the minimum of a function: gradient descent


24/4/2018 Hiba Hassan: U of K 24

gradient descent on an error


24/4/2018 Hiba Hassan: U of K 25

Cont.
• For a target (t) & an actual output (o), the error is
given by the following mean square error cost
function,

• Where D is the set of training examples.


• There are 2 types of gradient descent based cost
function, they are stated next;
24/4/2018 Hiba Hassan: U of K 26

Where,
24/4/2018 Hiba Hassan: U of K 27
24/4/2018 Hiba Hassan: U of K 28

Batch Training: ‫تحديث األوزان على دفعات‬


• Batch Training: In batch mode the weights and
biases of the network are updated only after the
entire training set has been applied to the network.
The gradients calculated at each training example
are added together to determine the change in the
weights and biases.
• Batch Gradient Descent: In the batch steepest
descent training function the weights and biases are
updated in the direction of the negative gradient of
the performance function. There is only one training
function associated with a given network.
24/4/2018 Hiba Hassan: U of K 29

Batch Gradient Descent with Momentum ‫عزم‬


‫حركي‬:
• This algorithm often provides faster convergence.
• Momentum allows a network to respond not only to
the local gradient, but also to recent trends in the
error surface.
• Acting like a low-pass filter, momentum allows the
network to ignore small features in the error surface.
• Without momentum a network may get stuck in a
shallow local minimum, such as shown in the next
figure.
24/4/2018 Hiba Hassan: U of K 30
24/4/2018 Hiba Hassan: U of K 31

Incremental Mode Gradient Descent


• When we use the gradient with respect to one
training example at a time, the gradient descent
becomes the Hoff’s delta rule, which is given by,

Wi   (t  o) xi
• Also called the Least Mean Square, LMS,
method.
24/4/2018 Hiba Hassan: U of K 32

LMS Learning Rule


Mean Square Error:
• Like the perceptron learning rule, the least mean
square error (LMS), the delta rule, algorithm is an
example of supervised training, in which the learning
rule is provided with a set of examples of desired

p1, t1 , p2 , t 2 , ... , pQ , tQ 


network behavior:

• We want to minimize the average of the sum of the


squared errors between target & actual network output:
1 Q 1 Q
mse   e(k ) 2   (t (k ) - a(k )) 2
Q k 1 Q k 1
24/4/2018 Hiba Hassan: U of K 33

LMS Algorithm: (Adaline rule)(Widrow-


Hoff rule)
• The LMS algorithm was presented by Widrow
and Hoff, hence, it is called Widrow-Hoff learning
algorithm.
• As seen before, it is based on an approximate
steepest descent procedure ‫طريقة االنحدار التدريجي‬.
• Widrow and Hoff decided that they could estimate
the mean square error by using the squared error
at each iteration.
24/4/2018 Hiba Hassan: U of K 34

Comparing Perceptron & Delta Rules


• Perceptron rule
• Thresholded output.
• Converges after a finite number of iterations to a
hypothesis that perfectly classifies the training data,
provided the training examples are linearly separable.
• Linearly separable data.
• Delta rule
• Unthresholded output.
• Converges toward the error minimum, possibly requiring
unbounded time, but converges regardless of whether the
training data are linearly separable or not.
• Linearly non-separable data.
24/4/2018 Hiba Hassan: U of K 35

Adaptive Linear Neuron Network


Architecture (ADALINE)
• The ADALINE network is a single layer neural
network with multiple nodes, where each node
accepts multiple inputs to generate one output.
• ADALINE networks are similar to the perceptron, but
their transfer function is linear rather than hard-
limiting. This allows their outputs to take on any
value, whereas the perceptron output is limited to
either 0 or 1.
• Both the ADALINE and the perceptron can only solve
linearly separable problems.
24/4/2018 Hiba Hassan: U of K 36

Cont.
• An adaptive linear system responds to changes in
its environment as it is operating.
• These networks are often used in error
cancellation, signal processing, and control
systems. For example, they are used by many
long distance phone lines for echo
cancellation.
• The pioneering ‫ الرائد‬work in this field was done
by Widrow and Hoff, who gave the name
ADALINE to adaptive linear elements.
24/4/2018 Hiba Hassan: U of K 37

The ADALINE Neural Network


24/4/2018 Hiba Hassan: U of K 38

Cont.
• Multiple layer ADALINE is called MADALINE.
• The Widrow-Hoff rule can only train single-layer
linear networks. This is not much of a disadvantage;
single-layer linear networks are just as capable as
multilayer linear networks.
• For every multilayer linear network, there is an
equivalent single-layer linear network.
24/4/2018 Hiba Hassan: U of K 39

BACKPROPAGATION
ALGORITHM
24/4/2018 Hiba Hassan: U of K 40

BackPropagation Algorithm
• The objective of this algorithm is to minimize the
error between the target and actual output and to
find Δw.
• The error is calculated at every iteration and is
back propagated through the layers of the ANN to
adapt the weights.
• The weights are adapted such that the error is
minimized.
• Once the error has reached a justified minimum
value, the training is stopped.
24/4/2018 Hiba Hassan: U of K 41

Cont.
• The configuration for training a neural network using
the BP algorithm is shown in the figure below.
24/4/2018 Hiba Hassan: U of K 42

The Generalized Delta Rule (G.D.R.)


• In BP algorithm, like in other learning algorithms,
the goal is to find the next value of the adaptation
weights (Δw) which is also known as the G.D.R..
• Consider the following ANN model:
24/4/2018 Hiba Hassan: U of K 43

Cont.
• We need to obtain the following algorithm to adapt
the weights between the output (k) and hidden (j)
layers:

• Where the weights are adapted as follows:

• And t is the iteration number and is the error


signal between the output and hidden layers & is
given by:
24/4/2018 Hiba Hassan: U of K 44

Cont.
• Adaptation between input (i) and hidden (j) layers :

• The new weight is thus:

• and the error signal through layer j is:

• Where,

• And,
24/4/2018 Hiba Hassan: U of K 45

Backpropagation Algorithm
• The following ANN model is used to derive the
backpropagation algorithm:
24/4/2018 Hiba Hassan: U of K 46

BP (cont.)
• The backpropagation has two steps,
• Forward propagation, and
• Backward propagation.
• Our ANN model has the following assumptions:
• A two-layer multilayer NN model, i.e. with 1 set of
hidden neurons.
• Neurons in layer i are fully connected to layer j and
neurons in layer j are fully connected to layer k.
• Input layer neurons have linear activation functions
and hidden and output layer neurons have logistic
activation functions (sigmoids).
24/4/2018 Hiba Hassan: U of K 47

Note: Sigmoid Function


• Sigmoids have a variable c that controls their firing
angle.
24/4/2018 Hiba Hassan: U of K 48

Cont.
• The firing angle used here is c=1.
• Bias weights are used with bias signals of 1 for hidden
(j) and output layer (k) neurons.
• In many ANN models, bias weights (θ) with bias signals
of 1 are used to speed up the convergence process.
• The learning parameter is given by the symbol η and is
usually fixed a value between 0 and 1, however, in
many applications nowadays an adaptive η is used.
• Usually η is set large in the initial stage of learning and
reduced to a small value at the final stage of learning.
• A momentum term α is also used in the G.D.R. to avoid
local minima.
24/4/2018 Hiba Hassan: U of K 49

Steps of BP Algorithm
• Step 1: Obtain a set of training patterns.
• Step 2: Set up neural network model: No. of Input
neurons, Hidden neurons, and Output Neurons.
• Step 3: Set learning rate η and momentum rate α
• Step 4: Initialize all connection Wji , Wkj and bias
weights θj θk to random values.
• Step 5: Set minimum error, Emin
• Step 6: Start training by applying input patterns one
at a time and propagate through the layers then
calculate total error.
24/4/2018 Hiba Hassan: U of K 50

Cont.
• Step 7: Backpropagate error through output and
hidden layer and adapt weights.
• Step 8: Backpropagate error through hidden and
input layer and adapt weights.
• Step 9: Check it Error < Emin
• If not repeat Steps 6-9. If yes stop training.
24/4/2018 Hiba Hassan: U of K 51

Solving an XOR Problem


• In this example we use the BP algorithm to solve a 2-bit
XOR problem.
• The training patterns of this ANN is the XOR example as
given in the next table.
• For simplicity, the ANN model has only 4 neurons (2
inputs, 1 hidden and 1 output) and has no bias weights.
• The input neurons have linear functions and the hidden
and output neurons have sigmoid functions.
• The weights are initialized randomly.
• We train the ANN by providing the patterns #1 to #4
through an iteration process until the error is minimized.
24/4/2018 Hiba Hassan: U of K 52

Cont.
• The training patterns of this ANN is the XOR
example as given in the following table:
24/4/2018 Hiba Hassan: U of K 53

Cont.
• The ANN model and its initial weights,

• Training begins when the pattern#1 and its target are


provided to the ANN.
• 1st pattern: 0, 0 target : 0
24/4/2018 Hiba Hassan: U of K 54
24/4/2018 Hiba Hassan: U of K 55

Compute the error by comparing this value to the target,


24/4/2018 Hiba Hassan: U of K 56

Cont.
• This error is now backpropagated through the layers
following the error signal equations given as follows:
• Between output (k) and hidden (j) layer

• Thus
• Between hidden (j) and input (i) layer :

• = -0.0035
24/4/2018 Hiba Hassan: U of K 57

Cont.
• Now we have calculated the error signal between
layers (k) and (j)

• If we had chosen the learning rate and momentum


term as follows :
• η = 0.1 and α= 0.9
• and the previous change in weight is 0 and Ojo= 0.5
• Then,

= -0.0064
24/4/2018 Hiba Hassan: U of K 58

Cont.
• This is the increment of the weight after the first
iteration for the weight between layers k and j.
• Now this change in weight is added to the actual
weight as follows

• and thus the weight between layers k and j has


been adapted.
24/4/2018 Hiba Hassan: U of K 59

Cont.
• Similarly for the weights between layers j and i, the
adaptation follows

• Now this change in weight is added to the actual


weight as follows:

• and this is the adapted weight between layers j and i


after pattern#1 is seen by the ANN in the first iteration.
• The whole calculation is then repeated for the next
pattern (pattern#2 = [0, 1]) with tk=1.
• After all the 4 patterns have been completed the whole
process is repeated for pattern#1 again.

You might also like