Neural Unit 2

Unit II
Neural Network for Control

Error Calculation Methods
Question: Define Loss / Error function and explain its types.
Error calculation occur at two levels
1) Local Error function: It is difference between target output
and actual output
2) Global Error function: All the local errors are aggregated
together to form global error.
The global error is the measurement of how well a neural network

performs to the entire training set.
Following are the some important global error calculations

methods
a) Sum of squares error (ESS)
b) Mean square error (MSE)
c) Root mean square (RMS)
a) Sum of squares error (SSE)
It is actually squared difference of target output and actual output.
𝑛 2
σ
i.e. 𝐸 = 𝑖=1 𝑦𝑡𝑖 − 𝑦𝑖 𝒚𝒕
𝑦
Q.1 Find SSE error present in the following data
𝒚 𝒚𝒕
43.6 41
44.4 45
45.2 49
46 47
46.8 44
Solution 2
2
𝑦 𝑦𝑡 𝑦𝑡 − 𝑦 𝑦𝑡 − 𝑦 σ𝑛𝑖=1 𝑦𝑡𝑖 − 𝑦𝑖
43.6 41 -2.6 6.76

𝑛 2
44.4 45 0.6 0.36
𝐸 = ෍ 𝑦𝑡𝑖 − 𝑦𝑖
𝑖=1
45.2 49 3.8 14.44 30.4
46 47 1 1
46.8 44 -2.8 7.84
b) Mean square error
𝑛
1 2
𝐸 = ෍ 𝑦𝑡𝑖 − 𝑦𝑖
𝑛
𝑖=1
Q.1 Find MSE error present in the following data

𝒚 𝒚𝒕
43.6 41
44.4 45
45.2 49
46 47
46.8 44
Solution
2 1 𝑛 2
𝑦 𝑦𝑡 𝑦𝑡 − 𝑦 𝑦𝑡 − 𝑦 σ 𝑦𝑡𝑖 − 𝑦𝑖
𝑛 𝑖=1
43.6 41 -2.6 6.76

44.4 45 0.6 0.36
45.2 49 3.8 14.44 6.08
46 47 1 1
46.8 44 -2.8 7.84
Q.2 Find SSE error present in the following data

𝐀𝐜𝐭𝐮𝐚𝐥 𝐏𝐫𝐞𝐝𝐢𝐜𝐭𝐞𝐝
10 9
MSE=2.2
13 14
14 12
9 10
17 15
b) Root mean square (RMS)
𝑛
1 2
𝐸= ෍ 𝑦𝑡𝑖 − 𝑦𝑖
𝑛
𝑖=1
Q.1 Find RMS error present in the following data

𝒚 𝒚𝒕
43.6 41
44.4 45
45.2 49
46 47
46.8 44
Solution 2
1 𝑛
2 σ 𝑦𝑡𝑖 − 𝑦𝑖
𝑦 𝑦𝑡 𝑦𝑡 − 𝑦 𝑦𝑡 − 𝑦 𝑛 𝑖=1
43.6 41 -2.6 6.76

44.4 45 0.6 0.36
45.2 49 3.8 14.44 2.47
46 47 1 1
46.8 44 -2.8 7.84
Q.2 Find RMS error present in the following data

𝐀𝐜𝐭𝐮𝐚𝐥 𝐏𝐫𝐞𝐝𝐢𝐜𝐭𝐞𝐝
10 9
RMS=1.48
13 14
14 12
9 10
17 15
Weights Initialization
Question: Find the range of initial weights of first layer by three
weight initialization techniques
Hint: Use Uniform Distribution, Xavier Normal & Uniform & He init Techniques and
find range of the initial weights
Associative memory
Question: What is Associative Memory? Explain its types.
 It is called as Content Addressable Memory (CAM)
 This type of memory is accessed simultaneously and in parallel
on the basis of data content rather than by specific address or
location.
 How memory write word?
1. When a word is written in associative memory, no address
is given.
2. The memory is capable of finding an empty location to
store the word.
 How memory read the word?
1. When a word is to be read from an associative memory,
then the content of the word or part of the word is specified.
2. The memory locates all words that matches specified
content and marks them for reading.
Associative memory
 It takes relatively less time to find an item based on content
rather than by an address.
Block Diagram of Associative Memory

Associative memory
Argument Register (A):
It contains the word to be searched
Key Register (K):

This specifies which part of the argument word needs to be
compared with words in memory.
If all bits in register are 1, the entire word should be compared.

Else only bits having 1 will be compared.
Associative memory array:

It contains the words which are to be compared with the argument
word.
Associative memory
Match Register (M):
After the matching process, the bits corresponding to matching
words in match register are set to be 1.
Example: Associative Memory page table
A: 101 111100
K: 111 000000
Word 1: 100 111100 no match (0)
Word 2: 101 000001 match (1)
Associative memory
Associative memory
Following are the two types of associative memories we can
observe:
1. Auto Associative Memory
2. Hetero Associative memory
Auto Associative Memory:

Auto associative memory is capable of retrieving a piece of data
only from partial information of that of piece of data.
Neural network using auto associative memory are called Auto

Associative Memory.
Auto Associative memory
Architecture
As shown in the following figure, the architecture of Auto
Associative memory network has ‘𝑛’ number of input training
vectors and similar ‘𝑛’ number of output target vectors.
Training Algorithm
For training, this network is using the Hebb or Delta learning rule.
Algorithm:
Step 1: Set all initial weights to zero
Step 2: For each input vector,
Start the weight adjustment by
(1) 𝑤𝑖𝑗 𝑛𝑒𝑤 = 𝑤𝑖𝑗 𝑜𝑙𝑑 + 𝑥𝑖 𝑦𝑗 Hebb Rule
OR
(2) 𝑊 = σ𝑖 𝑆𝑖𝑇 𝑆𝑖 (𝑥𝑖 = 𝑆𝑖𝑇 , 𝑦𝑖 = 𝑆𝑖 )
Step 3: Calculate net input 𝑧 = 𝑥1 𝑤1 + 𝑥2 𝑤2 + ⋯ + 𝑥𝑘 𝑤𝑘
Step 4: Take activation function as,
1 𝑖𝑓 𝑧 > 0
𝑦 = ൞ 0 𝑖𝑓 𝑧 = 0
−1 𝑖𝑓 𝑧 < 0
Example:
Train the auto associative network for input vector [−1 1 1 1]
and also test the network for same input vector.
Test the auto associative network with one missing, one mistake,
two missing and two mistake entries in test vector.
Solution:
 1  1 1 1 1
1  1 1 1 1 
W  S S     1 1 1 1  
T

1  1 1 1 1 
   
1  1 1 1 1 
Training network with same input vector:
x   1 1 1 1
 1 1 1 1
 1 1 1 1 
z  xW   1 1 1 1     4 4 4 4
 1 1 1 1 
 
 1 1 1 1 
 y   1 1 1 1 Correct Response
Training network with one missing entry:

x   0 1 1 1  1 1 1 1
 1 1 1 1 
z  xW   0 1 1 1     3 3 3 3
 1 1 1 1 
 
 1 1 1 1 
Training network with two missing entry:
x   0 0 1 1  1 1 1 1
 1 1 1 1 
z  xW   0 0 1 1     2 2 2 2 
 1 1 1 1 
 
 1 1 1 1 
Training network with two mistake entry:

x   1 1 1 1  1 1 1 1
 1 1 1 1 
z  xW   1 1 1 1     0 0 0 0
 1 1 1 1 
 
 1 1 1 1 
 y   0 0 0 0 Not getting Correct Response

Question: Train the auto associative network for input vector
[−1 -1 1 1] and [1 -1 1 -1] also test the network for two
missing values.
Hetero Associative memory
Hetero Associative memory is capable of retrieving a piece of data
of one category upon presentation of data from another category.
In this case
1) Training input vector X and target output vector t are different.
2) It uses Hebb’s / Delta Rule to find the weight matrix
Algorithm:
Step 1: Set all initial weights to zero
Step 2: For each input vector,
Start the weight adjustment by
(1) 𝑤𝑖𝑗 𝑛𝑒𝑤 = 𝑤𝑖𝑗 𝑜𝑙𝑑 + 𝑥𝑖 𝑦𝑗 Hebb Rule
OR
(2) 𝑊 = σ𝑖 𝑆𝑖𝑇 𝑡𝑖
Step 3: Calculate net input 𝑧 = 𝑥1 𝑤1 + 𝑥2 𝑤2 + ⋯ + 𝑥𝑘 𝑤𝑘
Step 4: Take activation function as,
1 𝑖𝑓 𝑧 > 0
𝑦 = ൞ 0 𝑖𝑓 𝑧 = 0
−1 𝑖𝑓 𝑧 < 0
Example:
Train a heteroassociative network to store the given bipolar input
vector 𝑆 = 𝑆1 , 𝑆2 , 𝑆3 , 𝑆4 to the output vector 𝑡 = 𝑡1 , 𝑡2 . Also
test performance of network with missing and mistaken data.
𝑺𝟏 𝑺𝟐 𝑺𝟑 𝑺𝟒 𝒕𝟏 𝒕𝟐
1 -1 -1 -1 -1 1
1 1 -1 -1 -1 1
-1 -1 -1 1 1 -1
-1 -1 1 1 1 -1
Solution:
(1) For pair, 𝑆 = (1, −1, −1, −1) and 𝑡 = (−1,1)
1  1 1 
 1  1 1
w1  S T t     1 1   
 1  1 1
   

 1  1 1
(2) For pair, 𝑆 = (1,1, −1, −1) and 𝑡 = (−1,1)

1  1 1 
1  1 1 
w2  S t     1 1  
T 
 1  1 1
   

 1  1 1
(3) For pair, 𝑆 = (−1, −1, −1,1) and 𝑡 = (1, −1)

 1  1 1 
 1  1 1 
w3  S T t    1 1   
 1  1 1 
   
 
1  1 1
(4) For pair, 𝑆 = (−1, −1,1,1) and 𝑡 = (1, −1)

 1  1 1 
 1  1 1 
w4  S t    1 1  
T 
1  1 1
   
 
1  1 1
Final weight matrix,
 1 1   1 1   1 1   1 1 
 1 1  1 1   1 1   1 1 
w  w1  w2  w3  w4      
 1 1  1 1  1 1   1 1
       
 1 1  1 1  1 1  1 1
 4 4 
 2 2 
ww 
 2 2 
 
 4 4 
Case I: With missing data entries
𝑥 = [0 1 0 − 1]
Net input,
 4 4 
 2 2 
z  xW   0 1 0 1     6 6
 2 2 
 
 4 4 
 y   1 1
Correct Response
Case II: With mistaken data,
𝑥 = [−1 1 1 − 1]
Net input,
 4 4 
 2 2 
z  xW   1 1 1 1     0 0
 2 2 
 
 4 4 
 y   0 0
Not getting Correct Response

Concept of Optimization
Optimization helps us to minimize error function (objective
function)
Target Output
There are majorly two category of optimization
1. First order optimization Algorithm (FOOA)
2. Second order optimization Algorithm (SOOA)
Gradient Descent optimization Algorithm
Question: Explain Gradient Descent optimization algorithm.
Explain its types.
Question: Explain any four optimizers in neural network and
their advantage and disadvantages.
 Gradient descent is an optimization algorithm based on a convex function

and minimizes iteratively a given function to its local minimum.
 Gradient Descent iteratively reduces a loss function by moving in the

direction opposite to that of steepest ascent. It is dependent on the
derivatives of the loss function for finding minima.
Gradient Descent optimization Algorithm
Algorithm:
Step 1: Find the slope of the objective function with respect to
each parameter
Step 2: Pick a random initial value for the parameters.
𝜕𝐸
Step 3: Evaluate gradient function i.e. at initial value
𝜕𝑊
Step 4: Calculate the step sizes for each feature as
𝜕𝐸
step size = gradient * learning rate.= 𝛼.
𝜕𝑊
Step 5: Calculate the new parameters as : new params. = old
params. -step size
𝜕𝐸
i.e. 𝑤 𝑛𝑒𝑤 = 𝑤 𝑜𝑙𝑑 − 𝛼.
𝜕𝑊
Step 6: Repeat steps 3 to 5 until gradient is almost 0.
First Order Optimization Algorithms
Types of Gradient Descent optimization algorithms:
1. Batch Gradient Algorithm (GD)
2. Stochastic Gradient Algorithm (SGD)
3. Mini batch Gradient Algorithm (MBGD)

𝜕𝐸
𝑤𝑖 𝑛𝑒𝑤 = 𝑤𝑖 𝑜𝑙𝑑 − 𝛼
𝜕𝑤𝑖
Advantages
• Easy to understand
• Easy to implement
Disadvantages
• Because this method calculates the gradient for the entire
data set in one update, the calculation is very slow.
• It requires large memory and it is computationally
expensive.
1. Batch Gradient Algorithm (SGD)
 A term “batch” denotes the total number of samples of a whole

dataset.
 The whole dataset is used for getting minima in less noisy and
less random manner.
 If dataset is very big then it become computationally very

expensive to perform
 BGD algorithm,
𝜕𝐸
𝜕𝑤𝑖
=1
 In SGD, the gradient of the cost function of a single sample is

taken for each iteration instead of the sum of the gradient of the
cost function of all the whole dataset.
 In SGD, since only one sample from the dataset is chosen at
random for each iteration, the path taken by the algorithm to
reach the minima is usually noisier than your typical Gradient
Descent algorithm.
 SDG algorithm
Advantages
• Frequent updates of model parameter
• Requires less Memory.
• Allows the use of large data sets as it has to update only one
example at a time.
Disadvantages
• The frequent can also result in noisy gradients which may
cause the error to increase instead of decreasing it.
• High Variance.
• Frequent updates are computationally expensive.
𝜕𝐸
𝜕𝑤𝑖
Batch Size=2
 In this variant of gradient descent instead of taking all the
training data, only a subset of the dataset is used for calculating
the loss function. Since we are using a batch of data instead of
taking the whole dataset, fewer iterations are needed.
 Mini-batch gradient descent algorithm is faster than both
stochastic gradient descent and batch gradient descent
algorithms.
 This algorithm is more efficient and robust than the earlier
variants of gradient descent.
 Moreover, the cost function in mini-batch gradient descent is
noisier than the batch gradient descent algorithm but smoother
than that of the stochastic gradient descent algorithm.
Advantages
1. It leads to more stable convergence.
2. More efficient gradient calculations.
3. Requires less amount of memory.
Disadvantages
1. Mini-batch gradient descent does not guarantee good
convergence
2. If the learning rate is too small, the convergence rate will be
slow. If it is too large, the loss function will oscillate or even
deviate at the minimum value.
Comparison: Cost Function & Time
Question: Explain the difference between Batch Gradient Descent,
Stochastic Gradient Descent & Mini Batch Gradient Descent.
4. AdaGrad(Adaptive Gradient Descent)
 The adaptive gradient descent algorithm is slightly different
from other gradient descent algorithms. In this case the change
in learning rate depends upon the difference in the parameters
during training. The more the parameters get change, the more
minor the learning rate changes.
 This modification is highly beneficial because real-world

datasets contain sparse as well as dense features. So it is unfair
to have the same value of learning rate for all the features.
 The Adagrad algorithm uses the below formula to update the

weights. ɳ shows changing learning
4. AdaGrad(Adaptive Gradient Descent)
Advantages
1. Learning Rate changes adaptively with iterations.
2. It is able to train sparse data as well.
Disadvantages
1. If the neural network is deep the learning rate becomes
very small number which will cause dead neuron problem.
Hopfield Network
Question: What is Hopfield Network? and Explain its types
 John J. Hopfield proposed a network in 1982 which

consist of a set of neurons with output of each
neuron is feedback to all the other neurons.
 Hopfield networks have been used in many

applications of associative memory & many
optimization problem (Travelling- Salesman
Problem).
 Hopfield network basically of 2 types,

1. Discrete Hopfield Network
2. Continuous Hopfield Network
 Hopfield network is an auto associative fully inter

connected single layer feedback network.
Hopfield Network
 It is symmetrical weighted network. (𝑤𝑖𝑗 = 𝑤𝑗𝑖 )
 It has only one layer of neurons corresponding to size of input

and output which must be same.
 It is useful when given data is noisy, or partial. Since the

network can store patterns, and retrieve those patterns based on
partial input, due to the interconnections.
Discrete Hopfield Network
 Hopfield network when operates in discrete line fashion, then it
is called “Discrete Hopfield Network”
 This network has two values inputs

(a) Binary: 0,1 (b) Bipolar : 1, −1
 This network has symmetrical weights with no self connection

i.e. 𝑤𝑖𝑗 = 𝑤𝑗𝑖 and 𝑤𝑖𝑖 = 0
 The architecture of discrete Hopfield network model consists of

two outputs, one inverting and other non inverting.
 The output of each processing element are feedback to the input

of other processing elements but not to itself.
Discrete Hopfield Network
𝒘𝟏𝟏 = 𝒘𝟐𝟐 = ⋯
= 𝒘𝒏𝒏 = 𝟎 𝒘𝟏𝒏 = 𝒘𝒏𝟏
𝒘𝟐𝒏 𝒘
𝒘𝟏𝟐 𝟐𝒊
𝒘𝟐𝟏
𝒘𝟏𝒏 𝒘𝟏𝒊
𝒘𝟏𝟏 𝒘𝟏𝟐 = 𝒘𝟐𝟏
𝒘𝟏𝟐
Training Algorithm- Discrete Hopfield Network
Step 1: Initialize weights (𝑤𝑖𝑗 ) to store patterns (using training

algorithm).
For Binary Pattern:
𝑊 = 𝑤𝑖𝑗 = σ𝑖 2𝑆𝑖 − 1 2𝑆𝑖 − 1 𝑇 , 𝑤𝑖𝑗 = 0 for 𝑖 = 𝑗
For Bipolar Pattern:
𝑊 = 𝑤𝑖𝑗 = σ𝑖 𝑆𝑖 𝑆𝑖𝑇 , 𝑤𝑖𝑗 = 0 for 𝑖 = 𝑗
Testing Algorithm:
Step 2: Consider Binary representation of given testing vector,
i.e. x = 𝑥1 𝑥2 𝑥3 … . 𝑥𝑛
Step 3: Initially take, y 𝑦1 𝑦2 𝑦3 … . 𝑦𝑛 = 𝑥 𝑥1 𝑥2 𝑥3 … . 𝑥𝑛
Step 4: To update 𝑦𝑖 ,
𝑇
Consider Net input, 𝑧𝑖 = 𝑥𝑖 + 𝑦 𝑜𝑙𝑑 . 𝑊 𝑖 𝑡ℎ 𝐶𝑜𝑙𝑢𝑚𝑛
Now, consider the activation function,
1 𝑖𝑓 𝑧𝑖 > 0
𝑦𝑖 𝑛𝑒𝑤 = ൞ 𝑦𝑖 𝑖𝑓 𝑧𝑖 = 0
0 𝑖𝑓 𝑧𝑖 < 0
Therefore, updated vector 𝑦 = 𝑦1 𝑦2 𝑦3 𝑦𝑖 𝑛𝑒𝑤 𝑦𝑛
Repeat the iteration, for any 𝑖 until we get 𝑦 = 𝑥.

Example: Construct an autoassociative Discrete Hopfield Network
with input vector [1 1 1 − 1] . Test Discrete Hopfield Network
with missing entries in first & second component of stored vector.
Continuous Hopfield Network
 The discrete Hopfield network can be modified to a continuous
model.
 The nodes of this network have continuous output rather than a

two state output(between 0 & 1).
 The continuous Hopfield network can be realized as an

electronic circuit, which uses in non linear amplifier and resister.

Neural Unit 2

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Neural Unit 2

Uploaded by

Copyright:

Available Formats

Unit II

Neural Network for Control

The global error is the measurement of how well a neural network

Following are the some important global error calculations

43.6 41 -2.6 6.76

Q.1 Find MSE error present in the following data

43.6 41 -2.6 6.76

Q.2 Find SSE error present in the following data

Q.1 Find RMS error present in the following data

43.6 41 -2.6 6.76

Q.2 Find RMS error present in the following data

Block Diagram of Associative Memory

Key Register (K):

If all bits in register are 1, the entire word should be compared.

Associative memory array:

Example: Associative Memory page table

Auto Associative Memory:

Neural network using auto associative memory are called Auto

Training network with one missing entry:

Training network with two mistake entry:

 y   0 0 0 0 Not getting Correct Response

(2) For pair, 𝑆 = (1,1, −1, −1) and 𝑡 = (−1,1)

(3) For pair, 𝑆 = (−1, −1, −1,1) and 𝑡 = (1, −1)

(4) For pair, 𝑆 = (−1, −1,1,1) and 𝑡 = (1, −1)

Not getting Correct Response

 Gradient descent is an optimization algorithm based on a convex function

 Gradient Descent iteratively reduces a loss function by moving in the

1. Batch Gradient Algorithm (GD)

2. Stochastic Gradient Algorithm (SGD)

3. Mini batch Gradient Algorithm (MBGD)

 A term “batch” denotes the total number of samples of a whole

 If dataset is very big then it become computationally very

 In SGD, the gradient of the cost function of a single sample is

 This modification is highly beneficial because real-world

 The Adagrad algorithm uses the below formula to update the

 John J. Hopfield proposed a network in 1982 which

 Hopfield networks have been used in many

 Hopfield network basically of 2 types,

 Hopfield network is an auto associative fully inter

 It has only one layer of neurons corresponding to size of input

 It is useful when given data is noisy, or partial. Since the

 This network has two values inputs

 This network has symmetrical weights with no self connection

 The architecture of discrete Hopfield network model consists of

 The output of each processing element are feedback to the input

Step 1: Initialize weights (𝑤𝑖𝑗 ) to store patterns (using training

Repeat the iteration, for any 𝑖 until we get 𝑦 = 𝑥.

 The nodes of this network have continuous output rather than a

 The continuous Hopfield network can be realized as an

You might also like