You are on page 1of 17

UNIT-3

Feedforward Neural Networks


A Feed Forward Neural Network is an artificial neural network in which the connections
between nodes does not form a cycle. The opposite of a feed forward neural network is a
recurrent neural network, in which certain pathways are cycled. The feed-forward model is the
simplest form of neural network as information is only processed in one direction. While the
data may pass through multiple hidden nodes, it always moves in one direction and never
backwards.It can be used in pattern recognition. This type of organization is represented as
bottom-up or top-down.

Learning Parameters ofFeedforward Neural Networks:


Mathematically, a feed-forward neural network defines a mapping y = f(x; θ) and learns the
value of the parameters θ that helps in finding the best function approximation.

Note:There is also a bias unit in a feed-forward neural network in all the layers except the
output layer.

Let us now use this knowledge to find the number of parameters.

Scenario 1: A feed-forward neural network with just one hidden layer. Number of units in the
input, hidden and output layers are respectively 3, 4 and 2.
Assumptions:

i = number of neurons in input layer

h = number of neurons in hidden layer

o = number of neurons in output layer

From the diagram, we have i = 3, h = 4 and o = 2. Note


that the red colored neuron is the bias for that layer. Each bias of a layer is connected to all the
neurons in the next layer except the bias of the next layer.

Mathematically:
Number of connections between the first and second layer: 3 × 4 = 12, which is nothing but the
product of i and h.

Number of connections between the second and third layer: 4 × 2 = 8, which is nothing but the
product of h and o.

There are connections between layers via bias as well. Number of connections between the
bias of the first layer and the neurons of the second layer (except bias of the second layer): 1 ×
4, which is nothing but h.

Number of connections between the bias of the second layer and the neurons of the third
layer: 1 × 2, which is nothing but o.

Summing up all:

3×4+4×2+1×4+1×2

= 12 + 8 + 4 + 2

= 26

Thus, this feed-forward neural network has 26 connections in all and thus will have 26 trainable
parameters.

Let us try to generalize using this equation and find a formula.


3×4+4×2+1×4+1×2

=3×4+4×2+4+2

=i×h+h×o+h+o

Thus, the total number of parameters in a feed-forward neural network with one hidden layer
is given by:

(i × h + h × o) + h + o

Since this network is a small network it was also possible to count the connections in the
diagram to find the total number. But, what if the number of layers is more? Let us work on one
more scenario and see if this formula works or we need an extension to this.

Scenario 1: A feed-forward neural network with three hidden layers. Number of units in the
input, first hidden, second hidden, third hidden and output layers are respectively 3, 5, 6, 4 and
2.

Assumptions:

i = number of neurons in input layer

h1 = number of neurons in first hidden layer

h2 = number of neurons in second hidden layer

h3 = number of neurons in third hidden layer

o = number of neurons in output layer

Number of connections between the first and second layer: 3 × 5 = 15, which is nothing but the
product of i and h1.

Number of connections between the second and third layer: 5 × 6 = 30, which is nothing but the
product of h1 and h2.

Number of connections between the third and fourth layer: 6 × 4 = 24, which is nothing but the
product of h2 and h3.

Number of connections between the fourth and fifth layer: 4 × 2= 8, which is nothing but the
product of h3 and o.

Number of connections between the bias of the first layer and the neurons of the second layer
(except bias of the second layer): 1 × 5 = 5, which is nothing but h1.
Number of connections between the bias of the second layer and the neurons of the third
layer: 1 × 6 = 6, which is nothing but h2.

Number of connections between the bias of the third layer and the neurons of the fourth layer:
1 × 4 = 4, which is nothing but h3.

Number of connections between the bias of the fourth layer and the neurons of the fifth layer:
1 × 2 = 2, which is nothing but o.

Summing up all:

3×5+5×6+6×4+4×2+1×5+1×6+1×4+1×2

= 15 + 30 + 24 + 8 + 5 + 6 + 4 + 2

= 94

Thus, this feed-forward neural network has 94 connections in all and thus 94 trainable
parameters.

Let us try to generalize using this equation and find a formula.

3×5+5×6+6×4+4×2+1×5+1×6+1×4+1×2

=3×5+5×6+6×4+4×2+5+6+4+2

= i × h1 + h1 × h2 + h2 × h3+ h3 × o + h1 + h2 + h3+ o

Thus, the total number of parameters in a feed-forward neural network with three hidden
layers is given by:

(i × h1 + h1 × h2 + h2 × h3 + h3 × o) + h1 + h2 + h3+ o

Thus, the formula to find the total number of trainable parameters in a feed-forward neural
network with n hidden layers is given by:

Backpropagation Gradient Descent(GD):


Backpropagation is one of the important concepts of a neural network. Our task is to classify
our data best. For this, we have to update the weights of parameters and bias.In the linear
regression model, we use gradient descent to optimize the parameter. Similarly, here we also
use a gradient descent algorithm using Backpropagation. Backpropagation algorithms are a set
of methods used to efficiently train artificial neural networks following a gradient descent
approach that exploits the chain rule.The main features of Backpropagation are the iterative,
recursive and efficient method through which it calculates the updated weight to improve the
network until it is not able to perform the task for which it is being trained.

Now, how error function is used in Backpropagation, and how Backpropagation works?

Input values
X1=0.05
X2=0.10

Initial weight
W1=0.15 w5=0.40
W2=0.20 w6=0.45
W3=0.25 w7=0.50
W4=0.30 w8=0.55

Bias Values
b1=0.35 b2=0.60

Target Values
T1=0.01
T2=0.99
Now, we first calculate the values of H1 and H2 by a forward pass.
Forward Pass
To find the value of H1 we first multiply the input value from the weights as

H1=x1×w1+x2×w2+b1
H1=0.05×0.15+0.10×0.20+0.35
H1=0.3775

To calculate the final result of H1, we performed the sigmoid function as

We will calculate the value of H2 in the same way as H1


H2=x1×w3+x2×w4+b1
H2=0.05×0.25+0.10×0.30+0.35
H2=0.3925

To calculate the final result of H1, we performed the sigmoid function as

Now, we calculate the values of y1 and y2 in the same way as we calculate the H1 and
H2.
To find the value of y1, we first multiply the input value i.e., the outcome of H1 and H2
from the weights as
y1=H1×w5+H2×w6+b2
y1=0.593269992×0.40+0.596884378×0.45+0.60
y1=1.10590597

To calculate the final result of y1 we performed the sigmoid function as

We will calculate the value of y2 in the


same way as y1
y2=H1×w7+H2×w8+b2
y2=0.593269992×0.50+0.596884378×0.55+0.60
y2=1.2249214

To calculate the final result of H1, we performed the sigmoid function as

Our target values are 0.01 and 0.99. Our y1 and y2 value is not matched with our target
values T1 and T2.

Now, we will find the total error, which is simply the difference between the outputs from the
target outputs. The total error is calculated as
So, the total error is

Now, we will
backpropagate
this error to
update the
weights using a
backward pass.

To update the weight, we calculate the error corresponding to each weight with the help of a
total error. The error on weight w is calculated by differentiating total error with respect to w.

We perform a backward process so first consider the last weight w5 as

From equation two, it is clear that we cannot partially differentiate it with respect to w5
because there is no any w5. We split equation one into multiple terms so that we can
easily differentiate it with respect to w5 as

Now, we
calculate each
term one by
one to differentiate Etotal with respect to w5 as
Putting the value of e-y in equation (5)

So, we put the values ofin equation no (3) to find the final
result.
Now, we will calculate the updated weight w5new with the help of the following formula

In the same way, we calculate w6new,w7new, and w8new and this will give us the following values

w5new=0.35891648
w6new=408666186
w7new=0.511301270
w8new=0.561370121

Root Mean Squared Propagation (RMSProp)


Root Mean Squared Propagation, or RMSProp for short, is an extension to the gradient descent
optimization algorithm.It is an unpublished extension, first described in Geoffrey Hinton’s
lecture notes.RMSProp is designed to accelerate the optimization process, e.g. decrease the
number of function evaluations required to reach the optima, or to improve the capability of
the optimization algorithm, e.g. result in a better final result.It is related to another extension
to gradient descent called Adaptive Gradient, or AdaGrad.RMSProp extends Adagrad to avoid
the effect of a monotonically decreasing learning rate.RMSProp maintains a decaying average
of squared gradients.The calculation of the mean squared partial derivative for one parameter
is as follows:

s(t+1) = (s(t) * rho) + (f'(x(t))^2 * (1.0-rho))

Where s(t+1) is the decaying moving average of the squared partial derivative for one
parameter for the current iteration of the algorithm, s(t) is the decaying moving average
squared partial derivative for the previous iteration, f'(x(t))^2 is the squared partial derivative
for the current parameter, and rho is a hyperparameter, typically with the value of 0.9 like
momentum.

Given that we are using a decaying average of the partial derivatives and calculating the square
root of this average gives the technique its name, e.g, square root of the mean squared partial
derivatives or root mean square (RMS). For example, the custom step size for a parameter may
be written as:

cust_step_size(t+1) = step_size / (1e-8 + RMS(s(t+1)))

Once we have the custom step size for the parameter, we can update the parameter using the
custom step size and the partial derivative f'(x(t)).

x(t+1) = x(t) – cust_step_size(t+1) * f'(x(t))

This process is then repeated for each input variable until a new point in the search space is
created and can be evaluated.

RMSProp is a very effective extension of gradient descent and is one of the preferred
approaches generally used to fit deep learning neural networks.

Adam:
The Adam optimization algorithm is an extension to stochastic gradient descent. It use to
update network weights iterative based in training data.Adam was presented by Diederik
Kingma from Open AI and Jimmy Ba. the name Adam is derived from adaptive moment
estimation.

benefits of using Adam on non-convex optimization problems, as follows:

Straightforward to implement.
Computationally efficient.
Little memory requirements.
Invariant to diagonal rescale of the gradients.
Well suited for problems that are large in terms of data and/or parameters.
Appropriate for non-stationary objectives.
Appropriate for problems with very noisy/or sparse gradients.
Hyper-parameters have intuitive interpretation and typically require little tuning.

Adam as combining the advantages of two other extensions of stochastic gradient descent.
Specifically:
Adaptive Gradient Algorithm (AdaGrad) that maintains a per-parameter learning rate that
improves performance on problems with sparse gradients (e.g. natural language and computer
vision problems).
Root Mean Square Propagation (RMSProp) that also maintains per-parameter learning rates
that are adapted based on the average of recent magnitudes of the gradients for the weight
(e.g. how quickly it is changing). This means the algorithm does well on online and non-
stationary problems (e.g. noisy).

Weighted initialization in neural network:


Why Weight Initialization?

Its main objective is to prevent layer activation outputs from exploding or vanishing gradients
during the forward propagation. If either of the problems occurs, loss gradients will either be
too large or too small, and the network will take more time to converge if it is even able to do
so at all.

If we initialized the weights correctly, then our objective i.e, optimization of loss function will be
achieved in the least time otherwise converging to a minimum using gradient descent will be
impossible.

Different Weight Initialization Techniques

One of the important things which we have to keep in mind while building your neural network
is to initialize your weight matrix for different connections between layers correctly.

Let us see the following two initialization scenarios which can cause issues while we training the
model:

Zero Initialization (Initialized all weights to 0)

If we initialized all the weights with 0, then what happens is that the derivative wrt loss function
is the same for every weight in W[l], thus all weights have the same value in subsequent
iterations. This makes hidden layers symmetric and this process continues for all the n
iterations. Thus initialized weights with zero make your network no better than a linear model.
It is important to note that setting biases to 0 will not create any problems as non-zero weights
take care of breaking the symmetry and even if bias is 0, the values in every neuron will still be
different.

Random Initialization (Initialized weights randomly)


– This technique tries to address the problems of zero initialization since it prevents neurons
from learning the same features of their inputs since our goal is to make each neuron learn
different functions of its input and this technique gives much better accuracy than zero
initialization.

– In general, it is used to break the symmetry. It is better to assign random values except 0 to
weights.

– Remember, neural networks are very sensitive and prone to overfitting as it quickly
memorizes the training data.

“What happens if the weights initialized randomly can be very high or very
low?”
(a) Vanishing gradients :

 For any activation function, abs(dW) will get smaller and smaller as we go backward with every
layer during backpropagation especially in the case of deep neural networks. So, in this case,
the earlier layers’ weights are adjusted slowly.

Due to this, the weight update is minor which results in slower convergence.

This makes the optimization of our loss function slow. It might be possible in the worst case,
this may completely stop the neural network from training further.

More specifically, in the case of the sigmoid and tanh and activation functions, if your weights
are very large, then the gradient will be vanishingly small, effectively preventing the weights
from changing their value. This is because abs(dW) will increase very slightly or possibly get
smaller and smaller after the completion of every iteration.

So, here comes the use of the RELU activation function in which vanishing gradients are
generally not a problem as the gradient is 0 for negative (and zero) values of inputs and 1 for
positive values of inputs.

(b) Exploding gradients : 

This is the exact opposite case of the vanishing gradients, which we discussed above.

Consider we have weights that are non-negative, large, and having small activations A. When
these weights are multiplied along with the different layers, they cause a very large change in
the value of the overall gradient (cost). This means that the changes in W, given by the
equation W= W — ⍺ * dW, will be in huge steps, the downward moment will increase.
Eigen values and Eigen vectors:
Role of Eigen values and Eigen vectors in deep learning

Picking the features which represent that data and eliminating less useful features is an
example of dimensionality reduction. We can use eigenvalues and vectors to identify those
dimensions which are most useful and prioritize our computational resources toward them.

What is an Eigenvalue?

Mathematically, the eigenvalue is the number by which the eigenvector is multiplied and
produces the same result as if the matrix were multiplied with the vector as shown in Equation
1.

Ax = λx……………(1)
Where A is the square matrix, λ is the eigenvalue and x is the eigenvector

Details of how to calculate the determinant of a matrix can be found in a linear algebra
textbook.

Equation 2(A - λI)x = 0

Equation 3det(A - λI) = 0


A

You might also like