Professional Documents
Culture Documents
Note:There is also a bias unit in a feed-forward neural network in all the layers except the
output layer.
Scenario 1: A feed-forward neural network with just one hidden layer. Number of units in the
input, hidden and output layers are respectively 3, 4 and 2.
Assumptions:
Mathematically:
Number of connections between the first and second layer: 3 × 4 = 12, which is nothing but the
product of i and h.
Number of connections between the second and third layer: 4 × 2 = 8, which is nothing but the
product of h and o.
There are connections between layers via bias as well. Number of connections between the
bias of the first layer and the neurons of the second layer (except bias of the second layer): 1 ×
4, which is nothing but h.
Number of connections between the bias of the second layer and the neurons of the third
layer: 1 × 2, which is nothing but o.
Summing up all:
3×4+4×2+1×4+1×2
= 12 + 8 + 4 + 2
= 26
Thus, this feed-forward neural network has 26 connections in all and thus will have 26 trainable
parameters.
=3×4+4×2+4+2
=i×h+h×o+h+o
Thus, the total number of parameters in a feed-forward neural network with one hidden layer
is given by:
(i × h + h × o) + h + o
Since this network is a small network it was also possible to count the connections in the
diagram to find the total number. But, what if the number of layers is more? Let us work on one
more scenario and see if this formula works or we need an extension to this.
Scenario 1: A feed-forward neural network with three hidden layers. Number of units in the
input, first hidden, second hidden, third hidden and output layers are respectively 3, 5, 6, 4 and
2.
Assumptions:
Number of connections between the first and second layer: 3 × 5 = 15, which is nothing but the
product of i and h1.
Number of connections between the second and third layer: 5 × 6 = 30, which is nothing but the
product of h1 and h2.
Number of connections between the third and fourth layer: 6 × 4 = 24, which is nothing but the
product of h2 and h3.
Number of connections between the fourth and fifth layer: 4 × 2= 8, which is nothing but the
product of h3 and o.
Number of connections between the bias of the first layer and the neurons of the second layer
(except bias of the second layer): 1 × 5 = 5, which is nothing but h1.
Number of connections between the bias of the second layer and the neurons of the third
layer: 1 × 6 = 6, which is nothing but h2.
Number of connections between the bias of the third layer and the neurons of the fourth layer:
1 × 4 = 4, which is nothing but h3.
Number of connections between the bias of the fourth layer and the neurons of the fifth layer:
1 × 2 = 2, which is nothing but o.
Summing up all:
3×5+5×6+6×4+4×2+1×5+1×6+1×4+1×2
= 15 + 30 + 24 + 8 + 5 + 6 + 4 + 2
= 94
Thus, this feed-forward neural network has 94 connections in all and thus 94 trainable
parameters.
3×5+5×6+6×4+4×2+1×5+1×6+1×4+1×2
=3×5+5×6+6×4+4×2+5+6+4+2
= i × h1 + h1 × h2 + h2 × h3+ h3 × o + h1 + h2 + h3+ o
Thus, the total number of parameters in a feed-forward neural network with three hidden
layers is given by:
(i × h1 + h1 × h2 + h2 × h3 + h3 × o) + h1 + h2 + h3+ o
Thus, the formula to find the total number of trainable parameters in a feed-forward neural
network with n hidden layers is given by:
Now, how error function is used in Backpropagation, and how Backpropagation works?
Input values
X1=0.05
X2=0.10
Initial weight
W1=0.15 w5=0.40
W2=0.20 w6=0.45
W3=0.25 w7=0.50
W4=0.30 w8=0.55
Bias Values
b1=0.35 b2=0.60
Target Values
T1=0.01
T2=0.99
Now, we first calculate the values of H1 and H2 by a forward pass.
Forward Pass
To find the value of H1 we first multiply the input value from the weights as
H1=x1×w1+x2×w2+b1
H1=0.05×0.15+0.10×0.20+0.35
H1=0.3775
Now, we calculate the values of y1 and y2 in the same way as we calculate the H1 and
H2.
To find the value of y1, we first multiply the input value i.e., the outcome of H1 and H2
from the weights as
y1=H1×w5+H2×w6+b2
y1=0.593269992×0.40+0.596884378×0.45+0.60
y1=1.10590597
Our target values are 0.01 and 0.99. Our y1 and y2 value is not matched with our target
values T1 and T2.
Now, we will find the total error, which is simply the difference between the outputs from the
target outputs. The total error is calculated as
So, the total error is
Now, we will
backpropagate
this error to
update the
weights using a
backward pass.
To update the weight, we calculate the error corresponding to each weight with the help of a
total error. The error on weight w is calculated by differentiating total error with respect to w.
From equation two, it is clear that we cannot partially differentiate it with respect to w5
because there is no any w5. We split equation one into multiple terms so that we can
easily differentiate it with respect to w5 as
Now, we
calculate each
term one by
one to differentiate Etotal with respect to w5 as
Putting the value of e-y in equation (5)
So, we put the values ofin equation no (3) to find the final
result.
Now, we will calculate the updated weight w5new with the help of the following formula
In the same way, we calculate w6new,w7new, and w8new and this will give us the following values
w5new=0.35891648
w6new=408666186
w7new=0.511301270
w8new=0.561370121
Where s(t+1) is the decaying moving average of the squared partial derivative for one
parameter for the current iteration of the algorithm, s(t) is the decaying moving average
squared partial derivative for the previous iteration, f'(x(t))^2 is the squared partial derivative
for the current parameter, and rho is a hyperparameter, typically with the value of 0.9 like
momentum.
Given that we are using a decaying average of the partial derivatives and calculating the square
root of this average gives the technique its name, e.g, square root of the mean squared partial
derivatives or root mean square (RMS). For example, the custom step size for a parameter may
be written as:
Once we have the custom step size for the parameter, we can update the parameter using the
custom step size and the partial derivative f'(x(t)).
This process is then repeated for each input variable until a new point in the search space is
created and can be evaluated.
RMSProp is a very effective extension of gradient descent and is one of the preferred
approaches generally used to fit deep learning neural networks.
Adam:
The Adam optimization algorithm is an extension to stochastic gradient descent. It use to
update network weights iterative based in training data.Adam was presented by Diederik
Kingma from Open AI and Jimmy Ba. the name Adam is derived from adaptive moment
estimation.
Straightforward to implement.
Computationally efficient.
Little memory requirements.
Invariant to diagonal rescale of the gradients.
Well suited for problems that are large in terms of data and/or parameters.
Appropriate for non-stationary objectives.
Appropriate for problems with very noisy/or sparse gradients.
Hyper-parameters have intuitive interpretation and typically require little tuning.
Adam as combining the advantages of two other extensions of stochastic gradient descent.
Specifically:
Adaptive Gradient Algorithm (AdaGrad) that maintains a per-parameter learning rate that
improves performance on problems with sparse gradients (e.g. natural language and computer
vision problems).
Root Mean Square Propagation (RMSProp) that also maintains per-parameter learning rates
that are adapted based on the average of recent magnitudes of the gradients for the weight
(e.g. how quickly it is changing). This means the algorithm does well on online and non-
stationary problems (e.g. noisy).
Its main objective is to prevent layer activation outputs from exploding or vanishing gradients
during the forward propagation. If either of the problems occurs, loss gradients will either be
too large or too small, and the network will take more time to converge if it is even able to do
so at all.
If we initialized the weights correctly, then our objective i.e, optimization of loss function will be
achieved in the least time otherwise converging to a minimum using gradient descent will be
impossible.
One of the important things which we have to keep in mind while building your neural network
is to initialize your weight matrix for different connections between layers correctly.
Let us see the following two initialization scenarios which can cause issues while we training the
model:
If we initialized all the weights with 0, then what happens is that the derivative wrt loss function
is the same for every weight in W[l], thus all weights have the same value in subsequent
iterations. This makes hidden layers symmetric and this process continues for all the n
iterations. Thus initialized weights with zero make your network no better than a linear model.
It is important to note that setting biases to 0 will not create any problems as non-zero weights
take care of breaking the symmetry and even if bias is 0, the values in every neuron will still be
different.
– In general, it is used to break the symmetry. It is better to assign random values except 0 to
weights.
– Remember, neural networks are very sensitive and prone to overfitting as it quickly
memorizes the training data.
“What happens if the weights initialized randomly can be very high or very
low?”
(a) Vanishing gradients :
For any activation function, abs(dW) will get smaller and smaller as we go backward with every
layer during backpropagation especially in the case of deep neural networks. So, in this case,
the earlier layers’ weights are adjusted slowly.
Due to this, the weight update is minor which results in slower convergence.
This makes the optimization of our loss function slow. It might be possible in the worst case,
this may completely stop the neural network from training further.
More specifically, in the case of the sigmoid and tanh and activation functions, if your weights
are very large, then the gradient will be vanishingly small, effectively preventing the weights
from changing their value. This is because abs(dW) will increase very slightly or possibly get
smaller and smaller after the completion of every iteration.
So, here comes the use of the RELU activation function in which vanishing gradients are
generally not a problem as the gradient is 0 for negative (and zero) values of inputs and 1 for
positive values of inputs.
This is the exact opposite case of the vanishing gradients, which we discussed above.
Consider we have weights that are non-negative, large, and having small activations A. When
these weights are multiplied along with the different layers, they cause a very large change in
the value of the overall gradient (cost). This means that the changes in W, given by the
equation W= W — ⍺ * dW, will be in huge steps, the downward moment will increase.
Eigen values and Eigen vectors:
Role of Eigen values and Eigen vectors in deep learning
Picking the features which represent that data and eliminating less useful features is an
example of dimensionality reduction. We can use eigenvalues and vectors to identify those
dimensions which are most useful and prioritize our computational resources toward them.
What is an Eigenvalue?
Mathematically, the eigenvalue is the number by which the eigenvector is multiplied and
produces the same result as if the matrix were multiplied with the vector as shown in Equation
1.
Ax = λx……………(1)
Where A is the square matrix, λ is the eigenvalue and x is the eigenvector
Details of how to calculate the determinant of a matrix can be found in a linear algebra
textbook.