Module1

Introduction
Click to add text

to Deep Learning
AI, ML and DL
Deep Learning (DL)
• Deep learning is way of classifying, clustering, and predicting
things by using a neural network that has been trained on vast
amounts of data.
Deep Learning (DL)
• DL has its roots in neural networks (NN)
• NN are a set of complex algorithms that are
designed for pattern recognition.
• These NNs are modeled after human brain
and its biological neuron.
• A human brain has roughly 86 billion neurons
connected to many other neurons.
• The fundamental unit of a NN is a node, based
on the biological neuron of a human brain.
Deep Learning (DL)
Deep NN
• These are NN with more than two layers.
• 'Deep' - no. Of hidden layers.
Inside a Deep Neural Network
Some DL Architectures
Designing a NN
• Movement of information
in a NN happens in two
stages
(feed)forward propagation
and backpropagation
Designing a NN
Designing a DNN
DNN are NNs that are designed to mimic human intelligence
Points to consider while designing :
1. which layer to use?

2. How many neurons to use in each layer?
3. How to arrange the layers?
4. Which Activation function to use?
5. Others
Applications of DL
A single-layer perceptron is the basic unit of a neural network. A perceptron consists of
input values, weights and a bias, a weighted sum and activation function.
• A perceptron works by taking in some numerical inputs along with what is known as
weights and a bias.
• It then multiplies these inputs with the respective weights(this is known as the weighted
sum).
• These products are then added together along with the bias.
• The activation function takes the weighted sum and the bias as inputs and returns a final
output.
Assume we have a single neuron and three inputs x1, x2, x3 multiplied by the

weights w1, w2, w3 respectively as shown above
given the numerical value of the inputs and the weights, there is a function, inside the
neuron, that will produce an output.
what if we wanted the outputs to fall into a certain range say 0 to 1.
An activation function is a function that converts the input given (the input, in this case,
would be the weighted sum) into a certain output based on a set of rules.
Build a network with 2 input neurons, 3 hidden neurons, 2 output neurons, and 4 observations in training
set.
Use same number of layers and neurons but reduce the number of observations in dataset to 1 instance:
MLP : Multi Layer Perceptron
What is an Activation Function?
• They basically decide whether a neuron should be activated or not.

• Whether the information/input that the neuron is receiving is relevant for the
given prediction or should it be ignored.
• Input to the activation function is
• The activation function is the non linear transformation that we do over
the input signals of hidden neurons.
• This transformed output is then sent to the next layer of neurons as
input.
• A neural network without an activation function is essentially just a

linear regression model.
• The activation function does the non-linear transformation to the input

making it capable to learn and perform more complex tasks.
• This is applied to the hidden neurons

Need for Activation Function
Purpose of Activation Functions is to introduce non-linearities in the network

Types of Activation functions with Neural Networks
The Activation Functions can be basically divided into different types-

1. Binary Step functions
2. Linear Activation Function
3. Non-linear Activation Functions
1. Binary Step Function
• A binary step function is a threshold-based activation function.
• It uses a threshold to decide whether a neuron should be activated or
not
• If the input to the activation function (Y) is above (or below) a certain
threshold, the neuron is activated and sends exactly the same signal to
the next layer.
• Otherwise, the neuron is not activated. I.e., signal is not passed to the
next layer.
Activation function f(x) = “activated” if

Y > threshold else not
Alternatively, f(x) = 1 if Y> threshold, 0
otherwise
Disadvantages of Binary Step Functions :
1. They don't provide multi-value outputs – not

suitable for multi-class classification
2. The gradient of the step function is zero, this
introduces some problem in the backpropagation
process
2. Linear Activation Function
• Also known as identity function.
• In Linear Activation Function, the
dependent Variable has a direct,
proportional relationship with the
independent variable.
• The output is proportional to the input.
Equation : f(x) = x
Range : (-infinity to infinity)
It doesn’t help with the complexity or

various parameters of usual data that is
fed to the neural networks.
• The output of the functions will not be confined between any range.
Disadvantages of Linear Activation Function
• The gradient of the function doesn't involve the

input (x)
• Hence it is difficult during backpropagation to
identify the neuron's whose weight have to be
adjusted
• The neuron passes the signal as it is to the
next layer
• The last layer will be a linear function of the first
layer.
• This linear activation function is generally used by
the neurons in the input layer of NN.
Non-linear Activation Function
The Nonlinear Activation Functions are the most used activation functions.
It makes it easy for the model to

generalize or adapt with variety
of data and to differentiate
between the output
The Nonlinear Activation Functions are mainly divided on the basis of their range or
curves
Advantages of Non-Linear Activation Functions
• The gradient of the function involves input 'x'.
• Hence it is easy to understand which weights of

the input neurons have to be adjusted, during
backpropagation to give a better prediction
1. Sigmoid or Logistic Activation Function
Input : a real number
Output : a number between 0 to 1
The main reason why we use sigmoid function is because it exists between (0 to 1). Therefore,
it is especially used for models where we have to predict the probability as an output. Since
probability of anything exists only between the range of 0 and 1, sigmoid is the right choice.
Smaller the input number (more

Adds Non-Linearity negative) 0
Greater the input number (more
positive) 1
Disadvantages of Sigmoid Activation Function
• The gradient of the function has a significant value,

only for inputs between 3 and –3.
• For inputs out of this range, the gradient is small, and
eventually it becomes zero.
• The network stops learning and suffers from
vanishing gradient problem
2. Tanh or hyperbolic tangent Activation Function
• The output range of the tanh function is from (-1 to 1). tanh is also sigmoidal (s - shaped).
• Tanh is zero centered.

• Negative inputs are
mapped strongly
negative
• Positive inputs are
mapped strongly
positive
• Zero inputs are mapped
near zero
• Both tanh and logistic

sigmoid activation
functions are used in
feed-forward nets.

Fig: tanh v/s Logistic Sigmoid
Disadvantages of Tanh Activation Function
• Gradient is very steep, but eventually becomes zero

• The network stops learning and suffers from
vanishing gradient problem
• But tanh is zero centered and the gradients move in all
directions.
• Hence tanh non-linearity is preferred over sigmoid
Comparison of Sigmoid and Tanh Activation Functions ….
• For integers between –6 to + 6
Comparison of Sigmoid and Tanh Activation Functions...
• For integers between –6 to + 6
• Data is centered around zero for tanh meaning, Mean of the input data is zero
• Training of the neural network converges faster, if the inputs to the neurons in
each layer have a mean of zero and a variance of 1 and decorrelated.
• Since the input to each layer comes from the previous layer, it is important
that the output of the previous layers (input to the next layers) are centered
around zero.
3. ReLU (Rectified Linear Unit) Activation Function
• The ReLU is the most used activation function. Since, it is used in almost all the
convolutional neural networks or deep learning.
• The ReLU is half rectified (from bottom). R(z) is zero when z is less than zero and
R(z) is equal to z when z is above or equal to zero.
• Range: [ 0 to infinity)
• Any negative input given to the ReLU activation function turns the value into zero
immediately in the graph, which in turns affects the resulting graph by not mapping
the negative values appropriately.
Disadvantages of ReLU :
• For negative inputs, the gradient is zero.

• Hence during backpropagation, the weights and bias of some neurons are not
updated.
• This creates dead neurons, which never get activated
• This is known as "Dying ReLU problem"
4. Leaky ReLU/Parametric ReLu
• It is an attempt to solve the dying ReLU problem
Fig : ReLU v/s Leaky ReLU
• The gradient has a slope for negative inputs .

• The leak helps to increase the range of the ReLU function.
• Usually, the value of a is 0.1 (Leaky ReLU) or some other value a
• When a is not 0.01 then it is called Randomized/Parametric ReLU.

f(x) = max(αx, x)
• Therefore the range of the Leaky ReLU is (-infinity to infinity).
Advantages and Disadvantages of Leaky ReLU :
• For negative inputs, the gradient is a non-zero value

• Hence during backpropagation, the weights and bias of all neurons are
updated. No dead neurons
• The predictions made for negative inputs are not consistent.
• Since the gradient is a very small value for negative inputs, learning of model
parameters is time consuming
•Sigmoid functions and their combinations generally work better in the case of
classifiers
•Sigmoids and tanh functions are sometimes avoided due to the vanishing gradient
problem
•ReLU function is a general activation function and is used in most cases these
days. ReLu is less computationally expensive than tanh and sigmoid because it
involves simpler mathematical operations and activates only few neurons
•If we encounter a case of dead neurons in our networks the leaky ReLU function
is the best choice
•Always keep in mind that ReLU function should only be used in the hidden
layers. At current time, ReLu works most of the time as a general approximator
• Variants of ReLU
• Leaky ReLU
• Parametric ReLU
• Exponential Linear Unit
SoftMax Activation Function
Activation Function
Activation Functions
Gradients and Activation Functions
• When constructing Artificial Neural Network (ANN) models, one of the key
considerations is to select an activation functions for the hidden and output
layers that are differentiable. I,e their derivatives should not be zero
• The gradient/derivative of the activation function is required during

backpropagation
• To update the weights of the neurons
• To determine how much and in what direction (+/-) the weights have to
be adjusted
Complete This !!!!!
# Activation Function Properties Pros Cons
1 Sigmoid
2 Softmax
3 ReLu
4 Leaky ReLu
6 TanH
Tip 1:
Generally, we use softmax activation instead of sigmoid with the cross-entropy loss because softmax
activation distributes the probability throughout each output node.
Which to use when and Where ?????

LOSS FUNCTIONS
From Word Doc

Loss/Cost/Objective/Error Functions
# Loss Function Type of Loss Properties Pros Cons
Function
1 MSE/Quadratic Regression
Loss/L2 Loss
2 Mean Absolute
Error/L1 Loss
3 Mean Bias Error
4 Hinge Loss/Multi
class SVM Loss
5 Cross Entropy Classification
Loss/Negative Log
Likelihood
6 Hubber
Which to use when and With what ?????

Cross Entropy Loss
P : Actual Probability
Q : Predicted Probability
Entropy :
Loss Functions
BACK - PROPAGATION
07/13/2023 69
C = Loss = Mean Squared Error()
07/13/2023 70
07/13/2023 71
07/13/2023 72
07/13/2023 73
07/13/2023 74
07/13/2023 75
Optimization
Given an function f(x), an optimization algorithm help in either minimizing or maximizing
the value of f(x).
In Deep learning, optimization algorithms are used to train the neural network by optimizing
the cost function J. The cost function is defined as:
• The value of cost function J is the mean of the loss L between the predicted value y’ and
actual value y.
• The value y’ is obtained during the forward propagation step and makes use of the Weights
W and biases b of the network.
• With the help of optimization algorithms, we minimize the value of Cost Function J by
updating the values of the trainable parameters W and b.
07/13/2023 77
07/13/2023 78
Gradient Descent
Batch Gradient Descent
07/13/2023 80
• Batch Gradient Descent involves calculations
over the full training set at each step as a
result of which it is very slow on very large
training data.
• Thus, it becomes very computationally
expensive to do Batch GD.
07/13/2023 82
07/13/2023 83
• In Stochastic Gradient Descent (SGD), we consider just one example at a
time to take a single step. We do the following steps in one epoch for SGD:
• Take an example
• Feed it to Neural Network
• Calculate it’s gradient
• Use the gradient we calculated in step 3 to update the weights
• Repeat steps 1–4 for all the examples in training dataset
•
• Drawback:
• SGD takes more number of iterations compared to GD to reach minimum
and also contains some noise when compared to Gradient Descent.
• As SGD computes derivatives of only 1 point at a time, the time taken to
complete one epoch is large compared to Gradient Descent algorithm.
Mini Batch Stochastic Gradient Descent
• MB-SGD is an extension of SGD algorithm.
• It is also common to sample a small number of data points instead of just one point
at each step and that is called “mini-batch” gradient descent. Mini-batch tries to
strike a balance between the goodness of gradient descent and speed of SGD.
• It overcomes the time-consuming complexity of SGD by taking a batch of points /
subset of points from dataset to compute derivative.
• after creating the mini-batches of fixed size, we do the following steps in one epoch:
1. Pick a mini-batch
2. Feed it to Neural Network
3. Calculate the mean gradient of the mini-batch
4. Use the mean gradient we calculated in step 3 to update the weights
5. Repeat steps 1–4 for the mini-batches we created
• Drawback is the update of weights is much noisier because the derivative is not
always towards minima.
Types - Gradient Descent
Batch GD : θ=θ−η⋅∇θJ(θ)
SGD : θ=θ−η⋅∇θJ(θ;x(i);y(i))
Mini Batch : θ=θ−η⋅∇θJ(θ;x(i:i+n);y(i:i+n))

Batch Vs Stochastic Vs Mini Batch
Optimization
Gradient descent is an optimization algorithm often used for finding the weights
SGD is one of many optimization methods, namely first order optimizer, meaning,

that it is based on analysis of the gradient of the objective.
In gradient descent one is trying to reach the minimum of the loss function with
respect to the parameters using the derivatives calculated in the back-propagation.
The easiest way would be to adjust the parameters by substracting its corresponding
derivative multiplied by a learning rate, which regulates how much you want to move
in the gradient direction.
The three main flavors of gradient descent are batch, stochastic, and mini-batch.
Backpropagation is an efficient method of computing gradients in directed graphs of

computations, such as neural networks.
This is not a learning method, but rather a nice computational trick which is often
used in learning methods.
This is actually a simple implementation of chain rule of derivatives, which simply
gives you the ability to compute all required partial derivatives in linear time
Trained with SGD using backprop as a gradient computing technique
Back Propagation
Back Propagation
The goal of back Propagation is to optimize the weights so that the neural network can learn how to correctly map arbitrary inputs to
outputs.
The Forward Pass
Total Error
Back Propagation
Backward Pass
Consider . , We want to know how much a change in affects
the total error, (Gradient w.r.t )
Applying Chain Rule

Back Propagation
Next, how much does the output of change with respect to its total net input?
What is a gradient ?
• a gradient is a measure of how much the output variable changes for

a small change in the input.
• this gradient is then used to update/learn the model parameters —
weights and biases
• the parameter updation rule is
• if the derivative term in the above equation is too small,there will be

very small change in Wx.
• Hence new and old weights are almost same. No learning.
• The weights of the initial layers would continue to remain unchanged
(or only change by a negligible amount), no matter how many
epochs you run with the backpropagation algorithm.
Problem of Vanishing Gradient
VANISHING GRADIENT PROBLEM
• As more layers using certain activation functions are added to neural networks,
the gradients of the loss function approaches zero, making the network hard to
train.
• Certain activation functions, like the sigmoid function, squishes a large input
space into a small input space between 0 and 1.
• Therefore, a large change in the input of the sigmoid function will cause a small
change in the output. Hence, the derivative becomes small.
• when the inputs of the sigmoid function becomes larger or smaller (when |x|
becomes bigger), the derivative becomes close to zero. Vanishing Gradient
Problem
• In networks with few layers and sigmoid activation function, there is
no problem of vanishing gradient
• when more layers are used, it can cause the gradient to be too small
for training to work effectively.
• Gradients of neural networks are found using backpropagation
• backpropagation finds the derivatives of the network by moving layer
by layer from the final layer to the initial one
• By the chain rule, the derivatives of each layer are multiplied down
the network (from the final layer to the initial) to compute the
derivatives of the initial layers.
• However, when n hidden layers use an activation like the sigmoid
function, n small derivatives are multiplied together.
• Thus, the gradient decreases exponentially as we propagate down to
the initial layers.
• A small gradient means that the weights and biases of the initial layers
will not be updated effectively with each training session
• Since these initial layers are often crucial to recognizing the core
elements of the input data, it can lead to overall inaccuracy of the
whole network.
Ways to detect whether your deep network is suffering from the
vanishing gradient problem: -
 The model will improve very slowly during the training phase and it is also
possible that training stops very early, meaning that any further training
does not improve the model.
 The weights closer to the output layer of the model would witness more of
a change whereas the layers that occur closer to the input layer would not
change much (if at all).
 Model weights shrink exponentially and become very small when training
the model.
 The model weights become 0 in the training phase.

Vanishing Gradient Problem
Few Solutions:
1. Use other activation functions, such as ReLU, which
doesn’t cause a small derivative
2. Residual networks (ResNet)
• Use bypass/skip connections to bypass
information from few layers.
• Using these connections, information can be
transferred from layer n to layer n+t
• to perform this, the activation function of layer n is
connected to the activation function of n+t.
• This causes the gradient to pass between the
layers without any modification in size.
• Residual connection directly adds the value at the
beginning of the block, x, to the end of the block
(F(x)+x)
• This residual connection doesn’t go through
activation functions that “squashes” the
derivatives, resulting in a higher overall derivative
of the block.
3. Batch Normalization :
• Vanishing gradients usually happen while using the Sigmoid or Tanh activation
functions in the hidden layer units.
• Looking at the function plot below, we can see that when inputs become very
small or very large, the sigmoid function saturates at 0 and 1 and the tanh
function saturates at -1 and 1.
• In both these cases, their derivatives are extremely close to 0.
• these ranges/regions of the function “saturating regions” or “bad regions”.
• Thus, if your input lies in any of the saturating regions, then it has almost no
gradient to propagate back through the network.
• batch normalization can be simply visualized as an additional layer in the
network that normalizes the data (using a mean and standard deviation)
before feeding it into the hidden unit activation function.
• Batch normalization normalizes the input and ensures that|x| lies within
the “good range” (marked as the green region) and doesn’t reach the
outer edges of the sigmoid function.
• If the input is in the good range, then the activation does not saturate,
and thus the derivative also stays in the good range, i.e- the derivative
value isn’t too small.
• Thus, batch normalization prevents the gradients from becoming too
small and makes sure that the gradient signal is heard.
Exploding Gradient Problem
Exploding gradients are a problem where large error gradients accumulate and result in very
large updates to neural network model weights during training
Results in model being unstable and unable to learn from your training data
Ways to detect whether your deep network is suffering from the
exploding gradient problem: -

 Model weights grow exponentially and become very large when training the
model.
 The model weights become NaN in the training phase.
Approaches to address both vanishing and exploding gradient
problems
1. Reducing the amount of Layers
This is solution could be used in both, scenarios (exploding and vanishing
gradient). However, by reducing the amount of layers in our network, we give up
some of our models complexity, since having more layers makes the networks
more capable of representing complex mappings.
2. Gradient Clipping (Exploding Gradients)

Checking for and limiting the size of the gradients whilst our model trains is
another solution.
3. Weight Initialization
A more careful initialization choice of the random initialization for your network
tends to be a partial solution, since it does not solve the problem completely.
Training a NN in Keras
Data Set : Pima Indians Diabetes Data Set
It describes patient medical record data for Pima Indians and whether
they had an onset of diabetes within five years.
It is a binary classification problem (onset of diabetes as 1 or not as 0).
The input variables that describe each patient are numerical and have
varying scales.
Below lists the eight attributes for the dataset:
1. Number of times pregnant. 2. Plasma glucose concentration a 2 hours
in an oral glucose tolerance test. 3. Diastolic blood pressure (mm Hg). 4.
Triceps skin fold thickness (mm). 5. 2-Hour serum insulin (mu U/ml). 6.
Body mass index. 7. Diabetes pedigree function. 8. Age (years). 9. Class,
onset of diabetes within five years.
Sample records:
6,148,72,35,0,33.6,0.627,50,1 1,85,66,29,0,26.6,0.351,31,0
8,183,64,0,0,23.3,0.672,32,1 1,89,66,23,94,28.1,0.167,21,0
0,137,40,35,168,43.1,2.288,33,1
Neural Network Structure
from google.colab import files
uploaded = files.upload()
# first neural network with keras tutorial
import keras
from numpy import loadtxt
from keras.models import Sequential
from keras.layers import Dense
import pandas as pd
df = pd.read_csv("/content/pima-indians-diabetes.csv")
# split into input (X) and output (y) variables
X = df.iloc[:,0:8]
y = df.iloc[:,8]
# define the keras model
model = Sequential()
#input_layer = Dense(12, input_dim = 8, activation = 'relu')
#model.add(input_layer)
model.add(Dense(12, input_dim=8, activation='relu'))
model.add(Dense(8, activation='relu'))
# compile the keras model and specify the training parameters of the architecture
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit the keras model on the dataset
model.fit(X, y, epochs=150, batch_size=16)
#Output
# evaluate the keras model
_, accuracy = model.evaluate(X, y)
print('Accuracy: %.2f' % (accuracy*100))
#Output
model.get_config()
Calculating the No. of Trainable Parameters
Ex1: With one hidden layer

No. of input units i = 3, hidden units h = 4 and
output units o = 2
Hence, no. of trainable parameters :

Summing it all,
3×4+4×2+1×4+1×2
=3×4+4×2+4+2
= i × h + h × o + h + o
Thus, the total number of parameters in a feed-forward neural network with

one hidden layer is given by:
(i × h + h × o) + h + o
Example 2:
A feed-forward neural network with three hidden layers. Number of units in the
input, first hidden, second hidden, third hidden and output layers are
respectively 3, 5, 6, 4 and 2. Calculate the no. of trainable parameters.
Ans :
Example 2:
A feed-forward neural network with three hidden layers. Number of units in the
input, first hidden, second hidden, third hidden and output layers are respectively
3, 5, 6, 4 and 2. Calculate the no. of trainable parameters.
Ans :
• Number of connections between the first and second layer: 3 × 5 = 15, which is
nothing but the product of i and h1.
• Number of connections between the second and third layer: 5 × 6 = 30, which is
nothing but the product of h1 and h2.
• Number of connections between the third and fourth layer: 6 × 4 = 24, which is
nothing but the product of h2 and h3.
• Number of connections between the fourth and fifth layer: 4 × 2= 8, which is
nothing but the product of h3 and o.
• Number of connections between the bias of the first layer and the neurons of
the second layer (except bias of the second layer): 1 × 5 = 5, which is nothing
but h1.
• Number of connections between the bias of the second layer and the neurons
of the third layer: 1 × 6 = 6, which is nothing but h2.
• Number of connections between the bias of the third layer and the neurons of
the fourth layer: 1 × 4 = 4, which is nothing but h3.
• Number of connections between the bias of the fourth layer and the neurons of
the fifth layer: 1 × 2 = 2, which is nothing but o.
• Summing up all:
3×5+5×6+6×4+4×2+1×5+1×6+1×4+1×2
= 15 + 30 + 24 + 8 + 5 + 6 + 4 + 2
= 94
Thus, this feed-forward neural network has 94 connections in all and thus 94 trainable
parameters.
• To generalize this equation and find a formula.

3×5+5×6+6×4+4×2+1×5+1×6+1×4+1×2
=3×5+5×6+6×4+4×2+5+6+4+2
= i × h1 + h1 × h2 + h2 × h3+ h3 × o + h1 + h2 + h3+ o
• Thus, the total number of parameters in a feed-forward neural network with three
hidden layers is given by:
(i × h1 + h1 × h2 + h2 × h3 + h3 × o) + h1 + h2 + h3+ o
Calculate the number of trainable parameters for this model :
Bias is initialised to Zero

Hyperparameters
• Hyperparameters are the variables which determines the
network structure(Eg: Number of Hidden Units) and the
variables which determine how the network is trained(Eg:
Learning Rate).
• Hyperparameters are set before training(before optimizing
the weights and bias
Hyper parameters
1. No. of hidden layers and units
2. DropOut
• Deep learning neural networks are likely to quickly overfit a training dataset
with few examples.
• A larger/deeper NN is also likely to overfit and hence poor generalization.
• Dropout is a regularization method used to prevent model overfitting.
• It simulates a large number of different network architectures from a
single model by randomly dropping out few neurons from each layer during
each training iteration.
• It is a very computationally cheap and remarkably effective regularization
method to reduce overfitting and improve generalization error in deep
neural networks of all kinds.
• It can be used with most types of layers, such as dense fully
connected layers, convolutional layers, and recurrent layers such as
the long short-term memory network layer.
• Dropout may be implemented on any or all hidden layers in the
network as well as the visible or input layer. It is not used on the
output layer.
• The term “dropout” refers to dropping out units (hidden and visible)
in a neural network.
• Dropout is not used after training when making a prediction with
the fit network.
• The dropout hyperparameter specifies the probability at which outputs
of the layer are dropped out (inversely, the propability at which inputs
to the layers are retained)
• a small dropout value of 20%-50% of neurons is generally used.
• A common value is a probability of 0.5 for retaining the output of each

node in a hidden layer(dropout is 0.5) and a value close to 1.0, such as
0.8, for retaining inputs from the visible layer (dropout is 0.2)
• The weights of the network will be larger than the normal because of
dropout.
• Hence weights are scaled down using the chosen dropout rate.
• The network can then be used as per normal to make predictions.

3. Weight Initialization
• different weight initialization schemes according to the activation function used

on each layer
• For a NN with L layers, there are L-1 hidden layers ,1 input and output layer
each.
• The parameters (weights and biases) for layer l are represented as
• These methods serve as good starting points for initialization and mitigate
the chances of exploding or vanishing gradients.
• They set the weights neither too much bigger than 1, nor too much less
than 1.
• So, the gradients do not vanish or explode too quickly. They help avoid
slow convergence
Source: Neural networks and deep learning,

Andrew Ng (Coursera.org).
• REFER THE PDF FOR THE OTHER HYPERPARAMETERS
Epochs
One Epoch is when an ENTIRE dataset is passed forward and
backward through the neural network only ONCE.
Batch Size
Total number of training examples present in a single batch.

Module1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Module1

Uploaded by

Copyright:

Available Formats

Introduction

Click to add text

1. which layer to use?

Assume we have a single neuron and three inputs x1, x2, x3 multiplied by the

what if we wanted the outputs to fall into a certain range say 0 to 1.

• They basically decide whether a neuron should be activated or not.

• A neural network without an activation function is essentially just a

• The activation function does the non-linear transformation to the input

• This is applied to the hidden neurons

Purpose of Activation Functions is to introduce non-linearities in the network

The Activation Functions can be basically divided into different types-

Activation function f(x) = “activated” if

1. They don't provide multi-value outputs – not

It doesn’t help with the complexity or

• The gradient of the function doesn't involve the

It makes it easy for the model to

• The gradient of the function involves input 'x'.

• Hence it is easy to understand which weights of

Output : a number between 0 to 1

Smaller the input number (more

• The gradient of the function has a significant value,

• Tanh is zero centered.

• Both tanh and logistic

• Gradient is very steep, but eventually becomes zero

• For negative inputs, the gradient is zero.

Fig : ReLU v/s Leaky ReLU

• The gradient has a slope for negative inputs .

• When a is not 0.01 then it is called Randomized/Parametric ReLU.

• For negative inputs, the gradient is a non-zero value

• The gradient/derivative of the activation function is required during

Which to use when and Where ?????

From Word Doc

Which to use when and With what ?????

Mini Batch : θ=θ−η⋅∇θJ(θ;x(i:i+n);y(i:i+n))

SGD is one of many optimization methods, namely first order optimizer, meaning,

Backpropagation is an efficient method of computing gradients in directed graphs of

The Forward Pass

Applying Chain Rule

• a gradient is a measure of how much the output variable changes for

• if the derivative term in the above equation is too small,there will be

 The model weights become 0 in the training phase.

2. Gradient Clipping (Exploding Gradients)

Ex1: With one hidden layer

Hence, no. of trainable parameters :

Thus, the total number of parameters in a feed-forward neural network with

• To generalize this equation and find a formula.

Bias is initialised to Zero

• a small dropout value of 20%-50% of neurons is generally used.

• A common value is a probability of 0.5 for retaining the output of each

• The network can then be used as per normal to make predictions.

• different weight initialization schemes according to the activation function used

Source: Neural networks and deep learning,

You might also like