DL Unit 2

Unit 2: Artificial Neural
Networks
Prof . Sachin S. Patil

D . Y. Patil University Ambi Pune
Prof.Sachin Sambhaji Patil 1
The Perceptron
• Perceptron was introduced by Frank Rosenblatt in 1957.
• He proposed a Perceptron learning rule based on the original MCP

neuron.
• A Perceptron is an algorithm for supervised learning of binary

classifiers.
• This algorithm enables neurons to learn and processes elements in the

training set one at a time.
The Perceptron

Basic Components of Perceptron
• Input Layer: The input layer consists of one or more input neurons, which
receive input signals from the external world or from other layers of the neural
network.
• Weights: Each input neuron is associated with a weight, which represents the
strength of the connection between the input neuron and the output neuron.
• Bias: A bias term is added to the input layer to provide the perceptron with
additional flexibility in modeling complex patterns in the input data.
• Activation Function: The activation function determines the output of the

perceptron based on the weighted sum of the inputs and the bias term.
Common activation functions used in perceptrons include the step function,
sigmoid function, and ReLU function.
Basic Components of Perceptron
• Output: The output of the perceptron is a single binary value, either 0 or 1,
which indicates the class or category to which the input data belongs.
• Training Algorithm: The perceptron is typically trained using a supervised

learning algorithm such as the perceptron learning algorithm or
backpropagation. During training, the weights and biases of the perceptron are
adjusted to minimize the error between the predicted output and the true
output for a given set of training examples.
• Overall, the perceptron is a simple yet powerful algorithm that can be used to
perform binary classification tasks and has paved the way for more complex
neural networks used in deep learning today.
Biological Neuron

Biological Neuron
• A human brain has billions of neurons.
• Neurons are interconnected nerve cells in the human brain that are
involved in processing and transmitting chemical and electrical
signals.
• Dendrites are branches that receive information from other neurons.

Biological Neuron
• Cell nucleus or Soma processes the information received from
dendrites.
• Axon is a cable that is used by neurons to send information.
• Synapse is the connection between an axon and other neuron

dendrites.

What is Artificial Neuron
• An artificial neuron is a mathematical function based on a model of
biological neurons, where each neuron takes inputs, weights them
separately, sums them up and passes this sum through a nonlinear
function to produce output.

Compare the biological neuron with the artificial neuron.
Biological Neuron Artificial Neuron
Cell Nucleus (Soma) Node
Dendrites Input
Weights or
Synapse
interconnections
Axon Output
Artificial Neuron
• A neuron is a mathematical function modeled on the working of biological
neurons
• It is an elementary unit in an artificial neural network
• One or more inputs are separately weighted
• Inputs are summed and passed through a nonlinear function to produce

output
• Every neuron holds an internal state called activation signal
• Each connection link carries information about the input signal
• Every neuron is connected to another neuron via connection link

Types of Perceptron:
• Single layer: Single layer perceptron can learn only linearly separable
patterns.
• Multilayer: Multilayer perceptrons can learn about two or more layers

having a greater processing power.
• The Perceptron algorithm learns the weights for the input signals in order
to draw a linear decision boundary.

Types of Perceptron:
• Supervised Learning is a types of machine learning used to learn models
from labeled training data. It enables output prediction for future or
unseen data.

How Does Perceptron Work?

• Perceptron is considered a single-layer neural link with four main parameters.
• The perceptron model begins with multiplying all input values and their
weights, then adds these values to create the weighted sum.
• Further, this weighted sum is applied to the activation function ‘f’ to obtain
the desired output.
• This activation function is also known as the step function and is represented
by ‘f.

• This step function or Activation function is vital in ensuring that output is
mapped between (0,1) or (-1,1).
• Take note that the weight of input indicates a node’s strength. Similarly,
an input value gives the ability the shift the activation function curve up
or down.

• Step 1: Multiply all input values with corresponding weight values and then add to
calculate the weighted sum. The following is the mathematical expression of it:
• ∑wi*xi = x1*w1 + x2*w2 + x3*w3+……..x4*w4
• Add a term called bias ‘b’ to this weighted sum to improve the model’s
performance.
• Step 2: An activation function is applied with the above-mentioned weighted sum

giving us an output either in binary form or a continuous value as follows:
• Y=f(∑wi*xi + b)
Types of Perceptron models
• Single Layer Perceptron model: One of the easiest ANN(Artificial Neural
Networks) types consists of a feed-forward network and includes a threshold
transfer inside the model. The main objective of the single-layer perceptron
model is to analyze the linearly separable objects with binary outcomes. A
Single-layer perceptron can learn only linearly separable patterns.
• Multi-Layered Perceptron model: It is mainly similar to a single-layer
perceptron model but has more hidden layers.
• Forward Stage: From the input layer in the on stage, activation functions
begin and terminate on the output layer.
• Backward Stage: In the backward stage, weight and bias values are modified
per the model’s requirement. The backstage removed the error between the
actual output and demands originating backward on the output layer. A
multilayer perceptron model has a greater processing power and can process
linear and non-linear patterns. Further, it also implements logic gates such as
AND, OR, XOR, XNOR, and NOR. Prof.Sachin Sambhaji Patil 18
Perceptron models
• Advantages:
• A multi-layered perceptron model can solve complex non-linear

problems.
• It works well with both small and large input data.
• Helps us to obtain quick predictions after the training.
• Helps us obtain the same accuracy ratio with big and small data.

Perceptron models
• Disadvantages:
• In multi-layered perceptron model, computations are time-consuming and

complex.
• It is tough to predict how much the dependent variable affects each

independent variable.
• The model functioning depends on the quality of training.

Characteristics of the Perceptron Model
• It is a machine learning algorithm that uses supervised learning of binary classifiers.
• In Perceptron, the weight coefficient is automatically learned.
• Initially, weights are multiplied with input features, and then the decision is made whether
the neuron is fired or not.
• The activation function applies a step rule to check whether the function is more significant
than zero.
• The linear decision boundary is drawn, enabling the distinction between the two linearly
separable classes +1 and -1.
• If the added sum of all input values is more than the threshold value, it must have an
output signal; otherwise, no output willProf.Sachin
be shown.Sambhaji Patil 21
Limitation of Perceptron Model
• The output of a perceptron can only be a binary number (0 or 1) due to

the hard-edge transfer function.
• It can only be used to classify the linearly separable sets of input vectors.
If the input vectors are non-linear, it is not easy to classify them correctly.

Perceptron Function
• Perceptron is a function that maps its input “x,” which is multiplied with
the learned weight coefficient; an output value ”f(x)”is generated.
In the equation given above:

“w” = vector of real-valued weights
“b” = bias (an element that adjusts the boundary away from origin without
any dependence on the input value)
“x” = vector of input x values
Perceptron Function
“m” = number of inputs to the Perceptron

The output can be represented as “1” or “0.” It can also be represented as “1” or “-1”
depending on which activation function is used.
https://www.simplilearn.com/tutorials/deep-learning-tutorial/perceptron

The Architecture of the Multilayer Feed-Forward Neural Network:

• This Neural Network or Artificial Neural Network has multiple hidden layers
that make it a multilayer neural Network and it is feed-forward because it is
a network that follows a top-down approach to train the network. In this
network there are the following layers:
1. Input Layer
2. Hidden Layer
3. Output Layer

• Input Layer: It is starting layer of the network that has a weight associated
with the signals.
• Hidden Layer: This layer lies after the input layer and contains multiple
neurons that perform all computations and pass the result to the output
unit.
• Output Layer: It is a layer that contains output units or neurons and receives
processed data from the hidden layer, if there are further hidden layers
connected to it then it passes the weighted unit to the connected hidden
layer for further processing to get theSambhaji
Prof.Sachin desired
Patil result. 27
• The input and hidden layers use sigmoid and linear activation functions
whereas the output layer uses a step activation function at nodes because it is
a two-step activation function that helps in predicting results as per
requirements.
• All units also known as neurons have weights and calculation at the hidden
layer is the summation of the dot product of all weights and their signals and
finally the sigmoid function of the calculated sum.
• Multiple hidden and output layer increases the accuracy of the output.
What is a neural network
• A neural network is a method in artificial intelligence that teaches

computers to process data in a way that is inspired by the human brain.
• It is a type of machine learning process, called deep learning, that uses

interconnected nodes or neurons in a layered structure that resembles the
human brain.

Neural network
A neural network is a series of
algorithms that endeavors to recognize
underlying relationships in a set of data
through a process that mimics the way the
human brain operates.
In this sense, neural networks refer to
systems of neurons, either organic or
artificial in nature.

Back propagation Forward propagation
• Backward Propagation is the process of moving from right (output layer) to
left (input layer).
• Forward propagation is the way data moves from left (input layer) to right
(output layer) in the neural network.
• A neural network can be understood by a collection of connected

input/output nodes.
• The accuracy of a node is expressed as a loss function or error rate.

Backpropagation calculates the slope of a loss function of other weights in
the neural network. Prof.Sachin Sambhaji Patil 31
Back propagation Forward propagation
To train a neural network, there are 2
passes (phases):
1. Forward
2. Backward
The process of propagating the inputs

from the input layer to the output layer
is called forward propagation.
Once the network error is calculated,
then the forward propagation phase has
ended, and backward pass starts.
Forward and backward passes in Neural Networks
• The forward and backward phases are repeated from some epochs. In each
epoch, the following occurs:
• The inputs are propagated from the input to the output layer.
• The network error is calculated.
• The error is propagated from the output layer to the input layer.

Forward and backward passes in Neural Networks
• In the forward pass, we start by propagating the data inputs to the input layer,
go through the hidden layer(s), measure the network’s predictions from the
output layer, and finally calculate the network error based on the predictions
the network made.
• This network error measures how far the network is from making the correct
prediction. For example, if the correct output is 4 and the network’s prediction
is 1.3, then the absolute error of the network is 4-1.3=2.7.

How backpropagation algorithm works
• How the algorithm works is best explained based on a simple network, like the
one given in the next figure. It only has an input layer with 2 inputs (X1 and X2),
and an output layer with 1 output. There are no hidden layers.
• The weights of the inputs are W1 and W2, respectively. The bias is treated as a
new input neuron to the output neuron which has a fixed value +1 and a
weight b. Both the weights and biases could be referred to as parameters.

Output layer uses the sigmoid activation
function defined by the following
equation:
Where s is the sum of products (SOP) between each input and its corresponding
weight:
S = X1* W1 + X2*W2 + b
Forward pass
The input of the activation function will be the SOP between each input and its
weight. The SOP is then added to the bias to return the output of the neuron:
S = X1* W1 + X2*W2 + b
S = 0.1* 0.5 + 0.3*0.2 + 1.83
S = 1.94 Prof.Sachin Sambhaji Patil 37
Compare Single and Multi layer Feed-Forward Neural Network

Activation Functions –ReLu, linear Sigmoid, SoftMax, Tanh
Activation functions are generally two types, These are
1. Linear or Identity Activation Function
2. Non-Linear Activation Function

Non-linear Activation Functions
• Generally, neural networks use non-linear activation functions, which
can help the network learn complex data, compute and learn almost any
function representing a question, and provide accurate predictions.
• They allow back-propagation because they have a derivative function

which is related to the inputs.

• Non-linear Activation Functions:
• Above listed all activation functions are belong to non-linear activation functions.
And we will discuss below more in details.
• Sigmoid Activation Function:
• Sigmoid Activation function is very simple which takes a real value as input and
gives probability that ‘s always between 0 or 1. It looks like ‘S’ shape.

• 2. Tanh or Hyperbolic tangent:
• Tanh help to solve non zero centered problem of sigmoid function. Tanh
squashes a real-valued number to the range [-1, 1]. It’s non-linear too.

It solve sigmoid’s drawback but it still can’t remove the vanishing
gradient problem completely.
When we compare tanh activation function with sighmoid , this picture
give you clear idea.
# tanh activation function
def tanh(z):
return (np.exp(z) - np.exp(-z)) / (np.exp(z) +
np.exp(-z))
# Derivative of Tanh Activation Function
def tanh_prime(z):
return 1 - np.power(tanh(z), 2)

• 3. ReLU (Rectified Linear Unit):
• This is most popular activation function which is used in hidden layer of NN.
• The formula is deceptively simple: (0, ) max(0,z). Despite its name and
appearance, it’s not linear and provides the same benefits as Sigmoid but with
better performance.

• It’s main advantage is that it avoids and rectifies vanishing gradient problem and less
computationally expensive than tanh and sigmoid.
• But it has also some draw back . Sometime some gradients can be fragile during training
and can die. That leads to dead neurons.
• In another words, for activations in the region (x<0) of ReLu, gradient will be 0 because
of which the weights will not get adjusted during descent.
• That means, those neurons which go into that state will stop responding to variations
in error/ input ( simply because gradient is 0, nothing changes ). So We should be very
carefully to choose activation function , and activation function should be as per
business requirement. Prof.Sachin Sambhaji Patil 45
• 4. Leaky ReLU
• It prevents dying ReLU problem. T his variation of ReLU has a small
positive slope in the negative area, so it does enable back-
propagation, even for negative input values

• 5. Softmax
• Generally, we use the function at last layer of neural network which calculates
the probabilities distribution of the event over ’n’ different events. The main
advantage of the function is able to handle multiple classes.

Losses in neural network
• When you train Deep learning models, you feed data to the network, generate
predictions, compare them with the actual values (the targets) and then
compute what is known as a loss.
• This loss essentially tells you something about the performance of the network:
the higher it is, the worse your network performs overall.

• Loss functions are mainly classified into two different categories
Classification loss and Regression Loss.
• Classification loss is the case where the aim is to predict the output
from the different categorical values
• for example, if we have a dataset of handwritten images and the digit is

to be predicted that lies between (0–9), in these kinds of scenarios
classification loss is used.

• Whereas if the problem is regression like predicting the continuous

values for example, if need to predict the weather conditions or
predicting the prices of houses on the basis of some features. In this
type of case, Regression Loss is used.

1. Mean Absolute Error (L1 Loss)
2. Mean Squared Error (L2 Loss)
3. Huber Loss
4. Cross-Entropy(a.k.a Log loss)
5. Relative Entropy(a.k.a Kullback–Leibler divergence)
6. Squared Hinge

• Mean Absolute Error (MAE)
• Mean absolute error (MAE) also called L1 Loss is a loss function used
for regression problems. It represents the difference between the
original and predicted values extracted by averaging the absolute
difference over the data set.

• Mean Absolute Error (MAE)
• Use Mean absolute error when you are doing regression and don’t want
outliers to play a big role. It can also be useful if you know that your
distribution is multimodal, and it’s desirable to have predictions at one of
the modes, rather than at the mean of them.

• Example: When doing image reconstruction, MAE encourages less blurry

images compared to MSE. This is used for example in the paper Image-to-
Image Translation with Conditional Adversarial Networks.

Mean Squared Error (MSE)
• Mean Squared Error (MSE) also called L2 Loss is also a loss function used
for regression. It represents the difference between the original and
predicted values extracted by squared the average difference over the
data set.

• MSE is sensitive towards outliers and given several examples with the same
input feature values, the optimal prediction will be their mean target value.
• This should be compared with Mean Absolute Error, where the optimal
prediction is the median.
• MSE is thus good to use if you believe that your target data, conditioned on
the input, is normally distributed around a mean value, and when it’s
important to penalize outliers extra much.

• When to use it?
• Use MSE when doing regression, believing that your target, conditioned
on the input, is normally distributed, and want large errors to be
significantly (quadratically) more penalized than small ones.

• Example: You want to predict future house prices.
• The price is a continuous value, and therefore we want to do

regression. MSE can here be used as the loss function.

Huber Loss
• Huber Loss is typically used in regression problems. It’s less sensitive to
outliers than the MSE as it treats error as square only inside an interval.
• Consider an example where we have a dataset of 100 values we would like our
model to be trained to predict. Out of all that data, 25% of the expected
values are 5 while the other 75% are 10.

Huber Loss
• The Huber Loss offers the best of both worlds by balancing the MSE and
MAE together. We can define it using the following piecewise function:
Here, ( ) delta → hyper parameter defines the range for MAE and MSE.
In simple terms, the above radically says is: for loss values less than ( ) delta, use the MSE;
for loss values greater than delta, use the MAE.
This way Huber loss provides the best of both MAE and MSE.
Cross-Entropy Loss
• The concept of cross-entropy traces back into the field of Information Theory
where Shannon introduced the concept of entropy in 1948.
• Entropy — it is a measure of disorder, or unpredictability, in a system.
• p(x) — probability distribution and a random variable X,
• Entropy is defined as follows:

Cross-Entropy Loss
• Cross-Entropy loss is also called logarithmic loss, log loss, or
logistic loss.
• Each predicted class probability is compared to the actual class

desired output 0 or 1
• Where x represents the predicted results by ML algorithm, p(x) is

the probability distribution of “true” label from training samples
and q(x) depicts the estimation of the ML algorithm.
https://www.theaidream.com/post/loss-functions-in-neural-networks
Cross-Entropy Loss
• Cross-entropy loss measures the performance of a classification model

whose output is a probability value between 0 and 1.

Basic concepts of artificial neurons
• Basic concepts of artificial neurons,
• The artificial neuron is the building component of the ANN designed to

simulate the function of the biological neuron. The arriving signals, called
inputs, multiplied by the connection weights (adjusted) are first summed
(combined) and then passed through a transfer function to produce the
output for that neuron.


• Artificial neurons (also called Perceptrons, Units or Nodes) are the simplest
elements or building blocks in a neural network. They are inspired by
biological neurons that are found in the human brain.

• A biological neuron receives its input signals from other neurons through dendrites (small
fibers). Likewise, a perceptron receives its data from other perceptron's through input
neurons that take numbers.
• The connection points between dendrites and biological neurons are called synapses.
Likewise, the connections between inputs and perceptron's are called weights. They measure
the importance level of each input.
• In a biological neuron, the nucleus produces an output signal based on the signals provided by
dendrites. Likewise, the nucleus (colored in blue) in a perceptron performs some calculations
based on the input values and produces an output.
• In a biological neuron, the output signal is carried away by the axon. Likewise, the axon in a
perceptron is the output value which will be the input for the next perceptron's.
Optimizers
• An optimizer is an algorithm or function that adapts the neural network's
attributes, like learning rate and weights. Hence, it assists in improving the
accuracy and reduces the total loss.
• Hyperparameters: Learning Rate,
• Regularization,
• Momentum,
• Gradient-Based Learning,

Hyperparameters:
1. Learning Rate,
2. Regularization,
3. Momentum,
4. Gradient-Based Learning,

Hyperparameters:
1. Learning Rate-
It offers a degree that denotes how much the model weights
should be updated.
The amount that the weights are updated during training is
referred to as the step size or the “learning rate.” Specifically, the
learning rate is a configurable hyperparameter used in the
training of neural networks that has a small positive value, often
in the range between 0.0 and 1.0.1
Hyperparameters:
1. Learning Rate-
• A few different values and see which one gives you the best loss without
sacrificing speed of training. We might start with a large value like 0.1,
then try exponentially lower values: 0.01, 0.001, etc.

Hyperparameters:
• Epoch: It denotes the number of times the algorithm operates on the entire training
dataset.
• Batch: It is the number of samples to be considered for updating the model
parameters.
• Cost Function/Loss Function: A cost function helps you calculate the cost,
representing the difference between the actual value and the predicted value.
• Learning rate: It offers a degree that denotes how much the model weights should
be updated.
• Weights/ Bias: They are learnable parameters that control the signal between two
neurons in a deep learning model.
Hyperparameters:
• Regularization is a set of techniques that can prevent overfitting in
neural networks and thus improve the accuracy of a Deep Learning
model when facing completely new data from the problem domain.
• A. Regularization in deep learning is a technique used to prevent

overfitting and improve the generalization of neural networks.
• Popular regularization techniques which are called L1, L2, and dropout.

Hyperparameters:
• Momentum method is a technique that can accelerate gradient
descent by taking accounts of previous gradients in the update rule at
each iteration.
• Momentum is a widely-used strategy for accelerating the convergence

of gradient-based optimization techniques. Momentum was designed
to speed up learning in directions of low curvature, without becoming
unstable in directions of high curvature.

Hyperparameters:
• In deep learning, a variant called stochastic gradient descent (SGD) is
often used. It updates the parameters based on a randomly selected
subset of training samples in each iteration, rather than the entire
dataset. This helps in speeding up the training process and making it
feasible for large-scale problems.

Gradient-Based Optimizers in Deep Learning

Role of Learning Rate
• Learning rate represents the size of the steps our optimization algorithm
takes to reach the global minima. To ensure that the gradient descent
algorithm reaches the local minimum we must set the learning rate to an
appropriate value, which is neither too low nor too high.
• Taking very large steps i.e, a large value of the learning rate may skip the
global minima, and the model will never reach the optimal value for the
loss function. On the contrary, taking very small steps i.e, a small value of
learning rate will take forever to converge.

Role of Gradient
• In general, Gradient represents the slope of the equation while gradients
are partial derivatives and they describe the change reflected in the loss
function with respect to the small change in parameters of the function.
• Now, this slight change in loss functions can tell us about the next step to
reduce the output of the loss function.

• Learning rate represents the size of the steps our optimization

algorithm takes to reach the global minima. To ensure that the
gradient descent algorithm reaches the local minimum we must set
the learning rate to an appropriate value, which is neither too low nor
too high.
https://www.analyticsvidhya.com/blog/2021/06/complete-guide-to-gradient-based-optimizers/

Taking very large steps i.e, a

large value of the learning rate
may skip the global minima,
and the model will never reach
the optimal value for the loss
function.
On the contrary, taking very
small steps i.e, a small value of
learning rate will take forever to
converge.

Implementing Gradient Descent
• Implementing gradient descent involves updating the
parameters iteratively.
• The update formula for parameter w is given by
• w = w — α * (dJ/dw) ,
• where α is the learning rate and
• (dJ/dw) is the derivative term of the cost function with respect
to w.
• Gradient descent is an iterative optimization algorithm used to find the
values of model parameters that result in the smallest possible cost.
• It aims to minimize the cost function by adjusting the parameters in a

systematic way.
• The algorithm makes small updates to the parameters based on the

calculated gradient of the cost function.

The Process of Gradient Descent
• To apply gradient descent, we start with initial guesses for the
parameters.
• The algorithm then iteratively updates the parameters by taking

steps proportional to the negative gradient of the cost function.
• By repeating this process, the algorithm gradually converges towards

the optimal parameter values that minimize the cost.

Visualizing Gradient Descent

Simultaneous updates
of both parameters
(weight and bias) are
crucial for correct
gradient descent
implementation.

Types of Gradient Descent

Types of Gradient Descent 1. Batch Gradient Descent
• Batch gradient descent, also known as vanilla gradient descent,
computes the gradient using the entire training dataset at each
iteration.
• It calculates the average of the gradients for all training examples

before updating the model’s parameters.
https://medium.com/@yennhi95zz/4-a-beginners-guide-to-gradient-descent-in-machine-learning-773ba7cd3dfe#:~:text=III.-,
Implementing%20Gradient%20Descent,function%20with%20respect%20to%20w.

Types of Gradient Descent 1. Batch Gradient Descent
• Batch gradient descent ensures stability during training but can be
computationally expensive when working with large datasets.
• Additionally, it may lead to slower convergence for noisy or

redundant data.

Types of Gradient Descent 2. Stochastic Gradient Descent
• Stochastic gradient descent (SGD) takes a different approach by updating
the parameters for each training example individually.
• It computes the gradient using only one randomly selected training

example, making it faster than batch gradient descent.
• SGD has the advantage of adapting quickly to changing patterns in the data.
• However, it can exhibit more oscillations and may take longer to converge
due to the noise introduced by individual samples.

Types of Gradient Descent 3. Mini-Batch Gradient Descent:
• Mini-batch gradient descent is a compromise between batch gradient
descent and stochastic gradient descent.
• It computes the gradient using a small subset, or mini-batch, of training

examples.
• This approach combines the advantages of both previous methods.
• By using mini-batches, the algorithm achieves a balance between stability

and computational efficiency.
• It reduces the noise introduced by individual samples and provides a more

accurate estimate of the true gradient.
The Importance of Learning Rate in Gradient Descent
• Gradient descent is a fundamental optimization algorithm used in machine
learning 1. to minimize a cost function and 2. to find the optimal values for
model parameters.
• The learning rate, denoted as alpha (α), plays a crucial role in determining
how quickly the algorithm converges to the minimum of the cost function.
• It essentially controls the step size taken in each iteration of the gradient
descent process.

The Importance of Learning Rate in Gradient Descent
• To better understand the impact of the learning rate, let’s consider two
scenarios:
• 1. a learning rate that is too small and
• 2. a learning rate that is too large.
https://medium.com/@yennhi95zz/4-a-beginners-guide-to-gradient-descent-in-machine-learning-
773ba7cd3dfe#:~:text=III.,Implementing%20Gradient%20Descent,function%20with%20respect%20to%20w.

1. a learning rate that is too small
• Learning Rate Too Small: When the learning rate is set to a very small
value, the algorithm takes tiny steps towards the minimum of the cost
function.
• These small steps can cause the convergence process to be extremely slow.
• Imagine taking small, hesitant steps towards a destination — it would take a

significant amount of time to reach your goal.
• Similarly, with a small learning rate, gradient descent takes many iterations to
approach the minimum, resulting in slower convergence.
1. a learning rate that is too small
• Learning Rate Too Large: Conversely, if the learning rate is set to a very large
value, gradient descent can overshoot the minimum and fail to converge.
• With a large learning rate, the algorithm takes big steps towards the
minimum, but it may continuously overshoot, causing the cost function to
increase rather than decrease.
• This can lead to divergence, where the algorithm fails to find the optimal
solution and keeps moving away from the minimum.
•
Gradient descent — The learning rate

Finding the Right Learning Rate:
• Finding the Right Learning Rate:
• Selecting an appropriate learning rate is crucial to ensure efficient

convergence of gradient descent.
• Ideally, you want to find a learning rate that allows the algorithm to
converge quickly without overshooting or getting stuck in local
minima.

Here are some steps to guide you in choosing an appropriate learning rate:
• 1. Experimentation: It’s often a trial-and-error process to find the optimal

learning rate.
• Start with a reasonable initial value and observe the behavior of the
algorithm.
• If it converges too slowly, increase the learning rate;
• if it diverges or overshoots, decrease the learning rate. Iterate this process

until you find the right balance.
• 2. Learning Rate Schedules:
• Instead of using a fixed learning rate throughout the entire training process,
you can employ learning rate schedules.
• These schedules gradually decrease the learning rate over time, allowing for
faster convergence in the beginning and finer adjustments towards the end.

• 3. Adaptive Learning Rates: Advanced optimization algorithms, such as

AdaGrad, RMSprop, or Adam, automatically adapt the learning rate during
training based on the gradients observed in previous iterations.
• These adaptive methods can handle different learning rates for different
parameters and mitigate some of the challenges associated with manually
tuning the learning rate.

Gradient descent
• Gradient descent is a powerful optimization algorithm used in various
machine learning applications.
• By iteratively updating model parameters based on the gradient of the cost

function, it helps find the values that minimize the cost.
• Understanding and implementing gradient descent allows for effective

model training and optimization.
• By following the principles of gradient descent, you can make significant

strides in model optimization.
Back propagation Algorithm

• The back propagation algorithm is the heart of neural network training.
• The signal needs to flow properly both in the forward direction when making
predictions as well as in the backward direction while calculating gradients.
• After propagating the input features forward to the output layer through the
various hidden layers consisting of different/same activation functions, we
come up with a predicted probability of a sample belonging to the positive
class ( generally, for classification tasks).

• Now, the back propagation algorithm propagates backward from the

output layer to the input layer calculating the error gradients on the way.
• Once the computation for gradients of the cost function w.r.t each
parameter (weights and biases) in the neural network is done, the
algorithm takes a gradient descent step towards the minimum to update
the value of each parameter in the network using these gradients.

What is Vanishing Gradient Problem ?
• As the back propagation algorithm advances downwards(or backward) from

the output layer towards the input layer,
• the gradients often get smaller and smaller and approach zero which
eventually leaves the weights of the initial or lower layers nearly
unchanged.
• As a result, the gradient descent never converges to the optimum. This is

known as the vanishing gradients problem.

What is Exploding Gradient Problem ?
• On the contrary, in some cases, the gradients keep on getting larger

and larger as the backpropagation algorithm progresses.
• This, in turn, causes very large weight updates and causes the
gradient descent to diverge. This is known as the exploding
gradients problem.

Compare vanishing and exploding gradient descent.

Thank You

DL Unit 2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DL Unit 2

Uploaded by

Copyright:

Available Formats

Unit 2: Artificial Neural

Prof . Sachin S. Patil

• Perceptron was introduced by Frank Rosenblatt in 1957.

• He proposed a Perceptron learning rule based on the original MCP

• A Perceptron is an algorithm for supervised learning of binary

• This algorithm enables neurons to learn and processes elements in the

Prof.Sachin Sambhaji Patil 3

• Activation Function: The activation function determines the output of the

• Training Algorithm: The perceptron is typically trained using a supervised

Prof.Sachin Sambhaji Patil 6

• A human brain has billions of neurons.

• Dendrites are branches that receive information from other neurons.

Prof.Sachin Sambhaji Patil 7

• Axon is a cable that is used by neurons to send information.

• Synapse is the connection between an axon and other neuron

Prof.Sachin Sambhaji Patil 8

Prof.Sachin Sambhaji Patil 9

Cell Nucleus (Soma) Node

• It is an elementary unit in an artificial neural network

• One or more inputs are separately weighted

• Inputs are summed and passed through a nonlinear function to produce

• Every neuron holds an internal state called activation signal

• Each connection link carries information about the input signal

• Every neuron is connected to another neuron via connection link

• Multilayer: Multilayer perceptrons can learn about two or more layers

Prof.Sachin Sambhaji Patil 12

Prof.Sachin Sambhaji Patil 13

Prof.Sachin Sambhaji Patil 14

Prof.Sachin Sambhaji Patil 15

Prof.Sachin Sambhaji Patil 16

• ∑wi*xi = x1*w1 + x2*w2 + x3*w3+……..x4*w4

• Step 2: An activation function is applied with the above-mentioned weighted sum

• A multi-layered perceptron model can solve complex non-linear

• It works well with both small and large input data.

• Helps us to obtain quick predictions after the training.

Prof.Sachin Sambhaji Patil 19

• In multi-layered perceptron model, computations are time-consuming and

• It is tough to predict how much the dependent variable affects each

• The model functioning depends on the quality of training.

Prof.Sachin Sambhaji Patil 20

• In Perceptron, the weight coefficient is automatically learned.

• The output of a perceptron can only be a binary number (0 or 1) due to

Prof.Sachin Sambhaji Patil 22

In the equation given above:

“m” = number of inputs to the Perceptron

Prof.Sachin Sambhaji Patil 24

Prof.Sachin Sambhaji Patil 25

Prof.Sachin Sambhaji Patil 26

• A neural network is a method in artificial intelligence that teaches

• It is a type of machine learning process, called deep learning, that uses

Prof.Sachin Sambhaji Patil 29

Prof.Sachin Sambhaji Patil 30

• A neural network can be understood by a collection of connected

• The accuracy of a node is expressed as a loss function or error rate.

The process of propagating the inputs

• The network error is calculated.

Prof.Sachin Sambhaji Patil 33

Prof.Sachin Sambhaji Patil 34

Prof.Sachin Sambhaji Patil 35

Prof.Sachin Sambhaji Patil 38

Activation functions are generally two types, These are

1. Linear or Identity Activation Function

2. Non-Linear Activation Function

Prof.Sachin Sambhaji Patil 39

• ∑wixi = x1w1 + x2w2 + x3w3+……..x4*w4