Professional Documents
Culture Documents
• A perceptron works by taking in some numerical inputs along with what is known as
weights and a bias.
• It then multiplies these inputs with the respective weights(this is known as the weighted
sum).
• These products are then added together along with the bias.
• The activation function takes the weighted sum and the bias as inputs and returns a final
output.
An activation function is a function that converts the input given (the input, in this case,
would be the weighted sum) into a certain output based on a set of rules.
Build a network with 2 input neurons, 3 hidden neurons, 2 output neurons, and 4 observations in training
set.
Use same number of layers and neurons but reduce the number of observations in dataset to 1 instance:
MLP : Multi Layer Perceptron
What is an Activation Function?
Equation : f(x) = x
Range : (-infinity to infinity)
• The output of the functions will not be confined between any range.
Disadvantages of Linear Activation Function
The Nonlinear Activation Functions are the most used activation functions.
The Nonlinear Activation Functions are mainly divided on the basis of their range or
curves
Advantages of Non-Linear Activation Functions
The main reason why we use sigmoid function is because it exists between (0 to 1). Therefore,
it is especially used for models where we have to predict the probability as an output. Since
probability of anything exists only between the range of 0 and 1, sigmoid is the right choice.
• The output range of the tanh function is from (-1 to 1). tanh is also sigmoidal (s - shaped).
• The ReLU is the most used activation function. Since, it is used in almost all the
convolutional neural networks or deep learning.
• The ReLU is half rectified (from bottom). R(z) is zero when z is less than zero and
R(z) is equal to z when z is above or equal to zero.
• Range: [ 0 to infinity)
• Any negative input given to the ReLU activation function turns the value into zero
immediately in the graph, which in turns affects the resulting graph by not mapping
the negative values appropriately.
Disadvantages of ReLU :
•Sigmoids and tanh functions are sometimes avoided due to the vanishing gradient
problem
•ReLU function is a general activation function and is used in most cases these
days. ReLu is less computationally expensive than tanh and sigmoid because it
involves simpler mathematical operations and activates only few neurons
•If we encounter a case of dead neurons in our networks the leaky ReLU function
is the best choice
•Always keep in mind that ReLU function should only be used in the hidden
layers. At current time, ReLu works most of the time as a general approximator
• Variants of ReLU
• Leaky ReLU
• Parametric ReLU
• Exponential Linear Unit
SoftMax Activation Function
Activation Function
Activation Functions
Gradients and Activation Functions
• When constructing Artificial Neural Network (ANN) models, one of the key
considerations is to select an activation functions for the hidden and output
layers that are differentiable. I,e their derivatives should not be zero
1 Sigmoid
2 Softmax
3 ReLu
4 Leaky ReLu
6 TanH
Tip 1:
Generally, we use softmax activation instead of sigmoid with the cross-entropy loss because softmax
activation distributes the probability throughout each output node.
4 Hinge Loss/Multi
class SVM Loss
5 Cross Entropy Classification
Loss/Negative Log
Likelihood
6 Hubber
P : Actual Probability
Q : Predicted Probability
Entropy :
Loss Functions
BACK - PROPAGATION
07/13/2023 69
C = Loss = Mean Squared Error()
07/13/2023 70
07/13/2023 71
07/13/2023 72
07/13/2023 73
07/13/2023 74
07/13/2023 75
Optimization
Given an function f(x), an optimization algorithm help in either minimizing or maximizing
the value of f(x).
In Deep learning, optimization algorithms are used to train the neural network by optimizing
the cost function J. The cost function is defined as:
• The value of cost function J is the mean of the loss L between the predicted value y’ and
actual value y.
• The value y’ is obtained during the forward propagation step and makes use of the Weights
W and biases b of the network.
• With the help of optimization algorithms, we minimize the value of Cost Function J by
updating the values of the trainable parameters W and b.
07/13/2023 77
07/13/2023 78
Gradient Descent
Batch Gradient Descent
07/13/2023 80
• Batch Gradient Descent involves calculations
over the full training set at each step as a
result of which it is very slow on very large
training data.
• Thus, it becomes very computationally
expensive to do Batch GD.
07/13/2023 82
07/13/2023 83
• In Stochastic Gradient Descent (SGD), we consider just one example at a
time to take a single step. We do the following steps in one epoch for SGD:
• Take an example
• Feed it to Neural Network
• Calculate it’s gradient
• Use the gradient we calculated in step 3 to update the weights
• Repeat steps 1–4 for all the examples in training dataset
•
• Drawback:
• SGD takes more number of iterations compared to GD to reach minimum
and also contains some noise when compared to Gradient Descent.
• As SGD computes derivatives of only 1 point at a time, the time taken to
complete one epoch is large compared to Gradient Descent algorithm.
Mini Batch Stochastic Gradient Descent
• MB-SGD is an extension of SGD algorithm.
• It is also common to sample a small number of data points instead of just one point
at each step and that is called “mini-batch” gradient descent. Mini-batch tries to
strike a balance between the goodness of gradient descent and speed of SGD.
• It overcomes the time-consuming complexity of SGD by taking a batch of points /
subset of points from dataset to compute derivative.
• after creating the mini-batches of fixed size, we do the following steps in one epoch:
1. Pick a mini-batch
2. Feed it to Neural Network
3. Calculate the mean gradient of the mini-batch
4. Use the mean gradient we calculated in step 3 to update the weights
5. Repeat steps 1–4 for the mini-batches we created
• Drawback is the update of weights is much noisier because the derivative is not
always towards minima.
Types - Gradient Descent
Batch GD : θ=θ−η⋅∇θJ(θ)
SGD : θ=θ−η⋅∇θJ(θ;x(i);y(i))
In gradient descent one is trying to reach the minimum of the loss function with
respect to the parameters using the derivatives calculated in the back-propagation.
The easiest way would be to adjust the parameters by substracting its corresponding
derivative multiplied by a learning rate, which regulates how much you want to move
in the gradient direction.
The three main flavors of gradient descent are batch, stochastic, and mini-batch.
This is not a learning method, but rather a nice computational trick which is often
used in learning methods.
This is actually a simple implementation of chain rule of derivatives, which simply
gives you the ability to compute all required partial derivatives in linear time
Trained with SGD using backprop as a gradient computing technique
Back Propagation
Back Propagation
The goal of back Propagation is to optimize the weights so that the neural network can learn how to correctly map arbitrary inputs to
outputs.
Total Error
Back Propagation
Backward Pass
Consider . , We want to know how much a change in affects
the total error, (Gradient w.r.t )
Next, how much does the output of change with respect to its total net input?
What is a gradient ?
• As more layers using certain activation functions are added to neural networks,
the gradients of the loss function approaches zero, making the network hard to
train.
• Certain activation functions, like the sigmoid function, squishes a large input
space into a small input space between 0 and 1.
• Therefore, a large change in the input of the sigmoid function will cause a small
change in the output. Hence, the derivative becomes small.
• when the inputs of the sigmoid function becomes larger or smaller (when |x|
becomes bigger), the derivative becomes close to zero. Vanishing Gradient
Problem
• In networks with few layers and sigmoid activation function, there is
no problem of vanishing gradient
• when more layers are used, it can cause the gradient to be too small
for training to work effectively.
• Gradients of neural networks are found using backpropagation
• backpropagation finds the derivatives of the network by moving layer
by layer from the final layer to the initial one
• By the chain rule, the derivatives of each layer are multiplied down
the network (from the final layer to the initial) to compute the
derivatives of the initial layers.
• However, when n hidden layers use an activation like the sigmoid
function, n small derivatives are multiplied together.
• Thus, the gradient decreases exponentially as we propagate down to
the initial layers.
• A small gradient means that the weights and biases of the initial layers
will not be updated effectively with each training session
• Since these initial layers are often crucial to recognizing the core
elements of the input data, it can lead to overall inaccuracy of the
whole network.
Ways to detect whether your deep network is suffering from the
vanishing gradient problem: -
The model will improve very slowly during the training phase and it is also
possible that training stops very early, meaning that any further training
does not improve the model.
The weights closer to the output layer of the model would witness more of
a change whereas the layers that occur closer to the input layer would not
change much (if at all).
Model weights shrink exponentially and become very small when training
the model.
• Vanishing gradients usually happen while using the Sigmoid or Tanh activation
functions in the hidden layer units.
• Looking at the function plot below, we can see that when inputs become very
small or very large, the sigmoid function saturates at 0 and 1 and the tanh
function saturates at -1 and 1.
• In both these cases, their derivatives are extremely close to 0.
• these ranges/regions of the function “saturating regions” or “bad regions”.
• Thus, if your input lies in any of the saturating regions, then it has almost no
gradient to propagate back through the network.
• batch normalization can be simply visualized as an additional layer in the
network that normalizes the data (using a mean and standard deviation)
before feeding it into the hidden unit activation function.
• Batch normalization normalizes the input and ensures that|x| lies within
the “good range” (marked as the green region) and doesn’t reach the
outer edges of the sigmoid function.
• If the input is in the good range, then the activation does not saturate,
and thus the derivative also stays in the good range, i.e- the derivative
value isn’t too small.
• Thus, batch normalization prevents the gradients from becoming too
small and makes sure that the gradient signal is heard.
Exploding Gradient Problem
Exploding gradients are a problem where large error gradients accumulate and result in very
large updates to neural network model weights during training
Results in model being unstable and unable to learn from your training data
Ways to detect whether your deep network is suffering from the
exploding gradient problem: -
Model weights grow exponentially and become very large when training the
model.
The model weights become NaN in the training phase.
Approaches to address both vanishing and exploding gradient
problems
1. Reducing the amount of Layers
This is solution could be used in both, scenarios (exploding and vanishing
gradient). However, by reducing the amount of layers in our network, we give up
some of our models complexity, since having more layers makes the networks
more capable of representing complex mappings.
3. Weight Initialization
A more careful initialization choice of the random initialization for your network
tends to be a partial solution, since it does not solve the problem completely.
Training a NN in Keras
Data Set : Pima Indians Diabetes Data Set
It describes patient medical record data for Pima Indians and whether
they had an onset of diabetes within five years.
It is a binary classification problem (onset of diabetes as 1 or not as 0).
The input variables that describe each patient are numerical and have
varying scales.
Below lists the eight attributes for the dataset:
1. Number of times pregnant. 2. Plasma glucose concentration a 2 hours
in an oral glucose tolerance test. 3. Diastolic blood pressure (mm Hg). 4.
Triceps skin fold thickness (mm). 5. 2-Hour serum insulin (mu U/ml). 6.
Body mass index. 7. Diabetes pedigree function. 8. Age (years). 9. Class,
onset of diabetes within five years.
Sample records:
6,148,72,35,0,33.6,0.627,50,1 1,85,66,29,0,26.6,0.351,31,0
8,183,64,0,0,23.3,0.672,32,1 1,89,66,23,94,28.1,0.167,21,0
0,137,40,35,168,43.1,2.288,33,1
Neural Network Structure
from google.colab import files
uploaded = files.upload()
# first neural network with keras tutorial
import keras
from numpy import loadtxt
from keras.models import Sequential
from keras.layers import Dense
import pandas as pd
df = pd.read_csv("/content/pima-indians-diabetes.csv")
# split into input (X) and output (y) variables
X = df.iloc[:,0:8]
y = df.iloc[:,8]
# define the keras model
model = Sequential()
#input_layer = Dense(12, input_dim = 8, activation = 'relu')
#model.add(input_layer)
model.add(Dense(12, input_dim=8, activation='relu'))
model.add(Dense(8, activation='relu'))
# compile the keras model and specify the training parameters of the architecture
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit the keras model on the dataset
model.fit(X, y, epochs=150, batch_size=16)
#Output
# evaluate the keras model
_, accuracy = model.evaluate(X, y)
print('Accuracy: %.2f' % (accuracy*100))
#Output
model.get_config()
Calculating the No. of Trainable Parameters
3×4+4×2+1×4+1×2
=3×4+4×2+4+2
= i × h + h × o + h + o
Example 2:
A feed-forward neural network with three hidden layers. Number of units in the
input, first hidden, second hidden, third hidden and output layers are
respectively 3, 5, 6, 4 and 2. Calculate the no. of trainable parameters.
Ans :
Example 2:
A feed-forward neural network with three hidden layers. Number of units in the
input, first hidden, second hidden, third hidden and output layers are respectively
3, 5, 6, 4 and 2. Calculate the no. of trainable parameters.
Ans :
• Number of connections between the first and second layer: 3 × 5 = 15, which is
nothing but the product of i and h1.
• Number of connections between the second and third layer: 5 × 6 = 30, which is
nothing but the product of h1 and h2.
• Number of connections between the third and fourth layer: 6 × 4 = 24, which is
nothing but the product of h2 and h3.
• Number of connections between the fourth and fifth layer: 4 × 2= 8, which is
nothing but the product of h3 and o.
• Number of connections between the bias of the first layer and the neurons of
the second layer (except bias of the second layer): 1 × 5 = 5, which is nothing
but h1.
• Number of connections between the bias of the second layer and the neurons
of the third layer: 1 × 6 = 6, which is nothing but h2.
• Number of connections between the bias of the third layer and the neurons of
the fourth layer: 1 × 4 = 4, which is nothing but h3.
• Number of connections between the bias of the fourth layer and the neurons of
the fifth layer: 1 × 2 = 2, which is nothing but o.
• Summing up all:
3×5+5×6+6×4+4×2+1×5+1×6+1×4+1×2
= 15 + 30 + 24 + 8 + 5 + 6 + 4 + 2
= 94
Thus, this feed-forward neural network has 94 connections in all and thus 94 trainable
parameters.
• Thus, the total number of parameters in a feed-forward neural network with three
hidden layers is given by:
(i × h1 + h1 × h2 + h2 × h3 + h3 × o) + h1 + h2 + h3+ o
Calculate the number of trainable parameters for this model :
• Deep learning neural networks are likely to quickly overfit a training dataset
with few examples.
• A larger/deeper NN is also likely to overfit and hence poor generalization.
• Dropout is a regularization method used to prevent model overfitting.
• It simulates a large number of different network architectures from a
single model by randomly dropping out few neurons from each layer during
each training iteration.
• It is a very computationally cheap and remarkably effective regularization
method to reduce overfitting and improve generalization error in deep
neural networks of all kinds.
• It can be used with most types of layers, such as dense fully
connected layers, convolutional layers, and recurrent layers such as
the long short-term memory network layer.
• Dropout may be implemented on any or all hidden layers in the
network as well as the visible or input layer. It is not used on the
output layer.
• The term “dropout” refers to dropping out units (hidden and visible)
in a neural network.
• Dropout is not used after training when making a prediction with
the fit network.
• The dropout hyperparameter specifies the probability at which outputs
of the layer are dropped out (inversely, the propability at which inputs
to the layers are retained)
• The weights of the network will be larger than the normal because of
dropout.
• Hence weights are scaled down using the chosen dropout rate.
Batch Size
Total number of training examples present in a single batch.