DL Sessional 1

Introduction to Deep Learning
Deep Learning
DSE 5251, M. Tech Data Science
Dr. Abhilash K Pai
Department of Data Science and Computer Applications
MIT Manipal
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 1
Introduction to AI
• Evolution of AI
• Goal: To create machines that think and perceive the world like humans.
• Initially solved problems that can be described by a list of formal mathematical rules.
• Ex- Playing Chess
• Challenge in solving tasks that are easy for people to perform but difficult for people
to describe formally (problems that we solve intuitively)
• Ex- Identifying words, Recognizing people in images
• How to get the informal knowledge into a computer?

• Knowledge base approach
• Hard code knowledge in formal languages
• Computers can reason about statements automatically using logical inference rules
• Ex- Cyc project
• However, none of these projects has led to a major success and fail model intricate aspects of the
world.
Machine Learning
• AI systems need the ability to acquire their own knowledge, by extracting
patterns from raw data. This capability is known as machine learning.
• Machine learning algorithms depends heavily on the representation (collection

of features).
• For many tasks, it is difficult to know what features

should be extracted.
Representations matter
Deep Learning
• Solution is to use machine learning to discover not only the mapping
from representation to output but also the representation itself.
• This approach is known as representation learning.
• Learned representation often produces better performance in

comparison to the hand-designed representations.
• They also allow AI systems to rapidly adapt to new tasks, with minimal
human intervention.
Deep Learning
• While designing algorithms for learning features the goal is usually to separate
the factors of variation.
• Unobserved objects or unobserved forces in the physical world that affect observable
quantities.
• They may also exist as constructs in the human mind that provide useful simplifying
explanations or inferred causes of the observed data.
• Ex: In voice data – speaker’s accent, gender, age
• However, many factors of variation influence every single piece of data we observe.
• Also, it can be very difficult to extract such high-level, abstract features from raw data.
• Deep learning solves this central problem in representation learning by

introducing representations that are expressed in terms of other, simpler
representations.
Deep Learning, Machine Learning and AI
Deep Learning and Machine Learning
(Source: softwaretestinghelp.com)
Learning Multiple Components
Neural Network Examples
Scale drives deep learning progress
DSE 5251 Deep Learning : Reference Materials
1. Ian Goodfellow, Yoshua Bengio and Aaron Courville, Deep Learning, MIT
Press 2016.
2. Charu C. Aggarwal, Neural Networks and Deep Learning, Springer 2018.
3. Course Notes – Neural Network and Deep Learning, Andrew NG
4. Course Notes- Deep Learning IIT Ropar, Prof Sudharshan Iyengar
5. Aurelien Geron, “Hands-On Machine Learning with Scikit-Learn , Keras &

Tensorflow, OReilly Publications
Deep Learning
DSE 5251, M. Tech Data Science
Dr. Abhilash K Pai
Department of Data Science and Computer Applications
MIT Manipal
Neural Networks: From ground up
Credits:
Most of the content in the slides are adapted from:
CS7015 Deep Learning, Dept. of CSE, IIT Madras
By Dr. Mithesh Khapra
2
Biological Neuron
3
Biological Neuron
Input : a funny Instagram reel

4
Biological Neuron
5
McCulloch Pits (MP) Neuron
6
Implementing Boolean functions using MP Neuron
7
8
9
10
11
12
Perceptron
13
Perceptron
14
MP Neuron vs Perceptron
MP Neuron Perceptron
15
Boolean function using Perceptron : Example
16
Perceptron Learning Algorithm
17
The XOR Conundrum
What do we do about functions which are not linearly separable ?

The XOR Conundrum
Non-linear!
Solving XOR using Multi-Layer Perceptrons (MLP)
Example of MLP
Solving XOR using Multi-Layer Perceptrons (MLP)
Theorem: Any boolean function of n inputs can be represented exactly by a network of perceptrons
containing 1 hidden layer with 2n perceptrons and one output layer containing 1 perceptron
Going beyond Binary Inputs and Outputs
Need for activation functions
• The thresholding logic used by a perceptron is very
harsh !
• Eg: When –w0 = 0.5, though the output values 0.49

and 0.51 are very close to each other, the perceptron
would assign different values to them.
• This behavior is not a characteristic of the specific

problem, the weight or threshold that we chose, it is
a characteristic of the perceptron function itself
which behaves like a step function
Sigmoid Neuron
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal
Other popular activation functions
Representation power of MLP
We are interested in knowing whether a network of neurons can be used to

represent an arbitrary function (like the one shown in the figure)
We observe that such an arbitrary function can be approximated by

several “tower” functions
More the number of such “tower” functions, better the

approximation
To be more precise, we can approximate

any arbitrary function by a sum of such
“tower” functions
To figure this out lets consider this example of a

sigmoid function
For this figure:
To figure this out lets consider this example of a

sigmoid function
If we set w to a very high value we will recover the

step function
Similarly, adjusting b will shift this curve

Now let us see what we get by taking two such sigmoid functions (with different b) and subtracting one from the other
Can we come up with a neural network to represent this operation of
subtracting one sigmoid function from another ?
So far, we have the case where there is only one input.
What if we have more than one input, like the below profile?
This is an open tower.

However, we need a closed tower!
The top portion is a closed tower.

But how to extract that?
A Typical Machine Learning Set-up
51
Learning Parameters
52
Learning Parameters
53
Learning Parameters
54
Learning Parameters
55
Gradient Descent
Feed Forward Neural Networks: Introduction
We need to answer two questions:
• How to choose the loss function?
• How to compute ?
The choice of the loss function depends on the problem at hand
Feed Forward Neural Networks: Loss and Activation Functions
Feed Forward Neural Networks
We need to answer two questions:
• How to choose the loss function?
• How to compute ?
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
ai+1 = Wi+1hij + bi+1

Derivatives of activation functions: g’(z)
Gradient descent and its variants
Local Minima
Saddle Points
Contour Plot
Plateaus and Flat Regions
Momentum-based Gradient Descent
Momentum-based Gradient Descent
Nesterov Accelerated Gradient Descent
Intuition: Look before you leap!
• Recall that:
• So, the movement is at least and then a bit more by

Vanilla (Batch) Gradient Descent
What is the issue here?

Stochastic Gradient Descent
Mini-Batch Stochastic Gradient Descent
Choosing a learning rate
• If learning rate is too small, it takes long time to converge.
If learning rate is too large, the gradients explode.
Some techniques for choosing learning rate:
• Linear Search
• Annealing-based methods:
• Step Decay:
• Halve the learning rate after every 5 epochs or
• Halve the learning rate after an epoch if the validation error is more than what it
was at the end of the previous epoch
• Exponential Decay: where, and are hyperparameters and t is

the step number
• 1/t Decay: where, and are hyperparameters and t is the step

number
• Annealing-based methods:
• Step Decay:
• Halve the learning rate after every 5 epochs or
• Halve the learning rate after an epoch if the validation error is more than what it
was at the end of the previous epoch
• Exponential Decay: where, and are hyperparameters and t is

the step number
• 1/t Decay: where, and are hyperparameters and t is the step

number
GD with adaptive learning rate
• Motivation: Can we have a different learning rate for each parameter which takes care of
the frequency of features ?
• Intuition: Decay the learning rate for parameters in proportion to their update history.
• For sparse features, accumulated update history is small
• For dense features, accumulated update history is large
Make learning rate inversely proportional to the update history i.ie, if the feature has
been updated fewer times, give it a larger learning rate and vice versa
Adagrad
• Update rule for Adagrad
If the feature has been updated fewer times, give it a larger

learning rate and vice versa
RMS Prop
• Intuition: Adagrad decays the learning rate very aggressively (as the denominator grows).
• Update rule for RMS Prop:

Exponential weighted moving average (weighted decay)
Adam
• Adding momentum to RMSProp
DSE 5251 DEEP LEARNING
Dr. Abhilash K Pai

Assistant Professor,
Dept. of Data Science and Computer Applications
MIT Manipal
The Convolution Operation - 1D
▪ Convolution is a linear operation on two functions of a real-valued argument, where one function is applied over
the other to yield element-wise dot products.
▪ Example: Consider a discrete signal ‘xt’ which represents the position of a spaceship at time ‘t’
recorded by a laser sensor.
▪ Now, suppose that this sensor is noisy.

x0
▪ To obtain a less noisy measurement we would like to average several measurements.
▪ Considering that, the most recent measurements are more important, we would like to take
a weighted average over ‘xt’. The new estimate at time ‘t’ is computed as follows: x1
convolution
∞
𝑠𝑡 = ෍ 𝑥𝑡−𝑎 𝑤−𝑎 = 𝑥 ∗ 𝑤 𝑡
𝑎=0
input Filter/Mask/Kernel
x2
▪ Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 2
▪ In practice, we would sum only over a small window.
6
For example: 𝑠𝑡 = ෍ 𝑥𝑡−𝑎 𝑤−𝑎

𝑎=0
▪ We just slide the filter over the input and compute the value of st based on a window around xt
w-6 w-5 w-4 w-3 w-2 w-1 w0

w 0.01 0.01 0.02 0.02 0.04 0.4 0.5
* * * * * * *
x 1.0 1.10 1.20 1.40 1.70 1.80 1.90 2.10 2.20
s 1.80
Content adapted from : CS7015 Deep Learning, Dept. of CSE, IIT Madras
Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 3

6

𝑎=0
w-6 w-5 w-4 w-3 w-2 w-1 w0

w 0.01 0.01 0.02 0.02 0.04 0.4 0.5
* * * * * * *
x 1.0 1.10 1.20 1.40 1.70 1.80 1.90 2.10 2.20
s 1.80 1.96

6

𝑎=0
w-6 w-5 w-4 w-3 w-2 w-1 w0

w 0.01 0.01 0.02 0.02 0.04 0.4 0.5
* * * * * * *
x 1.0 1.10 1.20 1.40 1.70 1.80 1.90 2.10 2.20
s 1.80 1.96 2.11
▪ Use cases of 1-D convolution : Audio signal processing, stock market analysis, time series analysis etc.

▪ Images are good examples of 2-D inputs.
▪ A 2-D convolution of an Image ‘I’ using a filter ‘K’ of size ‘m x n’ is now defined as:
𝑚−1 𝑛−1
𝑆𝑖𝑗 = 𝐼 ∗ 𝐾 𝑖𝑗 = ෍ ෍ 𝐼𝑡−𝑎,𝑗−𝑏 𝐾𝑎,𝑏

𝑎=0 𝑏=0
▪ However, the following is used in practice:

𝑚−1 𝑛−1
𝑆𝑖𝑗 = 𝐼 ∗ 𝐾 𝑖𝑗 = ෍ ෍ 𝐼𝑡+𝑎,𝑗+𝑏 𝐾𝑎,𝑏

𝑎=0 𝑏=0

▪ Now, if we consider the center pixel as the pixel of interest, 2-D convolution equation is as follows:
𝑚/2 𝑛/2
𝑆𝑖𝑗 = 𝐼 ∗ 𝐾 𝑖𝑗 = ෍ ෍ 𝐼 𝑖−𝑎,𝑗−𝑏 ∗ 𝐾 𝑚/2 +𝑎, 𝑛/2 +𝑏

𝑎= −𝑚/2 𝑏= −𝑛/2
Pixel of interest
0 1 0 0 1
0 0 1 1 0
1 0 0 0 1
0 1 0 0 1
0 0 1 0 1

Source: https://developers.google.com/

Input Image

Input Image

Smoothening Filter
Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Sharpening Filter

Filter for edge

detection

The Convolution Operation – 2D : Various filters (edge detection)
Prewitt
-1 0 1 1 1 1
-1 0 1 0 0 0
-1 0 1 -1 -1 -1
Sx Sy After applying
Horizontal edge
detection filter
Sobel
-1 0 1 1 2 1
-2 0 2 0 0 0
-1 0 1 -1 -2 -1
Sx Sy Input image After applying
Vertical edge
Laplacian Roberts detection filter
0 1 0 0 1 1 0
1 -4 1 -1 0 0 -1
0 1 0 Sx Sy
1 -1 -1
-1 1 -1 Filter 1
-1 -1 1
stride=1
1 0 0 0 0 1 Dot
product
0 1 0 0 1 0 3 -1
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
Input image: 6 x 6
Note: Stride is the number of “unit” the kernel is shifted per slide over rows/ columns
1 -1 -1
-1 1 -1 Filter 1
-1 -1 1
If stride=2
1 0 0 0 0 1
0 1 0 0 1 0 3 -3
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
Input image: 6 x 6
Note: Stride is the number of “unit” the kernel is shifted per slide over rows/ columns
1 -1 -1
-1 1 -1 Filter 1
-1 -1 1
stride=1
1 0 0 0 0 1
3 -1 -3 -1
0 1 0 0 1 0
0 0 1 1 0 0 -3 1 0 -3
1 0 0 0 1 0 4 x 4 Feature Map
0 1 0 0 1 0 -3 -3 0 1
0 0 1 0 1 0
3 -2 -2 -1
Input image: 6 x 6

-1 1 -1
-1 1 -1 Filter 2
-1 1 -1
stride=1
1 0 0 0 0 1 Repeat for each filter!

3 -1 -3 -1
0 1 0 0 1 0 -1 -1 -1 -1
0 0 1 1 0 0 -3 1 0 -3
-1 -1 -2 1
1 0 0 0 1 0 Feature
0 1 0 0 1 0 -3 -3 Map0 1
-1 -1 -2 1
0 0 1 0 1 0 Two 4 x 4 images
3 -2 -2 -1 Forming 4 x 4 x 2 matrix
Input image: 6 x 6 -1 0 -4 3

The Convolution Operation –RGB Images
R G B
Apply the filter to R, G, and B channels of

the image and combine the resultant
feature maps to obtain a 2-D feature map.
Source: Intuitively Understanding Convolutions for Deep Learning | by Irhum Shafkat | Towards Data Science

The Convolution Operation –RGB Images multiple filters
11 -1-1 -1-1 -1-1 11 -1-1 -1-1 11 -1-1
1 -1 -1 -1 0 -1 -1 1 -1
-1 1 -1 -1-1 11 -1-1 -1-1 11 -1-1
-1-1 11 -1-1 Filter 1 -1 0 -1 Filter 2
0 0 0 Filter K
-1-1 -1-1 11 -1-1 11 -1-1 -1-1 11 -1-1
-1 -1 1 -1 0 -1 -1 1 -1
1 0 0 0 0 1
1 0 0 0 0 1
0 11 00 00 01 00 1
0 1 0 0 1 0
0 00 11 01 00 10 0
0 0 1 1 0 0 K-filters = K-Feature Maps
1 00 00 10 11 00 0
1 0 0 0 1 0
0 11 00 00 01 10 0
0 1 0 0 1 0
0 00 11 00 01 10 0
0 0 1 0 1 0
0 0 1 0 1 0 Depth of feature map = No. of feature maps = No. of filters

The Convolution Operation : Terminologies
1 0 0 0 0 1
1 0 0 0 0 1
0 11 00 00 01 00 1
0 1 0 0 1 0
0 00 11 01 00 10 0 -1-1 11 -1-1
0 0 1 1 0 0 -1 1 -1
1 00 00 10 11 00 0 -1-1 11 -1-1
1 0 0 0 1 0 0 0 0 Filter
0 11 00 00 01 10 0 -1-1 11 -1-1
0 1 0 0 1 0 -1 1 -1
0 00 11 00 01 10 0
0 0 1 0 1 0
0 0 1 0 1 0
1. Depth of an Input Image = No. of channels in the Input Image = Depth of a filter
2. Assuming square filters, Spatial Extent (F) of a filter is the size of the filter

The Convolution Operation : Zero Padding
conv3x3
2x2
4x4
Pad Zeros and then convolve to obtain a

feature map with dimension = input image dimension

The Convolution Operation : Zero Padding
Feature map size: 5x5
Input image size: 5x5
Source: Intuitively Understanding Convolutions for Deep Learning | by Irhum Shafkat | Towards Data Science

Relation between i/p size, feature map size, filter size
Input Image
-1-1 11 -1-1
-1 0 -1 Stride length = S
1 0 0 0 0 1 -1-1 11 -1-1 No. of Filters = K
1 0 0 0 0 1 -1 0 -1 Padding = P
0 11 00 00 01 00 1 -1-1 11 -1-1
0 1 0 0 1 0 -1 0 -1 Filter
0 00 11 01 00 10 0
0 0 1 1 0 0 -1-1 11 -1-1
1 00 00 10 11 00 0 H1 -1 1 -1
1 0 0 0 1 0 -1-1 11 -1-1
0 11 00 00 01 10 0 0 0 0 F H2
0 1 0 0 1 0 -1-1 11 -1-1
0 00 11 00 01 10 0 -1 1 -1
0 0 1 0 1 0
D1
0 0 1 0 1 0 D1
D2
F
1 -1 -1
W1
11 -1-1 -1-1 W2
-1-1 11 -1-1
𝑾𝟏 − 𝑭 + 𝟐𝑷 -1 1 -1 𝑯𝟏 − 𝑭 + 𝟐𝑷
𝑾𝟐 = +𝟏 -1-1 -1-1 11 𝑯𝟐 = +𝟏 𝑫𝟐 = 𝑲
𝑺 -1 -1 1 𝑺

Convolutional Neural Network (CNN) : At a glance
cat | dog
Convolution
Pooling
Can repeat Fully Connected
many times Feedforward network
Convolution
Pooling
Source: CS 898: Deep Learning and Its Applications, University of Flattened

Waterloo, Canada.
Pooling
1 -1 -1 -1 1 -1
-1 1 -1 Filter 1 -1 1 -1 Filter 2
-1 -1 1 -1 1 -1
• Max Pooling
3 -1 -3 -1 -1 -1 -1 -1 • Average Pooling
-3 1 0 -3 -1 -1 -2 1
-3 -3 0 1 -1 -1 -2 1
3 -2 -2 -1 -1 0 -4 3
Source: CS 898: Deep Learning and Its Applications, University of Waterloo, Canada.

Pooling
Max. Pooling Average Pooling
Stride ?

Why Pooling ?
▪ Subsampling pixels will not change the object
bird
bird
Subsampling
▪ We can subsample the pixels to make image smaller
▪ Therefore, fewer parameters to characterize the image

Important properties of CNN
▪ Sparse Connectivity
▪ Shared weights
▪ Equivariant representation

Properties of CNN
1 1 1
1 -1 -1 Filter 1 2 0 -1
-1 1 -1 3 0 -1
-1 -1 1 4 0 -1
3
..
7 0 1
1 0 0 0 0 1 8 1 -1
0 1 0 0 1 0 9 0
0 0 1 1 0 0 10 0 -1
-1
..
1 0 0 0 1 0 1
Fewer parameters!
13 0
0 1 0 0 1 0 Only connect to 9 inputs, not fully
14 0 connected (Sparse Connectivity)
0 0 1 0 1 0
15 1
6 x 6 Image 16 1
..
Properties of CNN
Is sparse connectivity good?
Ian Goodellow et al. 2016

Properties of CNN
1 1
1 -1 -1 2 0
-1 1 -1 3 0
-1 -1 1 4 0 3
..
7 0
1 0 0 0 0 1 8 1
0 1 0 0 1 0 9 0
0 0 1 1 0 0 10 0
-1
..
1 0 0 0 1 0
13 0
0 1 0 0 1 0 Even Fewer parameters!
14 0
0 0 1 0 1 0 Fewer parameters!
15 1
6 x 6 Image Shared weights
16 1
..
Equivariance to translation
▪ A function f is equivariant to a function g if f(g(x)) = g(f(x)) or if the output changes in the same way as the
input.
▪ This is achieved by the concept of weight sharing.
▪ As the same weights are shared across the images, hence if an object occurs in any image, it will be detected
irrespective of its position in the image.
Source: Translational Invariance Vs Translational Equivariance | by Divyanshu Mishra | Towards Data Science

CNN vs Fully Connected NN
▪ A CNN compresses the fully connected NN in two ways:
▪ Reducing the number of connections
▪ Shared weights
▪ Max pooling further reduces the parameters to characterize an image.

Convolutional Neural Network (CNN) : Non-linearity with activation
cat | dog
Convolution +
ReLU
Pooling
Fully Connected
Feedforward network
Convolution+
ReLu
Pooling
Source: CS 898: Deep Learning and Its Applications, University of Flattened

Waterloo, Canada.
LeNet-5 Architecture for handwritten text recognition
#Param.
#Param. ((5*5*16)*120 +
#Param. #Param. 120 = 48120 #Param.
((5*5*6)+1) * 16 = 2416
((5*5*1)+1) * 6 = 156 =0 84*120 + 84=
#Param. 10164
=0 #Param.
84*10 + 10= 850
tanh tanh
sigmoid
S =1, F=5, S =2, F=2, S =1, F=5, S =2, F=2,

K=6, P=2 K=6, P=0 K=16, P=0 K=16, P=0
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., & others. (1998). Gradient-based learning applied to document recognition.
Proceedings of the IEEE, 86(11), 2278–2324.

LeNet-5 Architecture for handwritten number recognition
Source: http://yann.lecun.com/

ImageNet Dataset
More than 14 million images. 22,000 Image categories
Deng, Jia, et al. "Imagenet: A large-scale hierarchical image database."

IEEE conference on computer vision and pattern recognition. IEEE, 2009.
ImageNet Large Scale Visual Recognition Challenge
• 1000 ImageNet Categories
ZFNet

AlexNet (2012)
▪ Used ReLU activation function instead of

sigmoid and tanh.
▪ Used data augmentation techniques

that consisted of image translations,
horizontal reflections, and patch
extractions.
▪ Implemented dropout layers.

AlexNet Architecture
#Param. = 0 #Param. = 0
#Param. #Param.
#Param.
((5*5*96)+1) * 256 = 614656 ((3*3*256)+1) * 384 =
((11*11*3)+1) * 96 = 34944
885120
#Param. = 0
#Param.
((3*3*384)+1) * 256 =884992
Total #Param.
#Param. 62M
((3*3*384)+1) * 384 =
1327488
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012).
Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25.
ZFNet Architecture (2013)
• Used filters of size 7x7 instead of 11x11 in AlexNet
• Used Deconvnet to visualize the intermediate results.
Zeiler, M. D., & Fergus, R. (2013). Visualizing and understanding convolutional networks.
In European conference on computer vision (pp. 818-833). Springer, Cham.
ZFNet
Visualizing and Understanding Deep Neural Networks by Matt Zeiler - YouTube

ZFNet

VGGNet Architecture (2014)
Image Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras
• Used filters of size 3x3 in all the convolution layers.
• 3 conv layers back-to-back have an effective receptive field of 7x7.
• Also called VGG-16 as it has 16 layers.
• This work reinforced the notion that convolutional neural networks have to have a deep network of layers in order for
this hierarchical representation of visual data to work
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition , International Conference on Learning Representations (ICLR14)

GoogleNet Architecture (2014)
• Most of the architectures discussed till now

apply either of the following after each
convolution operation:
• Max Pooling
• 3x3 convolution
• 5x5 convolution
• Idea: Why cant we apply them all together

at the same time and concatenate the
feature maps.
• Problem: This will result in large number of

computations.
• Specifically, each element of the output

required O(FxFxD) computations
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., ... & Rabinovich, A. (2015). Going deeper with convolutions.
In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR’15)

• Solution: Apply 1x1 convolutions
• 1x1 convolution aggregates along the depth.
• So, if we apply D1 1x1 convolutions (D1<D), we

will get an output of size W x H x D1
• So, the total number of computations will reduce to

O(FxFxD1)
• We could then apply subsequent 3x3, 5 x5 filters on

this reduced output

• Also, we might want to use different

dimensionality reductions (applying 1x1
convolutions of different sizes) before the
3x3 and 5x5 filters.
• We can also add the maxpooling layer

followed by 1x1 convolution.
• After this, we concatenate all these layers.
• This is called the Inception module.
• GoogleNet contains many such inception

The Inception module modules.


Global average pooling
• 12 times less parameters and 2 times more

computations than AlexNet
• Used Global Average Pooling instead of

Flattening.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., & Rabinovich, A. (2015). Going deeper with convolutions.

ResNet Architecture (2015)
Effect of increasing layers of shallow CNN when experimented over the CIFAR dataset
Source: Residual Networks (ResNet) - Deep Learning - GeeksforGeeks

Shallow CNN +
Shallow CNN Additional layers
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition.
In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).

ResNet-34
Source: Residual Networks (ResNet) - Deep Learning - GeeksforGeeks

Sequence Models
DSE 3151 DEEP LEARNING
Dr. Abhilash K Pai

Assistant Professor,
Dept. of Data Science and Computer Applications
MIT Manipal
Examples of Sequence Data
▪ Speech Recognition Mary had a little lamb
▪ Music Generation
▪ Sentiment Classification
▪ DNA Sequence Analysis
▪ Machine Translation
▪ Video Activity Recognition
▪ Name Entity Recognition

▪ Speech Recognition
▪ Music Generation La

▪ Sentiment Classification “Its an average movie”

▪ DNA Sequence Analysis AGCCCCTGTGAGGAACTAG AGCCCCTGTGAGGAACTAG

▪ Machine Translation ARE YOU FEELING SLEEPY क्या आपको न ींद आ रह है

▪ Video Activity Recognition WAVING

▪ Name Entity Recognition “Alice wants to discuss about “Alice wants to discuss about
Deep Learning with Bob” Deep Learning with Bob”

Issues with using ANN/CNN on sequential data
• In feedforward and convolutional neural networks the size of the input was always fixed.
• Further, each input to the network was independent of the previous or future inputs.
• In many applications with sequence data, the input is not of a fixed size.
• Further successive inputs may not be independent of each other.

Modelling Sequence Learning Problems: Introduction
------- Don’t care --------
Task: Auto-complete Task: P-o-S tagging Task: Movie Review Task: Action Recognition
Legend
• The model needs to look at a sequence of inputs and produce an output (or outputs). Output
layer
• For this purpose, lets consider each input to be corresponding to one time step.
Hidden
layer
• Next, build a network for each time step/input, where each network performs the same task
(eg: Auto complete: input=character, output=character) Input layer

How to Model Sequence Learning Problems?
1. Model the dependence between inputs.

• Eg: The next word after an ‘adjective’ is most probably a ‘noun’.
2. Account for variable number of inputs.

• A sentence can have arbitrary no. of words.
• A video can have arbitrary no. of frames.
3. Make sure that the function executed at each time step is the same.
• Because at each time step we are doing the same task.

Modelling Sequence Learning Problems using Recurrent Neural Networks (RNN)
Introduction
Considering the network at each time step to be a fully connected

network, the general equation for the network at each time step is:

Modelling Sequence Learning Problems using Recurrent Neural Networks (RNN)
Introduction
Considering the network at each time step to be a fully connected

network, the general equation for the network at each time step is:
Since we want the same function to be executed at each timestep we

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras should share the same network (i.e., same parameters at each timestep)

Recurrent Neural Networks (RNN): Introduction
• If the input sequence is of length ‘n’, we would create ‘n’ networks for each input, as seen previously.

Recurrent Neural Networks (RNN): Introduction
• If the input sequence is of length ‘n’, we would create ‘n’ networks for each input, as seen previously.
But, how to model the dependencies between the inputs ?

Recurrent Neural Networks (RNN)

Solution: Add recurrent connection in the network.


• So, the RNN equation:

U, W, V, b, c are parameters of the network

The dimensions of each term is as follows:
𝑋𝑖 -- [1 x no. of i/p neurons]

𝑠𝑖 -- [1 x no. of neurons in the hidden state]
W -- [no. of neurons in the hidden state x no. of neurons in the hidden state]
U -- [no. of i/p neurons x no. of neurons in the hidden state]
V -- [no. of neurons in the hidden state x no. of neurons in the o/p state]
b -- [1 x no. of neurons in the hidden state]
Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras c – [1 x no. of neurons in the o/p state]

The dimensions of each term is as follows:
𝑋𝑖 -- [1 x no. of i/p neurons]

𝑠𝑖 -- [1 x no. of neurons in the hidden state]
W -- [no. of neurons in the hidden state x no. of neurons in the hidden state]
U -- [no. of i/p neurons x no. of neurons in the hidden state]
V -- [no. of neurons in the hidden state x no. of neurons in the o/p state]
b -- [1 x no. of neurons in the hidden state]
Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras c – [1 x no. of neurons in the o/p state]
• At time step i=0 there are no previous inputs, so they are typically assumed to be all zeros.
• Since, the output of si at time step i is a function of all the inputs from previous time steps, we could say it has a form of memory.
• A part of a neural network that preserves some state across time steps is called a memory cell ( or simply a cell )

Compact representation of a RNN:

Unroll
Same representation as seen
previously
• Unrolling the network through time = representing network against time axis.
• At each time step t (also called a frame) RNN receives inputs xi as well as output from previous step yi-1

Input and Output Sequences
Seq-to-Seq
Vector-to-Seq
Seq-to-Vector

Recurrent Neural Networks (RNN) : Example
1
Temperature
0.5
Source: https://www.youtube.com/c/joshstarmer
1
Temperature
0.5
1
Temperature
0.5
Problem :Given the temperatures of yesterday and today predict

tomorrow’s temperature.
1
Temperature
0.5
Problem :Given the temperatures of yesterday and today predict

tomorrow’s temperature.
Unrolling the feedback loop by making a copy of NN for each input value
Problem : Given the temperature of 3 days (today,

yesterday and day before yesterday), Predict tomorrow’s
temperature?

temperature?

temperature?
So, the no. of networks = no. of inputs
UNIT-4
Deep Learning: Basics of Deep Learning, Machine Learning Vs Deep
Learning, Fundamental Deep Learning Algorithm-Convolution Neural
Network (CNN).
Q) Describe the Motivation for Deep Learning.

The simple machine learning algorithms work very well on a wide variety of
important problems. However, they have not succeeded in solving the
central problems in AI, such as recognizing speech or recognizing
objects.Deep learning was designed to overcome these and other obstacles.
Q) Define Deep Learning(DL).

Deep learning is an aspect of artificial intelligence (AI) that is to simulate
the activity of the human brain specifically, pattern recognition by passing
input through various layers of the neural network.
Deep-learning architectures such as deep neural networks, deep belief
networks, recurrent neural networks and convolutional neural
networks have been applied to fields including computer vision, machine
vision, speech recognition, natural language processing, audio recognition,
social network filtering, machine translation, bioinformatics, drug
design, medical image analysis, material inspection and board
game programs.
Q)Give brief historical background of Deep Learning.

All the algorithms that are used in deep learning are largely inspired by the
way neurons and neural networks function and process data in the brain.
This image is one of the very first pictures of aneuron. It was drawn by
Santiago Ramon y Cajal, back in 1899 based on what he saw after placing a
pigeon's brain under the microscope. He is now known as the fatherof
modern neuroscience.
Fig. Human Neural Functioning

It is possible to mimic certain parts of neurons, such as dendrites, cell bodies
and axons using simplified mathematical models of what limited knowledge
we have on their inner workings: signals can be received from dendrites, and
sent down the axon once enough signals were received. This outgoing signal
can then be used as another input for other neurons, repeating the process.
Some signals are more important than others and can trigger some neurons
to fire easier. Connections can become stronger or weaker, new connections
can appear while others can cease to exist.
Fig. Biological Neuron

An artificial neuron behaves in the same way as a biological neuron. So it
consists of a soma(cell body for processing information), dendrites(input),
and an axon terminal to pass on the output of this neuron to other
neurons. The end of the axon can branch off to connect to many other
neurons.
Q) Differentiate a Biological neuron and and Artificial neuron.

Biological neuron Artificial
neuron
dendrites inputs
synapses weight or inter connection
axon output
cell body (Soma) summation and threshold
Q) Define Artificial Neuron(AN). Explain the computation/processing of

AN with an example.
Neural Networks are networks used in Deep Learning that work similar to
the human nervous system.
An artificial neuron is a mathematical function conceived as a model
of biological neurons, a neural network. Artificial neurons are elementary
units in an artificial neural network.
The artificial neuron receives one or more inputs and sums them to produce
an output by applying some activation function.
Fig. Artificial Neuron
For the above general model of artificial neural network, the net input can
be calculated as follows:
yin= (x1.w1+x2.w2+x3.w3…xm.wm) + bias
i.e., yin= ∑imxi.wi+ bias
where, Xi is set of features and Wi is set of weights.
Bias is the information which can impact output without being dependent
on any feature.
The output can be calculated by applying the activation function over the
net input.
Y=F(yin)
Each AN has an internal state, which is called an activation signal. Output
signals, which are produced after combining the input signals and activation
rule, may be sent to other units.
Eg.
Q) Construct a single layer neural network for implementing OR, AND,
NOT gates.
Let us take the activation function as:
The AND function can be implemented as:
The output of this neuron is:a = f( -1.5*1 + x1*1 + x2*1 )

Calculation for summation:
X1 = 0, x2 = 0 => f(-1.5 + 0 + 0) = f(-1.5) = 0
0 1 =>f(-1.5 + 0 + 1) = f(-0.5) = 0
1 0 =>f(-1.5 + 1 + 0) =f(-0.5) = 0
1 1 =>f(-1.5 + 1 +1) = f(+0.5) = 1
The truth table for this implementation is:
The OR function can be implemented as:
The output of this neuron is:a = f( -0.5 + x1 + x2 )

The NOT function can be implemented as:
The output of this neuron is:a = f( 1 – 2*x1 )

Q) Explain the need for multi-layered neural network with an example.
1. XOR:
XOR(A,B) = (A+B)*(AB)|
This sort of a relationship cannot be modeled using a single neuron. Thus
we will use a multi-layer network.
The idea behind using multiple layers is that complex relations can be
broken into simpler functions and combined.
2. XNOR function looks like:
Lets break down the XNOR function.

X1 XNOR X2 = NOT ( X1 XOR X2 )
= NOT [ (A+B).(A'+B') ]
= (A+B)' + (A'+B')'
= (A'.B') + (A.B)
a neuron to model A’.B’:
The output of this neuron is:a = f( 0.5 – x1 – x2 )

The truth table for this function is:
The different outputs represent different units:

a1: implements A‟.B‟
a2: implements A.B
a3: implements OR which works on a1 and a2, thus effectively (A‟.B‟ + A.B)
The functionality can be verified using the truth table:
Q) Define activation function. Explain different types of activation
functions.
 Activation Functions are extremely important feature of the Artificial

Neural Network. They basically decide whether a neuron should be
activated or not. It limits the output signal to a finite value.
 Activation Function does the non-linear transformation to the
input making it capable to learn more complex relation between input
and output. It make the network capable of learning more complex
pattern.
 Without an activation function, the neural network is just a linear
regression model as it performs only summation of product of input
and weights.
Eg. In the below image 2 requires a complex relation which is curve unlike a
simple linear relation in image 1.
Fig. Illustrating the need of Activation Function for a complex problem.
Activation function must be efficient and it should reduce the computation

time because the neural network sometimes trained on millions of data
points.
Types of AF:
The Activation Functions can be basically divided into 3 types-
1. Binary step Activation Function
2. Linear Activation Function
3. Non-linear Activation Functions
1. Binary Step Function

A binary step function is a threshold-based activation function. If the input
value is above or below a certain threshold, the neuron is activated and
sends exactly the same signal to the next layer.We decide some threshold
value to decide output that neuron should be activated or deactivated.It is
very simple and useful to classify binary problems or classifier.
Eg.f(x) = 1 if x > 0 else 0 if x <= 0
2. Linear or Identity Activation Function

As you can see the function is a line or linear. Therefore, the output of the
functions will not be confined between any range.
Fig: Linear Activation Function

Equation: f(x) = x
Range : (-infinity to infinity)
It doesn‟t help with the complexity or various parameters of usual data that
is fed to the neural networks
3. Non-linear Activation Function
The Nonlinear Activation Functions are the most used activation functions.
Nonlinearity helps to makes the graph look something like this.
Fig: Non-linear Activation Function
The main terminologies needed to understand for nonlinear functions are:
Derivative or Differential: Change in y-axis w.r.t. change in x-axis.It is also
known as slope.
Monotonic function: A function which is either entirely non-increasing or
non-decreasing.
The Nonlinear Activation Functions are mainly divided on the basis of

their range or curves-
Advantage of Non-linear function over the Linear function :

Differential is possible in all the non -linear function.
Stacking of network is possible, which helps us in creating deep neural nets.
It makes it easy for the model to generalize
3.1 Sigmoid(Logistic AF)(σ):

The main reason why we use sigmoid function is it exists between 0 to 1.
It is especially used for models where we have to predict the probability as
output. Since probability of anything exists only between the range of 0 and
1, sigmoid is the right choice.
Fig: Sigmoid Function (S-shaped Curve)

The function is differentiable and monotonic. But function derivative is
not monotonic.
The logistic sigmoid function can cause a neural network to get stuck at the
training time.
Advantages
1. Easy to understand and apply
2. Easy to train on small dataset
3. Smooth gradient, preventing “jumps” in output values.
4. Output values bound between 0 and 1, normalizing the output of each
neuron.
Disadvantages:
 Vanishing gradient—for very high or very low values of X, there is
almost no change to the prediction, causing a vanishing gradient
problem. This can result in the network refusing to learn further, or
being too slow to reach an accurate prediction.
 Outputs not zero centered.
 Computationally expensive
3.2 TanH(Hyperbolic Tangent AF):
TanH is also like logistic sigmoid but in better way. The range of the
TanHfunction is from -1 to +1.
TanH is often preferred over the sigmoid neuron because it is zero centred.
The advantage is that the negative inputs will be mapped strongly negative
and the zero inputs will be mapped near zero in tanh graph.
tanh(x) = 2 * sigmoid(2x) - 1
Fig. Sigmoid Vs Tanh
The function is differentiable and monotonic. But function derivative is

not monotonic.
Advantages
 Zero centered—making it easier to model inputs that have strongly
negative, neutral, and strongly positive values.
Disadvantages
 Like the Sigmoid function is also suffers from vanishing gradient
problem
 hard to train on small datasets
3.3 ReLU(Rectified Linear Unit):
The ReLU is the most used activation function. It is used in almost all
convolution neural networks in hidden layers only.
The ReLU is half rectified(from bottom). f(z) = 0, if z < 0
= z, otherwise
R(z) = max(0,z)
The range is 0 to inf.
Advantages
 Avoids vanishing gradient problem.
 Computationally efficient—allows the network to converge very
quickly
 Non-linear—although it looks like a linear function, ReLU has a
derivative function and allows for backpropagation
Disadvantages
 Can only be used with a hidden layer
 hard to train on small datasets and need much data for learning non-
linear behavior.
 The Dying ReLU problem—when inputs approach zero, or are
negative, the gradient of the function becomes zero, the network
cannot perform backpropagation and cannot learn.
The function and its derivative both are monotonic.

All the negative values are converted into zero, and this conversion rate is so
fast that neither it can map nor fit into data properly which creates a
problem.
Leaky ReLU Activation Function
We needed the Leaky ReLU activation function to solve the „Dying ReLU‟
problem.
Leaky ReLU we do not make all negative inputs to zero but to a value near
to zero which solves the major issue of ReLU activation function.
R(z) = max(0.1*z,z)
Advantages
 Prevents dying ReLU problem—this variation of ReLU has a small
positive slope in the negative area, so it does enable backpropagation,
even for negative input values
 Otherwise like ReLU
Disadvantages
 Results not consistent—leaky ReLU does not provide consistent
predictions for negative input values.
3.4 Softmax:
 Sigmoid able to handle more than two cases(class label).

 Softmax can handle multiple cases. Softmax function squeeze the
output for each class between 0 and 1 with sum of them is 1.
 It is ideally used in the final output layer of the classifier, where we
are actually trying to attain the probabilities.
 Softmax produces multiple outputs for an input array. For this
reason, we can build neural network models that can classify more
than 2 classes instead of binary class solution.
sigma = softmax
zi = input vector
e^{zi}} = standard exponential function for input vector
K = number of classes in the multi-class classifier
e^{zj} = standard exponential function for output vector
e^{zj} = standard exponential function for output vector
Advantages
Able to handle multiple classes only one class in other activation
functions—normalizes the outputs for each class between 0 and 1with the
sum of the probabilities been equal to 1, and divides by their sum, giving the
probability of the input value being in a specific class.
Useful for output neurons—typically Softmax is used only for the output
layer, for neural networks that need to classify inputs into multiple
categories.
Q) Explain about Deep feedforward networks or feedforward neural

networks or multilayer perceptron (MLP).
A deep neural network is a neural network with atleast two hidden layers.
Deep neural networks use sophisticated mathematical modeling to process
data in different ways.Traditional machine learning algorithms are linear,
deep learning algorithms are stacked in a hierarchy.
Fig. Deep Feedforward Network

Deep learning creates many layers of neurons, attempting to learn
structured representation, layer by layer.
The goal of a feedforward network is to approximate some function f ∗. For

example,for a classifier, y = f ∗(x) maps an input x to a category y.
A feedforward network defines a mapping y = f (x; θ) and learns the value of

the parameters θ that result in the best function approximation.
These models are called feedforward because information flows through the
function being evaluated from x, through the intermediate computations
used to define f, and finally to the output y. There are no feedback
connections in which outputs of the model are fed back into itself.
When feedforward neural networks are extended to include feedback

connections, they are called recurrent neural networks.
Feedforward networks are of extreme importance to machine learning

practitioners.They form the basis of many important commercial
applications. Forexample, the convolutional networks used for object
recognition from photos are aspecialized kind of feedforward network.
Feedforward neural networks are called networks because they are typically
represented by composing together many different functions. The model is
associated with a directed acyclic graph describing how the functions are
composed together.
For example, we might have three functions f (1), f (2), and f (3) connected in a
chain, to form f(x) = f(3)(f (2)(f(1) (x ))). This chain structure is most commonly
used structure of neural networks. In this case, f (1) is called the first layer of
the network called input layer used to feed the input into the network; f (2)
is called the second layer called hidden layer used to train the neural
network, and so on. The final layer of a feedforward network is called the
output layer that provides the output of the network. The overall length of
the chain gives the depth of the model and width of the model is number of
neurons in the input layer. It is from this terminology that the name “deep
learning” arises.
Q) Differentiate ML & DL.

1. Data dependencies for Performance:
When the data is small, deep learning algorithms don‟t perform that well.
This is because deep learning algorithms need a large amount of data to
understand it perfectly. On the other hand, traditional machine learning
algorithms with their handcrafted rules prevail in this scenario.
2. Hardware dependencies
Deep learning algorithms heavily depend on high-end machines, contrary to
traditional machine learning algorithms, which can work on low-end
machines. Deep learning algorithms inherently do a large amount of matrix
multiplication operations. These operations can be efficiently optimized
using a GPU.
3. Feature engineering:
Feature engineering is the process of transforming raw data into features
that better represent the underlying problem to the predictive models,
resulting in improved model accuracy on unseen data.Feature engineering
turn your inputs into things the algorithm can understand.
In Machine learning, most of the applied features need to be identified

by an expert and then hand-coded as per the domain and data type.
Features can be pixel values, shape, textures, position and orientation. The
performance of most of the Machine Learning algorithm depends on how
accurately the features are identified and extracted.
Deep learning algorithms try to learn high-level features from data.

Deep learning reduces the task of developing new feature extractor for every
problem. Like, Convolutional NN will try to learn low-level features such as
edges and lines in early layers then parts of faces of people and then high-
level representation of a face.
4. Problem Solving approach

When solving a problem using traditional machine learning algorithm, it is
generally recommended to break the problem down into different parts,
solve them individually and combine them to get the result. Deep learning in
contrast advocates to solve the problem end-to-end.
Eg. Suppose you have a task of multiple object detection. The task is to
identify what is the object and where is it present in the image.
In a typical ML approach, you would divide the problem into two steps,
object detection and object recognition
On the contrary, in deep learning approach, you would do the process
end-to-end.
5. Execution time
Usually, a deep learning algorithm takes a long time to train. This is
because there are so many parameters in a deep learning algorithm that
training them takes longer than usual. Whereas machine learning
comparatively takes much less time to train, ranging from a few seconds to
a few hours.
This is turn is completely reversed on testing time. At test time, deep
learning algorithm takes much less time to run. Whereas, if you compare it
with k-nearest neighbors (ML algorithm), test time increases on increasing
the size of data. Although this is not applicable on all machine learning
algorithms, as some of them have small testing times too.
6. Interpretability:
Suppose we use deep learning to give automated scoring to essays. The
performance it gives in scoring is quite excellent and is near human
performance. But there‟s is an issue. It does not reveal why it has given that
score. Indeed mathematically you can find out which nodes of a deep neural
network were activated, but we don‟t know what there neurons were
supposed to model and what these layers of neurons were doing collectively.
So we fail to interpret the results.
On the other hand, machine learning algorithms like decision trees
give us crisp rules as to why it chose what it chose, so it is particularly easy
to interpret the reasoning behind it. Therefore, algorithms like decision trees
and linear/logistic regression are primarily used in industry for
interpretability.
Characteristic ML DL
requires less amount of requires large amount
Data dependencies for data for identifying of data for better
Performance rules performance
work on low-end heavily depend on high-
Hardware dependencies machines end machines
Deep learning
algorithms try to learn
high-level features from
features need to be data.
identified by an expert Deep learning reduces
and then hand-coded as the task of developing
per the domain and new feature extractor
Feature engineering data type for every problem.
Break the problem into
Problem Solving parts, finds and Solves the problem end-
approach combines the solution to-end
Takes much small time
for training but may
Execution time Takes more time for take more time for
training and less time testing depending on
for testing the algorithm like KNN
Fails to interpret the Easy to interpret the
Interpretability results results
Q) Explain various applications of Deep Learning.
There are various interesting applications for Deep Learning that made
impossible things before a decade into reality. Some of them are:
1. Color restoration, where a given image in greyscale is automatically
turned into a colored one.
2. Recognizing hand written message.
3. Adding sound to a silent video that matches with the scene taking
place.
4. Self-driving cars
5. Computer Vision: for applications like vehicle number plate
identification and facial recognition.
6. Information Retrieval: for applications like search engines, both text
search, and image search.
7. Marketing: for applications like automated email marketing, target
identification
8. Medical Diagnosis: for applications like cancer identification, anomaly
detection
9. Natural Language Processing: for applications like sentiment analysis,
photo tagging
10. Online Advertising, etc
Q) Briefly explain about loss function in neural networks.
Neural Network uses optimising strategies to minimize the error in the

algorithm. The way we actually compute this error is by using a Loss
Function. It is used to quantify how good or bad the model is performing.
These are divided into two categories i.e. Regression loss and Classification
Loss.
1. Regression Loss Function

Regression Loss is used when we are predicting continuous values like the
price of a house or sales of a company.
Eg. Mean Squared Error
Mean Squared Error is the mean of squared differences between the actual
and predicted value. If the difference is large the model will penalize it as we
are computing the squared difference.
2. Binary Classification Loss Function
Suppose we are dealing with a Yes/No situation like “a person has diabetes
or not”, in this kind of scenario Binary Classification Loss Function is used.
Eg. Binary Cross Entropy Loss
It gives the probability value between 0 and 1 for a classification task.
Cross-Entropy calculates the average difference between the predicted and
actual probabilities.
3. Multi-Class Classification Loss Function
If we take a dataset like Iris where we need to predict the three-class labels:
Setosa, Versicolor and Virginia, in such cases where the target variable has
more than two classes Multi-Class Classification Loss function is used.
Eg. Categorical Cross Entropy Loss:
These are similar to binary classification cross-entropy, used for multi-class
classification problems.
Q) Explain briefly about gradient descent algorithm.

A deep learning neural network learns to map a set of inputs to a set of
outputs from training data. We cannot calculate the perfect weights for a
neural network.
Gradient descent is an iterative optimization algorithm for finding the
minimum of a function.
To find the minimum of a function using gradient descent, one takes
steps proportional to the negative of the gradient of the function at the
current point.
The “gradient” in gradient descent refers to an error gradient. The
model with a given set of weights is used to make predictions and the error
for those predictions is calculated.
Eg.
Fig. Gradient Descent

The gradient is given by the slope of the tangent at w = 0.2, and then the
magnitude of the step is controlled by a parameter called the learning rate.
The larger the learning rate, the bigger the step we take, and the smaller the
learning rate, the smaller the step we take. Then we take the step and we
move to w1.
Now when choosing the learning rate, we have to be very careful as a large
learning rate can lead to big steps and eventually missing the minimum.
On the other hand, a small learning rate can result in very small steps and
therefore causing the algorithm to take a long time to find the minimum
point.
Q) Explain about Back propagation algorithm.
Back-propagation is the essence of neural net training. It is the method of

fine-tuning the weights of a neural net based on the error rate obtained in
the previous epoch (i.e., iteration). Proper tuning of the weights allows you to
reduce error rates and to make the model reliable by increasing its
generalization.
The algorithm is used to effectively train a neural network through a method

called chain rule. In simple terms, after each forward pass through a
network, back propagation performs a backward pass while adjusting the
model‟s parameters (weights and biases).
Algorithm:
1. Initialize the weights and biases.
2. Iteratively repeat the following steps until defined number of times or
threshold value is reached:
i. Calculate network output using forward propagation.
ii. Calculate error between actual and predicted values.
iii. Propagate the error back into the network and update weights
and biases using the equations:
Fig. illustrating BP
Example:
Forward Propagation:
Therefore,
z1 = 0.415 a1 = 0.6023 z2 = 0.9210 a2 = 0.7153
Let us consider,
epochs = 1000 threshold = 0.001
learning rate = 0.4 T = 0.25
4.
E = 1/2(T-a2)2 = 0.1083
Eqn # 1: 𝑧1 = 𝑥1 ∙ 𝑤1 + 𝑏1
Eqn # 2: 𝑎1 = (𝑧1) = 1/( 1+ 𝑒 −𝑧1 )
Eqn # 3: 𝑧2 = 𝑎1 ∙ 𝑤2 + 𝑏2
Eqn # 4: 𝑎2 = (𝑧2) = 1 /(1+ 𝑒 −𝑧2)
Eqn # 5: 𝐸 = 1 /2 (𝑇 − 𝑎2)2
Updating w2:
= 0.45 - 0.4(-(0.25-0.7153))*(0.7153(1-0.7153))*(0.6023)
= 0.45 - 0.4*0.05706
= 0.427
Updating b2:
= 0.65 - 0.4*(-(0.25-0.7153))*(0.7153(1-0.7153))*1
= 0.65 - 0.4*0.0948
= 0.612
Updating w1:
= 0.15 - 0.4*(-(0.25-0.7153))*(0.7153(1 -0.7153))*0.45*0.6023(1-

0.6023)*0.1
= 0.15 - 0.4*0.001021
= 0.1496
Updating w2:
= 0.40 - 0.4*(-(0.25-0.7153))*(0.7153(1-0.7153))*0.45*0.6023(1-
0.6023)*1
= 0.40-0.4*0.01021
= 0.3959
Therefore we continue next iteration(feedforward) with the update

values of w1,b1,w2 and b2.
w1 = 0.1496 b1 = 0.3959 w2 = 0.427 b2 = 0.612
x1 = 0.1.
Q) What is Vanishing Gradient problem?

As more layers using certain activation functions are added to neural
networks, the gradients of the loss function approaches zero, making the
network hard to train.
Eg. In the below problem the derivatives with respect to weights are very
small.
So when we do back propagation, we keep multiplying the factors that are
less than 1 by each other and gradients tend to smaller and smaller by
moving backward in the network.
This means the neurons in the earlier layers learn very slowly. The result is
a training process that takes too long and prediction accuracy is
compromised.
Q) Explain indetail about CNN model.
MLP‟s use one perceptron for each input (e.g. pixel in an image, multiplied
by 3 in RGB case). The amount of weights rapidly becomes unmanageable
for large images. For a 224 x 224 pixel image with 3 color channels there are
around 150,528 weights that must be trained! As a result, difficulties arise
whilst training and overfitting can occur.
A Convolutional neural network (CNN) is a neural network that has one or

more convolutional layers and is used mainly for image processing,
classification, segmentation.
Fig. CNN Architecture

Input layer:
The input to a cnn, is mostly an image(nxmx1-gray scale image and nxmx3-
colored image)
Fig. RGB image as input

Convolution layer:
Here, we basically define filters and we compute the convolution between the
defined filters and each of the 3 images.
Fig. convolution operation
In the same way we apply to remaining (above is for red image, then we do
same for green and blue) images. We can apply more than one filter. More
filters we use, we can preserve spatial dimensions better.
We use convolution instead of considering flatten image as input as we will

end up with a massive number of parameters that will need to be optimized
and computationally expensive.
Eg. We require 25 weights if we take 5x5x1 image with out convolution.
We require 16 weights(n-f+1 x n-f+1) if we take 5x5x1 image with 2x2
convolution filter.
By using convolution we can prevent overfitting of the model.
It is worth to have ReLU activation function in convolution layer which

passed only positive values and make negative values to zeros.
Pooling layer:
Pooling layer objective is to reduce the spatial dimensions of the data
propagating through the network.
1. Max Pooling is the most common, for each section of the
image we scan and keep the highest value.
Fig. Max Pooling with stride = 2

Max. pooling provides spatial variance which enables the neural network to
recognize objects in an image even if the object does not exactly resemble
the original object.
2 Average Pooling: Here, we take average of area we scan.
Fig. Average Pooling with stride = 2
Fully Connected Layer:

Here, we flatten the output of the last convolutional layer
and connect every node of the current layer with every
other node of the next layer.
This layer basically takes output of the preceding layer,

whether it is a convolutional layer, ReLU or Pooling layer
and outputs an n-dimensional vector, where n is
number of classes pertaining to the problem.
Fig. Fully Connected Layer
Q) Differentiate Shallow NN and Deep NN.
Shallow Neural Network Deep Neural Network

It consists of one hidden layer It consists of more than one hidden
layer
It takes input as vectors only It takes raw data like images and
text as input.

DL Sessional 1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DL Sessional 1

Uploaded by

Copyright:

Available Formats

Introduction to Deep Learning

• How to get the informal knowledge into a computer?

• Machine learning algorithms depends heavily on the representation (collection

• For many tasks, it is difficult to know what features

• Learned representation often produces better performance in

• Deep learning solves this central problem in representation learning by

2. Charu C. Aggarwal, Neural Networks and Deep Learning, Springer 2018.

3. Course Notes – Neural Network and Deep Learning, Andrew NG

4. Course Notes- Deep Learning IIT Ropar, Prof Sudharshan Iyengar

5. Aurelien Geron, “Hands-On Machine Learning with Scikit-Learn , Keras &

Input : a funny Instagram reel

What do we do about functions which are not linearly separable ?

• Eg: When –w0 = 0.5, though the output values 0.49

• This behavior is not a characteristic of the specific

We are interested in knowing whether a network of neurons can be used to

We observe that such an arbitrary function can be approximated by

More the number of such “tower” functions, better the

To be more precise, we can approximate

To figure this out lets consider this example of a

For this figure:

To figure this out lets consider this example of a

If we set w to a very high value we will recover the

Similarly, adjusting b will shift this curve

This is an open tower.

The top portion is a closed tower.

We need to answer two questions:

• How to choose the loss function?

We need to answer two questions:

• How to choose the loss function?

ai+1 = Wi+1hij + bi+1

Intuition: Look before you leap!

• So, the movement is at least and then a bit more by

What is the issue here?

Some techniques for choosing learning rate:

• Exponential Decay: where, and are hyperparameters and t is

• 1/t Decay: where, and are hyperparameters and t is the step

• Exponential Decay: where, and are hyperparameters and t is

• 1/t Decay: where, and are hyperparameters and t is the step

If the feature has been updated fewer times, give it a larger

• Update rule for RMS Prop:

Dr. Abhilash K Pai

▪ Now, suppose that this sensor is noisy.

▪ To obtain a less noisy measurement we would like to average several measurements.

For example: 𝑠𝑡 = ෍ 𝑥𝑡−𝑎 𝑤−𝑎

w-6 w-5 w-4 w-3 w-2 w-1 w0

Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 3

For example: 𝑠𝑡 = ෍ 𝑥𝑡−𝑎 𝑤−𝑎

w-6 w-5 w-4 w-3 w-2 w-1 w0

Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 4

For example: 𝑠𝑡 = ෍ 𝑥𝑡−𝑎 𝑤−𝑎

w-6 w-5 w-4 w-3 w-2 w-1 w0

s 1.80 1.96 2.11

Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 5

𝑆𝑖𝑗 = 𝐼 ∗ 𝐾 𝑖𝑗 = ෍ ෍ 𝐼𝑡−𝑎,𝑗−𝑏 𝐾𝑎,𝑏

▪ However, the following is used in practice:

𝑆𝑖𝑗 = 𝐼 ∗ 𝐾 𝑖𝑗 = ෍ ෍ 𝐼𝑡+𝑎,𝑗+𝑏 𝐾𝑎,𝑏

Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 6

𝑆𝑖𝑗 = 𝐼 ∗ 𝐾 𝑖𝑗 = ෍ ෍ 𝐼 𝑖−𝑎,𝑗−𝑏 ∗ 𝐾 𝑚/2 +𝑎, 𝑛/2 +𝑏

Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 7

Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 8

Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 9

Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 10