You are on page 1of 301

Introduction to Deep Learning

Deep Learning
DSE 5251, M. Tech Data Science
Dr. Abhilash K Pai
Department of Data Science and Computer Applications
MIT Manipal

Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 1
Introduction to AI
• Evolution of AI
• Goal: To create machines that think and perceive the world like humans.
• Initially solved problems that can be described by a list of formal mathematical rules.
• Ex- Playing Chess

• Challenge in solving tasks that are easy for people to perform but difficult for people
to describe formally (problems that we solve intuitively)
• Ex- Identifying words, Recognizing people in images

• How to get the informal knowledge into a computer?


• Knowledge base approach
• Hard code knowledge in formal languages
• Computers can reason about statements automatically using logical inference rules
• Ex- Cyc project
• However, none of these projects has led to a major success and fail model intricate aspects of the
world.

Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 2
Machine Learning
• AI systems need the ability to acquire their own knowledge, by extracting
patterns from raw data. This capability is known as machine learning.

• Machine learning algorithms depends heavily on the representation (collection


of features).

• For many tasks, it is difficult to know what features


should be extracted.

Representations matter

Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 3
Deep Learning
• Solution is to use machine learning to discover not only the mapping
from representation to output but also the representation itself.
• This approach is known as representation learning.

• Learned representation often produces better performance in


comparison to the hand-designed representations.

• They also allow AI systems to rapidly adapt to new tasks, with minimal
human intervention.

Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 4
Deep Learning
• While designing algorithms for learning features the goal is usually to separate
the factors of variation.
• Unobserved objects or unobserved forces in the physical world that affect observable
quantities.
• They may also exist as constructs in the human mind that provide useful simplifying
explanations or inferred causes of the observed data.
• Ex: In voice data – speaker’s accent, gender, age

• However, many factors of variation influence every single piece of data we observe.
• Also, it can be very difficult to extract such high-level, abstract features from raw data.

• Deep learning solves this central problem in representation learning by


introducing representations that are expressed in terms of other, simpler
representations.

Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 5
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 6
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 7
Deep Learning, Machine Learning and AI

Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 8
Deep Learning and Machine Learning

(Source: softwaretestinghelp.com)

Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 9
Learning Multiple Components

Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 10
Neural Network Examples

Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 11
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 12
Scale drives deep learning progress

Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 13
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 14
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 15
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 16
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 17
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 18
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 19
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 20
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 21
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 22
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 23
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 24
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 25
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 26
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 27
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 28
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 29
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 30
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 31
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 32
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 33
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 34
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 35
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 36
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 37
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 38
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 39
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 40
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 41
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 42
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 43
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 44
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 45
DSE 5251 Deep Learning : Reference Materials
1. Ian Goodfellow, Yoshua Bengio and Aaron Courville, Deep Learning, MIT
Press 2016.

2. Charu C. Aggarwal, Neural Networks and Deep Learning, Springer 2018.

3. Course Notes – Neural Network and Deep Learning, Andrew NG

4. Course Notes- Deep Learning IIT Ropar, Prof Sudharshan Iyengar

5. Aurelien Geron, “Hands-On Machine Learning with Scikit-Learn , Keras &


Tensorflow, OReilly Publications

Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 46
Deep Learning
DSE 5251, M. Tech Data Science
Dr. Abhilash K Pai
Department of Data Science and Computer Applications
MIT Manipal

Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 1
Neural Networks: From ground up
Credits:
Most of the content in the slides are adapted from:
CS7015 Deep Learning, Dept. of CSE, IIT Madras
By Dr. Mithesh Khapra

2
Biological Neuron

3
Biological Neuron

Input : a funny Instagram reel


4
Biological Neuron

5
McCulloch Pits (MP) Neuron

6
Implementing Boolean functions using MP Neuron

7
Implementing Boolean functions using MP Neuron

8
Implementing Boolean functions using MP Neuron

9
Implementing Boolean functions using MP Neuron

10
Implementing Boolean functions using MP Neuron

11
Implementing Boolean functions using MP Neuron

12
Perceptron

13
Perceptron

14
MP Neuron vs Perceptron

MP Neuron Perceptron

15
Boolean function using Perceptron : Example

16
Perceptron Learning Algorithm

17
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 18
The XOR Conundrum

What do we do about functions which are not linearly separable ?


The XOR Conundrum

Non-linear!
Solving XOR using Multi-Layer Perceptrons (MLP)

Example of MLP
Solving XOR using Multi-Layer Perceptrons (MLP)

Theorem: Any boolean function of n inputs can be represented exactly by a network of perceptrons
containing 1 hidden layer with 2n perceptrons and one output layer containing 1 perceptron
Going beyond Binary Inputs and Outputs
Need for activation functions
• The thresholding logic used by a perceptron is very
harsh !

• Eg: When –w0 = 0.5, though the output values 0.49


and 0.51 are very close to each other, the perceptron
would assign different values to them.

• This behavior is not a characteristic of the specific


problem, the weight or threshold that we chose, it is
a characteristic of the perceptron function itself
which behaves like a step function
Sigmoid Neuron
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal
Other popular activation functions
Representation power of MLP
Representation power of MLP

We are interested in knowing whether a network of neurons can be used to


represent an arbitrary function (like the one shown in the figure)
Representation power of MLP

We observe that such an arbitrary function can be approximated by


several “tower” functions
Representation power of MLP

More the number of such “tower” functions, better the


approximation
Representation power of MLP

To be more precise, we can approximate


any arbitrary function by a sum of such
“tower” functions
Representation power of MLP
Representation power of MLP

To figure this out lets consider this example of a


sigmoid function
Representation power of MLP

For this figure:

To figure this out lets consider this example of a


sigmoid function
Representation power of MLP

If we set w to a very high value we will recover the


step function
Representation power of MLP

Similarly, adjusting b will shift this curve


Representation power of MLP

Now let us see what we get by taking two such sigmoid functions (with different b) and subtracting one from the other
Representation power of MLP
Can we come up with a neural network to represent this operation of
subtracting one sigmoid function from another ?
Representation power of MLP
Representation power of MLP
So far, we have the case where there is only one input.
What if we have more than one input, like the below profile?
Representation power of MLP

Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal
Representation power of MLP

Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal
Representation power of MLP

This is an open tower.


However, we need a closed tower!

Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal
Representation power of MLP
Representation power of MLP

Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal
Representation power of MLP

The top portion is a closed tower.


But how to extract that?

Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal
Representation power of MLP
Representation power of MLP
A Typical Machine Learning Set-up

51
Learning Parameters

52
Learning Parameters

53
Learning Parameters

54
Learning Parameters

55
Gradient Descent
Feed Forward Neural Networks: Introduction
Feed Forward Neural Networks: Introduction
Feed Forward Neural Networks: Introduction
Feed Forward Neural Networks: Introduction
Feed Forward Neural Networks: Introduction
Feed Forward Neural Networks: Introduction
Feed Forward Neural Networks: Introduction
Feed Forward Neural Networks: Introduction
Feed Forward Neural Networks: Introduction
Feed Forward Neural Networks: Introduction
Feed Forward Neural Networks: Introduction
Feed Forward Neural Networks: Introduction
Feed Forward Neural Networks: Introduction

We need to answer two questions:

• How to choose the loss function?

• How to compute ?
Feed Forward Neural Networks: Introduction
The choice of the loss function depends on the problem at hand
Feed Forward Neural Networks: Introduction
Feed Forward Neural Networks: Loss and Activation Functions
Feed Forward Neural Networks: Loss and Activation Functions
Feed Forward Neural Networks: Loss and Activation Functions
Feed Forward Neural Networks: Loss and Activation Functions
Feed Forward Neural Networks: Loss and Activation Functions
Feed Forward Neural Networks

We need to answer two questions:

• How to choose the loss function?

• How to compute ?
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation

ai+1 = Wi+1hij + bi+1


Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Derivatives of activation functions: g’(z)
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Gradient descent and its variants
Backpropagation in ANN : Recap
Gradient descent and its variants
Backpropagation in ANN : Recap
Gradient descent and its variants
Local Minima
Backpropagation in ANN : Recap
Gradient descent and its variants
Saddle Points
Backpropagation in ANN : Recap
Gradient descent and its variants
Backpropagation in ANN : Recap
Gradient descent and its variants

Contour Plot
Backpropagation in ANN : Recap
Gradient descent and its variants
Backpropagation in ANN : Recap
Gradient descent and its variants
Backpropagation in ANN : Recap
Gradient descent and its variants
Backpropagation in ANN : Recap
Gradient descent and its variants
Plateaus and Flat Regions
Backpropagation in ANN : Recap
Gradient descent and its variants
Momentum-based Gradient Descent
Backpropagation in ANN : Recap
Gradient descent and its variants
Momentum-based Gradient Descent
Backpropagation in ANN : Recap
Gradient descent and its variants
Nesterov Accelerated Gradient Descent

Intuition: Look before you leap!

• Recall that:

• So, the movement is at least and then a bit more by


Backpropagation in ANN : Recap
Gradient descent and its variants
Vanilla (Batch) Gradient Descent

What is the issue here?


Backpropagation in ANN : Recap
Gradient descent and its variants
Stochastic Gradient Descent
Backpropagation in ANN : Recap
Gradient descent and its variants
Mini-Batch Stochastic Gradient Descent
Backpropagation in ANN : Recap
Gradient descent and its variants
Backpropagation in ANN : Recap
Choosing a learning rate
• If learning rate is too small, it takes long time to converge.
If learning rate is too large, the gradients explode.

Some techniques for choosing learning rate:

• Linear Search
Backpropagation in ANN : Recap
Choosing a learning rate
• Annealing-based methods:

• Step Decay:
• Halve the learning rate after every 5 epochs or
• Halve the learning rate after an epoch if the validation error is more than what it
was at the end of the previous epoch

• Exponential Decay: where, and are hyperparameters and t is


the step number

• 1/t Decay: where, and are hyperparameters and t is the step


number
Backpropagation in ANN : Recap
Choosing a learning rate
• Annealing-based methods:

• Step Decay:
• Halve the learning rate after every 5 epochs or
• Halve the learning rate after an epoch if the validation error is more than what it
was at the end of the previous epoch

• Exponential Decay: where, and are hyperparameters and t is


the step number

• 1/t Decay: where, and are hyperparameters and t is the step


number
Backpropagation in ANN : Recap
GD with adaptive learning rate
• Motivation: Can we have a different learning rate for each parameter which takes care of
the frequency of features ?

• Intuition: Decay the learning rate for parameters in proportion to their update history.
• For sparse features, accumulated update history is small
• For dense features, accumulated update history is large

Make learning rate inversely proportional to the update history i.ie, if the feature has
been updated fewer times, give it a larger learning rate and vice versa
Adagrad
• Update rule for Adagrad

If the feature has been updated fewer times, give it a larger


learning rate and vice versa
Backpropagation in ANN : Recap
RMS Prop
• Intuition: Adagrad decays the learning rate very aggressively (as the denominator grows).

• Update rule for RMS Prop:


Exponential weighted moving average (weighted decay)
Backpropagation in ANN : Recap
Adam
• Adding momentum to RMSProp
DSE 5251 DEEP LEARNING

Dr. Abhilash K Pai


Assistant Professor,
Dept. of Data Science and Computer Applications
MIT Manipal
The Convolution Operation - 1D
▪ Convolution is a linear operation on two functions of a real-valued argument, where one function is applied over
the other to yield element-wise dot products.

▪ Example: Consider a discrete signal ‘xt’ which represents the position of a spaceship at time ‘t’
recorded by a laser sensor.

▪ Now, suppose that this sensor is noisy.


x0

▪ To obtain a less noisy measurement we would like to average several measurements.

▪ Considering that, the most recent measurements are more important, we would like to take
a weighted average over ‘xt’. The new estimate at time ‘t’ is computed as follows: x1
convolution

𝑠𝑡 = ෍ 𝑥𝑡−𝑎 𝑤−𝑎 = 𝑥 ∗ 𝑤 𝑡
𝑎=0

input Filter/Mask/Kernel
x2
▪ Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 2
The Convolution Operation - 1D
▪ In practice, we would sum only over a small window.
6

For example: 𝑠𝑡 = ෍ 𝑥𝑡−𝑎 𝑤−𝑎


𝑎=0

▪ We just slide the filter over the input and compute the value of st based on a window around xt

w-6 w-5 w-4 w-3 w-2 w-1 w0


w 0.01 0.01 0.02 0.02 0.04 0.4 0.5

* * * * * * *
x 1.0 1.10 1.20 1.40 1.70 1.80 1.90 2.10 2.20

s 1.80

Content adapted from : CS7015 Deep Learning, Dept. of CSE, IIT Madras

Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 3


The Convolution Operation - 1D
▪ In practice, we would sum only over a small window.
6

For example: 𝑠𝑡 = ෍ 𝑥𝑡−𝑎 𝑤−𝑎


𝑎=0

▪ We just slide the filter over the input and compute the value of st based on a window around xt

w-6 w-5 w-4 w-3 w-2 w-1 w0


w 0.01 0.01 0.02 0.02 0.04 0.4 0.5

* * * * * * *
x 1.0 1.10 1.20 1.40 1.70 1.80 1.90 2.10 2.20

s 1.80 1.96

Content adapted from : CS7015 Deep Learning, Dept. of CSE, IIT Madras

Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 4


The Convolution Operation - 1D
▪ In practice, we would sum only over a small window.
6

For example: 𝑠𝑡 = ෍ 𝑥𝑡−𝑎 𝑤−𝑎


𝑎=0

▪ We just slide the filter over the input and compute the value of st based on a window around xt

w-6 w-5 w-4 w-3 w-2 w-1 w0


w 0.01 0.01 0.02 0.02 0.04 0.4 0.5

* * * * * * *
x 1.0 1.10 1.20 1.40 1.70 1.80 1.90 2.10 2.20

s 1.80 1.96 2.11

▪ Use cases of 1-D convolution : Audio signal processing, stock market analysis, time series analysis etc.
Content adapted from : CS7015 Deep Learning, Dept. of CSE, IIT Madras

Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 5


The Convolution Operation - 2D
▪ Images are good examples of 2-D inputs.
▪ A 2-D convolution of an Image ‘I’ using a filter ‘K’ of size ‘m x n’ is now defined as:

𝑚−1 𝑛−1

𝑆𝑖𝑗 = 𝐼 ∗ 𝐾 𝑖𝑗 = ෍ ෍ 𝐼𝑡−𝑎,𝑗−𝑏 𝐾𝑎,𝑏


𝑎=0 𝑏=0

▪ However, the following is used in practice:


𝑚−1 𝑛−1

𝑆𝑖𝑗 = 𝐼 ∗ 𝐾 𝑖𝑗 = ෍ ෍ 𝐼𝑡+𝑎,𝑗+𝑏 𝐾𝑎,𝑏


𝑎=0 𝑏=0

Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 6


The Convolution Operation - 2D
▪ Now, if we consider the center pixel as the pixel of interest, 2-D convolution equation is as follows:
𝑚/2 𝑛/2

𝑆𝑖𝑗 = 𝐼 ∗ 𝐾 𝑖𝑗 = ෍ ෍ 𝐼 𝑖−𝑎,𝑗−𝑏 ∗ 𝐾 𝑚/2 +𝑎, 𝑛/2 +𝑏


𝑎= −𝑚/2 𝑏= −𝑛/2

Pixel of interest

0 1 0 0 1
0 0 1 1 0
1 0 0 0 1
0 1 0 0 1
0 0 1 0 1
Content adapted from : CS7015 Deep Learning, Dept. of CSE, IIT Madras

Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 7


The Convolution Operation - 2D

Source: https://developers.google.com/

Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 8


The Convolution Operation - 2D

Input Image

Source: https://developers.google.com/

Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 9


The Convolution Operation - 2D

Input Image

Source: https://developers.google.com/

Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 10


The Convolution Operation - 2D

Smoothening Filter

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 11


The Convolution Operation - 2D

Sharpening Filter

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 12


The Convolution Operation - 2D

Filter for edge


detection

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 13


The Convolution Operation – 2D : Various filters (edge detection)
Prewitt

-1 0 1 1 1 1
-1 0 1 0 0 0
-1 0 1 -1 -1 -1

Sx Sy After applying
Horizontal edge
detection filter
Sobel

-1 0 1 1 2 1
-2 0 2 0 0 0
-1 0 1 -1 -2 -1
Sx Sy Input image After applying
Vertical edge
Laplacian Roberts detection filter

0 1 0 0 1 1 0
1 -4 1 -1 0 0 -1

0 1 0 Sx Sy
Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 14
The Convolution Operation - 2D

1 -1 -1
-1 1 -1 Filter 1
-1 -1 1
stride=1

1 0 0 0 0 1 Dot
product
0 1 0 0 1 0 3 -1
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
Input image: 6 x 6
Note: Stride is the number of “unit” the kernel is shifted per slide over rows/ columns
Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 15
The Convolution Operation - 2D

1 -1 -1
-1 1 -1 Filter 1
-1 -1 1
If stride=2

1 0 0 0 0 1
0 1 0 0 1 0 3 -3
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
Input image: 6 x 6
Note: Stride is the number of “unit” the kernel is shifted per slide over rows/ columns
Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 16
The Convolution Operation - 2D

1 -1 -1
-1 1 -1 Filter 1
-1 -1 1
stride=1

1 0 0 0 0 1
3 -1 -3 -1
0 1 0 0 1 0
0 0 1 1 0 0 -3 1 0 -3
1 0 0 0 1 0 4 x 4 Feature Map
0 1 0 0 1 0 -3 -3 0 1
0 0 1 0 1 0
3 -2 -2 -1
Input image: 6 x 6

Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 17


The Convolution Operation - 2D
-1 1 -1
-1 1 -1 Filter 2
-1 1 -1
stride=1

1 0 0 0 0 1 Repeat for each filter!


3 -1 -3 -1
0 1 0 0 1 0 -1 -1 -1 -1
0 0 1 1 0 0 -3 1 0 -3
-1 -1 -2 1
1 0 0 0 1 0 Feature
0 1 0 0 1 0 -3 -3 Map0 1
-1 -1 -2 1
0 0 1 0 1 0 Two 4 x 4 images
3 -2 -2 -1 Forming 4 x 4 x 2 matrix
Input image: 6 x 6 -1 0 -4 3

Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 18


The Convolution Operation –RGB Images

R G B

Apply the filter to R, G, and B channels of


the image and combine the resultant
feature maps to obtain a 2-D feature map.

Source: Intuitively Understanding Convolutions for Deep Learning | by Irhum Shafkat | Towards Data Science

Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 19


The Convolution Operation –RGB Images multiple filters
11 -1-1 -1-1 -1-1 11 -1-1 -1-1 11 -1-1
1 -1 -1 -1 0 -1 -1 1 -1
-1 1 -1 -1-1 11 -1-1 -1-1 11 -1-1
-1-1 11 -1-1 Filter 1 -1 0 -1 Filter 2
0 0 0 Filter K
-1-1 -1-1 11 -1-1 11 -1-1 -1-1 11 -1-1
-1 -1 1 -1 0 -1 -1 1 -1

1 0 0 0 0 1
1 0 0 0 0 1
0 11 00 00 01 00 1
0 1 0 0 1 0
0 00 11 01 00 10 0
0 0 1 1 0 0 K-filters = K-Feature Maps
1 00 00 10 11 00 0
1 0 0 0 1 0
0 11 00 00 01 10 0
0 1 0 0 1 0
0 00 11 00 01 10 0
0 0 1 0 1 0
0 0 1 0 1 0 Depth of feature map = No. of feature maps = No. of filters

Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 20


The Convolution Operation : Terminologies

1 0 0 0 0 1
1 0 0 0 0 1
0 11 00 00 01 00 1
0 1 0 0 1 0
0 00 11 01 00 10 0 -1-1 11 -1-1
0 0 1 1 0 0 -1 1 -1
1 00 00 10 11 00 0 -1-1 11 -1-1
1 0 0 0 1 0 0 0 0 Filter
0 11 00 00 01 10 0 -1-1 11 -1-1
0 1 0 0 1 0 -1 1 -1
0 00 11 00 01 10 0
0 0 1 0 1 0
0 0 1 0 1 0

1. Depth of an Input Image = No. of channels in the Input Image = Depth of a filter

2. Assuming square filters, Spatial Extent (F) of a filter is the size of the filter

Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 21


The Convolution Operation : Zero Padding

conv3x3
2x2
4x4

Pad Zeros and then convolve to obtain a


feature map with dimension = input image dimension

Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 22


The Convolution Operation : Zero Padding

Feature map size: 5x5

Input image size: 5x5

Source: Intuitively Understanding Convolutions for Deep Learning | by Irhum Shafkat | Towards Data Science

Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 23


Relation between i/p size, feature map size, filter size
Input Image
-1-1 11 -1-1
-1 0 -1 Stride length = S
1 0 0 0 0 1 -1-1 11 -1-1 No. of Filters = K
1 0 0 0 0 1 -1 0 -1 Padding = P
0 11 00 00 01 00 1 -1-1 11 -1-1
0 1 0 0 1 0 -1 0 -1 Filter
0 00 11 01 00 10 0
0 0 1 1 0 0 -1-1 11 -1-1
1 00 00 10 11 00 0 H1 -1 1 -1
1 0 0 0 1 0 -1-1 11 -1-1
0 11 00 00 01 10 0 0 0 0 F H2
0 1 0 0 1 0 -1-1 11 -1-1
0 00 11 00 01 10 0 -1 1 -1
0 0 1 0 1 0
D1
0 0 1 0 1 0 D1
D2
F

1 -1 -1
W1
11 -1-1 -1-1 W2

-1-1 11 -1-1
𝑾𝟏 − 𝑭 + 𝟐𝑷 -1 1 -1 𝑯𝟏 − 𝑭 + 𝟐𝑷
𝑾𝟐 = +𝟏 -1-1 -1-1 11 𝑯𝟐 = +𝟏 𝑫𝟐 = 𝑲
𝑺 -1 -1 1 𝑺

Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 24


Convolutional Neural Network (CNN) : At a glance

cat | dog
Convolution

Pooling
Can repeat Fully Connected
many times Feedforward network
Convolution

Pooling

Source: CS 898: Deep Learning and Its Applications, University of Flattened


Waterloo, Canada.
Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 25
Pooling

1 -1 -1 -1 1 -1
-1 1 -1 Filter 1 -1 1 -1 Filter 2
-1 -1 1 -1 1 -1
• Max Pooling

3 -1 -3 -1 -1 -1 -1 -1 • Average Pooling

-3 1 0 -3 -1 -1 -2 1

-3 -3 0 1 -1 -1 -2 1

3 -2 -2 -1 -1 0 -4 3
Source: CS 898: Deep Learning and Its Applications, University of Waterloo, Canada.

Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 26


Pooling

Max. Pooling Average Pooling

Stride ?

Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 27


Why Pooling ?

▪ Subsampling pixels will not change the object

bird
bird

Subsampling

▪ We can subsample the pixels to make image smaller

▪ Therefore, fewer parameters to characterize the image


Source: CS 898: Deep Learning and Its Applications, University of Waterloo, Canada.
Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 28
Important properties of CNN

▪ Sparse Connectivity

▪ Shared weights

▪ Equivariant representation

Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 29


Properties of CNN
1 1 1
1 -1 -1 Filter 1 2 0 -1

-1 1 -1 3 0 -1

-1 -1 1 4 0 -1
3

..
7 0 1
1 0 0 0 0 1 8 1 -1

0 1 0 0 1 0 9 0
0 0 1 1 0 0 10 0 -1
-1

..
1 0 0 0 1 0 1
Fewer parameters!
13 0
0 1 0 0 1 0 Only connect to 9 inputs, not fully
14 0 connected (Sparse Connectivity)
0 0 1 0 1 0
15 1
6 x 6 Image 16 1

..
Source: CS 898: Deep Learning and Its Applications, University of Waterloo, Canada.
Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 30
Properties of CNN

Is sparse connectivity good?

Ian Goodellow et al. 2016

Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 31


Properties of CNN
1 1
1 -1 -1 2 0
-1 1 -1 3 0
-1 -1 1 4 0 3

..
7 0
1 0 0 0 0 1 8 1
0 1 0 0 1 0 9 0
0 0 1 1 0 0 10 0
-1

..
1 0 0 0 1 0
13 0
0 1 0 0 1 0 Even Fewer parameters!
14 0
0 0 1 0 1 0 Fewer parameters!
15 1
6 x 6 Image Shared weights
16 1

..
Source: CS 898: Deep Learning and Its Applications, University of Waterloo, Canada.
Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 32
Equivariance to translation

▪ A function f is equivariant to a function g if f(g(x)) = g(f(x)) or if the output changes in the same way as the
input.

▪ This is achieved by the concept of weight sharing.

▪ As the same weights are shared across the images, hence if an object occurs in any image, it will be detected
irrespective of its position in the image.

Source: Translational Invariance Vs Translational Equivariance | by Divyanshu Mishra | Towards Data Science

Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 33


CNN vs Fully Connected NN

▪ A CNN compresses the fully connected NN in two ways:

▪ Reducing the number of connections

▪ Shared weights

▪ Max pooling further reduces the parameters to characterize an image.

Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 34


Convolutional Neural Network (CNN) : Non-linearity with activation

cat | dog
Convolution +
ReLU

Pooling
Fully Connected
Feedforward network
Convolution+
ReLu

Pooling

Source: CS 898: Deep Learning and Its Applications, University of Flattened


Waterloo, Canada.
Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 35
LeNet-5 Architecture for handwritten text recognition

#Param.
#Param. ((5*5*16)*120 +
#Param. #Param. 120 = 48120 #Param.
((5*5*6)+1) * 16 = 2416
((5*5*1)+1) * 6 = 156 =0 84*120 + 84=
#Param. 10164
=0 #Param.
84*10 + 10= 850

tanh tanh

sigmoid

S =1, F=5, S =2, F=2, S =1, F=5, S =2, F=2,


K=6, P=2 K=6, P=0 K=16, P=0 K=16, P=0

LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., & others. (1998). Gradient-based learning applied to document recognition.
Proceedings of the IEEE, 86(11), 2278–2324.

Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 36


LeNet-5 Architecture for handwritten number recognition

Source: http://yann.lecun.com/

Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 37


ImageNet Dataset

More than 14 million images. 22,000 Image categories

Deng, Jia, et al. "Imagenet: A large-scale hierarchical image database."


IEEE conference on computer vision and pattern recognition. IEEE, 2009.
Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 38
ImageNet Large Scale Visual Recognition Challenge
• 1000 ImageNet Categories

ZFNet

Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 39


AlexNet (2012)

▪ Used ReLU activation function instead of


sigmoid and tanh.

▪ Used data augmentation techniques


that consisted of image translations,
horizontal reflections, and patch
extractions.

▪ Implemented dropout layers.

Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 40


AlexNet Architecture

#Param. = 0 #Param. = 0
#Param. #Param.
#Param.
((5*5*96)+1) * 256 = 614656 ((3*3*256)+1) * 384 =
((11*11*3)+1) * 96 = 34944
885120

#Param. = 0
#Param.
((3*3*384)+1) * 256 =884992
Total #Param.
#Param. 62M
((3*3*384)+1) * 384 =
1327488
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012).
Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25.
Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 41
ZFNet Architecture (2013)

• Used filters of size 7x7 instead of 11x11 in AlexNet

• Used Deconvnet to visualize the intermediate results.

Zeiler, M. D., & Fergus, R. (2013). Visualizing and understanding convolutional networks.
In European conference on computer vision (pp. 818-833). Springer, Cham.
Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 42
ZFNet

Visualizing and Understanding Deep Neural Networks by Matt Zeiler - YouTube

Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 43


ZFNet

Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 44


VGGNet Architecture (2014)

Image Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

• Used filters of size 3x3 in all the convolution layers.

• 3 conv layers back-to-back have an effective receptive field of 7x7.

• Also called VGG-16 as it has 16 layers.

• This work reinforced the notion that convolutional neural networks have to have a deep network of layers in order for
this hierarchical representation of visual data to work

Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition , International Conference on Learning Representations (ICLR14)

Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 45


GoogleNet Architecture (2014)

• Most of the architectures discussed till now


apply either of the following after each
convolution operation:
• Max Pooling
• 3x3 convolution
• 5x5 convolution

• Idea: Why cant we apply them all together


at the same time and concatenate the
feature maps.

• Problem: This will result in large number of


computations.

• Specifically, each element of the output


required O(FxFxD) computations
Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., ... & Rabinovich, A. (2015). Going deeper with convolutions.
In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR’15)

Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 46


GoogleNet Architecture (2014)

• Solution: Apply 1x1 convolutions

• 1x1 convolution aggregates along the depth.

• So, if we apply D1 1x1 convolutions (D1<D), we


will get an output of size W x H x D1

• So, the total number of computations will reduce to


O(FxFxD1)

• We could then apply subsequent 3x3, 5 x5 filters on


this reduced output

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., ... & Rabinovich, A. (2015). Going deeper with convolutions.
In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR’15)

Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 47


GoogleNet Architecture (2014)

• Also, we might want to use different


dimensionality reductions (applying 1x1
convolutions of different sizes) before the
3x3 and 5x5 filters.

• We can also add the maxpooling layer


followed by 1x1 convolution.

• After this, we concatenate all these layers.

• This is called the Inception module.

• GoogleNet contains many such inception


The Inception module modules.

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras


Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., ... & Rabinovich, A. (2015). Going deeper with convolutions.
In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR’15)

Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 48


GoogleNet Architecture (2014)

Global average pooling

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

• 12 times less parameters and 2 times more


computations than AlexNet

• Used Global Average Pooling instead of


Flattening.

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., & Rabinovich, A. (2015). Going deeper with convolutions.
In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR’15)

Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 49


ResNet Architecture (2015)

Effect of increasing layers of shallow CNN when experimented over the CIFAR dataset

Source: Residual Networks (ResNet) - Deep Learning - GeeksforGeeks


Shallow CNN +
Shallow CNN Additional layers
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition.
In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).

Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 50


ResNet Architecture (2015)

ResNet-34

Source: Residual Networks (ResNet) - Deep Learning - GeeksforGeeks


He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition.
In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 51
ResNet Architecture (2015)

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition.
In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 52
Sequence Models

DSE 3151 DEEP LEARNING

Dr. Abhilash K Pai


Assistant Professor,
Dept. of Data Science and Computer Applications
MIT Manipal
Examples of Sequence Data

▪ Speech Recognition Mary had a little lamb

▪ Music Generation

▪ Sentiment Classification

▪ DNA Sequence Analysis

▪ Machine Translation

▪ Video Activity Recognition

▪ Name Entity Recognition

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 2


Examples of Sequence Data

▪ Speech Recognition

▪ Music Generation La

▪ Sentiment Classification

▪ DNA Sequence Analysis

▪ Machine Translation

▪ Video Activity Recognition

▪ Name Entity Recognition

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 3


Examples of Sequence Data

▪ Speech Recognition

▪ Music Generation

▪ Sentiment Classification “Its an average movie”

▪ DNA Sequence Analysis

▪ Machine Translation

▪ Video Activity Recognition

▪ Name Entity Recognition

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 4


Examples of Sequence Data

▪ Speech Recognition

▪ Music Generation

▪ Sentiment Classification

▪ DNA Sequence Analysis AGCCCCTGTGAGGAACTAG AGCCCCTGTGAGGAACTAG

▪ Machine Translation

▪ Video Activity Recognition

▪ Name Entity Recognition

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 5


Examples of Sequence Data

▪ Speech Recognition

▪ Music Generation

▪ Sentiment Classification

▪ DNA Sequence Analysis

▪ Machine Translation ARE YOU FEELING SLEEPY क्या आपको न ींद आ रह है

▪ Video Activity Recognition

▪ Name Entity Recognition

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 6


Examples of Sequence Data

▪ Speech Recognition

▪ Music Generation

▪ Sentiment Classification

▪ DNA Sequence Analysis

▪ Machine Translation

▪ Video Activity Recognition WAVING

▪ Name Entity Recognition

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 7


Examples of Sequence Data

▪ Speech Recognition

▪ Music Generation

▪ Sentiment Classification

▪ DNA Sequence Analysis

▪ Machine Translation

▪ Video Activity Recognition

▪ Name Entity Recognition “Alice wants to discuss about “Alice wants to discuss about
Deep Learning with Bob” Deep Learning with Bob”

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 8


Issues with using ANN/CNN on sequential data

• In feedforward and convolutional neural networks the size of the input was always fixed.

• Further, each input to the network was independent of the previous or future inputs.

• In many applications with sequence data, the input is not of a fixed size.

• Further successive inputs may not be independent of each other.

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 9


Modelling Sequence Learning Problems: Introduction

------- Don’t care --------

Task: Auto-complete Task: P-o-S tagging Task: Movie Review Task: Action Recognition
Legend

• The model needs to look at a sequence of inputs and produce an output (or outputs). Output
layer
• For this purpose, lets consider each input to be corresponding to one time step.
Hidden
layer
• Next, build a network for each time step/input, where each network performs the same task
(eg: Auto complete: input=character, output=character) Input layer

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 10


How to Model Sequence Learning Problems?

1. Model the dependence between inputs.


• Eg: The next word after an ‘adjective’ is most probably a ‘noun’.

2. Account for variable number of inputs.


• A sentence can have arbitrary no. of words.
• A video can have arbitrary no. of frames.

3. Make sure that the function executed at each time step is the same.
• Because at each time step we are doing the same task.

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 11


Modelling Sequence Learning Problems using Recurrent Neural Networks (RNN)

Introduction

Considering the network at each time step to be a fully connected


network, the general equation for the network at each time step is:

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 12


Modelling Sequence Learning Problems using Recurrent Neural Networks (RNN)

Introduction

Considering the network at each time step to be a fully connected


network, the general equation for the network at each time step is:

Since we want the same function to be executed at each timestep we


Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras should share the same network (i.e., same parameters at each timestep)

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 13


Recurrent Neural Networks (RNN): Introduction

• If the input sequence is of length ‘n’, we would create ‘n’ networks for each input, as seen previously.

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 14


Recurrent Neural Networks (RNN): Introduction

• If the input sequence is of length ‘n’, we would create ‘n’ networks for each input, as seen previously.

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

But, how to model the dependencies between the inputs ?

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 15


Recurrent Neural Networks (RNN)

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 16


Recurrent Neural Networks (RNN)

Solution: Add recurrent connection in the network.

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 17


Recurrent Neural Networks (RNN)

Solution: Add recurrent connection in the network.

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 18


Recurrent Neural Networks (RNN)
• So, the RNN equation:
Solution: Add recurrent connection in the network.

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 19


Recurrent Neural Networks (RNN)
• So, the RNN equation:
Solution: Add recurrent connection in the network.

U, W, V, b, c are parameters of the network

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 20


Recurrent Neural Networks (RNN)
• So, the RNN equation:
Solution: Add recurrent connection in the network.

U, W, V, b, c are parameters of the network

The dimensions of each term is as follows:

𝑋𝑖 -- [1 x no. of i/p neurons]


𝑠𝑖 -- [1 x no. of neurons in the hidden state]
W -- [no. of neurons in the hidden state x no. of neurons in the hidden state]
U -- [no. of i/p neurons x no. of neurons in the hidden state]
V -- [no. of neurons in the hidden state x no. of neurons in the o/p state]
b -- [1 x no. of neurons in the hidden state]
Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras c – [1 x no. of neurons in the o/p state]

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 21


Recurrent Neural Networks (RNN)
• So, the RNN equation:
Solution: Add recurrent connection in the network.

U, W, V, b, c are parameters of the network

The dimensions of each term is as follows:

𝑋𝑖 -- [1 x no. of i/p neurons]


𝑠𝑖 -- [1 x no. of neurons in the hidden state]
W -- [no. of neurons in the hidden state x no. of neurons in the hidden state]
U -- [no. of i/p neurons x no. of neurons in the hidden state]
V -- [no. of neurons in the hidden state x no. of neurons in the o/p state]
b -- [1 x no. of neurons in the hidden state]
Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras c – [1 x no. of neurons in the o/p state]

• At time step i=0 there are no previous inputs, so they are typically assumed to be all zeros.
• Since, the output of si at time step i is a function of all the inputs from previous time steps, we could say it has a form of memory.
• A part of a neural network that preserves some state across time steps is called a memory cell ( or simply a cell )

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 22


Recurrent Neural Networks (RNN)

Compact representation of a RNN:

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 23


Recurrent Neural Networks (RNN)

Unroll
Same representation as seen
previously

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

• Unrolling the network through time = representing network against time axis.

• At each time step t (also called a frame) RNN receives inputs xi as well as output from previous step yi-1

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 24


Input and Output Sequences

Seq-to-Seq
Vector-to-Seq

Seq-to-Vector

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 25


Recurrent Neural Networks (RNN) : Example
1
Temperature

0.5

Source: https://www.youtube.com/c/joshstarmer
Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 26
Recurrent Neural Networks (RNN) : Example
1
Temperature

0.5

Source: https://www.youtube.com/c/joshstarmer
Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 27
Recurrent Neural Networks (RNN) : Example
1
Temperature

0.5

Problem :Given the temperatures of yesterday and today predict


tomorrow’s temperature.

Source: https://www.youtube.com/c/joshstarmer
Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 28
Recurrent Neural Networks (RNN) : Example
1
Temperature

0.5

Problem :Given the temperatures of yesterday and today predict


tomorrow’s temperature.

Source: https://www.youtube.com/c/joshstarmer
Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 29
Recurrent Neural Networks (RNN) : Example

Source: https://www.youtube.com/c/joshstarmer
Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 30
Recurrent Neural Networks (RNN) : Example

Source: https://www.youtube.com/c/joshstarmer
Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 31
Recurrent Neural Networks (RNN) : Example

Unrolling the feedback loop by making a copy of NN for each input value

Source: https://www.youtube.com/c/joshstarmer
Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 32
Recurrent Neural Networks (RNN) : Example

Source: https://www.youtube.com/c/joshstarmer
Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 33
Recurrent Neural Networks (RNN) : Example

Problem : Given the temperature of 3 days (today,


yesterday and day before yesterday), Predict tomorrow’s
temperature?

Source: https://www.youtube.com/c/joshstarmer
Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 34
Recurrent Neural Networks (RNN) : Example

Problem : Given the temperature of 3 days (today,


yesterday and day before yesterday), Predict tomorrow’s
temperature?

Source: https://www.youtube.com/c/joshstarmer
Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 35
Recurrent Neural Networks (RNN) : Example

Problem : Given the temperature of 3 days (today,


yesterday and day before yesterday), Predict tomorrow’s
temperature?

So, the no. of networks = no. of inputs

Source: https://www.youtube.com/c/joshstarmer
Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 36
UNIT-4
Deep Learning: Basics of Deep Learning, Machine Learning Vs Deep
Learning, Fundamental Deep Learning Algorithm-Convolution Neural
Network (CNN).

Q) Describe the Motivation for Deep Learning.


The simple machine learning algorithms work very well on a wide variety of
important problems. However, they have not succeeded in solving the
central problems in AI, such as recognizing speech or recognizing
objects.Deep learning was designed to overcome these and other obstacles.

Q) Define Deep Learning(DL).


Deep learning is an aspect of artificial intelligence (AI) that is to simulate
the activity of the human brain specifically, pattern recognition by passing
input through various layers of the neural network.
Deep-learning architectures such as deep neural networks, deep belief
networks, recurrent neural networks and convolutional neural
networks have been applied to fields including computer vision, machine
vision, speech recognition, natural language processing, audio recognition,
social network filtering, machine translation, bioinformatics, drug
design, medical image analysis, material inspection and board
game programs.

Q)Give brief historical background of Deep Learning.


All the algorithms that are used in deep learning are largely inspired by the
way neurons and neural networks function and process data in the brain.
This image is one of the very first pictures of aneuron. It was drawn by
Santiago Ramon y Cajal, back in 1899 based on what he saw after placing a
pigeon's brain under the microscope. He is now known as the fatherof
modern neuroscience.

Fig. Human Neural Functioning


It is possible to mimic certain parts of neurons, such as dendrites, cell bodies
and axons using simplified mathematical models of what limited knowledge
we have on their inner workings: signals can be received from dendrites, and
sent down the axon once enough signals were received. This outgoing signal
can then be used as another input for other neurons, repeating the process.
Some signals are more important than others and can trigger some neurons
to fire easier. Connections can become stronger or weaker, new connections
can appear while others can cease to exist.

Fig. Biological Neuron


An artificial neuron behaves in the same way as a biological neuron. So it
consists of a soma(cell body for processing information), dendrites(input),
and an axon terminal to pass on the output of this neuron to other
neurons. The end of the axon can branch off to connect to many other
neurons.

Q) Differentiate a Biological neuron and and Artificial neuron.


Biological neuron Artificial
neuron
dendrites inputs
synapses weight or inter connection
axon output
cell body (Soma) summation and threshold

Q) Define Artificial Neuron(AN). Explain the computation/processing of


AN with an example.

Neural Networks are networks used in Deep Learning that work similar to
the human nervous system.
An artificial neuron is a mathematical function conceived as a model
of biological neurons, a neural network. Artificial neurons are elementary
units in an artificial neural network.
The artificial neuron receives one or more inputs and sums them to produce
an output by applying some activation function.
Fig. Artificial Neuron

For the above general model of artificial neural network, the net input can
be calculated as follows:
yin= (x1.w1+x2.w2+x3.w3…xm.wm) + bias
i.e., yin= ∑imxi.wi+ bias
where, Xi is set of features and Wi is set of weights.
Bias is the information which can impact output without being dependent
on any feature.
The output can be calculated by applying the activation function over the
net input.
Y=F(yin)
Each AN has an internal state, which is called an activation signal. Output
signals, which are produced after combining the input signals and activation
rule, may be sent to other units.
Eg.
Q) Construct a single layer neural network for implementing OR, AND,
NOT gates.

Let us take the activation function as:

The AND function can be implemented as:

The output of this neuron is:a = f( -1.5*1 + x1*1 + x2*1 )


Calculation for summation:
X1 = 0, x2 = 0 => f(-1.5 + 0 + 0) = f(-1.5) = 0
0 1 =>f(-1.5 + 0 + 1) = f(-0.5) = 0
1 0 =>f(-1.5 + 1 + 0) =f(-0.5) = 0
1 1 =>f(-1.5 + 1 +1) = f(+0.5) = 1
The truth table for this implementation is:

The OR function can be implemented as:

The output of this neuron is:a = f( -0.5 + x1 + x2 )


The truth table for this implementation is:
The NOT function can be implemented as:

The output of this neuron is:a = f( 1 – 2*x1 )


The truth table for this implementation is:

Q) Explain the need for multi-layered neural network with an example.

1. XOR:

XOR(A,B) = (A+B)*(AB)|
This sort of a relationship cannot be modeled using a single neuron. Thus
we will use a multi-layer network.
The idea behind using multiple layers is that complex relations can be
broken into simpler functions and combined.
2. XNOR function looks like:

Lets break down the XNOR function.


X1 XNOR X2 = NOT ( X1 XOR X2 )
= NOT [ (A+B).(A'+B') ]
= (A+B)' + (A'+B')'
= (A'.B') + (A.B)
a neuron to model A’.B’:

The output of this neuron is:a = f( 0.5 – x1 – x2 )


The truth table for this function is:

The different outputs represent different units:


a1: implements A‟.B‟
a2: implements A.B
a3: implements OR which works on a1 and a2, thus effectively (A‟.B‟ + A.B)
The functionality can be verified using the truth table:
Q) Define activation function. Explain different types of activation
functions.

 Activation Functions are extremely important feature of the Artificial


Neural Network. They basically decide whether a neuron should be
activated or not. It limits the output signal to a finite value.
 Activation Function does the non-linear transformation to the
input making it capable to learn more complex relation between input
and output. It make the network capable of learning more complex
pattern.
 Without an activation function, the neural network is just a linear
regression model as it performs only summation of product of input
and weights.
Eg. In the below image 2 requires a complex relation which is curve unlike a
simple linear relation in image 1.

Fig. Illustrating the need of Activation Function for a complex problem.

Activation function must be efficient and it should reduce the computation


time because the neural network sometimes trained on millions of data
points.

Types of AF:
The Activation Functions can be basically divided into 3 types-
1. Binary step Activation Function
2. Linear Activation Function
3. Non-linear Activation Functions

1. Binary Step Function


A binary step function is a threshold-based activation function. If the input
value is above or below a certain threshold, the neuron is activated and
sends exactly the same signal to the next layer.We decide some threshold
value to decide output that neuron should be activated or deactivated.It is
very simple and useful to classify binary problems or classifier.
Eg.f(x) = 1 if x > 0 else 0 if x <= 0

2. Linear or Identity Activation Function


As you can see the function is a line or linear. Therefore, the output of the
functions will not be confined between any range.

Fig: Linear Activation Function


Equation: f(x) = x
Range : (-infinity to infinity)
It doesn‟t help with the complexity or various parameters of usual data that
is fed to the neural networks
3. Non-linear Activation Function
The Nonlinear Activation Functions are the most used activation functions.
Nonlinearity helps to makes the graph look something like this.
Fig: Non-linear Activation Function
The main terminologies needed to understand for nonlinear functions are:
Derivative or Differential: Change in y-axis w.r.t. change in x-axis.It is also
known as slope.
Monotonic function: A function which is either entirely non-increasing or
non-decreasing.

The Nonlinear Activation Functions are mainly divided on the basis of


their range or curves-

Advantage of Non-linear function over the Linear function :


Differential is possible in all the non -linear function.
Stacking of network is possible, which helps us in creating deep neural nets.
It makes it easy for the model to generalize

3.1 Sigmoid(Logistic AF)(σ):


The main reason why we use sigmoid function is it exists between 0 to 1.
It is especially used for models where we have to predict the probability as
output. Since probability of anything exists only between the range of 0 and
1, sigmoid is the right choice.

Fig: Sigmoid Function (S-shaped Curve)


The function is differentiable and monotonic. But function derivative is
not monotonic.
The logistic sigmoid function can cause a neural network to get stuck at the
training time.
Advantages
1. Easy to understand and apply
2. Easy to train on small dataset
3. Smooth gradient, preventing “jumps” in output values.
4. Output values bound between 0 and 1, normalizing the output of each
neuron.
Disadvantages:
 Vanishing gradient—for very high or very low values of X, there is
almost no change to the prediction, causing a vanishing gradient
problem. This can result in the network refusing to learn further, or
being too slow to reach an accurate prediction.
 Outputs not zero centered.
 Computationally expensive

3.2 TanH(Hyperbolic Tangent AF):

TanH is also like logistic sigmoid but in better way. The range of the
TanHfunction is from -1 to +1.

TanH is often preferred over the sigmoid neuron because it is zero centred.
The advantage is that the negative inputs will be mapped strongly negative
and the zero inputs will be mapped near zero in tanh graph.

tanh(x) = 2 * sigmoid(2x) - 1

Fig. Sigmoid Vs Tanh

The function is differentiable and monotonic. But function derivative is


not monotonic.
Advantages
 Zero centered—making it easier to model inputs that have strongly
negative, neutral, and strongly positive values.
Disadvantages
 Like the Sigmoid function is also suffers from vanishing gradient
problem
 hard to train on small datasets

3.3 ReLU(Rectified Linear Unit):

The ReLU is the most used activation function. It is used in almost all
convolution neural networks in hidden layers only.
The ReLU is half rectified(from bottom). f(z) = 0, if z < 0
= z, otherwise
R(z) = max(0,z)
The range is 0 to inf.

Advantages
 Avoids vanishing gradient problem.
 Computationally efficient—allows the network to converge very
quickly
 Non-linear—although it looks like a linear function, ReLU has a
derivative function and allows for backpropagation

Disadvantages
 Can only be used with a hidden layer
 hard to train on small datasets and need much data for learning non-
linear behavior.
 The Dying ReLU problem—when inputs approach zero, or are
negative, the gradient of the function becomes zero, the network
cannot perform backpropagation and cannot learn.

The function and its derivative both are monotonic.


All the negative values are converted into zero, and this conversion rate is so
fast that neither it can map nor fit into data properly which creates a
problem.
Leaky ReLU Activation Function

We needed the Leaky ReLU activation function to solve the „Dying ReLU‟
problem.
Leaky ReLU we do not make all negative inputs to zero but to a value near
to zero which solves the major issue of ReLU activation function.

R(z) = max(0.1*z,z)

Advantages
 Prevents dying ReLU problem—this variation of ReLU has a small
positive slope in the negative area, so it does enable backpropagation,
even for negative input values
 Otherwise like ReLU
Disadvantages
 Results not consistent—leaky ReLU does not provide consistent
predictions for negative input values.

3.4 Softmax:

 Sigmoid able to handle more than two cases(class label).


 Softmax can handle multiple cases. Softmax function squeeze the
output for each class between 0 and 1 with sum of them is 1.
 It is ideally used in the final output layer of the classifier, where we
are actually trying to attain the probabilities.
 Softmax produces multiple outputs for an input array. For this
reason, we can build neural network models that can classify more
than 2 classes instead of binary class solution.

sigma = softmax
zi = input vector
e^{zi}} = standard exponential function for input vector
K = number of classes in the multi-class classifier
e^{zj} = standard exponential function for output vector
e^{zj} = standard exponential function for output vector
Advantages
Able to handle multiple classes only one class in other activation
functions—normalizes the outputs for each class between 0 and 1with the
sum of the probabilities been equal to 1, and divides by their sum, giving the
probability of the input value being in a specific class.
Useful for output neurons—typically Softmax is used only for the output
layer, for neural networks that need to classify inputs into multiple
categories.

Q) Explain about Deep feedforward networks or feedforward neural


networks or multilayer perceptron (MLP).
A deep neural network is a neural network with atleast two hidden layers.
Deep neural networks use sophisticated mathematical modeling to process
data in different ways.Traditional machine learning algorithms are linear,
deep learning algorithms are stacked in a hierarchy.

Fig. Deep Feedforward Network


Deep learning creates many layers of neurons, attempting to learn
structured representation, layer by layer.

The goal of a feedforward network is to approximate some function f ∗. For


example,for a classifier, y = f ∗(x) maps an input x to a category y.

A feedforward network defines a mapping y = f (x; θ) and learns the value of


the parameters θ that result in the best function approximation.

These models are called feedforward because information flows through the
function being evaluated from x, through the intermediate computations
used to define f, and finally to the output y. There are no feedback
connections in which outputs of the model are fed back into itself.

When feedforward neural networks are extended to include feedback


connections, they are called recurrent neural networks.

Feedforward networks are of extreme importance to machine learning


practitioners.They form the basis of many important commercial
applications. Forexample, the convolutional networks used for object
recognition from photos are aspecialized kind of feedforward network.

Feedforward neural networks are called networks because they are typically
represented by composing together many different functions. The model is
associated with a directed acyclic graph describing how the functions are
composed together.

For example, we might have three functions f (1), f (2), and f (3) connected in a
chain, to form f(x) = f(3)(f (2)(f(1) (x ))). This chain structure is most commonly
used structure of neural networks. In this case, f (1) is called the first layer of
the network called input layer used to feed the input into the network; f (2)
is called the second layer called hidden layer used to train the neural
network, and so on. The final layer of a feedforward network is called the
output layer that provides the output of the network. The overall length of
the chain gives the depth of the model and width of the model is number of
neurons in the input layer. It is from this terminology that the name “deep
learning” arises.

Q) Differentiate ML & DL.


1. Data dependencies for Performance:
When the data is small, deep learning algorithms don‟t perform that well.
This is because deep learning algorithms need a large amount of data to
understand it perfectly. On the other hand, traditional machine learning
algorithms with their handcrafted rules prevail in this scenario.
2. Hardware dependencies
Deep learning algorithms heavily depend on high-end machines, contrary to
traditional machine learning algorithms, which can work on low-end
machines. Deep learning algorithms inherently do a large amount of matrix
multiplication operations. These operations can be efficiently optimized
using a GPU.

3. Feature engineering:
Feature engineering is the process of transforming raw data into features
that better represent the underlying problem to the predictive models,
resulting in improved model accuracy on unseen data.Feature engineering
turn your inputs into things the algorithm can understand.

In Machine learning, most of the applied features need to be identified


by an expert and then hand-coded as per the domain and data type.
Features can be pixel values, shape, textures, position and orientation. The
performance of most of the Machine Learning algorithm depends on how
accurately the features are identified and extracted.

Deep learning algorithms try to learn high-level features from data.


Deep learning reduces the task of developing new feature extractor for every
problem. Like, Convolutional NN will try to learn low-level features such as
edges and lines in early layers then parts of faces of people and then high-
level representation of a face.

4. Problem Solving approach


When solving a problem using traditional machine learning algorithm, it is
generally recommended to break the problem down into different parts,
solve them individually and combine them to get the result. Deep learning in
contrast advocates to solve the problem end-to-end.
Eg. Suppose you have a task of multiple object detection. The task is to
identify what is the object and where is it present in the image.
In a typical ML approach, you would divide the problem into two steps,
object detection and object recognition
On the contrary, in deep learning approach, you would do the process
end-to-end.

5. Execution time
Usually, a deep learning algorithm takes a long time to train. This is
because there are so many parameters in a deep learning algorithm that
training them takes longer than usual. Whereas machine learning
comparatively takes much less time to train, ranging from a few seconds to
a few hours.
This is turn is completely reversed on testing time. At test time, deep
learning algorithm takes much less time to run. Whereas, if you compare it
with k-nearest neighbors (ML algorithm), test time increases on increasing
the size of data. Although this is not applicable on all machine learning
algorithms, as some of them have small testing times too.

6. Interpretability:
Suppose we use deep learning to give automated scoring to essays. The
performance it gives in scoring is quite excellent and is near human
performance. But there‟s is an issue. It does not reveal why it has given that
score. Indeed mathematically you can find out which nodes of a deep neural
network were activated, but we don‟t know what there neurons were
supposed to model and what these layers of neurons were doing collectively.
So we fail to interpret the results.
On the other hand, machine learning algorithms like decision trees
give us crisp rules as to why it chose what it chose, so it is particularly easy
to interpret the reasoning behind it. Therefore, algorithms like decision trees
and linear/logistic regression are primarily used in industry for
interpretability.

Characteristic ML DL
requires less amount of requires large amount
Data dependencies for data for identifying of data for better
Performance rules performance
work on low-end heavily depend on high-
Hardware dependencies machines end machines
Deep learning
algorithms try to learn
high-level features from
features need to be data.
identified by an expert Deep learning reduces
and then hand-coded as the task of developing
per the domain and new feature extractor
Feature engineering data type for every problem.
Break the problem into
Problem Solving parts, finds and Solves the problem end-
approach combines the solution to-end
Takes much small time
for training but may
Execution time Takes more time for take more time for
training and less time testing depending on
for testing the algorithm like KNN
Fails to interpret the Easy to interpret the
Interpretability results results
Q) Explain various applications of Deep Learning.

There are various interesting applications for Deep Learning that made
impossible things before a decade into reality. Some of them are:
1. Color restoration, where a given image in greyscale is automatically
turned into a colored one.
2. Recognizing hand written message.
3. Adding sound to a silent video that matches with the scene taking
place.
4. Self-driving cars
5. Computer Vision: for applications like vehicle number plate
identification and facial recognition.
6. Information Retrieval: for applications like search engines, both text
search, and image search.
7. Marketing: for applications like automated email marketing, target
identification
8. Medical Diagnosis: for applications like cancer identification, anomaly
detection
9. Natural Language Processing: for applications like sentiment analysis,
photo tagging
10. Online Advertising, etc

Q) Briefly explain about loss function in neural networks.

Neural Network uses optimising strategies to minimize the error in the


algorithm. The way we actually compute this error is by using a Loss
Function. It is used to quantify how good or bad the model is performing.
These are divided into two categories i.e. Regression loss and Classification
Loss.

1. Regression Loss Function


Regression Loss is used when we are predicting continuous values like the
price of a house or sales of a company.
Eg. Mean Squared Error
Mean Squared Error is the mean of squared differences between the actual
and predicted value. If the difference is large the model will penalize it as we
are computing the squared difference.
2. Binary Classification Loss Function
Suppose we are dealing with a Yes/No situation like “a person has diabetes
or not”, in this kind of scenario Binary Classification Loss Function is used.
Eg. Binary Cross Entropy Loss
It gives the probability value between 0 and 1 for a classification task.
Cross-Entropy calculates the average difference between the predicted and
actual probabilities.
3. Multi-Class Classification Loss Function
If we take a dataset like Iris where we need to predict the three-class labels:
Setosa, Versicolor and Virginia, in such cases where the target variable has
more than two classes Multi-Class Classification Loss function is used.
Eg. Categorical Cross Entropy Loss:
These are similar to binary classification cross-entropy, used for multi-class
classification problems.

Q) Explain briefly about gradient descent algorithm.


A deep learning neural network learns to map a set of inputs to a set of
outputs from training data. We cannot calculate the perfect weights for a
neural network.
Gradient descent is an iterative optimization algorithm for finding the
minimum of a function.
To find the minimum of a function using gradient descent, one takes
steps proportional to the negative of the gradient of the function at the
current point.
The “gradient” in gradient descent refers to an error gradient. The
model with a given set of weights is used to make predictions and the error
for those predictions is calculated.
Eg.

Fig. Gradient Descent


The gradient is given by the slope of the tangent at w = 0.2, and then the
magnitude of the step is controlled by a parameter called the learning rate.
The larger the learning rate, the bigger the step we take, and the smaller the
learning rate, the smaller the step we take. Then we take the step and we
move to w1.
Now when choosing the learning rate, we have to be very careful as a large
learning rate can lead to big steps and eventually missing the minimum.
On the other hand, a small learning rate can result in very small steps and
therefore causing the algorithm to take a long time to find the minimum
point.

Q) Explain about Back propagation algorithm.

Back-propagation is the essence of neural net training. It is the method of


fine-tuning the weights of a neural net based on the error rate obtained in
the previous epoch (i.e., iteration). Proper tuning of the weights allows you to
reduce error rates and to make the model reliable by increasing its
generalization.

The algorithm is used to effectively train a neural network through a method


called chain rule. In simple terms, after each forward pass through a
network, back propagation performs a backward pass while adjusting the
model‟s parameters (weights and biases).

Algorithm:
1. Initialize the weights and biases.
2. Iteratively repeat the following steps until defined number of times or
threshold value is reached:
i. Calculate network output using forward propagation.
ii. Calculate error between actual and predicted values.
iii. Propagate the error back into the network and update weights
and biases using the equations:

Fig. illustrating BP
Example:
Forward Propagation:

Therefore,
z1 = 0.415 a1 = 0.6023 z2 = 0.9210 a2 = 0.7153

Let us consider,
epochs = 1000 threshold = 0.001
learning rate = 0.4 T = 0.25

4.
E = 1/2(T-a2)2 = 0.1083
Eqn # 1: 𝑧1 = 𝑥1 ∙ 𝑤1 + 𝑏1
Eqn # 2: 𝑎1 = (𝑧1) = 1/( 1+ 𝑒 −𝑧1 )
Eqn # 3: 𝑧2 = 𝑎1 ∙ 𝑤2 + 𝑏2
Eqn # 4: 𝑎2 = (𝑧2) = 1 /(1+ 𝑒 −𝑧2)
Eqn # 5: 𝐸 = 1 /2 (𝑇 − 𝑎2)2
Updating w2:

= 0.45 - 0.4(-(0.25-0.7153))*(0.7153(1-0.7153))*(0.6023)
= 0.45 - 0.4*0.05706
= 0.427
Updating b2:
= 0.65 - 0.4*(-(0.25-0.7153))*(0.7153(1-0.7153))*1
= 0.65 - 0.4*0.0948
= 0.612
Updating w1:

= 0.15 - 0.4*(-(0.25-0.7153))*(0.7153(1 -0.7153))*0.45*0.6023(1-


0.6023)*0.1
= 0.15 - 0.4*0.001021
= 0.1496
Updating w2:

= 0.40 - 0.4*(-(0.25-0.7153))*(0.7153(1-0.7153))*0.45*0.6023(1-
0.6023)*1
= 0.40-0.4*0.01021
= 0.3959

Therefore we continue next iteration(feedforward) with the update


values of w1,b1,w2 and b2.
w1 = 0.1496 b1 = 0.3959 w2 = 0.427 b2 = 0.612
x1 = 0.1.

Q) What is Vanishing Gradient problem?


As more layers using certain activation functions are added to neural
networks, the gradients of the loss function approaches zero, making the
network hard to train.

Eg. In the below problem the derivatives with respect to weights are very
small.
So when we do back propagation, we keep multiplying the factors that are
less than 1 by each other and gradients tend to smaller and smaller by
moving backward in the network.
This means the neurons in the earlier layers learn very slowly. The result is
a training process that takes too long and prediction accuracy is
compromised.

Q) Explain indetail about CNN model.

MLP‟s use one perceptron for each input (e.g. pixel in an image, multiplied
by 3 in RGB case). The amount of weights rapidly becomes unmanageable
for large images. For a 224 x 224 pixel image with 3 color channels there are
around 150,528 weights that must be trained! As a result, difficulties arise
whilst training and overfitting can occur.

A Convolutional neural network (CNN) is a neural network that has one or


more convolutional layers and is used mainly for image processing,
classification, segmentation.

Fig. CNN Architecture


Input layer:
The input to a cnn, is mostly an image(nxmx1-gray scale image and nxmx3-
colored image)

Fig. RGB image as input


Convolution layer:
Here, we basically define filters and we compute the convolution between the
defined filters and each of the 3 images.

Fig. convolution operation

In the same way we apply to remaining (above is for red image, then we do
same for green and blue) images. We can apply more than one filter. More
filters we use, we can preserve spatial dimensions better.

We use convolution instead of considering flatten image as input as we will


end up with a massive number of parameters that will need to be optimized
and computationally expensive.
Eg. We require 25 weights if we take 5x5x1 image with out convolution.
We require 16 weights(n-f+1 x n-f+1) if we take 5x5x1 image with 2x2
convolution filter.

By using convolution we can prevent overfitting of the model.

It is worth to have ReLU activation function in convolution layer which


passed only positive values and make negative values to zeros.

Pooling layer:
Pooling layer objective is to reduce the spatial dimensions of the data
propagating through the network.
1. Max Pooling is the most common, for each section of the
image we scan and keep the highest value.

Fig. Max Pooling with stride = 2


Max. pooling provides spatial variance which enables the neural network to
recognize objects in an image even if the object does not exactly resemble
the original object.

2 Average Pooling: Here, we take average of area we scan.

Fig. Average Pooling with stride = 2

Fully Connected Layer:


Here, we flatten the output of the last convolutional layer
and connect every node of the current layer with every
other node of the next layer.

This layer basically takes output of the preceding layer,


whether it is a convolutional layer, ReLU or Pooling layer
and outputs an n-dimensional vector, where n is
number of classes pertaining to the problem.

Fig. Fully Connected Layer

Q) Differentiate Shallow NN and Deep NN.

Shallow Neural Network Deep Neural Network


It consists of one hidden layer It consists of more than one hidden
layer
It takes input as vectors only It takes raw data like images and
text as input.

You might also like