Professional Documents
Culture Documents
Deep Learning
DSE 5251, M. Tech Data Science
Dr. Abhilash K Pai
Department of Data Science and Computer Applications
MIT Manipal
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 1
Introduction to AI
• Evolution of AI
• Goal: To create machines that think and perceive the world like humans.
• Initially solved problems that can be described by a list of formal mathematical rules.
• Ex- Playing Chess
• Challenge in solving tasks that are easy for people to perform but difficult for people
to describe formally (problems that we solve intuitively)
• Ex- Identifying words, Recognizing people in images
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 2
Machine Learning
• AI systems need the ability to acquire their own knowledge, by extracting
patterns from raw data. This capability is known as machine learning.
Representations matter
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 3
Deep Learning
• Solution is to use machine learning to discover not only the mapping
from representation to output but also the representation itself.
• This approach is known as representation learning.
• They also allow AI systems to rapidly adapt to new tasks, with minimal
human intervention.
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 4
Deep Learning
• While designing algorithms for learning features the goal is usually to separate
the factors of variation.
• Unobserved objects or unobserved forces in the physical world that affect observable
quantities.
• They may also exist as constructs in the human mind that provide useful simplifying
explanations or inferred causes of the observed data.
• Ex: In voice data – speaker’s accent, gender, age
• However, many factors of variation influence every single piece of data we observe.
• Also, it can be very difficult to extract such high-level, abstract features from raw data.
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 5
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 6
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 7
Deep Learning, Machine Learning and AI
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 8
Deep Learning and Machine Learning
(Source: softwaretestinghelp.com)
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 9
Learning Multiple Components
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 10
Neural Network Examples
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 11
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 12
Scale drives deep learning progress
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 13
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 14
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 15
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 16
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 17
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 18
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 19
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 20
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 21
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 22
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 23
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 24
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 25
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 26
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 27
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 28
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 29
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 30
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 31
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 32
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 33
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 34
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 35
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 36
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 37
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 38
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 39
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 40
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 41
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 42
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 43
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 44
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 45
DSE 5251 Deep Learning : Reference Materials
1. Ian Goodfellow, Yoshua Bengio and Aaron Courville, Deep Learning, MIT
Press 2016.
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 46
Deep Learning
DSE 5251, M. Tech Data Science
Dr. Abhilash K Pai
Department of Data Science and Computer Applications
MIT Manipal
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 1
Neural Networks: From ground up
Credits:
Most of the content in the slides are adapted from:
CS7015 Deep Learning, Dept. of CSE, IIT Madras
By Dr. Mithesh Khapra
2
Biological Neuron
3
Biological Neuron
5
McCulloch Pits (MP) Neuron
6
Implementing Boolean functions using MP Neuron
7
Implementing Boolean functions using MP Neuron
8
Implementing Boolean functions using MP Neuron
9
Implementing Boolean functions using MP Neuron
10
Implementing Boolean functions using MP Neuron
11
Implementing Boolean functions using MP Neuron
12
Perceptron
13
Perceptron
14
MP Neuron vs Perceptron
MP Neuron Perceptron
15
Boolean function using Perceptron : Example
16
Perceptron Learning Algorithm
17
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal 18
The XOR Conundrum
Non-linear!
Solving XOR using Multi-Layer Perceptrons (MLP)
Example of MLP
Solving XOR using Multi-Layer Perceptrons (MLP)
Theorem: Any boolean function of n inputs can be represented exactly by a network of perceptrons
containing 1 hidden layer with 2n perceptrons and one output layer containing 1 perceptron
Going beyond Binary Inputs and Outputs
Need for activation functions
• The thresholding logic used by a perceptron is very
harsh !
Now let us see what we get by taking two such sigmoid functions (with different b) and subtracting one from the other
Representation power of MLP
Can we come up with a neural network to represent this operation of
subtracting one sigmoid function from another ?
Representation power of MLP
Representation power of MLP
So far, we have the case where there is only one input.
What if we have more than one input, like the below profile?
Representation power of MLP
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal
Representation power of MLP
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal
Representation power of MLP
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal
Representation power of MLP
Representation power of MLP
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal
Representation power of MLP
Dr. Abhilash Pai, Dept of Data Science and Computer Applications, MIT Manipal
Representation power of MLP
Representation power of MLP
A Typical Machine Learning Set-up
51
Learning Parameters
52
Learning Parameters
53
Learning Parameters
54
Learning Parameters
55
Gradient Descent
Feed Forward Neural Networks: Introduction
Feed Forward Neural Networks: Introduction
Feed Forward Neural Networks: Introduction
Feed Forward Neural Networks: Introduction
Feed Forward Neural Networks: Introduction
Feed Forward Neural Networks: Introduction
Feed Forward Neural Networks: Introduction
Feed Forward Neural Networks: Introduction
Feed Forward Neural Networks: Introduction
Feed Forward Neural Networks: Introduction
Feed Forward Neural Networks: Introduction
Feed Forward Neural Networks: Introduction
Feed Forward Neural Networks: Introduction
• How to compute ?
Feed Forward Neural Networks: Introduction
The choice of the loss function depends on the problem at hand
Feed Forward Neural Networks: Introduction
Feed Forward Neural Networks: Loss and Activation Functions
Feed Forward Neural Networks: Loss and Activation Functions
Feed Forward Neural Networks: Loss and Activation Functions
Feed Forward Neural Networks: Loss and Activation Functions
Feed Forward Neural Networks: Loss and Activation Functions
Feed Forward Neural Networks
• How to compute ?
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Backpropagation in ANN : Recap
Feed Forward Neural Networks: Backpropagation
Contour Plot
Backpropagation in ANN : Recap
Gradient descent and its variants
Backpropagation in ANN : Recap
Gradient descent and its variants
Backpropagation in ANN : Recap
Gradient descent and its variants
Backpropagation in ANN : Recap
Gradient descent and its variants
Plateaus and Flat Regions
Backpropagation in ANN : Recap
Gradient descent and its variants
Momentum-based Gradient Descent
Backpropagation in ANN : Recap
Gradient descent and its variants
Momentum-based Gradient Descent
Backpropagation in ANN : Recap
Gradient descent and its variants
Nesterov Accelerated Gradient Descent
• Recall that:
• Linear Search
Backpropagation in ANN : Recap
Choosing a learning rate
• Annealing-based methods:
• Step Decay:
• Halve the learning rate after every 5 epochs or
• Halve the learning rate after an epoch if the validation error is more than what it
was at the end of the previous epoch
• Step Decay:
• Halve the learning rate after every 5 epochs or
• Halve the learning rate after an epoch if the validation error is more than what it
was at the end of the previous epoch
• Intuition: Decay the learning rate for parameters in proportion to their update history.
• For sparse features, accumulated update history is small
• For dense features, accumulated update history is large
Make learning rate inversely proportional to the update history i.ie, if the feature has
been updated fewer times, give it a larger learning rate and vice versa
Adagrad
• Update rule for Adagrad
▪ Example: Consider a discrete signal ‘xt’ which represents the position of a spaceship at time ‘t’
recorded by a laser sensor.
▪ Considering that, the most recent measurements are more important, we would like to take
a weighted average over ‘xt’. The new estimate at time ‘t’ is computed as follows: x1
convolution
∞
𝑠𝑡 = 𝑥𝑡−𝑎 𝑤−𝑎 = 𝑥 ∗ 𝑤 𝑡
𝑎=0
input Filter/Mask/Kernel
x2
▪ Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 2
The Convolution Operation - 1D
▪ In practice, we would sum only over a small window.
6
▪ We just slide the filter over the input and compute the value of st based on a window around xt
* * * * * * *
x 1.0 1.10 1.20 1.40 1.70 1.80 1.90 2.10 2.20
s 1.80
Content adapted from : CS7015 Deep Learning, Dept. of CSE, IIT Madras
▪ We just slide the filter over the input and compute the value of st based on a window around xt
* * * * * * *
x 1.0 1.10 1.20 1.40 1.70 1.80 1.90 2.10 2.20
s 1.80 1.96
Content adapted from : CS7015 Deep Learning, Dept. of CSE, IIT Madras
▪ We just slide the filter over the input and compute the value of st based on a window around xt
* * * * * * *
x 1.0 1.10 1.20 1.40 1.70 1.80 1.90 2.10 2.20
▪ Use cases of 1-D convolution : Audio signal processing, stock market analysis, time series analysis etc.
Content adapted from : CS7015 Deep Learning, Dept. of CSE, IIT Madras
𝑚−1 𝑛−1
Pixel of interest
0 1 0 0 1
0 0 1 1 0
1 0 0 0 1
0 1 0 0 1
0 0 1 0 1
Content adapted from : CS7015 Deep Learning, Dept. of CSE, IIT Madras
Source: https://developers.google.com/
Input Image
Source: https://developers.google.com/
Input Image
Source: https://developers.google.com/
Smoothening Filter
Sharpening Filter
-1 0 1 1 1 1
-1 0 1 0 0 0
-1 0 1 -1 -1 -1
Sx Sy After applying
Horizontal edge
detection filter
Sobel
-1 0 1 1 2 1
-2 0 2 0 0 0
-1 0 1 -1 -2 -1
Sx Sy Input image After applying
Vertical edge
Laplacian Roberts detection filter
0 1 0 0 1 1 0
1 -4 1 -1 0 0 -1
0 1 0 Sx Sy
Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 14
The Convolution Operation - 2D
1 -1 -1
-1 1 -1 Filter 1
-1 -1 1
stride=1
1 0 0 0 0 1 Dot
product
0 1 0 0 1 0 3 -1
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
Input image: 6 x 6
Note: Stride is the number of “unit” the kernel is shifted per slide over rows/ columns
Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 15
The Convolution Operation - 2D
1 -1 -1
-1 1 -1 Filter 1
-1 -1 1
If stride=2
1 0 0 0 0 1
0 1 0 0 1 0 3 -3
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
Input image: 6 x 6
Note: Stride is the number of “unit” the kernel is shifted per slide over rows/ columns
Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 16
The Convolution Operation - 2D
1 -1 -1
-1 1 -1 Filter 1
-1 -1 1
stride=1
1 0 0 0 0 1
3 -1 -3 -1
0 1 0 0 1 0
0 0 1 1 0 0 -3 1 0 -3
1 0 0 0 1 0 4 x 4 Feature Map
0 1 0 0 1 0 -3 -3 0 1
0 0 1 0 1 0
3 -2 -2 -1
Input image: 6 x 6
R G B
Source: Intuitively Understanding Convolutions for Deep Learning | by Irhum Shafkat | Towards Data Science
1 0 0 0 0 1
1 0 0 0 0 1
0 11 00 00 01 00 1
0 1 0 0 1 0
0 00 11 01 00 10 0
0 0 1 1 0 0 K-filters = K-Feature Maps
1 00 00 10 11 00 0
1 0 0 0 1 0
0 11 00 00 01 10 0
0 1 0 0 1 0
0 00 11 00 01 10 0
0 0 1 0 1 0
0 0 1 0 1 0 Depth of feature map = No. of feature maps = No. of filters
1 0 0 0 0 1
1 0 0 0 0 1
0 11 00 00 01 00 1
0 1 0 0 1 0
0 00 11 01 00 10 0 -1-1 11 -1-1
0 0 1 1 0 0 -1 1 -1
1 00 00 10 11 00 0 -1-1 11 -1-1
1 0 0 0 1 0 0 0 0 Filter
0 11 00 00 01 10 0 -1-1 11 -1-1
0 1 0 0 1 0 -1 1 -1
0 00 11 00 01 10 0
0 0 1 0 1 0
0 0 1 0 1 0
1. Depth of an Input Image = No. of channels in the Input Image = Depth of a filter
2. Assuming square filters, Spatial Extent (F) of a filter is the size of the filter
conv3x3
2x2
4x4
Source: Intuitively Understanding Convolutions for Deep Learning | by Irhum Shafkat | Towards Data Science
1 -1 -1
W1
11 -1-1 -1-1 W2
-1-1 11 -1-1
𝑾𝟏 − 𝑭 + 𝟐𝑷 -1 1 -1 𝑯𝟏 − 𝑭 + 𝟐𝑷
𝑾𝟐 = +𝟏 -1-1 -1-1 11 𝑯𝟐 = +𝟏 𝑫𝟐 = 𝑲
𝑺 -1 -1 1 𝑺
cat | dog
Convolution
Pooling
Can repeat Fully Connected
many times Feedforward network
Convolution
Pooling
1 -1 -1 -1 1 -1
-1 1 -1 Filter 1 -1 1 -1 Filter 2
-1 -1 1 -1 1 -1
• Max Pooling
3 -1 -3 -1 -1 -1 -1 -1 • Average Pooling
-3 1 0 -3 -1 -1 -2 1
-3 -3 0 1 -1 -1 -2 1
3 -2 -2 -1 -1 0 -4 3
Source: CS 898: Deep Learning and Its Applications, University of Waterloo, Canada.
Stride ?
bird
bird
Subsampling
▪ Sparse Connectivity
▪ Shared weights
▪ Equivariant representation
-1 1 -1 3 0 -1
-1 -1 1 4 0 -1
3
..
7 0 1
1 0 0 0 0 1 8 1 -1
0 1 0 0 1 0 9 0
0 0 1 1 0 0 10 0 -1
-1
..
1 0 0 0 1 0 1
Fewer parameters!
13 0
0 1 0 0 1 0 Only connect to 9 inputs, not fully
14 0 connected (Sparse Connectivity)
0 0 1 0 1 0
15 1
6 x 6 Image 16 1
..
Source: CS 898: Deep Learning and Its Applications, University of Waterloo, Canada.
Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 30
Properties of CNN
..
7 0
1 0 0 0 0 1 8 1
0 1 0 0 1 0 9 0
0 0 1 1 0 0 10 0
-1
..
1 0 0 0 1 0
13 0
0 1 0 0 1 0 Even Fewer parameters!
14 0
0 0 1 0 1 0 Fewer parameters!
15 1
6 x 6 Image Shared weights
16 1
..
Source: CS 898: Deep Learning and Its Applications, University of Waterloo, Canada.
Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 32
Equivariance to translation
▪ A function f is equivariant to a function g if f(g(x)) = g(f(x)) or if the output changes in the same way as the
input.
▪ As the same weights are shared across the images, hence if an object occurs in any image, it will be detected
irrespective of its position in the image.
Source: Translational Invariance Vs Translational Equivariance | by Divyanshu Mishra | Towards Data Science
▪ Shared weights
cat | dog
Convolution +
ReLU
Pooling
Fully Connected
Feedforward network
Convolution+
ReLu
Pooling
#Param.
#Param. ((5*5*16)*120 +
#Param. #Param. 120 = 48120 #Param.
((5*5*6)+1) * 16 = 2416
((5*5*1)+1) * 6 = 156 =0 84*120 + 84=
#Param. 10164
=0 #Param.
84*10 + 10= 850
tanh tanh
sigmoid
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., & others. (1998). Gradient-based learning applied to document recognition.
Proceedings of the IEEE, 86(11), 2278–2324.
Source: http://yann.lecun.com/
ZFNet
#Param. = 0 #Param. = 0
#Param. #Param.
#Param.
((5*5*96)+1) * 256 = 614656 ((3*3*256)+1) * 384 =
((11*11*3)+1) * 96 = 34944
885120
#Param. = 0
#Param.
((3*3*384)+1) * 256 =884992
Total #Param.
#Param. 62M
((3*3*384)+1) * 384 =
1327488
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012).
Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25.
Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 41
ZFNet Architecture (2013)
Zeiler, M. D., & Fergus, R. (2013). Visualizing and understanding convolutional networks.
In European conference on computer vision (pp. 818-833). Springer, Cham.
Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 42
ZFNet
• This work reinforced the notion that convolutional neural networks have to have a deep network of layers in order for
this hierarchical representation of visual data to work
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition , International Conference on Learning Representations (ICLR14)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., ... & Rabinovich, A. (2015). Going deeper with convolutions.
In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR’15)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., ... & Rabinovich, A. (2015). Going deeper with convolutions.
In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR’15)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., & Rabinovich, A. (2015). Going deeper with convolutions.
In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR’15)
Effect of increasing layers of shallow CNN when experimented over the CIFAR dataset
ResNet-34
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition.
In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
Abhilash K Pai, Dept. of DSCA DSE 5251 Deep Learning 52
Sequence Models
▪ Music Generation
▪ Sentiment Classification
▪ Machine Translation
▪ Speech Recognition
▪ Music Generation La
▪ Sentiment Classification
▪ Machine Translation
▪ Speech Recognition
▪ Music Generation
▪ Machine Translation
▪ Speech Recognition
▪ Music Generation
▪ Sentiment Classification
▪ Machine Translation
▪ Speech Recognition
▪ Music Generation
▪ Sentiment Classification
▪ Speech Recognition
▪ Music Generation
▪ Sentiment Classification
▪ Machine Translation
▪ Speech Recognition
▪ Music Generation
▪ Sentiment Classification
▪ Machine Translation
▪ Name Entity Recognition “Alice wants to discuss about “Alice wants to discuss about
Deep Learning with Bob” Deep Learning with Bob”
• In feedforward and convolutional neural networks the size of the input was always fixed.
• Further, each input to the network was independent of the previous or future inputs.
• In many applications with sequence data, the input is not of a fixed size.
Task: Auto-complete Task: P-o-S tagging Task: Movie Review Task: Action Recognition
Legend
• The model needs to look at a sequence of inputs and produce an output (or outputs). Output
layer
• For this purpose, lets consider each input to be corresponding to one time step.
Hidden
layer
• Next, build a network for each time step/input, where each network performs the same task
(eg: Auto complete: input=character, output=character) Input layer
3. Make sure that the function executed at each time step is the same.
• Because at each time step we are doing the same task.
Introduction
Introduction
• If the input sequence is of length ‘n’, we would create ‘n’ networks for each input, as seen previously.
• If the input sequence is of length ‘n’, we would create ‘n’ networks for each input, as seen previously.
• At time step i=0 there are no previous inputs, so they are typically assumed to be all zeros.
• Since, the output of si at time step i is a function of all the inputs from previous time steps, we could say it has a form of memory.
• A part of a neural network that preserves some state across time steps is called a memory cell ( or simply a cell )
Unroll
Same representation as seen
previously
• Unrolling the network through time = representing network against time axis.
• At each time step t (also called a frame) RNN receives inputs xi as well as output from previous step yi-1
Seq-to-Seq
Vector-to-Seq
Seq-to-Vector
0.5
Source: https://www.youtube.com/c/joshstarmer
Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 26
Recurrent Neural Networks (RNN) : Example
1
Temperature
0.5
Source: https://www.youtube.com/c/joshstarmer
Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 27
Recurrent Neural Networks (RNN) : Example
1
Temperature
0.5
Source: https://www.youtube.com/c/joshstarmer
Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 28
Recurrent Neural Networks (RNN) : Example
1
Temperature
0.5
Source: https://www.youtube.com/c/joshstarmer
Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 29
Recurrent Neural Networks (RNN) : Example
Source: https://www.youtube.com/c/joshstarmer
Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 30
Recurrent Neural Networks (RNN) : Example
Source: https://www.youtube.com/c/joshstarmer
Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 31
Recurrent Neural Networks (RNN) : Example
Unrolling the feedback loop by making a copy of NN for each input value
Source: https://www.youtube.com/c/joshstarmer
Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 32
Recurrent Neural Networks (RNN) : Example
Source: https://www.youtube.com/c/joshstarmer
Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 33
Recurrent Neural Networks (RNN) : Example
Source: https://www.youtube.com/c/joshstarmer
Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 34
Recurrent Neural Networks (RNN) : Example
Source: https://www.youtube.com/c/joshstarmer
Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 35
Recurrent Neural Networks (RNN) : Example
Source: https://www.youtube.com/c/joshstarmer
Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 36
UNIT-4
Deep Learning: Basics of Deep Learning, Machine Learning Vs Deep
Learning, Fundamental Deep Learning Algorithm-Convolution Neural
Network (CNN).
Neural Networks are networks used in Deep Learning that work similar to
the human nervous system.
An artificial neuron is a mathematical function conceived as a model
of biological neurons, a neural network. Artificial neurons are elementary
units in an artificial neural network.
The artificial neuron receives one or more inputs and sums them to produce
an output by applying some activation function.
Fig. Artificial Neuron
For the above general model of artificial neural network, the net input can
be calculated as follows:
yin= (x1.w1+x2.w2+x3.w3…xm.wm) + bias
i.e., yin= ∑imxi.wi+ bias
where, Xi is set of features and Wi is set of weights.
Bias is the information which can impact output without being dependent
on any feature.
The output can be calculated by applying the activation function over the
net input.
Y=F(yin)
Each AN has an internal state, which is called an activation signal. Output
signals, which are produced after combining the input signals and activation
rule, may be sent to other units.
Eg.
Q) Construct a single layer neural network for implementing OR, AND,
NOT gates.
1. XOR:
XOR(A,B) = (A+B)*(AB)|
This sort of a relationship cannot be modeled using a single neuron. Thus
we will use a multi-layer network.
The idea behind using multiple layers is that complex relations can be
broken into simpler functions and combined.
2. XNOR function looks like:
Types of AF:
The Activation Functions can be basically divided into 3 types-
1. Binary step Activation Function
2. Linear Activation Function
3. Non-linear Activation Functions
TanH is also like logistic sigmoid but in better way. The range of the
TanHfunction is from -1 to +1.
TanH is often preferred over the sigmoid neuron because it is zero centred.
The advantage is that the negative inputs will be mapped strongly negative
and the zero inputs will be mapped near zero in tanh graph.
tanh(x) = 2 * sigmoid(2x) - 1
The ReLU is the most used activation function. It is used in almost all
convolution neural networks in hidden layers only.
The ReLU is half rectified(from bottom). f(z) = 0, if z < 0
= z, otherwise
R(z) = max(0,z)
The range is 0 to inf.
Advantages
Avoids vanishing gradient problem.
Computationally efficient—allows the network to converge very
quickly
Non-linear—although it looks like a linear function, ReLU has a
derivative function and allows for backpropagation
Disadvantages
Can only be used with a hidden layer
hard to train on small datasets and need much data for learning non-
linear behavior.
The Dying ReLU problem—when inputs approach zero, or are
negative, the gradient of the function becomes zero, the network
cannot perform backpropagation and cannot learn.
We needed the Leaky ReLU activation function to solve the „Dying ReLU‟
problem.
Leaky ReLU we do not make all negative inputs to zero but to a value near
to zero which solves the major issue of ReLU activation function.
R(z) = max(0.1*z,z)
Advantages
Prevents dying ReLU problem—this variation of ReLU has a small
positive slope in the negative area, so it does enable backpropagation,
even for negative input values
Otherwise like ReLU
Disadvantages
Results not consistent—leaky ReLU does not provide consistent
predictions for negative input values.
3.4 Softmax:
sigma = softmax
zi = input vector
e^{zi}} = standard exponential function for input vector
K = number of classes in the multi-class classifier
e^{zj} = standard exponential function for output vector
e^{zj} = standard exponential function for output vector
Advantages
Able to handle multiple classes only one class in other activation
functions—normalizes the outputs for each class between 0 and 1with the
sum of the probabilities been equal to 1, and divides by their sum, giving the
probability of the input value being in a specific class.
Useful for output neurons—typically Softmax is used only for the output
layer, for neural networks that need to classify inputs into multiple
categories.
These models are called feedforward because information flows through the
function being evaluated from x, through the intermediate computations
used to define f, and finally to the output y. There are no feedback
connections in which outputs of the model are fed back into itself.
Feedforward neural networks are called networks because they are typically
represented by composing together many different functions. The model is
associated with a directed acyclic graph describing how the functions are
composed together.
For example, we might have three functions f (1), f (2), and f (3) connected in a
chain, to form f(x) = f(3)(f (2)(f(1) (x ))). This chain structure is most commonly
used structure of neural networks. In this case, f (1) is called the first layer of
the network called input layer used to feed the input into the network; f (2)
is called the second layer called hidden layer used to train the neural
network, and so on. The final layer of a feedforward network is called the
output layer that provides the output of the network. The overall length of
the chain gives the depth of the model and width of the model is number of
neurons in the input layer. It is from this terminology that the name “deep
learning” arises.
3. Feature engineering:
Feature engineering is the process of transforming raw data into features
that better represent the underlying problem to the predictive models,
resulting in improved model accuracy on unseen data.Feature engineering
turn your inputs into things the algorithm can understand.
5. Execution time
Usually, a deep learning algorithm takes a long time to train. This is
because there are so many parameters in a deep learning algorithm that
training them takes longer than usual. Whereas machine learning
comparatively takes much less time to train, ranging from a few seconds to
a few hours.
This is turn is completely reversed on testing time. At test time, deep
learning algorithm takes much less time to run. Whereas, if you compare it
with k-nearest neighbors (ML algorithm), test time increases on increasing
the size of data. Although this is not applicable on all machine learning
algorithms, as some of them have small testing times too.
6. Interpretability:
Suppose we use deep learning to give automated scoring to essays. The
performance it gives in scoring is quite excellent and is near human
performance. But there‟s is an issue. It does not reveal why it has given that
score. Indeed mathematically you can find out which nodes of a deep neural
network were activated, but we don‟t know what there neurons were
supposed to model and what these layers of neurons were doing collectively.
So we fail to interpret the results.
On the other hand, machine learning algorithms like decision trees
give us crisp rules as to why it chose what it chose, so it is particularly easy
to interpret the reasoning behind it. Therefore, algorithms like decision trees
and linear/logistic regression are primarily used in industry for
interpretability.
Characteristic ML DL
requires less amount of requires large amount
Data dependencies for data for identifying of data for better
Performance rules performance
work on low-end heavily depend on high-
Hardware dependencies machines end machines
Deep learning
algorithms try to learn
high-level features from
features need to be data.
identified by an expert Deep learning reduces
and then hand-coded as the task of developing
per the domain and new feature extractor
Feature engineering data type for every problem.
Break the problem into
Problem Solving parts, finds and Solves the problem end-
approach combines the solution to-end
Takes much small time
for training but may
Execution time Takes more time for take more time for
training and less time testing depending on
for testing the algorithm like KNN
Fails to interpret the Easy to interpret the
Interpretability results results
Q) Explain various applications of Deep Learning.
There are various interesting applications for Deep Learning that made
impossible things before a decade into reality. Some of them are:
1. Color restoration, where a given image in greyscale is automatically
turned into a colored one.
2. Recognizing hand written message.
3. Adding sound to a silent video that matches with the scene taking
place.
4. Self-driving cars
5. Computer Vision: for applications like vehicle number plate
identification and facial recognition.
6. Information Retrieval: for applications like search engines, both text
search, and image search.
7. Marketing: for applications like automated email marketing, target
identification
8. Medical Diagnosis: for applications like cancer identification, anomaly
detection
9. Natural Language Processing: for applications like sentiment analysis,
photo tagging
10. Online Advertising, etc
Algorithm:
1. Initialize the weights and biases.
2. Iteratively repeat the following steps until defined number of times or
threshold value is reached:
i. Calculate network output using forward propagation.
ii. Calculate error between actual and predicted values.
iii. Propagate the error back into the network and update weights
and biases using the equations:
Fig. illustrating BP
Example:
Forward Propagation:
Therefore,
z1 = 0.415 a1 = 0.6023 z2 = 0.9210 a2 = 0.7153
Let us consider,
epochs = 1000 threshold = 0.001
learning rate = 0.4 T = 0.25
4.
E = 1/2(T-a2)2 = 0.1083
Eqn # 1: 𝑧1 = 𝑥1 ∙ 𝑤1 + 𝑏1
Eqn # 2: 𝑎1 = (𝑧1) = 1/( 1+ 𝑒 −𝑧1 )
Eqn # 3: 𝑧2 = 𝑎1 ∙ 𝑤2 + 𝑏2
Eqn # 4: 𝑎2 = (𝑧2) = 1 /(1+ 𝑒 −𝑧2)
Eqn # 5: 𝐸 = 1 /2 (𝑇 − 𝑎2)2
Updating w2:
= 0.45 - 0.4(-(0.25-0.7153))*(0.7153(1-0.7153))*(0.6023)
= 0.45 - 0.4*0.05706
= 0.427
Updating b2:
= 0.65 - 0.4*(-(0.25-0.7153))*(0.7153(1-0.7153))*1
= 0.65 - 0.4*0.0948
= 0.612
Updating w1:
= 0.40 - 0.4*(-(0.25-0.7153))*(0.7153(1-0.7153))*0.45*0.6023(1-
0.6023)*1
= 0.40-0.4*0.01021
= 0.3959
Eg. In the below problem the derivatives with respect to weights are very
small.
So when we do back propagation, we keep multiplying the factors that are
less than 1 by each other and gradients tend to smaller and smaller by
moving backward in the network.
This means the neurons in the earlier layers learn very slowly. The result is
a training process that takes too long and prediction accuracy is
compromised.
MLP‟s use one perceptron for each input (e.g. pixel in an image, multiplied
by 3 in RGB case). The amount of weights rapidly becomes unmanageable
for large images. For a 224 x 224 pixel image with 3 color channels there are
around 150,528 weights that must be trained! As a result, difficulties arise
whilst training and overfitting can occur.
In the same way we apply to remaining (above is for red image, then we do
same for green and blue) images. We can apply more than one filter. More
filters we use, we can preserve spatial dimensions better.
Pooling layer:
Pooling layer objective is to reduce the spatial dimensions of the data
propagating through the network.
1. Max Pooling is the most common, for each section of the
image we scan and keep the highest value.