Professional Documents
Culture Documents
M ODULE # 1 : F UNDAMENTALS OF
N EURAL N ETWORK
Deep Learning 2 / 95
Table of Contents
1 Course Logistics
2 Introduction to Deep Learning
3 Applications of Deep Learning
4 Key Components of DL Problem
5 Artificial Neural Network
6 Perceptron
7 Single Perceptron for Linear Regression
8 Multi Layer Perceptron (MLP)
Deep Learning 3 / 95
What we Learn...
T EXT B OOKS
T1 Dive into Deep Learning by Aston Zhang, Zack C. Lipton, Mu Li, Alex J. Smola.
https://d2l.ai/chapter_introduction/index.html
Deep Learning 5 / 95
Evaluation Schedule
EC3 Compre-sem Regular Open book 2.5 hrs 30% Check Calendar
Deep Learning 6 / 95
Lab Sessions
Deep Learning 7 / 95
LMS
Refer Canvas for the following
Handout
Schedule for Webinars
Schedule of Quiz, and Assignments.
Evaluation scheme
Session Slide Deck
Demo Lab Sheets
Quiz-I, Quiz-II
Assignment-I, Assignment-II
Sample QPs
All submissions for graded components must be the result of your original effort. It
is strictly prohibited to copy and paste verbatim from any sources, whether online or
from your peers. The use of unauthorized sources or materials, as well as collusion
or unauthorized collaboration to gain an unfair advantage, is also strictly prohibited.
Please note that we will not distinguish between the person sharing their resources
and the one receiving them for plagiarism, and the consequences will apply to both
parties equally.
Deep Learning 9 / 95
In case of Queries regarding the course ...
Deep Learning 10 / 95
Table of Contents
1 Course Logistics
2 Introduction to Deep Learning
3 Applications of Deep Learning
4 Key Components of DL Problem
5 Artificial Neural Network
6 Perceptron
7 Single Perceptron for Linear Regression
8 Multi Layer Perceptron (MLP)
Deep Learning 11 / 95
What is Deep Learning?
Artificial Intelligence
Machine
Learning
Neural
Data
Networks
Science
Deep
Learning
Deep Learning 13 / 95
AI – ML – DL
Deep Learning 14 / 95
Deep (Machine) Learning
Deep Learning 15 / 95
Why Deep Learning?
Deep Learning 17 / 95
Deep Learning: Timeline
Deep Learning 18 / 95
Table of Contents
1 Course Logistics
2 Introduction to Deep Learning
3 Applications of Deep Learning
4 Key Components of DL Problem
5 Artificial Neural Network
6 Perceptron
7 Single Perceptron for Linear Regression
8 Multi Layer Perceptron (MLP)
Deep Learning 19 / 95
Breakthroughs with Neural Networks
Deep Learning 20 / 95
Breakthroughs with Neural Networks
Deep Learning 21 / 95
Image Segmentation and Recognition
Deep Learning 22 / 95
Image Recognition
Deep Learning 23 / 95
Breakthroughs with Neural Networks
Deep Learning 24 / 95
Breakthroughs with Neural Networks
Deep Learning 25 / 95
Applications of Deep Learning
Deep Learning 26 / 95
Many more applications...
Deep Learning 27 / 95
Table of Contents
1 Course Logistics
2 Introduction to Deep Learning
3 Applications of Deep Learning
4 Key Components of DL Problem
5 Artificial Neural Network
6 Perceptron
7 Single Perceptron for Linear Regression
8 Multi Layer Perceptron (MLP)
Deep Learning 28 / 95
Core Components of DL Problem
The data that we can learn from.
A model of how to transform the data.
An objective function that quantifies how well (or badly) the model is doing.
An algorithm to adjust the model’s parameters to optimize the objective
function.
Learning
Algorithm
Data Model
Loss
Function
Deep Learning 29 / 95
1. Data
Collection of examples.
Data has to be converted to an useful and a suitable numerical
representation.
Each example (or data point, data instance, sample) typically consists of a set
of attributes called features (or covariates), from which the model must make
its predictions.
In the supervised learning problems, the attribute to be predicted is designated
as the label (or target).
Mathematically, a set of m examples,
Data = D = {X , t }
Dimensionality of data
I Each example has the same number of numerical values. This data consist of
fixed-length vectors. Eg: Image
I The constant length of the vectors as the dimensionality of the data.
I Text data has varying-length data.
Deep Learning 31 / 95
2. Model
Model denotes the computational machinery for ingesting data of one type,
and spitting out predictions of a possibly different type.
Deep learning models consist of many successive transformations of the data
that are chained together top to bottom, thus the name deep learning.
Deep Learning 32 / 95
3. Objective Function
Deep Learning 33 / 95
3. Loss Functions
Deep Learning 34 / 95
3. Loss Functions
Deep Learning 35 / 95
4. Optimization Algorithms
Deep Learning 36 / 95
Example of the Framework
Deep Learning 37 / 95
Example of the Framework
Create a Model
I Define a flexible program whose behavior is determined by a number of
parameters.
I To determine the best possible set of parameters, use the data. The parameters
should improve the performance of the program with respect to some measure of
performance on the task of interest.
I After fixing the parameters, we call the program a model.
Eg: The model receives a snippet of audio as input, and the model generates a
selection among yes, no as output.
I The set of all distinct programs (input-output mappings) that we can produce just
by manipulating the parameters is called a family of models.
Eg: We expect that the same model family should be suitable for ”Alexa”
recognition and ”Hey Siri” recognition because they seem, intuitively, to be
similar tasks.
Deep Learning 38 / 95
Example of the Framework
The meta-program that uses our dataset to choose the parameters is called a
learning algorithm.
In machine learning, the learning is the process by which we discover the right
setting of the parameter coercing the desired behavior from our model.
Train the model with data.
Deep Learning 39 / 95
Table of Contents
1 Course Logistics
2 Introduction to Deep Learning
3 Applications of Deep Learning
4 Key Components of DL Problem
5 Artificial Neural Network
6 Perceptron
7 Single Perceptron for Linear Regression
8 Multi Layer Perceptron (MLP)
Deep Learning 40 / 95
What are Neural Networks??
Deep Learning 41 / 95
What are Neural Networks???
It begins with brain.
Humans learn, solve problems, recognize patterns, create, think deeply about
something, meditate and many many more.....
Humans learn through association. [Refer to Associationism for more details.]
Deep Learning 42 / 95
Observation: The Brain
Deep Learning 43 / 95
Brain: Interconnected Neurons
Deep Learning 45 / 95
Connectionist Machines
x0
P
x1 f ŷ
x2
Deep Learning 47 / 95
Properties of Artificial Neural Nets (ANNs)
Deep Learning 48 / 95
When to consider Neural Networks?
Deep Learning 49 / 95
Table of Contents
1 Course Logistics
2 Introduction to Deep Learning
3 Applications of Deep Learning
4 Key Components of DL Problem
5 Artificial Neural Network
6 Perceptron
7 Single Perceptron for Linear Regression
8 Multi Layer Perceptron (MLP)
Deep Learning 50 / 95
Perceptron
One type of ANN system is based on a unit called a perceptron.
A perceptron takes a vector of real-valued inputs, calculates a linear
combination of these inputs, then outputs a 1 if the result is greater than
some threshold and -1 otherwise.
x0
w0
w1 P2 P
x1 z= i =0 wi xi h = 1, if w i xi > 0 ŷ
P
h = −1, if wi xi ≤ 0
w2
x2
Deep Learning 52 / 95
NOT Gate
Question:
How to represent NOT gate using a Perceptron?
What are the parameters for the NOT Perceptron?
Data is given below.
A B x1 t
A B 0 1 Rewrite as -1 1
1 0 1 -1
Deep Learning 53 / 95
Perceptron for NOT gate
Perceptron equation is
ŷ = w0 x0 + w1 x1 . x0
x0 = 1 always. w0
h > 0 for output to be 1. w1
x1 z h ŷ
For each row of truth table, the
equations are given.
One solution is w0 = 1 and x1 t1 h h
w1 = −1. (Intuitive solution) -1 1 w0 x0 + w1 (−1) > 0 w0 − w1 > 0
This give a beautiful linear 1 -1 w0 x0 + w1 (1) < 0 w0 + w1 < 0
decision boundary.
Deep Learning 54 / 95
AND Logic Gate
Question:
How to represent AND gate using a Perceptron?
What are the parameters for the AND Perceptron?
Data is given below.
A B AB x1 x2 t
A AB 0 0 0 -1 -1 -1
B 0 1 0 Rewrite as -1 1 -1
1 0 0 1 -1 -1
1 1 1 1 1 1
Deep Learning 55 / 95
Perceptron for AND gate
x0
Perceptron equation is w0
ŷ = w0 x0 + w1 x1 + w2 x2 . w1
x1 z h ŷ
h > 0 for output to be 1.
For each row of the truth table, w2
the equations are given. x2
0.4
0.2
x2 w2 = 2
0.2 0.4 0.6 0.8 1
Deep Learning 58 / 95
OR Logic Gate
Question:
How to represent OR gate using a Perceptron?
What are the parameters for the OR Perceptron?
Data is given below.
A B AB x1 x2 t
A AB 0 0 0 -1 -1 -1
B 0 1 1 Rewrite as -1 1 1
1 0 1 1 -1 1
1 1 1 1 1 1
Deep Learning 59 / 95
Perceptron for OR gate
x0
Perceptron equation is w0
ŷ = w0 x0 + w1 x1 + w2 x2 . w1
x1 z h ŷ
h > 0 for output to be 1.
For each row of the truth table, w2
the equations are given. x2
0.4
0.2
x2 w2 = 2
0.2 0.4 0.6 0.8 1
wi ← wi + ∆wi
∆wi = η(t − ŷ )xi
where :
t = target value
ŷ = perceptron output
η = learning rate
Deep Learning 62 / 95
Convergence of Perceptron Learning Algorithm
Deep Learning 63 / 95
Perceptron Learning for NOT gate
Equations are
Hypothesis z = w0 + w1 x1
Activation h = sign(z )
Compute output ŷ = h = sign(w0 + w1 x1 )
Compute ∆w = η(t − ŷ )x
Update wnew ← wold + ∆w
Deep Learning 64 / 95
Perceptron Learning for NOT gate
Deep Learning 65 / 95
Demo Python Code
https://colab.research.google.com/drive/
1DUVcOoUIWhl8GQKc6AWR1wi0LaMeNkgD?usp=sharing
Student pl note:
The Python notebook is shared for anyone who has access to the link and the access
is restricted to use BITS email id. So please do not access from non-BITS email id
and send requests for access. Access for non-BITS email id will nto be granted.
Deep Learning 66 / 95
Exercise
Deep Learning 67 / 95
Representational Power of Perceptron
A Perceptron represents a hyperplane decision surface in the n-dimensional
space of examples.
The Perceptron outputs a 1 for examples lying on one side of the hyperplane
and outputs a -1 for examples lying on the other side.
Deep Learning 68 / 95
Linearly Separable Data
Two sets of data points in a two dimensional space are said to be linearly
separable when they can be completely separable by a single straight line.
A straight line can be drawn to separate all the data examples belonging to
class +1 from all the examples belonging to the class -1. Then the
two-dimensional data are clearly linearly separable.
An infinite number of straight lines can be drawn to separate the class +1 from
the class -1.
Deep Learning 69 / 95
Perceptron for Linearly Separable Data
1 x0 w0
w0 x1 w1
x =. w = .
x1 w1
.. ..
w2 f xn wn
x2 w >x ŷ X
= w0 + w1 x1 + · · · + wn xn
..
.wn = w >x
1 if w > x > 0
xn
ŷ =
−1 otherwise.
Challenge: How to learn these n parameters w0 . . . wn ?
Solution: Use Perceptron learning algorithm.
Deep Learning 70 / 95
Perceptron Training
wi ← wi + ∆wi
where
∆wi = η(t − o)xi
where:
t = c (~x ) is target value
o is perceptron output
η is small constant (e.g., .1) called learning rate
Perceptron training rule will converge
If training data is linearly separable
η sufficiently small
Deep Learning 71 / 95
Perceptron and its Learning - Review
Data
I Truth tables or set of examples
Model
I Perceptron
Objective Function
I Deviation of desired output t and the computed output ŷ
Learning Algorithm
I Perceptron Learning algorithm
Deep Learning 72 / 95
Table of Contents
1 Course Logistics
2 Introduction to Deep Learning
3 Applications of Deep Learning
4 Key Components of DL Problem
5 Artificial Neural Network
6 Perceptron
7 Single Perceptron for Linear Regression
8 Multi Layer Perceptron (MLP)
Deep Learning 73 / 95
Linear Regression Example
Suppose that we wish to estimate the prices of houses (in dollars) based on
their area (in square feet) and age (in years).
The linearity assumption just says that the target (price) can be expressed as
a weighted sum of the features (area and age):
Deep Learning 74 / 95
Data
The dataset is called a training dataset or training set.
Each row is called an example (or data point, data instance, sample).
The thing we are trying to predict is called a label (or target).
The independent variables upon which the predictions are based are called
features (or covariates).
m − number of training examples
i − index of i th example
x (i ) − features of i th example
= [x (1) , x (2) , x (3) , . . . , x (m) ]
y (i ) − label corresponding to i th example
= [y (1) , y (2) , y (3) , . . . , y (m) ]
Deep Learning 75 / 95
Affine transformations and Linear Models
Affine transformations are
ŷ = w1 x1 + w2 x2 + . . . + wd xd + b
= wT x + b
ŷ = WX + b
Deep Learning 77 / 95
Squared Error Loss Function
The most popular loss function in regression problems is the squared error.
For each example,
1 (i ) 2
loss(i ) (w , b) = ŷ − y (i )
2
For the entire dataset of m examples average (or equivalently, sum) the losses
m m
1 X (i ) 1 X 1 (i ) (i ) 2
L(w , b) = loss (w , b) = ŷ −y
m m 2
i =1 i =1
When training the model, find parameters (wopt , bopt ) that minimize the total
loss across all training examples.
Deep Learning 78 / 95
Single-Layer Neural Network
Deep Learning 79 / 95
Single-Layer Neural Network
ŷ1
x1 x2 xd
Deep Learning 80 / 95
Table of Contents
1 Course Logistics
2 Introduction to Deep Learning
3 Applications of Deep Learning
4 Key Components of DL Problem
5 Artificial Neural Network
6 Perceptron
7 Single Perceptron for Linear Regression
8 Multi Layer Perceptron (MLP)
Deep Learning 81 / 95
Non-linearly Separable Data
Deep Learning 82 / 95
XOR Logic Gate
Question:
How to represent XOR gate using a Challenge: Data is
Perceptron? non-linearly separable.
Decision boundary for XOR
What are the parameters for the XOR
1
Perceptron?
0.8
Data is given below.
0.6
A B AB
0.4
A AB 0 0 0
B 0 1 1
0.2
Deep Learning 83 / 95
MLP for XOR gate
Qn: How to represent XOR gate using a Perceptron?
Use Multilayer Perceptron (MLP)
Introduce another layer in between the input and output.
This in-between layer is called hidden layer.
1
x1 n1 1
1 1
-1 1 n3 2 x1 xor x2
-1
x2 n2 -1
Deep Learning 84 / 95
MLP for XOR gate
1
x1 n1 1
1 1
-1 1 n3 2 x1 xor x2
-1
x2 n2 -1
x1 x2 n1 n2 n3
0 0 0·1+0·1= 0 ≯ th 0 · (−1) + 0 · (−1) = 0 > th 0·1+1·1= 1 ≯ th
n1 = 0 n2 = 1 n3 = 0
0 1 0·1+1·1= 1 ≥ th 0 · (−1) + 1 · (−1) = −1 ≥ th 1·1+1·1= 2 ≥ th
n1 = 1 n2 = 1 n3 = 1
1 0 1·1+0·1= 1 ≥ th 1 · (−1) + 0 · (−1) = −1 ≥ th 1·1+1·1= 2 ≥ th
n1 = 1 n2 = 1 n3 = 1
1 1 1·1+1·1= 2 ≥ th 1 · (−1) + 1 · (−1) = −2 ≯ th 1·1+0·1= 1 ≯ th
Deep Learning n1 = 1 85 / 95 n2 = 0 n3 = 0
Solution of XOR Data
Data
I Truth table
Model
I Multi-layered Perceptron
Challenge
I How to learn the parameters and threshold?
Solution for learning
I Use gradient descent algorithm
Deep Learning 86 / 95
Gradient Descent
To understand, consider simpler linear unit, where
ŷ = w0 + w1 x1 + · · · + wn xn
Let’s learn wi ’s that minimize the squared error
1X
w] ≡
E [~ (td − ŷd )2
2
d ∈D
∂E ∂ 1X
= (td − ŷd )2
∂ wi ∂ wi 2 d
1X ∂
= (td − ŷd )2
2
d
∂ wi
1X ∂
= 2(td − ŷd ) (td − ŷd )
2
d
∂ wi
X ∂
= (td − ŷd ) (td − w ~ · x~d )
d
∂ wi
∂E X
= (td − ŷd )(−xi ,d )
∂ wi d
Deep Learning 88 / 95
Gradient Descent Algorithm
Gradient-Descent(training examples h~x , t i, learning rate η )
Initialize each wi to some small random value
Until the termination condition is met, Do
I Initialize each ∆w to zero.
i
I For each h~ x , t i in training examples, Do
F Input the instance ~x to the unit and compute the output ŷ
F For each linear unit weight wi , Do
∂E X
= (td − ŷd )(−xi ,d )
∂ wi d
∂E
∆wi = −η
∂ wi
I For each linear unit weight wi , Do
wi ← wi + ∆wi
Deep Learning 89 / 95
Gradient Descent Algorithm
Deep Learning 90 / 95
Incremental Gradient Descent Algorithm
Do until satisfied
I For each training example d in D
F Compute the gradient ∇Ed [~w]
F w~ ←w ~ − η∇Ed [~
w]
1
(td − ŷd )2
w] ≡
Ed [~
2
Incremental Gradient Descent can approximate Batch Gradient Descent arbitrarily
closely if η made small enough.
Deep Learning 91 / 95
Minibatch Stochastic Gradient Descent (SGD)
Apply Gradient descent algorithm on a random minibatch of examples every
time we need to compute the update.
In each iteration,
1 Step 1: randomly sample a minibatch B consisting of a fixed number of training
examples.
2 Step 2: compute the derivative (gradient) of the average loss on the minibatch
with regard to the model parameters.
3 Step 3: multiply the gradient by a predetermined positive value η and subtract
the resulting term from the current parameter values.
η X η X (i ) T (i )
w ←w− ∂w loss(i ) (w , b) = w − x w x + b − y (i )
|B | |B |
i ∈B i ∈B
η X
(i ) η X T (i )
b←b− ∂b loss (w , b) = b − w x + b − y (i )
|B | |B |
i ∈B i ∈B
Deep Learning 92 / 95
Training SGD
Initialize parameters (w , b)
Repeat until done
I compute gradient
1 X
g ← ∂(w ,b) w T x (i ) + b − y (i )
|B |
i ∈B
I update parameters
(w , b) ← (w , b) − η g
PS: The number of epochs and the learning rate are both hyperparameters. Setting
hyperparameters requires some adjustment by trial and error.
Deep Learning 93 / 95
Numerical Example of SGD
Equation is y = (x + 5)2 .
When will it be minimum?
Use gradient descent algorithm .
Assume starting point as 3 and Learning rate as 0.01.
Equations:
dy d
= ((x + 5)2 ) = 2(x + 5)
dx dx
dy
x ←x −η·
dx
Epoch 1:
x ← 3 − 0.01 ∗ 2(3 + 5) = 2.84
Epoch 2:
x ← 2.84 − 0.01 ∗ 2(2.84 + 5) = 2.68
Deep Learning 94 / 95
Further Reading
Deep Learning 95 / 95