You are on page 1of 38

Computer Science and Engineering| Indian Institute of Technology Kharagpu

cse.iitkgp.ac.i

Deep Learning
CS60010

Abir Das
Assistant Professor
Computer Science and Engineering Department
Indian Institute of Technology Kharagpur

http://cse.iitkgp.ac.in/~adas/
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Biological Neural Network

Image courtesy: F. A. Makinde et. al., “Prediction of crude oil viscosity using feed-forward
back-propagation neural network (FFBPNN)”." Petroleum & Coal 2012

21, 27 Jan 2022 CS60010 / Deep Learning | Introduction (c) Abir Das 2
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

21, 27 Jan 2022 CS60010 / Deep Learning | Introduction (c) Abir Das 3
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Artificial Neuron
𝑇 𝑇
𝒘 = [ 𝑤 1 𝑤 2 …𝑤 𝑑 ] and 𝒙 =[ 𝑥1 𝑥2 … 𝑥 𝑑 ]

[]
𝑑
𝑥1 𝑤1 𝒙
𝑎 ( 𝒙 )=𝑏+ ∑ 𝑤𝑖 𝑥 𝑖=[ 𝒘 𝑏 ] 𝑇

𝑤2 1
𝑎(𝒙) 𝑔 (𝑎) 𝑖=1
𝑥2  𝑦 𝑦 =𝑔 (𝑎 ( 𝒙 ) )

𝑊 0 =𝑏
𝑤𝑑 Terminologies:-
𝑥𝑑 : input, : weights, : bias
1 : pre-activation (input activation)
: activation function
: activation (output activation)

21, 27 Jan 2022 CS60010 / Deep Learning | Introduction (c) Abir Das 4
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Perceptron

Image courtesy: https://blogs.umass.edu/comphon/2017/06/15/did-frank-rosenblatt-invent-


deep-learning-in-1962/

21, 27 Jan 2022 CS60010 / Deep Learning | Introduction (c) Abir Das 5
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Perceptron
and – Binary Classification

𝑥1 𝑤1
{
𝑔 (𝑎)= 1 , 𝑎 ≥ 0
¿ 0 , 𝑎< 0
(Rosenblatt, 1957)

To make things simpler, the response is taken as


𝑤2 𝑎(𝒙) 𝑔 (𝑎)
𝑥2  𝑦
𝑔 (𝑎)= { 1 ,𝑎≥ 0
¿ − 1, 𝑎 <0 𝑔(𝑎)

𝑊 0 =𝑏 𝑎
𝑥 𝑑 𝑤𝑑 The perceptron classification rule, thus, translates to
1
{
𝑇
𝑦= 1 ,𝒘 𝒙 +𝑏 ≥ 0 Remember: represents
𝑇
¿ −1 , 𝒘 𝒙 +𝑏<0 a hyperplane.

21, 27 Jan 2022 CS60010 / Deep Learning | Introduction (c) Abir Das 6
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Perceptron Learning Algorithm

Convergence Theorem:-
For a finite and linearly separable set of data, the perceptron learning algorithm
will find a linear separator in at most iterations where the maximum length of any
data point is and is the maximum margin of the linear separators.

21, 27 Jan 2022 CS60010 / Deep Learning | Introduction (c) Abir Das 7
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

21, 27 Jan 2022 CS60010 / Deep Learning | Introduction (c) Abir Das 8
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Perceptron Learning Algorithm – Convergence Proof


Some Important Brush-ups:-

(1)
(1) 𝒙 (1) (2) 𝒙
𝒙 𝒙 (3) 𝒘 𝒙 (2)(3) 𝒘
𝒙 (2)(3) 𝒘 𝒙 𝒙
𝒙
¿𝑎
𝑎 ¿𝑎

(0,0) (0,0)
(0,0) 𝑇 𝒘 𝑇 𝒙=𝑎
𝑇 𝒘 𝒙=𝑎
𝒘 𝒙=𝑎

For a hyperplane , is the normal vector


(often assumed to be unit normal vector) to Thus a hyperplane with normal vector , can separate two sets of
the hyperplane and is the distance from points by examining if is > 0 or < 0. (Remember our notation
the origin. says –a is nothing but b).

21, 27 Jan 2022 CS60010 / Deep Learning | Introduction (c) Abir Das 9
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Perceptron Learning Algorithm – Convergence Proof


• Thus a simple classification rule can be or .

• For simplicity, let us assume that the hyperplane passes through origin – This does not
affect the problem in our hand as it just shifts the origin. But that means . Thus the
classification rule becomes .

• There may not exist such hyperplane that can separate two sets of data. If it exists then
the data is known to be linearly separable.

• A more formal definition is – Two sets and of dimensional points are said to be linearly
separable if there exists real numbers such that every point , satisfies and every point ,
satisfies ,

21, 27 Jan 2022 CS60010 / Deep Learning | Introduction (c) Abir Das 10
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Perceptron Learning Algorithm – Convergence Proof


Without loss of generalization we will also assume the following while proving the convergence
theorem.
• We normalize all the points so that , i.e., maximum length of any data point is 1. Notice that
this does not affect the algorithm or the solution as,

• Margin of separation: This is the minimum distance between any data point and the
hyperplane . Such a hyperplane is defined to be the maximum margin separator.

21, 27 Jan 2022 CS60010 / Deep Learning | Introduction (c) Abir Das 11
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Perceptron Learning Algorithm – Convergence Proof


• Let be the normal to the maximum margin linear separator. We want our in the perceptron
algorithm in each step getting closer and closer to this (sort of) ideal .
• This means that the inner product between and should increase with each update.
• Remember the update rule:
• So, (assuming )

Previous This is at least


• Similarly for a negative training example,

is negative as the example is negative and thus .


• So, in both cases is increasing by at least . Which means length of increases with each
update  Observation 1
12
21, 27 Jan 2022 CS60010 / Deep Learning | Introduction (c) Abir Das
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Perceptron Learning Algorithm – Convergence Proof


• length of is given by . Let us see, how this changes with each update.

• Let us consider the case first.

This is negative as update occurs only on This is , as we assumed every point is


misclassification and for , rescaled to unit ball.
misclassification means .
• So,

21, 27 Jan 2022 CS60010 / Deep Learning | Introduction (c) Abir Das 13
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Perceptron Learning Algorithm – Convergence Proof


• Let us now consider the case .

• Using similar arguments, , so, and also

• So, here also

• So, the length of the updated vector is increasing but it is increasing in a controlled way. 
Observation 2

21, 27 Jan 2022 CS60010 / Deep Learning | Introduction (c) Abir Das 14
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Perceptron Learning Algorithm – Convergence Proof


• Say after iterations, is denoted as .

• So observation 1 implies, and observation 2 implies [remember the initial value of is 0]

• Now we have to apply the Cauchy-Swartz inequality.


• , but , so,
• [observation 1]
• This implies [observation 2]
• So,

21, 27 Jan 2022 CS60010 / Deep Learning | Introduction (c) Abir Das 15
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

21, 27 Jan 2022 CS60010 / Deep Learning | Introduction (c) Abir Das 16
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Popular Choices of Activation Functions


Linear activation function Sigmoid activation function

• Unbounded • Bounded (0, 1)


• Always positive

tanh activation function ReLU activation function



• Bounded below by 0
• Bounded (-1, 1) • But not upper-bounded
• Can be positive or
negative

21, 27 Jan 2022 CS60010 / Deep Learning | Introduction (c) Abir Das 17
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Multilayer Neural Network


𝑥1 𝑤1 𝑥1 𝑤1
𝑤2 𝑎(𝒙) 𝑔(𝑎) 𝑤2
𝑥2  𝑦 𝑥2
𝑔(𝑎)
𝑦
𝑊 0 =𝑏 𝑊 0 =𝑏
𝑥 𝑑 𝑤𝑑 𝑥 𝑑 𝑤𝑑
1 1
A bit of notation change

21, 27 Jan 2022 CS60010 / Deep Learning | Introduction (c) Abir Das 18
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
Input Layer Hidden Layer Multilayer Neural Network
𝑥1 1

𝑥2 𝑔(𝑎)
1 Output Layer
Intuition:-
Hidden Layer: Extracts better
𝑥𝑑 𝑔(𝑎)
𝑦 representation of the input data
o Output layer: Does the classification

𝑔(𝑎)
21, 27 Jan 2022 CS60010 / Deep Learning | Introduction (c) Abir Das 19
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
Input Layer Hidden Layer Multilayer Neural Network

[]
𝑑
𝑥1 (1)
1 𝒙
𝑎1 (𝒙 )=𝑏1 + ∑ 𝑤1 ,𝑖   𝑥 𝑖 =[ 𝒘 1 𝑏1 ]
𝑤 1,1 (1) (1)
( 1) (1) (1) (1) (1)
𝑤 =𝑏
(1)
1,0 1
𝑖=1 1
𝑤
𝑥2 1,2
𝑔(𝑎) h
(1 )
1
[ (1) (1)
Row vector 𝑤1 ,1 , 𝑤1 ,2 , ⋯ ,𝑤1 , 𝑑
(1)
]
𝑇
Column vector [ 𝑥1 𝑥 2 … 𝑥𝑑 ]

𝑥𝑑 𝑤
(1)
1, 𝑑
h =𝑔 ( 𝑎 ( 𝒙 ) )
(1 )
1
( 1)
1

21, 27 Jan 2022 CS60010 / Deep Learning | Introduction (c) Abir Das 20
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
Input Layer Hidden Layer Multilayer Neural Network

[]
𝑑
𝑥1 (1)
1 𝒙
𝑎1 (𝒙 )=𝑏1 + ∑ 𝑤1 ,𝑖   𝑥 𝑖 =[ 𝒘 1 𝑏1 ]
𝑤 1,1 (1) (1)
( 1) (1) (1) (1) (1)
𝑤 =𝑏
(1)(1)
1,0 1
𝑖=1 1
𝑤𝑤
𝑥2 1,22,1
h
(1 )
𝑔(𝑎) h1 =𝑔 ( 𝑎1 ( 𝒙 ) )
1 (1 ) ( 1)

[]
𝑤(1) 1 𝑑
(1) (1)
𝒙
𝑤 =𝑏
𝑎 2 (𝒙 )=𝑏 2 + ∑ 𝑤2 ,𝑖   𝑥 𝑖 =[ 𝒘 2 𝑏2 ]
2,2
2,0 2 ( 1) (1) (1) (1) (1)

𝑤 (1)
1
𝑥𝑑 𝑤 (1)
2,3
1, 𝑑 𝑔(𝑎) h
(1 )
2
𝑖=1

h2 =𝑔 ( 𝑎2 ( 𝒙 ) )
(1 ) ( 1)

21, 27 Jan 2022 CS60010 / Deep Learning | Introduction (c) Abir Das 21
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
Input Layer Hidden Layer Multilayer Neural Network

[]
𝑑
𝑥1 (1)
1 𝒙
𝑎1 (𝒙 )=𝑏1 + ∑ 𝑤1 ,𝑖   𝑥 𝑖 =[ 𝒘 1 𝑏1 ]
𝑤 1,1 (1) (1)
( 1) (1) (1) (1) (1)
𝑤 =𝑏
(1)(1)
1,0 1
𝑖=1 1
𝑤𝑤
𝑥2 1,22,1
h
(1 )
𝑔(𝑎) h1 =𝑔 ( 𝑎1 ( 𝒙 ) )
1 (1 ) ( 1)

[]
𝑤(1) 1 𝑑
(1) (1)
𝒙
𝑤 =𝑏
𝑎 2 (𝒙 )=𝑏 2 + ∑ 𝑤2 ,𝑖   𝑥 𝑖 =[ 𝒘 2 𝑏2 ]
2,2
(1) 2,0 2 ( 1) (1) (1) (1) (1)
𝑤 𝑀, 2
𝑤 (1)
1
𝑥𝑑 𝑤 (1)
2,3
1, 𝑑 𝑔(𝑎) h
(1 )
2
𝑖=1

h2 =𝑔 ( 𝑎2 ( 𝒙 ) )
(1 ) ( 1)

[]
(1) 𝑑
𝑤 𝑀, 1 𝒙
𝑎 (𝒙)=𝑏 + ∑ 𝑤
( 1)
𝑀
(1)
𝑀
(1)
𝑀 ,𝑖   𝑥 𝑖 =[ 𝒘 𝑏
(1)
𝑀
(1 )
2 ]
𝑤
(1)
𝑀, 3
1
𝑤
(1) (1)
=𝑏 𝑖=1 1
𝑀, 0 𝑀

h
(1 ) h =𝑔 ( 𝑎 ( 𝒙 ) )
(1 )
𝑀
( 1)
𝑀
𝑔(𝑎) 𝑀

21, 27 Jan 2022 CS60010 / Deep Learning | Introduction (c) Abir Das 22
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
Input Layer Hidden Layer Multilayer Neural Network

[ ][ ]
𝑥1 1 𝑎 11 ( 𝒙 )= [ 𝒘 1 𝑏 1
( ) ( 1) ( 1)
][ ]𝒙
1 (1 )
𝑎1 ( 𝒙 ) 𝒘1
(1)
𝑏1
(1)

][ 𝒙
1]
[ ]
(1) (1)
𝑎 21 ( 𝒙 )= [ 𝒘 2 𝑏 2 𝑎2 ( 𝒙 ) = 𝒘 2 𝑏2 𝒙
( ) ( 1) ( 1) (1 )

𝑥2 𝑔(𝑎) h
(1 )
1 ⋮ ⋮ ⋮ ⋮ 1
(1) (1)

[ ]
( 1)
( 𝒙 )= [ 𝒘
𝒙 𝑎𝑀 ( 𝒙 ) 𝒘 𝑏𝑀
1 𝑎
( 1)
𝑀
(1)
𝑀
(1)
𝑏
2 ] 1
𝑀

𝑥𝑑 h
(1 ) 𝒂 1 ( 𝒙 )=[ 𝑾
( ) ( 1)
𝒃
(1)
] [ ]
𝒙
1
𝑔(𝑎) 2
Matrix
Column vector Column vector
Column vector

[ ][ ]
1 (1)
𝑔 ( 𝑎1 )  
( 1)
h1  
h(1)   𝑔 ( 𝑎 2 )
( 1)

Hidden Layer Activation = 𝒉 =𝒈 ( 𝒂 )


2 (1 ) ( 1)
(1 )
𝑔(𝑎) h 𝑀
⋮ ⋮
h(1)
𝑀   𝑔 ( 𝑎𝑀 )
(1 )

21, 27 Jan 2022 CS60010 / Deep Learning | Introduction (c) Abir Das 23
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
Input Layer Hidden Layer Multilayer Neural Network
𝑥1 1
Output Layer Pre-activation

[ ]
( 1)
𝒉
(1 ) )= [ 𝒘 ]
(2 ) ( 1) (2) ( 2)
𝑎 (𝒉 𝑏
𝑥2 𝑔(𝑎)
h1 1 1 1
1
(2)
1 𝑤1,1 Output Layer Output Layer Activation
(1 )
𝑦 =𝑜 ( 𝑎 (12) ( 𝒉(1 )) )
h
𝑥𝑑 𝑔(𝑎)
2
𝑦
𝑤
(2) o o can have many options
1,2
• sigmoid
• tanh
1 (2) • Relu etc.
𝑤1, 𝑀
One special is softmax function

𝑔(𝑎) h(1𝑀)
21, 27 Jan 2022 CS60010 / Deep Learning | Introduction (c) Abir Das 24
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
Input Layer Hidden Layer Multilayer Neural Network
Output Layer
𝑥1 1
(2) (2)
𝑎1 exp ⁡( 𝑎1 )
𝑥2 𝑔(𝑎)
𝑦 1 =𝑜 1 ( 𝒂
(2 )
)=
∑ exp ⁡(𝑎(2)
𝑐 )
o 𝑐
1
(2) Softmax function:
𝑎2
( 2)
exp ⁡(𝑎 ) Exponentiate each component
𝑦 2 =𝑜 2 ( 𝒂 )=
( 2) 2
𝑥𝑑 𝑔(𝑎) o ∑ exp ⁡( 𝑎 ) of output activation vector and
( 2)
𝑐
𝑐
elementwise divide by the sum
of the exponentials.
1
(2)
𝑎 𝐶 (2)
exp ⁡(𝑎 𝐶 )
𝑦 𝐶 =𝑜 𝐶 ( 𝒂 )=
(2)

𝑔(𝑎) o ∑ exp ⁡(𝑎(2)


𝑐 )
𝑐

21, 27 Jan 2022 CS60010 / Deep Learning | Introduction (c) Abir Das 25
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Training a Neural Network – Loss Function


𝑥1 1

Probability that So we can aim to maximize the probability


𝑥2 𝑔(𝑎)
𝑦 1 belongs to corresponding to the correct class for any
o class 1 example
1

Probability that max 𝒚 𝑐


𝑥𝑑 𝑔(𝑎) o
𝑦2 belongs to
class 2

1
Can be equivalently expressed as known
Probability that as cross-entropy loss
𝑦𝐶 belongs to
𝑔(𝑎) o class C
21, 27 Jan 2022 CS60010 / Deep Learning | Introduction (c) Abir Das 26
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Forward Pass in a Nutshell


(1) (1 )
is the collection of all
𝑾 ,𝒃 𝑔 (.) (2)
𝑾 ,𝒃
(2)
𝑔 (.) 𝑾
(𝐿) (𝐿)
,𝒃 o
(1)
𝒂
(1)
𝒉
(2)
𝒂
(2)
𝒉
( 𝐿)
𝒂 𝒚= 𝒇 ( 𝒙 ,𝜽 ) learnable parameters
i.e., all and

Hidden layer pre-activation:


For

Hidden layer activation:


For

Output layer activation:


For

21, 27 Jan 2022 CS60010 / Deep Learning | Introduction (c) Abir Das 27
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Backpropagation
(𝑙 ) (𝑙 )
(𝑙 − 1) 𝑾 ,𝒃 𝑔 (.) 𝑾
(𝑙 +1) (𝑙 +1)
,𝒃 𝑔 (.) 𝑾
(𝐿) (𝐿)
,𝒃 o
𝒉 (𝑙)
𝒂
(𝑙)
𝒉
(𝑙 +1)
𝒂 (𝑙 +1)
𝒉
( 𝐿)
𝒂 𝒚= 𝒇 ( 𝒙 ,𝜽 ) 𝐽 ( 𝒇 ( 𝒙 , 𝜽 ) , 𝒄)

True class of
For gradient descent we need to compute and the example

Backpropagation consists of intelligent use of the chain rule of differentiation for computation of these gradients

21, 27 Jan 2022 CS60010 / Deep Learning | Introduction (c) Abir Das 28
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Backpropagation
(𝑙 ) (𝑙 )
(𝑙 − 1) 𝑾 ,𝒃 𝑔 (.) 𝑾
(𝑙 +1) (𝑙 +1)
,𝒃 𝑔 (.) 𝑾
(𝐿) (𝐿)
,𝒃 o
𝒉 (𝑙)
𝒂
(𝑙)
𝒉
(𝑙 +1)
𝒂 (𝑙 +1)
𝒉
( 𝐿)
𝒂 𝒚= 𝒇 ( 𝒙 ,𝜽 ) 𝐽 ( 𝒇 ( 𝒙 , 𝜽 ) , 𝒄)

influences through . That means is a function of and , in turn, is a function of


(𝑙)
𝜕𝐽 𝜕 𝐽 𝜕 𝒂𝑖
(𝑙)
= (𝑙) (𝑙)
𝜕𝑾 𝑖 , 𝑗 𝜕 𝒂𝑖 𝜕𝑾 𝑖, 𝑗
𝒂 𝑖 =∑ 𝑾 𝑖, 𝑗  𝒉 𝑗 +𝒃 𝑖 =𝑾 𝑖, 1  𝒉1 +𝑾 𝑖 ,2  𝒉2 +…+𝑾 𝑖, 𝑗  𝒉 𝑗 +…+𝒃𝑖
(𝑙 ) (𝑙) (𝑙−1) (𝑙) (𝑙) (𝑙−1) (𝑙) (𝑙− 1) (𝑙 ) ( 𝑙−1 ) (𝑙)

𝑗
(𝑙)
𝜕𝐽 𝜕 𝐽 𝜕𝒂 𝑖 𝜕 𝐽 ( 𝑙−1 )
⇒ (𝑙)
= (𝑙) (𝑙)
= (𝑙) 𝒉 𝑗 … (1)
𝜕𝑾 𝑖, 𝑗 𝜕𝒂 𝑖 𝜕𝑾 𝑖 , 𝑗 𝜕 𝒂𝑖 We know from forward pass, so we
need
And similarly … (2)

21, 27 Jan 2022 CS60010 / Deep Learning | Introduction (c) Abir Das 29
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Backpropagation
(𝑙 ) (𝑙 )
(𝑙 − 1) 𝑾 ,𝒃 𝑔 (.) 𝑾
(𝑙 +1) (𝑙 +1)
,𝒃 𝑔 (.) 𝑾
(𝐿) (𝐿)
,𝒃 o
𝒉 (𝑙)
𝒂
(𝑙)
𝒉
(𝑙 +1)
𝒂 (𝑙 +1)
𝒉
( 𝐿)
𝒂 𝒚= 𝒇 ( 𝒙 ,𝜽 ) 𝐽 ( 𝒇 ( 𝒙 , 𝜽 ) , 𝒄)

influences through . That means is a function of and , in turn, is a function of


(𝑙)
𝜕𝐽 𝜕 𝐽 𝜕𝒉𝑖 𝜕𝐽
(𝑙)
= (𝑙) (𝑙)
= (𝑙)
𝑔 ′ ( 𝒂𝑖 )
(𝑙)
… (3)
𝜕 𝒂𝑖 𝜕 𝒉𝑖 𝜕 𝒂 𝑖 𝜕 𝒉𝑖
We know from forward pass, so we need

21, 27 Jan 2022 CS60010 / Deep Learning | Introduction (c) Abir Das 30
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Backpropagation
(𝑙 ) (𝑙 )
(𝑙 − 1) 𝑾 ,𝒃 𝑔 (.) 𝑾
(𝑙 +1) (𝑙 +1)
,𝒃 𝑔 (.) 𝑾
(𝐿) (𝐿)
,𝒃 o
𝒉 (𝑙)
𝒂 𝒉
(𝑙) (𝑙 +1)
𝒂 𝒉
(𝑙 +1) ( 𝐿)
𝒂 𝒚= 𝒇 ( 𝒙 ,𝜽 ) 𝐽 ( 𝒇 ( 𝒙 , 𝜽 ) , 𝒄)

influences through . That means is a function ofand each of , in turn, is a function of


Number of neurons in hidden layer
𝑀 𝑙+1 (𝑙+1 )
𝜕𝐽 𝜕 𝐽 𝜕𝒂
(𝑙)
=∑ (𝑙+1)
𝑚
(𝑙)
Remember I
𝜕 𝒉𝑖 𝑚=1 𝜕 𝒂𝑚 𝜕 𝒉 𝑖

𝒂 𝑚 =∑ 𝑾 𝑚, 𝑗   𝒉 𝑗 +𝒃
( 𝑙+1) (𝑙+1) (𝑙) (𝑙+1) (𝑙+1) (𝑙) (𝑙+1) (𝑙) ( 𝑙+1 ) ( 𝑙) (𝑙+1)
𝑚 =𝑾 𝑚,1   𝒉1 +𝑾 𝑚, 2   𝒉2 +…+𝑾 𝑚, 𝑖   𝒉𝑖 +…+𝒃𝑚
𝑗
𝑀 𝑙+ 1
𝜕𝐽 𝜕𝐽
⇒ (𝑙)
= ∑ (𝑙+1)
𝑾 ( 𝑙+1)
𝑚,𝑖 … (4)
𝜕 𝒉𝑖 𝑚=1 𝜕 𝒂𝑚

21, 27 Jan 2022 CS60010 / Deep Learning | Introduction (c) Abir Das 31
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Backpropagation
[ 𝑾 (𝑙)𝑚, 𝑖   ] (𝑙)
𝒃𝑖
[ 𝑾 (𝑙)𝑖, 𝑗   ]

𝒉
( 𝑙− 1)
𝑗
× +¿ 𝒂𝑖
(𝑙)
𝑔 (.) 𝒉𝑗
(𝑙 )

𝜕𝐽 𝜕𝐽
(𝑙)
= (𝑙)
𝑔 ′ ( 𝒂𝑖 )
(𝑙)
… (3)
𝜕 𝒂𝑖 𝜕 𝒉𝑖

𝑔′ (.)

𝜕𝐽 𝜕𝐽
×
𝜕 𝒂(𝑙)
𝑖 𝜕 𝒉(𝑙)
𝑖

21, 27 Jan 2022 CS60010 / Deep Learning | Introduction (c) Abir Das 32
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Backpropagation
[ 𝑾 (𝑙)𝑚, 𝑖   ] (𝑙) 𝜕𝐽
𝒃
[ 𝑾 (𝑙)𝑖, 𝑗   ] 𝑖 (𝑙)
𝜕 𝒃𝑖

𝒉
( 𝑙− 1)
𝑗
× +¿ 𝒂𝑖
(𝑙)
𝑔 (.) 𝒉𝑗
(𝑙 )

𝜕𝐽 𝜕𝐽
(𝑙)
= (𝑙)
𝑔 ′ ( 𝒂𝑖 )
(𝑙)
… (3)
𝜕 𝒂𝑖 𝜕 𝒉𝑖
𝜕𝐽 𝜕𝐽
𝑔′ (.) = … (2)
𝜕 𝒃(𝑙)
𝑖 𝜕 𝒂(𝑙)
𝑖

𝜕𝐽 𝜕𝐽
×
𝜕 𝒂(𝑙)
𝑖 𝜕 𝒉(𝑙)
𝑖

21, 27 Jan 2022 CS60010 / Deep Learning | Introduction (c) Abir Das 33
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Backpropagation
[ 𝑾 (𝑙)𝑚, 𝑖   ] 𝜕𝐽 (𝑙) 𝜕𝐽
𝒃
[ 𝑾 (𝑙)𝑖, 𝑗   ] (𝑙)
𝜕𝑾 𝑖 , 𝑗 𝑖 (𝑙)
𝜕 𝒃𝑖

𝒉
( 𝑙− 1)
𝑗
× +¿ 𝒂𝑖
(𝑙)
𝑔 (.) 𝒉𝑗
(𝑙 )

𝜕𝐽 𝜕𝐽
(𝑙)
= (𝑙)
𝑔 ′ ( 𝒂𝑖 )
(𝑙)
… (3)
𝜕 𝒂𝑖 𝜕 𝒉𝑖
𝜕𝐽 𝜕𝐽
𝑔′ (.) = … (2)
𝜕 𝒃(𝑙)
𝑖 𝜕 𝒂 (𝑙)
𝑖
𝜕𝐽 𝜕 𝐽 ( 𝑙−1 )
(𝑙)
= (𝑙)
𝒉𝑗 … (1)
𝜕𝑾 𝑖 , 𝑗 𝜕 𝒂𝑖
𝜕𝐽 𝜕𝐽
×
𝜕 𝒂(𝑙)
𝑖 𝜕 𝒉(𝑙)
𝑖

21, 27 Jan 2022 CS60010 / Deep Learning | Introduction (c) Abir Das 34
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Backpropagation
[ 𝑾 (𝑙)𝑚, 𝑖   ] 𝜕𝐽 (𝑙) 𝜕𝐽
𝒃
[ 𝑾 (𝑙)𝑖, 𝑗   ] (𝑙)
𝜕𝑾 𝑖 , 𝑗 𝑖 (𝑙)
𝜕 𝒃𝑖

𝒉
( 𝑙− 1)
𝑗
× +¿ 𝒂𝑖
(𝑙)
𝑔 (.) 𝒉𝑗
(𝑙 )

𝜕𝐽 𝜕𝐽
(𝑙)
= (𝑙)
𝑔 ′ ( 𝒂𝑖 )
(𝑙)
… (3)
∀ 𝑖=1 : 𝑀 𝐿 𝜕 𝒂𝑖 𝜕 𝒉𝑖
𝜕𝐽 𝜕𝐽
𝑔′ (.) = … (2)
𝜕 𝒃(𝑙)
𝑖 𝜕 𝒂 (𝑙)
𝑖
𝜕𝐽 𝜕 𝐽 ( 𝑙−1 )
(𝑙)
= (𝑙)
𝒉𝑗 … (1)
𝜕𝑾 𝑖 , 𝑗 𝜕 𝒂𝑖
𝜕𝐽 × 𝜕𝐽 𝜕𝐽 𝑀𝑙
× 𝜕𝐽 𝜕𝐽
𝜕 𝒉(𝑙−1)
and ∀ 𝑖=1 : 𝑀 𝐿
𝜕 𝒂(𝑙) 𝜕 𝒉(𝑙) = ∑ 𝑾 ( 𝑙)
𝑚, 𝑖 … (4)
𝑖
+¿ 𝑖 𝑖 (𝑙−1)
𝜕 𝒉𝑖 (𝑙)
𝑚=1 𝜕 𝒂𝑚
21, 27 Jan 2022 CS60010 / Deep Learning | Introduction (c) Abir Das 35
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Training Multilayer Neural Network for non-linearly Separable Data

Training Data
21, 27 Jan 2022 CS60010 / Deep Learning | Introduction (c) Abir Das 35
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Learned Decision Boundary with Single Hidden Layer

# of hidden neurons = 2 # of hidden neurons = 4 # of hidden neurons = 8 # of hidden neurons = 16

21, 27 Jan 2022 CS60010 / Deep Learning | Introduction (c) Abir Das 36
# of hidden neurons = 32 # of hidden neurons = 64 # of hidden neurons = 128 # of hidden neurons = 256
Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i

Learned Decision Boundary with Two Hidden Layers

# of neurons in each hidden layer = 2 # of neurons in each hidden layer = 4 # of neurons in each hidden layer = 8

# of neurons in each CS60010


hidden /layer = 16 # of neurons in each hidden layer = 32
Deep Learning | Introduction (c) Abir Das 37
21, 27 Jan 2022

You might also like