05 Neural Network

Computer Science and Engineering| Indian Institute of Technology Kharagpu
cse.iitkgp.ac.i
Deep Learning
CS60010
Abir Das
Assistant Professor
Computer Science and Engineering Department
Indian Institute of Technology Kharagpur
http://cse.iitkgp.ac.in/~adas/
cse.iitkgp.ac.i
Biological Neural Network
Image courtesy: F. A. Makinde et. al., “Prediction of crude oil viscosity using feed-forward
back-propagation neural network (FFBPNN)”." Petroleum & Coal 2012
21, 27 Jan 2022 CS60010 / Deep Learning | Introduction (c) Abir Das 2
cse.iitkgp.ac.i
cse.iitkgp.ac.i
Artificial Neuron
𝑇 𝑇
𝒘 = [ 𝑤 1 𝑤 2 …𝑤 𝑑 ] and 𝒙 =[ 𝑥1 𝑥2 … 𝑥 𝑑 ]
[]
𝑑
𝑥1 𝑤1 𝒙
𝑎 ( 𝒙 )=𝑏+ ∑ 𝑤𝑖 𝑥 𝑖=[ 𝒘 𝑏 ] 𝑇
𝑤2 1
𝑎(𝒙) 𝑔 (𝑎) 𝑖=1
𝑥2  𝑦 𝑦 =𝑔 (𝑎 ( 𝒙 ) )
𝑊 0 =𝑏
𝑤𝑑 Terminologies:-
𝑥𝑑 : input, : weights, : bias
1 : pre-activation (input activation)
: activation function
: activation (output activation)
cse.iitkgp.ac.i
Perceptron
Image courtesy: https://blogs.umass.edu/comphon/2017/06/15/did-frank-rosenblatt-invent-

deep-learning-in-1962/
cse.iitkgp.ac.i
Perceptron
and – Binary Classification
𝑥1 𝑤1
{
𝑔 (𝑎)= 1 , 𝑎 ≥ 0
¿ 0 , 𝑎< 0
(Rosenblatt, 1957)
To make things simpler, the response is taken as

𝑤2 𝑎(𝒙) 𝑔 (𝑎)
𝑥2  𝑦
𝑔 (𝑎)= { 1 ,𝑎≥ 0
¿ − 1, 𝑎 <0 𝑔(𝑎)
𝑊 0 =𝑏 𝑎
𝑥 𝑑 𝑤𝑑 The perceptron classification rule, thus, translates to
1
{
𝑇
𝑦= 1 ,𝒘 𝒙 +𝑏 ≥ 0 Remember: represents
𝑇
¿ −1 , 𝒘 𝒙 +𝑏<0 a hyperplane.
cse.iitkgp.ac.i
Perceptron Learning Algorithm
Convergence Theorem:-
For a finite and linearly separable set of data, the perceptron learning algorithm
will find a linear separator in at most iterations where the maximum length of any
data point is and is the maximum margin of the linear separators.
cse.iitkgp.ac.i
cse.iitkgp.ac.i
Perceptron Learning Algorithm – Convergence Proof

Some Important Brush-ups:-
(1)
(1) 𝒙 (1) (2) 𝒙
𝒙 𝒙 (3) 𝒘 𝒙 (2)(3) 𝒘
𝒙 (2)(3) 𝒘 𝒙 𝒙
𝒙
¿𝑎
𝑎 ¿𝑎
(0,0) (0,0)
(0,0) 𝑇 𝒘 𝑇 𝒙=𝑎
𝑇 𝒘 𝒙=𝑎
𝒘 𝒙=𝑎
For a hyperplane , is the normal vector

(often assumed to be unit normal vector) to Thus a hyperplane with normal vector , can separate two sets of
the hyperplane and is the distance from points by examining if is > 0 or < 0. (Remember our notation
the origin. says –a is nothing but b).
cse.iitkgp.ac.i

• Thus a simple classification rule can be or .
• For simplicity, let us assume that the hyperplane passes through origin – This does not
affect the problem in our hand as it just shifts the origin. But that means . Thus the
classification rule becomes .
• There may not exist such hyperplane that can separate two sets of data. If it exists then
the data is known to be linearly separable.
• A more formal definition is – Two sets and of dimensional points are said to be linearly
separable if there exists real numbers such that every point , satisfies and every point ,
satisfies ,
cse.iitkgp.ac.i

Without loss of generalization we will also assume the following while proving the convergence
theorem.
• We normalize all the points so that , i.e., maximum length of any data point is 1. Notice that
this does not affect the algorithm or the solution as,
• Margin of separation: This is the minimum distance between any data point and the
hyperplane . Such a hyperplane is defined to be the maximum margin separator.
cse.iitkgp.ac.i

• Let be the normal to the maximum margin linear separator. We want our in the perceptron
algorithm in each step getting closer and closer to this (sort of) ideal .
• This means that the inner product between and should increase with each update.
• Remember the update rule:
• So, (assuming )
Previous This is at least

• Similarly for a negative training example,
is negative as the example is negative and thus .

• So, in both cases is increasing by at least . Which means length of increases with each
update  Observation 1
12
21, 27 Jan 2022 CS60010 / Deep Learning | Introduction (c) Abir Das
cse.iitkgp.ac.i

• length of is given by . Let us see, how this changes with each update.
• Let us consider the case first.
This is negative as update occurs only on This is , as we assumed every point is

misclassification and for , rescaled to unit ball.
misclassification means .
• So,
cse.iitkgp.ac.i

• Let us now consider the case .
• Using similar arguments, , so, and also
• So, here also
• So, the length of the updated vector is increasing but it is increasing in a controlled way. 
Observation 2
cse.iitkgp.ac.i

• Say after iterations, is denoted as .
• So observation 1 implies, and observation 2 implies [remember the initial value of is 0]
• Now we have to apply the Cauchy-Swartz inequality.

• , but , so,
• [observation 1]
• This implies [observation 2]
• So,
cse.iitkgp.ac.i
cse.iitkgp.ac.i
Popular Choices of Activation Functions

Linear activation function Sigmoid activation function
• Unbounded • Bounded (0, 1)

• Always positive
tanh activation function ReLU activation function

•
• Bounded below by 0
• Bounded (-1, 1) • But not upper-bounded
• Can be positive or
negative
cse.iitkgp.ac.i
Multilayer Neural Network

𝑥1 𝑤1 𝑥1 𝑤1
𝑤2 𝑎(𝒙) 𝑔(𝑎) 𝑤2
𝑥2  𝑦 𝑥2
𝑔(𝑎)
𝑦
𝑊 0 =𝑏 𝑊 0 =𝑏
𝑥 𝑑 𝑤𝑑 𝑥 𝑑 𝑤𝑑
1 1
A bit of notation change
cse.iitkgp.ac.i
Input Layer Hidden Layer Multilayer Neural Network
𝑥1 1
𝑥2 𝑔(𝑎)
1 Output Layer
Intuition:-
Hidden Layer: Extracts better
𝑥𝑑 𝑔(𝑎)
𝑦 representation of the input data
o Output layer: Does the classification
𝑔(𝑎)
cse.iitkgp.ac.i
[]
𝑑
𝑥1 (1)
1 𝒙
𝑎1 (𝒙 )=𝑏1 + ∑ 𝑤1 ,𝑖 𝑥 𝑖 =[ 𝒘 1 𝑏1 ]
𝑤 1,1 (1) (1)
( 1) (1) (1) (1) (1)
𝑤 =𝑏
(1)
1,0 1
𝑖=1 1
𝑤
𝑥2 1,2
𝑔(𝑎) h
(1 )
1
[ (1) (1)
Row vector 𝑤1 ,1 , 𝑤1 ,2 , ⋯ ,𝑤1 , 𝑑
(1)
]
𝑇
Column vector [ 𝑥1 𝑥 2 … 𝑥𝑑 ]
𝑥𝑑 𝑤
(1)
1, 𝑑
h =𝑔 ( 𝑎 ( 𝒙 ) )
(1 )
1
( 1)
1
cse.iitkgp.ac.i
[]
𝑑
𝑥1 (1)
1 𝒙
𝑎1 (𝒙 )=𝑏1 + ∑ 𝑤1 ,𝑖 𝑥 𝑖 =[ 𝒘 1 𝑏1 ]
𝑤 1,1 (1) (1)
( 1) (1) (1) (1) (1)
𝑤 =𝑏
(1)(1)
1,0 1
𝑖=1 1
𝑤𝑤
𝑥2 1,22,1
h
(1 )
𝑔(𝑎) h1 =𝑔 ( 𝑎1 ( 𝒙 ) )
1 (1 ) ( 1)
[]
𝑤(1) 1 𝑑
(1) (1)
𝒙
𝑤 =𝑏
𝑎 2 (𝒙 )=𝑏 2 + ∑ 𝑤2 ,𝑖 𝑥 𝑖 =[ 𝒘 2 𝑏2 ]
2,2
2,0 2 ( 1) (1) (1) (1) (1)
𝑤 (1)
1
𝑥𝑑 𝑤 (1)
2,3
1, 𝑑 𝑔(𝑎) h
(1 )
2
𝑖=1
h2 =𝑔 ( 𝑎2 ( 𝒙 ) )
(1 ) ( 1)
cse.iitkgp.ac.i
[]
𝑑
𝑥1 (1)
1 𝒙
𝑎1 (𝒙 )=𝑏1 + ∑ 𝑤1 ,𝑖 𝑥 𝑖 =[ 𝒘 1 𝑏1 ]
𝑤 1,1 (1) (1)
( 1) (1) (1) (1) (1)
𝑤 =𝑏
(1)(1)
1,0 1
𝑖=1 1
𝑤𝑤
𝑥2 1,22,1
h
(1 )
𝑔(𝑎) h1 =𝑔 ( 𝑎1 ( 𝒙 ) )
1 (1 ) ( 1)
[]
𝑤(1) 1 𝑑
(1) (1)
𝒙
𝑤 =𝑏
𝑎 2 (𝒙 )=𝑏 2 + ∑ 𝑤2 ,𝑖 𝑥 𝑖 =[ 𝒘 2 𝑏2 ]
2,2
(1) 2,0 2 ( 1) (1) (1) (1) (1)
𝑤 𝑀, 2
𝑤 (1)
1
𝑥𝑑 𝑤 (1)
2,3
1, 𝑑 𝑔(𝑎) h
(1 )
2
𝑖=1
h2 =𝑔 ( 𝑎2 ( 𝒙 ) )
(1 ) ( 1)
[]
(1) 𝑑
𝑤 𝑀, 1 𝒙
𝑎 (𝒙)=𝑏 + ∑ 𝑤
( 1)
𝑀
(1)
𝑀
(1)
𝑀 ,𝑖 𝑥 𝑖 =[ 𝒘 𝑏
(1)
𝑀
(1 )
2 ]
𝑤
(1)
𝑀, 3
1
𝑤
(1) (1)
=𝑏 𝑖=1 1
𝑀, 0 𝑀
h
(1 ) h =𝑔 ( 𝑎 ( 𝒙 ) )
(1 )
𝑀
( 1)
𝑀
𝑔(𝑎) 𝑀
cse.iitkgp.ac.i
[ ][ ]
𝑥1 1 𝑎 11 ( 𝒙 )= [ 𝒘 1 𝑏 1
( ) ( 1) ( 1)
][ ]𝒙
1 (1 )
𝑎1 ( 𝒙 ) 𝒘1
(1)
𝑏1
(1)
][ 𝒙
1]
[ ]
(1) (1)
𝑎 21 ( 𝒙 )= [ 𝒘 2 𝑏 2 𝑎2 ( 𝒙 ) = 𝒘 2 𝑏2 𝒙
( ) ( 1) ( 1) (1 )
𝑥2 𝑔(𝑎) h
(1 )
1 ⋮ ⋮ ⋮ ⋮ 1
(1) (1)
[ ]
( 1)
( 𝒙 )= [ 𝒘
𝒙 𝑎𝑀 ( 𝒙 ) 𝒘 𝑏𝑀
1 𝑎
( 1)
𝑀
(1)
𝑀
(1)
𝑏
2 ] 1
𝑀
𝑥𝑑 h
(1 ) 𝒂 1 ( 𝒙 )=[ 𝑾
( ) ( 1)
𝒃
(1)
] [ ]
𝒙
1
𝑔(𝑎) 2
Matrix
Column vector Column vector
Column vector
[ ][ ]
1 (1)
𝑔 ( 𝑎1 )
( 1)
h1
h(1) 𝑔 ( 𝑎 2 )
( 1)
Hidden Layer Activation = 𝒉 =𝒈 ( 𝒂 )

2 (1 ) ( 1)
(1 )
𝑔(𝑎) h 𝑀
⋮ ⋮
h(1)
𝑀 𝑔 ( 𝑎𝑀 )
(1 )
cse.iitkgp.ac.i
𝑥1 1
Output Layer Pre-activation
[ ]
( 1)
𝒉
(1 ) )= [ 𝒘 ]
(2 ) ( 1) (2) ( 2)
𝑎 (𝒉 𝑏
𝑥2 𝑔(𝑎)
h1 1 1 1
1
(2)
1 𝑤1,1 Output Layer Output Layer Activation
(1 )
𝑦 =𝑜 ( 𝑎 (12) ( 𝒉(1 )) )
h
𝑥𝑑 𝑔(𝑎)
2
𝑦
𝑤
(2) o o can have many options
1,2
• sigmoid
• tanh
1 (2) • Relu etc.
𝑤1, 𝑀
One special is softmax function
𝑔(𝑎) h(1𝑀)
cse.iitkgp.ac.i
Output Layer
𝑥1 1
(2) (2)
𝑎1 exp ⁡( 𝑎1 )
𝑥2 𝑔(𝑎)
𝑦 1 =𝑜 1 ( 𝒂
(2 )
)=
∑ exp ⁡(𝑎(2)
𝑐 )
o 𝑐
1
(2) Softmax function:
𝑎2
( 2)
exp ⁡(𝑎 ) Exponentiate each component
𝑦 2 =𝑜 2 ( 𝒂 )=
( 2) 2
𝑥𝑑 𝑔(𝑎) o ∑ exp ⁡( 𝑎 ) of output activation vector and
( 2)
𝑐
𝑐
elementwise divide by the sum
of the exponentials.
1
(2)
𝑎 𝐶 (2)
exp ⁡(𝑎 𝐶 )
𝑦 𝐶 =𝑜 𝐶 ( 𝒂 )=
(2)
𝑔(𝑎) o ∑ exp ⁡(𝑎(2)

𝑐 )
𝑐
cse.iitkgp.ac.i
Training a Neural Network – Loss Function

𝑥1 1
Probability that So we can aim to maximize the probability

𝑥2 𝑔(𝑎)
𝑦 1 belongs to corresponding to the correct class for any
o class 1 example
1
Probability that max 𝒚 𝑐

𝑥𝑑 𝑔(𝑎) o
𝑦2 belongs to
class 2
1
Can be equivalently expressed as known
Probability that as cross-entropy loss
𝑦𝐶 belongs to
𝑔(𝑎) o class C
cse.iitkgp.ac.i
Forward Pass in a Nutshell

(1) (1 )
is the collection of all
𝑾 ,𝒃 𝑔 (.) (2)
𝑾 ,𝒃
(2)
𝑔 (.) 𝑾
(𝐿) (𝐿)
,𝒃 o
(1)
𝒂
(1)
𝒉
(2)
𝒂
(2)
𝒉
( 𝐿)
𝒂 𝒚= 𝒇 ( 𝒙 ,𝜽 ) learnable parameters
i.e., all and
Hidden layer pre-activation:

For
Hidden layer activation:

For
Output layer activation:

For
cse.iitkgp.ac.i
Backpropagation
(𝑙 ) (𝑙 )
(𝑙 − 1) 𝑾 ,𝒃 𝑔 (.) 𝑾
(𝑙 +1) (𝑙 +1)
,𝒃 𝑔 (.) 𝑾
(𝐿) (𝐿)
,𝒃 o
𝒉 (𝑙)
𝒂
(𝑙)
𝒉
(𝑙 +1)
𝒂 (𝑙 +1)
𝒉
( 𝐿)
𝒂 𝒚= 𝒇 ( 𝒙 ,𝜽 ) 𝐽 ( 𝒇 ( 𝒙 , 𝜽 ) , 𝒄)
True class of
For gradient descent we need to compute and the example
Backpropagation consists of intelligent use of the chain rule of differentiation for computation of these gradients
cse.iitkgp.ac.i
Backpropagation
(𝑙 ) (𝑙 )
(𝑙 − 1) 𝑾 ,𝒃 𝑔 (.) 𝑾
(𝑙 +1) (𝑙 +1)
,𝒃 𝑔 (.) 𝑾
(𝐿) (𝐿)
,𝒃 o
𝒉 (𝑙)
𝒂
(𝑙)
𝒉
(𝑙 +1)
𝒂 (𝑙 +1)
𝒉
( 𝐿)
𝒂 𝒚= 𝒇 ( 𝒙 ,𝜽 ) 𝐽 ( 𝒇 ( 𝒙 , 𝜽 ) , 𝒄)
influences through . That means is a function of and , in turn, is a function of

(𝑙)
𝜕𝐽 𝜕 𝐽 𝜕 𝒂𝑖
(𝑙)
= (𝑙) (𝑙)
𝜕𝑾 𝑖 , 𝑗 𝜕 𝒂𝑖 𝜕𝑾 𝑖, 𝑗
𝒂 𝑖 =∑ 𝑾 𝑖, 𝑗 𝒉 𝑗 +𝒃 𝑖 =𝑾 𝑖, 1 𝒉1 +𝑾 𝑖 ,2 𝒉2 +…+𝑾 𝑖, 𝑗 𝒉 𝑗 +…+𝒃𝑖
(𝑙 ) (𝑙) (𝑙−1) (𝑙) (𝑙) (𝑙−1) (𝑙) (𝑙− 1) (𝑙 ) ( 𝑙−1 ) (𝑙)
𝑗
(𝑙)
𝜕𝐽 𝜕 𝐽 𝜕𝒂 𝑖 𝜕 𝐽 ( 𝑙−1 )
⇒ (𝑙)
= (𝑙) (𝑙)
= (𝑙) 𝒉 𝑗 … (1)
𝜕𝑾 𝑖, 𝑗 𝜕𝒂 𝑖 𝜕𝑾 𝑖 , 𝑗 𝜕 𝒂𝑖 We know from forward pass, so we
need
And similarly … (2)
cse.iitkgp.ac.i
Backpropagation
(𝑙 ) (𝑙 )
(𝑙 − 1) 𝑾 ,𝒃 𝑔 (.) 𝑾
(𝑙 +1) (𝑙 +1)
,𝒃 𝑔 (.) 𝑾
(𝐿) (𝐿)
,𝒃 o
𝒉 (𝑙)
𝒂
(𝑙)
𝒉
(𝑙 +1)
𝒂 (𝑙 +1)
𝒉
( 𝐿)
𝒂 𝒚= 𝒇 ( 𝒙 ,𝜽 ) 𝐽 ( 𝒇 ( 𝒙 , 𝜽 ) , 𝒄)
influences through . That means is a function of and , in turn, is a function of

(𝑙)
𝜕𝐽 𝜕 𝐽 𝜕𝒉𝑖 𝜕𝐽
(𝑙)
= (𝑙) (𝑙)
= (𝑙)
𝑔 ′ ( 𝒂𝑖 )
(𝑙)
… (3)
𝜕 𝒂𝑖 𝜕 𝒉𝑖 𝜕 𝒂 𝑖 𝜕 𝒉𝑖
We know from forward pass, so we need
cse.iitkgp.ac.i
Backpropagation
(𝑙 ) (𝑙 )
(𝑙 − 1) 𝑾 ,𝒃 𝑔 (.) 𝑾
(𝑙 +1) (𝑙 +1)
,𝒃 𝑔 (.) 𝑾
(𝐿) (𝐿)
,𝒃 o
𝒉 (𝑙)
𝒂 𝒉
(𝑙) (𝑙 +1)
𝒂 𝒉
(𝑙 +1) ( 𝐿)
𝒂 𝒚= 𝒇 ( 𝒙 ,𝜽 ) 𝐽 ( 𝒇 ( 𝒙 , 𝜽 ) , 𝒄)
influences through . That means is a function ofand each of , in turn, is a function of

Number of neurons in hidden layer
𝑀 𝑙+1 (𝑙+1 )
𝜕𝐽 𝜕 𝐽 𝜕𝒂
(𝑙)
=∑ (𝑙+1)
𝑚
(𝑙)
Remember I
𝜕 𝒉𝑖 𝑚=1 𝜕 𝒂𝑚 𝜕 𝒉 𝑖
𝒂 𝑚 =∑ 𝑾 𝑚, 𝑗 𝒉 𝑗 +𝒃
( 𝑙+1) (𝑙+1) (𝑙) (𝑙+1) (𝑙+1) (𝑙) (𝑙+1) (𝑙) ( 𝑙+1 ) ( 𝑙) (𝑙+1)
𝑚 =𝑾 𝑚,1 𝒉1 +𝑾 𝑚, 2 𝒉2 +…+𝑾 𝑚, 𝑖 𝒉𝑖 +…+𝒃𝑚
𝑗
𝑀 𝑙+ 1
𝜕𝐽 𝜕𝐽
⇒ (𝑙)
= ∑ (𝑙+1)
𝑾 ( 𝑙+1)
𝑚,𝑖 … (4)
𝜕 𝒉𝑖 𝑚=1 𝜕 𝒂𝑚
cse.iitkgp.ac.i
Backpropagation
[ 𝑾 (𝑙)𝑚, 𝑖 ] (𝑙)
𝒃𝑖
[ 𝑾 (𝑙)𝑖, 𝑗 ]
𝒉
( 𝑙− 1)
𝑗
× +¿ 𝒂𝑖
(𝑙)
𝑔 (.) 𝒉𝑗
(𝑙 )
𝜕𝐽 𝜕𝐽
(𝑙)
= (𝑙)
𝑔 ′ ( 𝒂𝑖 )
(𝑙)
… (3)
𝜕 𝒂𝑖 𝜕 𝒉𝑖
𝑔′ (.)
𝜕𝐽 𝜕𝐽
×
𝜕 𝒂(𝑙)
𝑖 𝜕 𝒉(𝑙)
𝑖
cse.iitkgp.ac.i
Backpropagation
[ 𝑾 (𝑙)𝑚, 𝑖 ] (𝑙) 𝜕𝐽
𝒃
[ 𝑾 (𝑙)𝑖, 𝑗 ] 𝑖 (𝑙)
𝜕 𝒃𝑖
𝒉
( 𝑙− 1)
𝑗
× +¿ 𝒂𝑖
(𝑙)
𝑔 (.) 𝒉𝑗
(𝑙 )
𝜕𝐽 𝜕𝐽
(𝑙)
= (𝑙)
𝑔 ′ ( 𝒂𝑖 )
(𝑙)
… (3)
𝜕𝐽 𝜕𝐽
𝑔′ (.) = … (2)
𝜕 𝒃(𝑙)
𝑖 𝜕 𝒂(𝑙)
𝑖
𝜕𝐽 𝜕𝐽
×
𝜕 𝒂(𝑙)
𝑖 𝜕 𝒉(𝑙)
𝑖
cse.iitkgp.ac.i
Backpropagation
[ 𝑾 (𝑙)𝑚, 𝑖 ] 𝜕𝐽 (𝑙) 𝜕𝐽
𝒃
[ 𝑾 (𝑙)𝑖, 𝑗 ] (𝑙)
𝜕𝑾 𝑖 , 𝑗 𝑖 (𝑙)
𝜕 𝒃𝑖
𝒉
( 𝑙− 1)
𝑗
× +¿ 𝒂𝑖
(𝑙)
𝑔 (.) 𝒉𝑗
(𝑙 )
𝜕𝐽 𝜕𝐽
(𝑙)
= (𝑙)
𝑔 ′ ( 𝒂𝑖 )
(𝑙)
… (3)
𝜕𝐽 𝜕𝐽
𝑔′ (.) = … (2)
𝜕 𝒃(𝑙)
𝑖 𝜕 𝒂 (𝑙)
𝑖
𝜕𝐽 𝜕 𝐽 ( 𝑙−1 )
(𝑙)
= (𝑙)
𝒉𝑗 … (1)
𝜕𝑾 𝑖 , 𝑗 𝜕 𝒂𝑖
𝜕𝐽 𝜕𝐽
×
𝜕 𝒂(𝑙)
𝑖 𝜕 𝒉(𝑙)
𝑖
cse.iitkgp.ac.i
Backpropagation
[ 𝑾 (𝑙)𝑚, 𝑖 ] 𝜕𝐽 (𝑙) 𝜕𝐽
𝒃
[ 𝑾 (𝑙)𝑖, 𝑗 ] (𝑙)
𝜕𝑾 𝑖 , 𝑗 𝑖 (𝑙)
𝜕 𝒃𝑖
𝒉
( 𝑙− 1)
𝑗
× +¿ 𝒂𝑖
(𝑙)
𝑔 (.) 𝒉𝑗
(𝑙 )
𝜕𝐽 𝜕𝐽
(𝑙)
= (𝑙)
𝑔 ′ ( 𝒂𝑖 )
(𝑙)
… (3)
∀ 𝑖=1 : 𝑀 𝐿 𝜕 𝒂𝑖 𝜕 𝒉𝑖
𝜕𝐽 𝜕𝐽
𝑔′ (.) = … (2)
𝜕 𝒃(𝑙)
𝑖 𝜕 𝒂 (𝑙)
𝑖
𝜕𝐽 𝜕 𝐽 ( 𝑙−1 )
(𝑙)
= (𝑙)
𝒉𝑗 … (1)
𝜕𝑾 𝑖 , 𝑗 𝜕 𝒂𝑖
𝜕𝐽 × 𝜕𝐽 𝜕𝐽 𝑀𝑙
× 𝜕𝐽 𝜕𝐽
𝜕 𝒉(𝑙−1)
and ∀ 𝑖=1 : 𝑀 𝐿
𝜕 𝒂(𝑙) 𝜕 𝒉(𝑙) = ∑ 𝑾 ( 𝑙)
𝑚, 𝑖 … (4)
𝑖
+¿ 𝑖 𝑖 (𝑙−1)
𝜕 𝒉𝑖 (𝑙)
𝑚=1 𝜕 𝒂𝑚
cse.iitkgp.ac.i
Training Multilayer Neural Network for non-linearly Separable Data
Training Data
cse.iitkgp.ac.i
Learned Decision Boundary with Single Hidden Layer
# of hidden neurons = 2 # of hidden neurons = 4 # of hidden neurons = 8 # of hidden neurons = 16
# of hidden neurons = 32 # of hidden neurons = 64 # of hidden neurons = 128 # of hidden neurons = 256
cse.iitkgp.ac.i
Learned Decision Boundary with Two Hidden Layers
# of neurons in each hidden layer = 2 # of neurons in each hidden layer = 4 # of neurons in each hidden layer = 8
# of neurons in each CS60010

hidden /layer = 16 # of neurons in each hidden layer = 32
Deep Learning | Introduction (c) Abir Das 37
21, 27 Jan 2022

05 Neural Network

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

05 Neural Network

Uploaded by

Copyright:

Available Formats

Computer Science and Engineering| Indian Institute of Technology Kharagpu

Biological Neural Network

Image courtesy: https://blogs.umass.edu/comphon/2017/06/15/did-frank-rosenblatt-invent-

To make things simpler, the response is taken as

Perceptron Learning Algorithm

Perceptron Learning Algorithm – Convergence Proof

For a hyperplane , is the normal vector

Perceptron Learning Algorithm – Convergence Proof

Perceptron Learning Algorithm – Convergence Proof

Perceptron Learning Algorithm – Convergence Proof

Previous This is at least

is negative as the example is negative and thus .

Perceptron Learning Algorithm – Convergence Proof

• Let us consider the case first.

This is negative as update occurs only on This is , as we assumed every point is

Perceptron Learning Algorithm – Convergence Proof

• Using similar arguments, , so, and also

• So, here also

Perceptron Learning Algorithm – Convergence Proof

• So observation 1 implies, and observation 2 implies [remember the initial value of is 0]

• Now we have to apply the Cauchy-Swartz inequality.

Popular Choices of Activation Functions

• Unbounded • Bounded (0, 1)

tanh activation function ReLU activation function

Multilayer Neural Network

Hidden Layer Activation = 𝒉 =𝒈 ( 𝒂 )

𝑔(𝑎) o ∑ exp ⁡(𝑎(2)

Training a Neural Network – Loss Function

Probability that So we can aim to maximize the probability

Probability that max 𝒚 𝑐

Forward Pass in a Nutshell

Hidden layer pre-activation:

Hidden layer activation:

Output layer activation:

influences through . That means is a function of and , in turn, is a function of

influences through . That means is a function of and , in turn, is a function of

influences through . That means is a function ofand each of , in turn, is a function of

Training Multilayer Neural Network for non-linearly Separable Data

Learned Decision Boundary with Single Hidden Layer

# of hidden neurons = 2 # of hidden neurons = 4 # of hidden neurons = 8 # of hidden neurons = 16

Learned Decision Boundary with Two Hidden Layers

# of neurons in each CS60010

You might also like