You are on page 1of 60

AI VIETNAM

All-in-One Course

Insight into
Logistic Regression

Quang-Vinh Dinh
Ph.D. in Computer Science
Outline
➢ Vectorization
➢ Optimiztion for 1+ samples
➢ Logistic Regression – Mini-batch
➢ Logistic Regression – Batch
➢ BCE and MSE Loss Functions
➢ Sigmoid and Tanh Function (Optional)
demo
Implementation - One Sample Feature Label

1) Pick a sample (𝑥, 𝑦) from training data

2) Compute the output 𝑦ො


𝑧 = 𝑤𝑥 + 𝑏
1
𝑦ො = 𝜎(𝑧) =
1 + 𝑒 −𝑧
3) Compute loss
𝐿(𝑦,
ො 𝑦) = −ylogොy−(1−y)log(1−ොy )
4) Compute derivative

𝜕𝐿 𝜕𝐿
= 𝑥(𝑦ො − 𝑦) = (𝑦ො − 𝑦)
𝜕𝑤 𝜕𝑏
5) Update parameters
𝜕𝐿 𝜕𝐿
𝑤 =𝑤−𝜂 𝑏 =𝑏−𝜂
𝜕𝑤 𝜕𝑏
if #features changes, which functions are affected? 1
demo

1) Pick a sample (𝑥, 𝑦) from training data

2) Compute the output 𝑦ො


𝑧 = 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑏
1
𝑦ො = 𝜎(𝑧) =
1 + 𝑒 −𝑧
3) Compute loss
𝐿(𝑦,
ො 𝑦) = −ylogොy−(1−y)log(1−ොy )
4) Compute derivative How to solve the problem?
𝜕𝐿 𝜕𝐿
= 𝑥𝑖 (𝑦ො − 𝑦) = (𝑦ො − 𝑦)
𝜕𝑤𝑖 𝜕𝑏
5) Update parameters
𝜕𝐿 𝜕𝐿
𝑤𝑖 = 𝑤𝑖 − 𝜂 𝑏 =𝑏−𝜂
𝜕𝑤𝑖 𝜕𝑏
2
Vector/Matrix Operations
Transpose
𝑣1
𝑣Ԧ = … 𝑣Ԧ 𝑇 = 𝑣1 … 𝑣𝑛
𝑣𝑛

T Multiply with a number


1
1 2 𝑢1 𝛼𝑢1
2
𝛼𝑢 = 𝛼 … = …
𝑢𝑛 𝛼𝑢𝑛

𝑎11 … 𝑎1𝑛 𝑎11 … 𝑎𝑚1


A = … … ... A𝑇 = … … …
𝑎𝑚1 … 𝑎𝑚𝑛 𝑎1𝑛 … 𝑎𝑚𝑛 data result

1 2
T 2 = 4
1 2 1 3
2 *
3 6
3 4 2 4
3
Vector/Matrix Operations
Dot product

𝑣1 𝑢1
𝑣Ԧ = … 𝑢= …
𝑣𝑛 𝑢𝑛

𝑣Ԧ ∙ 𝑢 = 𝑣1 × 𝑢1 + ⋯ + 𝑣𝑛 × 𝑢𝑛

v w result

1 2 2 = 8

4
AI VIETNAM Feature Label
All-in-One Course
Vectorization
1) Pick a sample (𝑥, 𝑦) from training data
𝑥 𝑦
2) Compute the output 𝑦ො
𝑧 = 𝑤𝑥 + 𝑏 Traditional
1 1 𝑏
𝑧 = 𝑤𝑥 + 𝑏 𝒙= 𝜽=
𝑦ො = 𝜎(𝑧) = 𝑥 𝑤
1 + 𝑒 −𝑧
3) Compute loss
𝐿(𝑦,
ො 𝑦) = −ylogොy−(1−y)log(1−ොy )
4) Compute derivative 𝑏 → 𝜽𝑇 = 𝑏 𝑤
𝜽=
𝑤
𝜕𝐿 𝜕𝐿
= 𝑥(𝑦ො − 𝑦) = (𝑦ො − 𝑦)
𝜕𝑤 𝜕𝑏
1
5) Update parameters 𝑧 = 𝑤𝑥 + 𝑏1 = 𝑏 𝑤 = 𝜽𝑇 𝒙
𝑥
𝜕𝐿 𝜕𝐿
𝑤 =𝑤−𝜂 𝑏 =𝑏−𝜂
𝜕𝑤 𝜕𝑏
𝜂 is learning rate dot product 5
AI VIETNAM
All-in-One Course
Vectorization
1) Pick a sample (𝑥, 𝑦) from training data
1 𝑏
2) Compute the output 𝑦ො 𝑧 = 𝑤𝑥 + 𝑏 𝒙= 𝜽=
𝑥 𝑤
𝑧 = 𝑤𝑥 + 𝑏 Traditional
1 1
𝑦ො = 𝜎(𝑧) = 𝑧= 𝜽𝑇 𝒙 𝑦ො = 𝜎(𝑧) =
1 + 𝑒 −𝑧
1 + 𝑒 −𝑧
3) Compute loss
𝐿(𝑦,
ො 𝑦) = −ylogොy−(1−y)log(1−ොy )
4) Compute derivative 𝐿 𝑦,
ො 𝑦 = −ylogොy−(1−y)log(1−ොy )
𝜕𝐿 𝜕𝐿
= 𝑥(𝑦ො − 𝑦) = (𝑦ො − 𝑦)
𝜕𝑤 𝜕𝑏
numbers
5) Update parameters
𝜕𝐿 𝜕𝐿
𝑤 =𝑤−𝜂 𝑏 =𝑏−𝜂
𝜕𝑤 𝜕𝑏
𝜂 is learning rate What will we do? 6
1) Pick a sample (𝑥, 𝑦) from training data

2) Compute the output 𝑦ො Traditional


Vectorization
𝑧 = 𝑤𝑥 + 𝑏
1
𝑦ො = 𝜎(𝑧) =
1 + 𝑒 −𝑧 𝑧 = 𝑤𝑥 + 𝑏 𝒙=
1
𝜽=
𝑏
3) Compute loss 𝑥 𝑤
𝐿(𝑦,
ො 𝑦) = −ylogොy−(1−y)log(1−ොy )
4) Compute derivative 𝜕𝐿
= 𝑦ො − 𝑦 = 𝑦ො − 𝑦 × 1
𝜕𝐿 𝜕𝐿 𝜕𝑏
= 𝑥(𝑦ො − 𝑦) = (𝑦ො − 𝑦)
𝜕𝑤 𝜕𝑏 𝜕𝐿
= 𝑥 𝑦ො − 𝑦 = 𝑦ො − 𝑦 × 𝑥
5) Update parameters 𝜕𝑤
𝜕𝐿 𝜕𝐿
𝑤 =𝑤−𝜂 𝑏 =𝑏−𝜂
𝜕𝑤 𝜕𝑏
𝜕𝐿
𝑦ො − 𝑦 × 1 1
= 𝑦ො − 𝑦 = 𝑦ො − 𝑦 𝒙 = 𝜕𝑏 = 𝛻𝜽 𝐿 → 𝛻𝜽 𝐿 = 𝒙(𝑦Ƹ − 𝑦)
𝑦ො − 𝑦 × 𝑥 𝑥 𝜕𝐿
𝜕𝑤
common factor 7
AI VIETNAM
All-in-One Course
Vectorization 𝑧 = 𝜽𝑇 𝒙
𝜕𝐿
1
𝒙=
𝑥 𝛻𝜽 𝐿 = 𝜕𝑏
𝜕𝐿
1) Pick a sample (𝑥, 𝑦) from training data 𝑏
𝜽= 𝜕𝑤
𝑤
2) Compute the output 𝑦ො
𝑧 = 𝑤𝑥 + 𝑏 Traditional
1 𝜕𝐿
𝑦ො = 𝜎(𝑧) = 𝑏 = 𝑏 − 𝜂
1 + 𝑒 −𝑧 𝜕𝑏
3) Compute loss
𝜕𝐿
𝐿(𝑦,
ො 𝑦) = −ylogොy−(1−y)log(1−ොy ) 𝑤 = 𝑤 − 𝜂
𝜕𝑤
4) Compute derivative

𝜕𝐿 𝜕𝐿
= 𝑥(𝑦ො − 𝑦) = (𝑦ො − 𝑦) 𝜽 𝜽 𝛻𝜽 𝐿
𝜕𝑤 𝜕𝑏
5) Update parameters
𝜕𝐿 𝜕𝐿
𝑤 =𝑤−𝜂 𝑏 =𝑏−𝜂 → 𝜽 = 𝜽 − 𝜂𝛻𝜽 𝐿
𝜕𝑤 𝜕𝑏
𝜂 is learning rate 8
AI VIETNAM
All-in-One Course
Vectorization

1) Pick a sample (𝑥, 𝑦) from training data 1) Pick a sample (𝐱, 𝑦) from training data

2) Compute the output 𝑦ො 2) Compute output 𝑦ො


1 1
𝑧 = 𝑤𝑥 + 𝑏 𝑦ො = 𝜎(𝑧) = 𝑧= 𝜽𝑇 𝒙 = 𝒙𝑇 𝜽 𝑦ො = 𝜎(𝑧) =
1 + 𝑒 −𝑧 1 + 𝑒 −𝑧
3) Compute loss
3) Compute loss
𝐿(𝑦,
ො 𝑦) = −ylogොy−(1−y)log(1−ොy ) 𝐿(𝑦,
ො 𝑦) = −ylogොy−(1−y)log(1−ොy )
4) Compute derivative
Traditional Vectorized
𝜕𝐿 𝜕𝐿 4) Compute derivative
= 𝑥(𝑦ො − 𝑦) = (𝑦ො − 𝑦)
𝜕𝑤 𝜕𝑏 𝛻𝜽 𝐿 = 𝒙(𝑦ො − 𝑦)
5) Update parameters
5) Update parameters
𝜕𝐿 𝜕𝐿
𝑤 =𝑤−𝜂 𝑏 =𝑏−𝜂 𝜽 = 𝜽 − 𝜂𝛻𝜽 𝐿
𝜕𝑤 𝜕𝑏
𝜂 is learning rate 𝜂 is learning rate
9
AI VIETNAM
All-in-One Course
Vectorization
❖ Implementation (using Numpy)

1) Pick a sample (𝒙, 𝑦) from training data

2) Compute output 𝑦ො
1
𝑧= 𝜽𝑇 𝒙 = 𝒙𝑇 𝜽 𝑦ො = 𝜎(𝑧) =
1 + 𝑒 −𝑧
3) Compute loss

𝐿(𝑦,
ො 𝑦) = −ylogොy−(1−y)log(1−ොy )

4) Compute derivative # Given X and y

𝛻𝜽 𝐿 = 𝒙(𝑦ො − 𝑦)

5) Update parameters
𝜽 = 𝜽 − 𝜂𝛻𝜽 𝐿
𝜂 is learning rate 10
1 𝑏 0.1
1 1 Given 𝜽 = 𝑤1 = 0.5
𝒙 = 𝑥1 = 1.4 𝑤2 −0.1
𝑥2 0.2 𝜂 = 0.01
Dataset
Input 𝒙

0.1 Model
1) Pick a sample (𝒙, 𝑦) from training data 𝜽 = 0.5
−0.1
2) Compute output 𝑦ො Label
1
𝑧= 𝜽𝑇 𝒙 = 𝒙𝑇 𝜽 𝑦ො = 𝜎(𝑧) = 𝑦ො = 𝜎 𝜽𝑇 𝒙 = 0.6856 𝑦=0
1 + 𝑒 −𝑧
3) Compute loss
Loss
3
𝐿(𝑦,
ො 𝑦) = −ylogොy−(1−y)log(1−ොy ) 𝐿 = 1.1573
4 ′
1 0.6856 𝐿 𝑏
4) Compute derivative
𝛻𝜽 𝐿 = 𝒙 𝑦ො − 𝑦 = 1.4 0.6856 = 0.9599 = 𝐿′𝑤1
𝛻𝜽 𝐿 = 𝒙(𝑦ො − 𝑦) 0.2 0.1371 𝐿′𝑤2
5 0.1 0.6856 0.093
5) Update parameters
𝛉 − ηL′𝛉 = 0.5 − η 0.9599 = 0.499
𝜽 = 𝜽 − 𝜂𝛻𝜽 𝐿 −0.1 0.1371 −0.101
AI VIETNAM
All-in-One Course
Logistic Regression-Stochastic

Dataset 1) Pick a sample (𝒙, 𝑦) from training data


2) Compute output 𝑦ො
𝑧 = 𝜽𝑇 𝒙
1
𝑦ො = 𝜎(𝑧) =
1 + 𝑒 −𝑧
3) Compute loss
𝐿 𝜽 = −ylogොy−(1−y)log(1−ොy )
1
𝒙 = 1.4 𝒚= 0 4) Compute derivative
0.2
𝛻𝜽 𝐿 = 𝐱(ොy − 𝑦)

5) Update parameters
𝜽 = 𝜽 − 𝜂𝛻𝜽 𝐿
𝜂 is learning rate
Demo
12
Outline
➢ Vectorization
➢ Optimiztion for 1+ samples
➢ Logistic Regression – Mini-batch
➢ Logistic Regression – Batch
➢ BCE and MSE Loss Functions
➢ Sigmoid and Tanh Function (Optional)
AI VIETNAM
All-in-One Course
Optimization for One+ Samples
❖ Equations for partial gradients
𝑑𝑓 𝑑𝑓
=𝑥 =1
𝑓 𝑥 𝑖 = 𝑎𝑥 𝑖 + 𝑏 (𝑥 1 =1, 𝑦1 =5) 𝑑𝑎 𝑑𝑏
2
illustration
𝑔 𝑓𝑖 = 𝑓𝑖 − 𝑦𝑖 (𝑥 2 =2, 𝑦 2 =7) 𝑑𝑔
=2 𝑓−𝑦
𝑑𝑓
𝑦𝑖

𝑑𝑔𝑖 𝑑𝑓 𝑖
𝑑𝑔 𝑑𝑔 𝑑𝑓
= = 2𝑥 𝑓 − 𝑦
𝑑𝑓 𝑖 𝑑𝑎 𝑑𝑎 𝑑𝑓 𝑑𝑎
𝑎 𝑑𝑓 𝑖
𝑑𝑎 𝑑𝑔𝑖 𝑑𝑔 𝑑𝑔 𝑑𝑓
𝑑𝑓 𝑖
= =2 𝑓−𝑦
𝑥𝑖 𝑓𝑖 𝑔𝑖 𝑑𝑏 𝑑𝑓 𝑑𝑏

𝑑𝑓 𝑖
𝑏 𝑑𝑔𝑖 𝑑𝑓 𝑖 During looking for optimal a
𝑑𝑏
𝑑𝑓 𝑖 𝑑𝑏 and b, at a given time, a and b
have concrete values
13
❖ Optimization for a composite function
𝑑𝑔1 𝑑𝑓 1 𝑦1
Find a and b so that g(f(x)) is minimum 𝑑𝑓 1 𝑑𝑎
𝑎 𝑑𝑓 1
𝑓 𝑥 𝑖 = 𝑎𝑥 𝑖 + 𝑏 (𝑥 1 =1,𝑦1 =5)
𝑑𝑎 𝑑𝑔1
𝑖 𝑖 𝑖 2 (𝑥 2 =2,𝑦 2 =7) 𝑑𝑓 1
𝑔 𝑓 = 𝑓 −𝑦 𝑥1 𝑓1 𝑔1

Partial derivative functions 𝑑𝑓 1


𝑏 𝑑𝑏 𝑑𝑔1 𝑑𝑓 1
𝑑𝑔 𝑑𝑔 𝑑𝑓 𝑑𝑓 1 𝑑𝑏
= = 2𝑥 𝑓 − 𝑦
𝑑𝑎 𝑑𝑓 𝑑𝑎
𝑑𝑔 𝑑𝑔 𝑑𝑓
= =2 𝑓−𝑦 𝑦2
𝑑𝑏 𝑑𝑓 𝑑𝑏 𝑑𝑔2 𝑑𝑓 2
𝑎 𝑑𝑓 2 𝑑𝑓 2 𝑑𝑎

𝑑𝑔𝑖 𝑑𝑔1 𝑑𝑓 1 𝑑𝑔2 𝑑𝑓 2 𝑑𝑎 𝑑𝑔2


෍ = + 𝑓2 𝑑𝑓 2 𝑔2
𝑑𝑎 𝑑𝑓 1 𝑑𝑎 𝑑𝑓 2 𝑑𝑎 𝑥2
𝑖
𝑑𝑓 2
𝑑𝑔𝑖 𝑑𝑔 𝑑𝑓 𝑑𝑔 𝑑𝑓 1 1 2 2 𝑏 𝑑𝑏 𝑑𝑔2 𝑑𝑓 2
෍ = 1 + 2 𝑑𝑓 2 𝑑𝑏
𝑑𝑏 𝑑𝑓 𝑑𝑏 𝑑𝑓 𝑑𝑏
𝑖
AI VIETNAM
All-in-One Course
Optimization
❖ How to use gradient information
1 update at
time t
info 1

update at
time t+1
info 2

2
info 1
update

info 2
Summary 1
info 1 𝑑𝑔 𝑑𝑔 𝑑𝑓
= = 2𝑥 𝑓 − 𝑦
𝑑𝑎 𝑑𝑓 𝑑𝑎
𝑑𝑔 𝑑𝑔 𝑑𝑓
info 2 = =2 𝑓−𝑦
𝑑𝑏 𝑑𝑓 𝑑𝑏

Initialize a, b

Compute partial
gradient at a, b

Move a, b
opposite to db, db

𝜂 = 0.01
16
Summary 2
info 1
𝑑𝑔 𝑑𝑔 𝑑𝑓
update = = 2𝑥 𝑓 − 𝑦
𝑑𝑎 𝑑𝑓 𝑑𝑎
info 2 𝑑𝑔 𝑑𝑔 𝑑𝑓
= =2 𝑓−𝑦
𝑑𝑏 𝑑𝑓 𝑑𝑏

Initialize a, b

Compute partial
gradient at a, b

Move a, b
opposite to db, db

𝜂 = 0.01
17
Summary 2
info 1
𝑑𝑔 𝑑𝑔 𝑑𝑓
update = = 2𝑥 𝑓 − 𝑦
𝑑𝑎 𝑑𝑓 𝑑𝑎
info 2 𝑑𝑔 𝑑𝑔 𝑑𝑓
= =2 𝑓−𝑦
𝑑𝑏 𝑑𝑓 𝑑𝑏

Initialize a, b

Compute partial
gradient at a, b

Move a, b
opposite to db, db

𝜂 = 0.001
18
Outline
➢ Vectorization
➢ Optimiztion for 1+ samples
➢ Logistic Regression – Mini-batch
➢ Logistic Regression – Batch
➢ BCE and MSE Loss Functions
➢ Sigmoid and Tanh Function (Optional)
AI VIETNAM
All-in-One Course
Linear Regression (m-samples)
𝒛 = 𝜽𝑇 𝒙
❖ Construct formulas 2) Compute output 𝑦ො 1
Dataset ෝ = 𝜎(𝒛) =
𝒚
1 + 𝑒 −𝒛
(1) (1)
1 1.5 0.2 1 𝑥1 𝑥2
𝒙= = (2) (2)
1 4.1 1.3 1 𝑥1 𝑥2

𝑏 0.1
𝜽 = 𝑤1 = 0.5
𝑤2 −0.1
0
𝒚=
1 (1)
𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑏
(1)
𝒛= 𝑧 (1) =
𝒙=
1 1.5 0.2 𝑧 (2) (2)
𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑏
(2)
1 4.1 1.3
1 𝑥1
(1) (1)
𝑥2 𝑏
𝑏 0.1 0.83
= 𝑤1 = 𝒙𝜽 =
𝜽 = 𝑤1 = 0.5 1 𝑥1
(2) (2)
𝑥2 2.02
𝑤2 −0.1 𝑤2
19
AI VIETNAM
All-in-One Course
Linear Regression (m-samples)
𝒛 = 𝜽𝑇 𝒙
❖ Construct formulas 2) Compute output 𝑦ො 1
Dataset ෝ = 𝜎(𝒛) =
𝒚
1 + 𝑒 −𝒛
(1) (1)
1 1.5 0.2 1 𝑥1 𝑥2
𝒙= = (2) (2)
1 4.1 1.3 1 𝑥1 𝑥2

𝑏 0.1
𝜽 = 𝑤1 = 0.5
𝑤2 −0.1
0 Numpy perspective
𝒚=
1
0.83
𝒛 = 𝒙𝜽 =
1 1.5 0.2 2.02
𝒙=
1 4.1 1.3
1
𝑏 1 (1) 1
0.1 𝑦ො 1 + 𝑒 −𝑧 0.69
𝜽 = 𝑤1 = 0.5 ෝ=𝜎 𝒛 =
𝒚 = = =
𝑦ො 2 1 1 + 𝑒 −𝒛 0.88
𝑤2 −0.1 (2)
1 + 𝑒 −𝑧 20
AI VIETNAM
All-in-One Course
Linear Regression (m-samples)
❖ Construct formulas
3) Compute loss
Dataset
1
𝐿(ෝ
𝒚, 𝒚) = −𝐲 T log𝐲−(1−y)
ො T log(1−𝐲
ො)
m

𝐿(1) 𝑦ො (1) , 𝑦 (1) + 𝐿(2) 𝑦ො (2) , 𝑦 (2)


𝐿(ෝ
𝒚, 𝒚) =
m

0
𝒚=
1
𝐿(1) 𝑦ො (1) , 𝑦 (1) = −𝑦 (1) log𝑦ො (1) −(1−𝑦 (1) )log(1−𝑦ො (1) )
1 1.5 0.2
𝒙= +
1 4.1 1.3
𝐿(2) 𝑦ො (2) , 𝑦 (2) = −𝑦 (2) log𝑦ො (2) −(1−𝑦 (2) )log(1−𝑦ො (2) )
𝑏 0.1
𝜽 = 𝑤1 = 0.5
𝑤2 −0.1 𝐲 T log𝐲ො (1−y)T log(1−𝐲ො )
21
4) Compute derivative 𝜕𝐿(1) 𝜕𝐿(2)
𝜕𝐿 + (𝑦ො (1) − 𝑦 (1) ) + (𝑦ො (2) − 𝑦 (2) )
= 𝜕𝑏 𝜕𝑏 =
𝜕𝑏 𝑚 𝑚 (1)
𝜕𝐿 (1) sample 1 1 𝑥0 = 1
= (𝑦ො (1) − 𝑦 (1) ) = 1 ∗ (𝑦ො (1) − 𝑦 (1) ) + 1 ∗ (𝑦ො (2) − 𝑦 (2) )
𝜕𝑏 𝑚 (2)
𝑥0 = 1
1 (1) (2) 𝑦 ො (1) − 𝑦 (1)
𝜕𝐿(1) (1) = 𝑥 𝑥0
= 𝑥1 (𝑦ො (1) − 𝑦 (1) ) 𝑚 0 𝑦ො (2) − 𝑦 (2)
𝜕𝑤1
𝜕𝐿(1) (1) 𝜕𝐿(1) 𝜕𝐿(2)
= 𝑥2 (𝑦ො (1) − 𝑦 (1) ) 𝜕𝐿 + 𝑥
(1) (1)
(𝑦ො − 𝑦 (1) ) + 𝑥 (2) (𝑦
ො (2) − 𝑦 (2) )
𝜕𝑤1 𝜕𝑤1 1 1
𝜕𝑤2 = =
𝜕𝑤1 𝑚 𝑚

sample 2 1 (1) (2) 𝑦ො (1) − 𝑦 (1)


(2) = 𝑥 𝑥1
𝜕𝐿 𝑚 1 𝑦ො (2) − 𝑦 (2)
= (𝑦ො (2) − 𝑦 (2) )
𝜕𝑏
𝜕𝐿(2) (2)
= 𝑥1 (𝑦ො (2) − 𝑦 (2) ) 𝜕𝐿(1) 𝜕𝐿(2)
𝜕𝑤1 𝜕𝐿 + (1) (2)
𝑥2 (𝑦ො (1) − 𝑦 (1) ) + 𝑥2 (𝑦ො (2) − 𝑦 (2) )
𝜕𝑤2 𝜕𝑤2
= =
𝜕𝑤2 𝑚 𝑚
𝜕𝐿(2) (2)
= 𝑥2 (𝑦ො (2) − 𝑦 (2) ) 1 (1) 𝑦ො (1) − 𝑦 (1)
𝜕𝑤2 = 𝑥
(2)
𝑥2
𝑚 2 𝑦ො (2) − 𝑦 (2) 22
4) Compute derivative 𝜕𝐿 1 (1) (2) 𝑦ො (1) − 𝑦 (1)
= 𝑥 𝑥0
𝜕𝑏 𝑚 0 𝑦ො (2) − 𝑦 (2)
𝜕𝐿(1) sample 1
𝜕𝐿 1 (1) (2) 𝑦ො (1) − 𝑦 (1)
= (𝑦ො (1) − 𝑦 (1) ) = 𝑥 𝑥1
𝜕𝑏 𝜕𝑤1 𝑚 1 𝑦ො (2) − 𝑦 (2)
𝜕𝐿(1) (1) 𝜕𝐿 1 (1) (2) 𝑦ො (1) − 𝑦 (1)
= 𝑥1 (𝑦ො (1) − 𝑦 (1) ) = 𝑥 𝑥2
𝜕𝑤1 𝜕𝑤2 𝑚 2 𝑦ො (2) − 𝑦 (2)

𝜕𝐿(1) (1) 𝜕𝐿
= 𝑥2 (𝑦ො (1) − 𝑦 (1) ) (1) (2)
𝜕𝑤2 𝜕𝑏 𝑥0 𝑥0
𝜕𝐿 1 (1) 𝑦ො (1) − 𝑦 (1)
𝛻𝜽 𝑳 =
𝜕𝑤1
= 𝑥 𝑥1(2)
(2) sample 2 𝑚 1 𝑦ො (2) − 𝑦 (2)
𝜕𝐿 𝜕𝐿 𝑥2(1) 𝑥2(2)
= (𝑦ො (2) − 𝑦 (2) )
𝜕𝑏 𝜕𝑤2
𝜕𝐿(2) 𝑇
𝐲ො 𝐲
(2)
𝒙(1)
= 𝑥1 (𝑦ො (2) − 𝑦 (2) )
𝜕𝑤1 (2) 𝑇
𝒙
𝜕𝐿(2) (2)
= 𝑥2 (𝑦ො (2) − 𝑦 (2) ) 1 𝑇
𝜕𝑤2 𝛻𝜽 𝐿 = 𝒙 (𝐲ො − 𝒚)
𝑚 23
5) Update parameters
𝜕𝐿
𝜕𝑏
Dataset 𝜕𝐿 1 𝑇
𝛻𝜽 𝑳 = = 𝒙 (𝐲ො − 𝒚)
𝜕𝑤1 𝑚
𝜕𝐿
𝜕𝑤2

1 1.5 0.2 𝜕𝐿
𝒙= 𝑏= 𝑏 − 𝜂
1 4.1 1.3 𝜕𝑏
𝜕𝐿
0 𝑤1 = 𝑤1 − 𝜂
𝒚= 𝜕𝑤1 𝜽 = 𝜽 − 𝜂𝛻𝜽 𝐿
1
𝜕𝐿
𝑏 0.1 𝑤2 = 𝑤2 − 𝜂
𝜕𝑤2
𝜽 = 𝑤1 = 0.5
𝑤2 −0.1
𝜽 𝜽 𝛻𝜽 𝑳
24
AI VIETNAM
All-in-One Course
Logistic Regression - Minibatch
Mini-batch m=2
1) Pick m samples from training data (1) (1) 𝑏
1 𝑥1 𝑥2

2) Compute output 𝒚 𝒙= 𝜽 = 𝑤1
(2) (2)
1 𝑥1 𝑥2 𝑤2
𝒛 = 𝒙𝜽
1
ෝ = 𝜎(𝒛) =
𝒚
1 + 𝑒 −𝒛 Input 𝒙
3) Compute loss
1
ො 𝒚) =
𝐿(𝐲, −𝐲 T log𝐲−(1−y)
ො T
log(1−𝐲ො ) 𝑏 Model
m
𝜽 = 𝑤1
4) Compute derivative 𝑤2
1 T Label
𝛻𝜽 𝐿 = 𝐱 (𝐲ො − 𝒚)
m ෝ = 𝜎 𝒙𝜽
𝒚 𝒚
5) Update parameters
𝜽 = 𝜽 − 𝜂𝛻𝜽 𝐿 Loss
𝜂 is learning rate 𝐿(𝜽) =…
25
Dataset
2 0.1
1 1.5 0.2
ෝ = 𝜎 𝒙𝜽 = 𝜎
𝒚 0.5
1 4.1 1.3
−0.1
0.83 0.6963
=𝜎 =
2.02 0.8828

1 1.5 0.2
𝒙=
1) Pick m samples from training data 1 4.1 1.3

2) Compute output 𝒚 Mini-batch m=2
Model
𝒛 = 𝒙𝜽 0.1
1 𝜽 = 0.5
ෝ = 𝜎(𝒛) =
𝒚
1 + 𝑒 −𝒛 −0.1
3) Compute loss
1 0
𝒚=
ො 𝒚) =
𝐿(𝐲, −𝐲 T log𝐲−(1−y)
ො T log(1−𝐲
ො) ෝ = 𝜎 𝒙𝜽 =
𝒚
0.6963 1
m 0.8828
4) Compute derivative
1 T Loss
𝛻𝜽 𝐿 = 𝐱 (𝐲ො − 𝒚)
m
𝐿(𝜽) = …
5) Update parameters
𝜽 = 𝜽 − 𝜂𝛻𝜽 𝐿 26
3
1 log0.6963 log(1 − 0.6963)
𝐿(𝜽) = −0 1 −1 0
𝑦 = log(𝑥) m log0.8828 log(1 − 0.8828)
1
= −log0.8828−log(1 − 0.6963)
m
0.1246 + 1.1917
= = 0.65815
m
1) Pick m samples from training data
1 1.5 0.2

2) Compute output 𝒚 Mini-batch m=2 𝒙=
1 4.1 1.3
𝒛 = 𝒙𝜽
1 Model
ෝ = 𝜎(𝒛) =
𝒚 0.1
1 + 𝑒 −𝒛
3) Compute loss 𝜽 = 0.5
1 −0.1
ො 𝒚) =
𝐿(𝐲, −𝐲 T log𝐲−(1−y)
ො T log(1−𝐲
ො)
m 0
𝒚=
0.6963 1
ෝ = 𝜎 𝒙𝜽 =
𝒚
4) Compute derivative 0.8828
1 T
𝛻𝜽 𝐿 = 𝐱 (𝐲ො − 𝒚)
m
Loss
5) Update parameters
𝜽 = 𝜽 − 𝜂𝛻𝜽 𝐿 𝐿(𝜽) = 0.65815
Dataset
1 1.5 0.2
𝒙=
1 4.1 1.3

Model
0.1
𝜽 = 0.5
−0.1
0
1) Pick m samples from training data 𝒚=
0.6963 1
ෝ = 𝜎 𝒙𝜽 =
𝒚

2) Compute output 𝒚 Mini-batch m=2 0.8828
𝒛 = 𝒙𝜽
1
ෝ = 𝜎(𝒛) =
𝒚 Loss
1 + 𝑒 −𝒛 3
3) Compute loss 𝐿(𝜽) = 0.65815
1
ො 𝒚) =
𝐿(𝐲, −𝐲 T log𝐲−(1−y)
ො T log(1−𝐲
ො)
m 4
1 1 1
0.6963 0
4) Compute derivative 𝛻𝜽 𝐿 = 1.5 4.1 −
m 0.8828 1
1 T 0.2 1.3
𝛻𝜽 𝐿 = 𝐱 (𝐲ො − 𝒚)
m 1 1 1 0.28961
0.6963
5) Update parameters = 1.5 4.1 = 0.28217
m −0.1171
𝜽 = 𝜽 − 𝜂𝛻𝜽 𝐿 0.2 1.3 −0.0064
Dataset
1 1.5 0.2
𝒙=
1 4.1 1.3

Model
0.1
𝜽 = 0.5
−0.1
0
1) Pick m samples from training data 𝒚=
0.6963 1
ෝ = 𝜎 𝒙𝜽 =
𝒚

2) Compute output 𝒚 Mini-batch m=2 0.8828
𝒛 = 𝒙𝜽
1
ෝ = 𝜎(𝒛) =
𝒚 Loss
1 + 𝑒 −𝒛 3
3) Compute loss 𝐿(𝜽) = 0.65815
1
ො 𝒚) =
𝐿(𝐲, −𝐲 T log𝐲−(1−y)
ො T log(1−𝐲
ො)
m 4 0.28961
4) Compute derivative
𝛻𝜽 𝐿 = 0.28217
1 T −0.0064
𝛻𝜽 𝐿 = 𝐱 (𝐲ො − 𝒚)
m 0.1 0.28961 0.0971
5
5) Update parameters 𝛉 − ηL′𝛉 = 0.5 − η 0.28217 = 0.4971
𝜽 = 𝜽 − 𝜂𝛻𝜽 𝐿 −0.1 −0.0064 −0.099 29
Outline
➢ Vectorization
➢ Optimiztion for 1+ samples
➢ Logistic Regression – Mini-batch
➢ Logistic Regression – Batch
➢ BCE and MSE Loss Functions
➢ Sigmoid and Tanh Function (Optional)
Logistic Regression - Batch
(1) (1)
1 𝑥1 𝑥2
(2) (2)
1) Pick all the samples from training data 1 𝑥1 𝑥2
𝒙= (3) (3)
2) Compute output 𝑦ො 1 𝑥1 𝑥2
(4) (4)
𝑧 = 𝒙𝜽 1 𝑥1 𝑥2
1 𝑏
ෝ = 𝜎(𝒛) =
𝒚
Input 𝒙 𝜽 = 𝑤1
1 + 𝑒 −𝒛 𝑤2
3) Compute loss Model
1 𝑏
ො 𝒚) =
𝐿(𝐲, −𝐲 T log𝐲−(1−y)
ො T log(1−𝐲
ො)
N 𝜽 = 𝑤1
𝑤2
4) Compute derivative
1 T Label
𝛻𝜽 𝐿 = 𝒙 (𝐲ො − 𝒚)
N ෝ = 𝜎 𝜽𝑇 𝒙
𝒚 𝒚
5) Update parameters
𝜽 = 𝜽 − 𝜂𝛻𝜽 𝐿 Loss
𝜂 is learning rate 𝐿(𝜽) =…
30
Dataset
Logistic Regression - Batch
1 1.4 0.2 0
𝑏
1 1.5 0.2 0
𝒙= 𝒚= 𝜽 = 𝑤1
1 3.0 1.1 1
𝑤2
1 4.1 1.3 1
1) Pick all the samples from training data

2) Compute output 𝒚 Input 𝒙
𝑧 = 𝒙𝜽
1 Model
ෝ = 𝜎(𝒛) =
𝒚
1 + 𝑒 −𝒛 𝑏
3) Compute loss 𝜽 = 𝑤1
1 𝑤2
ො 𝒚) =
𝐿(𝐲, −𝐲 T log𝐲−(1−y)
ො T log(1−𝐲
ො)
N Label
4) Compute derivative ෝ = 𝜎 𝒙𝜽
𝒚 𝒚
1 T
𝛻𝜽 𝐿 = 𝒙 (𝐲ො − 𝒚)
N Loss
5) Update parameters
𝐿(𝜽) =…
𝜽 = 𝜽 − 𝜂𝛻𝜽 𝐿 31
Dataset 2
1 1.4 0.2
1 1.4 0.2
0.1
1 1.5 0.2 1 1.5 0.2
𝒙= ෝ = 𝜎 𝒙𝜽 = 𝜎
𝒚 0.5
1 3.0 1.1 1 3.0 1.1
−0.1
1 4.1 1.3 1 4.1 1.3
0.78 0.6856
0.83 0.6963
=𝜎 =
1.49 0.8160
1) Pick all the samples from training data 2.02 0.8828

2) Compute output 𝒚
𝒙
𝑧 = 𝒙𝜽
1
ෝ = 𝜎(𝒛) =
𝒚 Model
1 + 𝑒 −𝒛 0.1
3) Compute loss 𝜽 = 0.5
1 −0.1
ො 𝒚) =
𝐿(𝐲, −𝐲 T log𝐲−(1−y)
ො T log(1−𝐲
ො)
N

4) Compute derivative 1 𝒚
ෝ = 𝜎 𝒙𝜽 =
𝒚
1 T 1 + 𝑒 −𝒛
𝛻𝜽 𝐿 = 𝒙 (𝐲ො − 𝒚)
N
Loss
5) Update parameters
𝜽 = 𝜽 − 𝜂𝛻𝜽 𝐿 𝐿(𝜽) = …
32
3 T 𝑇
0 0.6856 1 0 0.6856
1 0 0.6963 1 0 0.6963
𝑦 = log(𝑥) 𝐿(𝜽) = − log − − log 1−
N 1 0.8160 1 1 0.8160
1 0.8828 1 1 0.8828

1) Pick all the samples from training data


𝒙

2) Compute output 𝒚
𝑧 = 𝒙𝜽 Model 0
1 0.1 0
ෝ = 𝜎(𝒛) =
𝒚 𝜽 = 0.5 𝒚=
1 + 𝑒 −𝒛 1
3) Compute loss −0.1 1
1
ො 𝒚) =
𝐿(𝐲, −𝐲 T log𝐲−(1−y)
ො T log(1−𝐲
ො)
N 1 𝒚
ෝ = 𝜎 𝒙𝜽 =
𝒚
4) Compute derivative 0.6856 1 + 𝑒 −𝒛
1 T 0.6963
ෝ=
𝒚 Loss
𝛻𝜽 𝐿 = 𝒙 (𝐲ො − 𝒚) 0.8160
N
0.8828
5) Update parameters 𝐿(𝜽) = …
𝜽 = 𝜽 − 𝜂𝛻𝜽 𝐿 33
3 T 𝑇
0 0.6856 1 0.3144
1 0 0.6963 1 0.3037
𝑦 = log(𝑥) 𝐿(𝜽) = − log − log
N 1 0.8160 0 0.1840
1 0.8828 0 0.1172
1
= −log0.8160−log0.8828 − log0.3144−log0.3037
N
= 0.6691
1) Pick all the samples from training data
𝒙

2) Compute output 𝒚
𝑧 = 𝒙𝜽 Model 0
1 0.1 0
ෝ = 𝜎(𝒛) =
𝒚 𝜽 = 0.5 𝒚=
1 + 𝑒 −𝒛 1
3) Compute loss −0.1 1
1
ො 𝒚) =
𝐿(𝐲, −𝐲 T log𝐲−(1−y)
ො T log(1−𝐲
ො)
N 1 𝒚
ෝ = 𝜎 𝒙𝜽 =
𝒚
4) Compute derivative 0.6856 1 + 𝑒 −𝒛
1 T 0.6963
ෝ=
𝒚
𝛻𝜽 𝐿 = 𝒙 (𝐲ො − 𝒚) 0.8160 Loss
N 0.8828
5) Update parameters 𝐿(𝜽) = 0.6691
𝜽 = 𝜽 − 𝜂𝛻𝜽 𝐿 34
𝒙
1 1.4 0.2 0.6856
1 1.5 0.2 0.6963 Model 0
𝒙= ෝ=
𝒚 0.1
1 3.0 1.1 0.8160 0
𝜽 = 0.5 𝒚=
1 4.1 1.3 0.8828 1
−0.1 1

1 𝒚
ෝ = 𝜎 𝒙𝜽 =
𝒚
1) Pick all the samples from training data 1 + 𝑒 −𝒛

2) Compute output 𝒚
Loss
𝑧 = 𝒙𝜽 3
1 𝐿(𝜽) = 0.6691
ෝ = 𝜎(𝒛) =
𝒚
1 + 𝑒 −𝒛
3) Compute loss 4
1 0.6856 0
ො 𝒚) =
𝐿(𝐲, −𝐲 T log𝐲−(1−y)
ො T log(1−𝐲
ො) 1 1 1 1 1
0.6963 0
N 𝛻𝜽 𝐿 = 1.4 1.5 3.0 4.1 −
N 0.8160 1
4) Compute derivative 0.2 0.2 1.1 1.3
0.8828 1
1 T
𝛻𝜽 𝐿 = 𝒙 (𝐲ො − 𝒚) 0.6856
N 1 1 1 1 1 0.2702
0.6963
5) Update parameters = 1.4 1.5 3.0 4.1 = 0.2431
N −0.184
0.2 0.2 1.1 1.3 −0.019
𝜽 = 𝜽 − 𝜂𝛻𝜽 𝐿 −0.117
Dataset
𝒙

Model
0.1
𝜽 = 0.5
−0.1

1) Pick all the samples from training data 1 𝒚


ෝ = 𝜎 𝒙𝜽 =
𝒚

2) Compute output 𝒚 1 + 𝑒 −𝒛
𝑧 = 𝒙𝜽
1
ෝ = 𝜎(𝒛) =
𝒚 Loss
1 + 𝑒 −𝒛 3
3) Compute loss 𝐿(𝜽) = 0.6691
1
ො 𝒚) =
𝐿(𝐲, −𝐲 T log𝐲−(1−y)
ො T log(1−𝐲
ො)
N
4 0.2702
4) Compute derivative 𝛻𝜽 𝐿 = 0.2431
1 T −0.019
𝛻𝜽 𝐿 = 𝒙 (𝐲ො − 𝒚)
N
5 0.1 0.2702 0.0971
5) Update parameters
𝛉 − ηL′𝛉 = 0.5 − η 0.2431 = 0.4971
𝜽 = 𝜽 − 𝜂𝛻𝜽 𝐿 −0.1 −0.019 −0.099
Outline
➢ Vectorization
➢ Optimiztion for 1+ samples
➢ Logistic Regression – Mini-batch
➢ Logistic Regression – Batch
➢ BCE and MSE Loss Functions
➢ Sigmoid and Tanh Function (Optional)
AI VIETNAM
All-in-One Course
Hessian Matrices
❖ Definition

The Hessian matrix or Hessian is a


square matrix of second-order partial
Given 𝑓 𝑥, 𝑦 = 𝑥 2 + 2𝑥 2 𝑦 + 𝑦 3
derivatives of a scalar-valued function
https://en.wikipedia.org/wiki/Hessian_matrix 𝜕𝑓
= 2𝑥 + 4𝑥𝑦
𝜕𝑥
Given 𝑓(𝑥, 𝑦) 𝜕𝑓
= 2𝑥 2 + 3𝑦 2
𝑓: 𝑅2 → 𝑅 𝜕𝑦
𝜕2𝑓 𝜕2𝑓
𝜕𝑥 2 𝜕𝑥𝜕𝑦
𝐻𝑓 = 2 + 4𝑦 4𝑥
𝜕2𝑓 𝜕2𝑓 𝐻𝑓 =
4𝑥 6𝑦
𝜕𝑥𝜕𝑦 𝜕𝑦 2

37
𝜕𝐿 𝜕𝐿 𝜕𝑦ො 𝜕𝑧 Derivative
Binary Cross-entropy =
𝜕𝜃𝑖 𝜕𝑦ො 𝜕𝑧 𝜕𝜃𝑖
𝜕𝐿 𝑦 1−𝑦 𝑦ො − 𝑦
=− + =
❖ Convex function 𝜕𝑦ො 𝑦ො 1 − 𝑦ො 𝑦(1
ො − 𝑦) ො

Model and Loss 𝜕𝑦ො


𝑇 = 𝑦(1
ො − 𝑦)

𝑧=𝜽 𝒙 𝜕𝑧 𝜕𝐿
= 𝑥𝑖 (𝑦ො − 𝑦)
1 𝜕𝑧 𝜕𝜃𝑖
𝑦ො = 𝜎(𝑧) = = 𝑥𝑖
1 + 𝑒 −𝑧 𝜕𝜃𝑖

L = −𝑦log 𝑦ො − (1 − 𝑦)log 1 − 𝑦ො

𝜕𝐿
= 𝑥𝑖 (𝑦ො − 𝑦)
𝜕𝜃𝑖
𝜕2𝐿 𝜕 2 2) ≥ 0
= 𝑥𝑖 ( 𝑦
ො − 𝑦) = 𝑥𝑖 ( 𝑦
ො − 𝑦

𝜕𝜃𝑖2 𝜕𝜃𝑖

1
𝑥𝑖2 ≥0 𝑦ො − 𝑦ො 2 ∈ 0,
4
AI VIETNAM
All-in-One Course
Logistic Regression-MSE
❖ Construct loss

Model and Loss Derivative

𝑧 = 𝜽𝑇 𝒙 = 𝒙𝑇 𝜽 𝜕𝐿 𝜕𝐿 𝜕𝑦ො 𝜕𝑧 𝜕 𝑦ො
= = 𝑦(1
ො − 𝑦)

1 𝜕𝜃𝑖 𝜕𝑦ො 𝜕𝑧 𝜕𝜃𝑖 𝜕𝑧
𝑦ො = 𝜎(𝑧) =
1 + 𝑒 −𝑧
𝜕𝐿 𝜕𝑧
L = (𝑦ො − 𝑦)2 = 2(𝑦ො − 𝑦) = 𝑥𝑖
𝜕𝑦ො 𝜕𝜃𝑖

𝜕𝐿
= 2𝑥𝑖 (𝑦ො − 𝑦)𝑦(1
ො − 𝑦)

𝜕𝜃𝑖
39
Mean Squared Error Derivative

𝜕𝐿 𝜕𝐿 𝜕𝑦ො 𝜕𝑧 𝜕𝑦ො
Model and Loss = 𝑦(1
ො − 𝑦)

= 𝜕𝑧
𝜕𝜃𝑖 𝜕𝑦ො 𝜕𝑧 𝜕𝜃𝑖
𝑧 = 𝜽𝑇 𝒙 = 𝒙𝑇 𝜽 𝜕𝑧
𝜕𝐿 = 𝑥𝑖
1 = 2(𝑦ො − 𝑦) 𝜕𝜃𝑖
𝑦ො = 𝜎(𝑧) = 𝜕𝑦ො
1 + 𝑒 −𝑧
𝜕𝐿
L = (𝑦ො − 𝑦)2 = 2𝑥𝑖 (𝑦ො − 𝑦)𝑦(1
ො − 𝑦)

𝜕𝜃𝑖

𝜕𝐿
= 2𝑥𝑖 (𝑦ො − 𝑦)𝑦(1 ො = 2𝑥𝑖 −𝑦ො 3 + 𝑦ො 2 − 𝑦𝑦ො + 𝑦𝑦ො 2
ො − 𝑦)
𝜕𝜃𝑖

𝜕2𝐿 𝜕 3 2 2
= 2𝑥𝑖 −𝑦ො + 𝑦
ො − 𝑦 𝑦
ො + 𝑦 𝑦

𝜕𝜃𝑖2 𝜕𝜃𝑖
= 2𝑥𝑖 −3𝑦ො 2 𝑥𝑖 𝑦ො 1 − 𝑦ො + 2𝑥𝑖 𝑦ො 𝑦ො 1 − 𝑦ො − 𝑦𝑥𝑖 𝑦(1
ො − 𝑦)
ො + 2𝑥𝑖 𝑦𝑦ො 𝑦ො 1 − 𝑦ො

= 2𝑥𝑖2 𝑦(1 ො −3𝑦ො 2 + 2𝑦ො − 𝑦 + 2𝑦𝑦ො


ො − 𝑦) 40
Mean Squared Error
𝜕2𝐿 2 2 + 2𝑦
2 = 2𝑥𝑖 𝑦(1
ො − 𝑦)
ො −3 𝑦
ො ො − 𝑦 + 2𝑦𝑦ො
𝜕𝜃𝑖

𝑥𝑖2 ≥ 0 ො = −3𝑦ො 2 + 2𝑦ො


𝑓(𝑦)
1
𝑦(1
ො − 𝑦)
ො ∈ 0,
4

𝑦=0
ො = −3𝑦ො 2 + 2𝑦ො
𝑓(𝑦)

𝑦=1
𝑓 𝑦ො = −3𝑦ො 2 + 4𝑦ො − 1
𝑓 𝑦ො = −3𝑦ො 2 + 4𝑦ො − 1
41
AI VIETNAM
All-in-One Course
MSE and BCE
❖ Visualization

Mean Squared Error Binary Cross-Entropy


42
AI VIETNAM
All-in-One Course
Sigmoid and Tanh Functions
1 𝑒 𝑥 − 𝑒 −𝑥
sigmoid 𝑥 = tanh 𝑥 = 𝑥
1 + 𝑒 −𝑥 𝑒 + 𝑒 −𝑥

35
1 2

3 4
AI VIETNAM
All-in-One Course
Sigmoid and Tanh Functions
1 1
sigmoid 2𝑥 = tanh 𝑥 = 2 × −2𝑥
−1
1 + 𝑒 −2𝑥 1+𝑒

tanh 𝑥 = 2 × sigmoid 2𝑥 − 1
Outline
➢ Vectorization
➢ Optimiztion for 1+ samples
➢ Logistic Regression – Mini-batch
➢ Logistic Regression – Batch
➢ BCE and MSE Loss Functions
➢ Sigmoid and Tanh Function (Optional)
AI VIETNAM
All-in-One Course
Tanh function

𝑒 𝑥 − 𝑒 −𝑥 𝑒 2𝑥 − 1
tanh 𝑥 = 𝑥 −𝑥 = 2𝑥
𝑒 +𝑒 𝑒 +1
2
= 1 − 2𝑥
𝑒 +1

𝑒 𝑥 − 𝑒 −𝑥 1 − 𝑒 −2𝑥
tanh 𝑥 = 𝑥 =
𝑒 + 𝑒 −𝑥 1 + 𝑒 −2𝑥
𝑒 −2𝑥 − 1 2
= − −2𝑥 = −2𝑥 −1
𝑒 +1 𝑒 +1
AI VIETNAM
All-in-One Course
Tanh function
𝑒 𝑥 − 𝑒 −𝑥 𝑒 2𝑥 − 1 2
tanh 𝑥 = 𝑥 −𝑥
= 2𝑥 = 1 − 2𝑥
𝑒 +𝑒 𝑒 +1 𝑒 +1

𝑒 𝑥 − 𝑒 −𝑥 1 − 𝑒 −2𝑥 𝑒 −2𝑥 − 1 2
tanh 𝑥 = 𝑥 = = − −2𝑥 = −1
𝑒 + 𝑒 −𝑥 1 + 𝑒 −2𝑥 𝑒 + 1 𝑒 −2𝑥 + 1

𝑥 − 𝑒 −𝑥 ′
𝑒 𝑒 𝑥 + 𝑒 −𝑥 𝑒 𝑥 + 𝑒 −𝑥 − 𝑒 𝑥 − 𝑒 −𝑥 𝑒 𝑥 − 𝑒 −𝑥
𝑡𝑎𝑛ℎ′ (𝑥) = 𝑥 =
𝑒 + 𝑒 −𝑥 𝑒 𝑥 + 𝑒 −𝑥 2
𝑒 𝑥 + 𝑒 −𝑥 2 − 𝑒 𝑥 − 𝑒 −𝑥 2
=
𝑒 𝑥 + 𝑒 −𝑥 2
2
𝑒 𝑥 − 𝑒 −𝑥
=1− 𝑥 = 1 − 𝑡𝑎𝑛ℎ2 (𝑥)
𝑒 + 𝑒 −𝑥
AI VIETNAM
All-in-One Course
Tanh function
𝑒 𝑥 − 𝑒 −𝑥 𝑒 2𝑥 − 1 2
tanh 𝑥 = 𝑥 −𝑥
= 2𝑥 = 1 − 2𝑥
𝑒 +𝑒 𝑒 +1 𝑒 +1

𝑒 𝑥 − 𝑒 −𝑥 1 − 𝑒 −2𝑥 𝑒 −2𝑥 − 1 2
tanh 𝑥 = 𝑥 = = − −2𝑥 = −1
𝑒 + 𝑒 −𝑥 1 + 𝑒 −2𝑥 𝑒 + 1 𝑒 −2𝑥 + 1

2 4𝑒 −2𝑥 𝑒 −2𝑥 + 1 − 1
𝑡𝑎𝑛ℎ′ (𝑥) = −1 = −2𝑥 =4
𝑒 −2𝑥 + 1 𝑒 +1 2 𝑒 −2𝑥 + 1 2

11 4 4
= 4 −2𝑥 − −2𝑥 2 =− −
𝑒 +1 𝑒 +1 𝑒 −2𝑥 + 1 2 𝑒 −2𝑥 + 1
2
4 4 2
=− 2− +1 −1 =1− −1 = 1 − 𝑡𝑎𝑛ℎ2 (𝑥)
𝑒 −2𝑥 + 1 𝑒 −2𝑥 + 1 𝑒 −2𝑥 + 1
Logistic Regression 𝑧 = 𝜽𝑇 𝒙 = 𝒙𝑇 𝜽
Model and Loss

Tanh 𝑒 𝑧 − 𝑒 −𝑧
𝑦ො = 𝑡𝑎𝑛ℎ(𝑧) = 𝑧
𝑒 + 𝑒 −𝑧
❖ Construct loss 𝑦ො + 1
𝑦ො𝑠 =
2
L = −𝑦log 𝑦ො𝑠 − (1 − 𝑦)log 1 − 𝑦ො𝑠
2
tanh 𝑥 = −2𝑥
−1
1+𝑒 𝜕𝐿 𝜕𝐿 𝜕𝑦ො𝑠 𝜕𝑦ො 𝜕𝑧 Derivative
=
𝜕𝜃𝑖 𝜕𝑦ො𝑠 𝜕𝑦ො 𝜕𝑧 𝜕𝜃𝑖
𝜕𝐿 𝑦 1−𝑦 𝑦ො𝑠 − 𝑦
=− + =
𝜕𝑦ො𝑠 𝑦ො𝑠 1 − 𝑦ො𝑠 𝑦ො𝑠 (1 − 𝑦ො𝑠 )

𝜕𝑦ො𝑠 1 𝜕𝑦ො 𝜕𝑧
= = 1 − 𝑦ො 2 = 𝑥𝑖
𝜕𝑦ො 2 𝜕𝑧 𝜕𝜃𝑖

𝜕𝐿 (𝑦ො𝑠 − 𝑦)(1 − 𝑦ො 2 )
= 𝑥𝑖
𝜕𝜃𝑖 2𝑦ො𝑠 (1 − 𝑦ො𝑠 )
Logistic Regression 𝜕𝐿
=
𝜕𝐿 𝜕𝑦ො𝑠 𝜕𝑦ො 𝜕𝑧 Derivative

Tanh 𝜕𝜃𝑖 𝜕𝑦ො𝑠 𝜕𝑦ො 𝜕𝑧 𝜕𝜃𝑖


𝜕𝐿 𝑦 1−𝑦 𝑦ො𝑠 − 𝑦
❖ Construct loss 𝜕𝑦ො𝑠
=− + =
𝑦ො𝑠 1 − 𝑦ො𝑠 𝑦ො𝑠 (1 − 𝑦ො𝑠 )

2 𝜕𝑦ො𝑠 1
=
𝜕𝑦ො
= 1 − 𝑦ො 2
𝜕𝑧
= 𝑥𝑖
tanh 𝑥 = −2𝑥
−1 𝜕𝑦ො 2 𝜕𝑧 𝜕𝜃𝑖
1+𝑒
𝜕𝐿 (𝑦ො𝑠 − 𝑦)(1 − 𝑦ො 2 )
Model and Loss = 𝑥𝑖
𝑧= 𝜽𝑇 𝒙 = 𝒙𝑇 𝜽 𝜕𝜃𝑖 2𝑦ො𝑠 (1 − 𝑦ො𝑠 )
𝑒 𝑧 − 𝑒 −𝑧
𝑦ො = 𝑡𝑎𝑛ℎ(𝑧) = 𝑧 𝑦ො + 1
𝑒 + 𝑒 −𝑧 𝜕𝐿 ( − 𝑦)(1 − 𝑦ො 2 )
= 𝑥𝑖 2
𝑦ො + 1 𝑦ො + 1 𝑦ො + 1
𝑦ො𝑠 = 𝜕𝜃𝑖
2 2 (1 − )
2 2
L = −𝑦log 𝑦ො𝑠 − (1 − 𝑦)log 1 − 𝑦ො𝑠 𝜕𝐿 (𝑦ො + 1 − 2𝑦)(1 − 𝑦ො 2 )
= 𝑥𝑖
𝜕𝜃𝑖 (𝑦ො + 1)(1 − 𝑦)

𝜕𝐿
= 𝑥𝑖 (𝑦ො + 1 − 2𝑦)
𝜕𝜃𝑖
Summary
1) Pick all the samples from training data

2) Compute output 𝒚
𝑦2 𝒛 = 𝒙𝜽
1
ෝ = 𝜎(𝒛) =
𝒚
1 + 𝑒 −𝒛
1 3) Compute loss (binary cross-entropy)
𝒚 𝑦=
1 + 𝑒 −𝑥 1
𝐿(𝜽) = −𝐲 T log𝐲−(1−y)
ො T log(1−𝐲
ො)
𝑦1 N
4) Compute derivative
𝑥1 𝑥2
1 T
𝛻𝜽 𝐿 = 𝒙 (𝐲ො − 𝒚)
N
𝑥 5) Update parameters
−∞ +∞
𝜽 = 𝜽 − 𝜂𝐿′𝜽
Sigmoid function
𝜂 is learning rate

You might also like