Professional Documents
Culture Documents
𝑧 = 𝑤 ∙ 𝑥 + 𝑏 = 𝑤𝑖 𝑥𝑖 + 𝑏
z
𝑖
Function set:
Including all
𝑓𝑤,𝑏 𝑥 = 𝑃𝑤,𝑏 𝐶1 |𝑥 different w and b
Step 1: Function Set
x1 w1 z wi xi 𝑓𝑤,𝑏 𝑥
i
…
wi z z
xi 𝑃𝑤,𝑏 𝐶1 |𝑥
…
…
wI Sigmoid Function z
xI b
z
1
1 ez z
Logistic Regression Linear Regression
Step 2:
Step 3:
Step 2: Goodness of a Function
Training 𝑥1 𝑥2 𝑥3 𝑥𝑁
……
Data 𝐶1 𝐶1 𝐶2 𝐶1
𝑤 ∗ , 𝑏 ∗ = 𝑎𝑟𝑔 max 𝐿 𝑤, 𝑏
𝑤,𝑏
= 𝑤 ∗ , 𝑏 ∗ = 𝑎𝑟𝑔 min
𝑤,𝑏
−𝑙𝑛𝐿 𝑤, 𝑏
−𝑙𝑛𝐿 𝑤, 𝑏
= −𝑙𝑛𝑓𝑤,𝑏 𝑥 1 − 𝑦ො11 𝑙𝑛𝑓 𝑥 1 + 1 −0 𝑦ො 1 𝑙𝑛 1 − 𝑓 𝑥 1
−𝑙𝑛𝑓𝑤,𝑏 𝑥 2 − 𝑦ො12 𝑙𝑛𝑓 𝑥 2 + 1 −0 𝑦ො 2 𝑙𝑛 1 − 𝑓 𝑥 2
−𝑙𝑛 1 − 𝑓𝑤,𝑏 𝑥 3 − 𝑦ො03 𝑙𝑛𝑓 𝑥 3 + 1 −1 𝑦ො 3 𝑙𝑛 1 − 𝑓 𝑥 3
……
Step 2: Goodness of a Function
𝐿 𝑤, 𝑏 = 𝑓𝑤,𝑏 𝑥 1 𝑓𝑤,𝑏 𝑥 2 1 − 𝑓𝑤,𝑏 𝑥 3 ⋯ 𝑓𝑤,𝑏 𝑥 𝑁
Cross entropy:
𝐶 𝑓 𝑥 𝑛 , 𝑦ො 𝑛 = − 𝑦ො 𝑛 𝑙𝑛𝑓 𝑥 𝑛 + 1 − 𝑦ො 𝑛 𝑙𝑛 1 − 𝑓 𝑥 𝑛
Question: Why don’t we simply use square error as linear
regression?
Step 3: Find the best function
1 − 𝑓𝑤,𝑏 𝑥 𝑛 𝑥𝑖𝑛
−𝑙𝑛𝐿 𝑤, 𝑏 = − 𝑦ො 𝑛 𝑙𝑛𝑓𝑤,𝑏 𝑥 𝑛 + 1 − 𝑦ො 𝑛 𝑙𝑛 1 − 𝑓𝑤,𝑏 𝑥 𝑛
𝜕𝑤𝑖 𝑛 𝜕𝑤𝑖 𝜕𝑤𝑖
𝜕𝑙𝑛𝑓𝑤,𝑏 𝑥 𝜕𝑙𝑛𝑓𝑤,𝑏 𝑥 𝜕𝑧 𝜕𝑧
= = 𝑥𝑖 𝜎 𝑧
𝜕𝑤𝑖 𝜕𝑧 𝜕𝑤𝑖 𝜕𝑤𝑖
𝜕𝜎 𝑧
𝜕𝑙𝑛𝜎 𝑧 1 𝜕𝜎 𝑧 1 𝜕𝑧
= = 𝜎 𝑧 1−𝜎 𝑧
𝜕𝑧 𝜎 𝑧 𝜕𝑧 𝜎 𝑧
𝑓𝑤,𝑏 𝑥 = 𝜎 𝑧
𝑧 = 𝑤 ∙ 𝑥 + 𝑏 = 𝑤𝑖 𝑥𝑖 + 𝑏
= 1Τ1 + 𝑒𝑥𝑝 −𝑧 𝑖
Step 3: Find the best function
1 − 𝑓𝑤,𝑏 𝑥 𝑛 𝑥𝑖𝑛 𝑓𝑤,𝑏 𝑥 𝑛 𝑥𝑖𝑛
−𝑙𝑛𝐿 𝑤, 𝑏 = − 𝑦ො 𝑛 𝑙𝑛𝑓𝑤,𝑏 𝑥 𝑛 + 1 − 𝑦ො 𝑛 𝑙𝑛 1 − 𝑓𝑤,𝑏 𝑥 𝑛
𝜕𝑤𝑖 𝑛 𝜕𝑤𝑖 𝜕𝑤𝑖
𝜕𝑙𝑛 1 − 𝜎 𝑧 1 𝜕𝜎 𝑧 1
=− =− 𝜎 𝑧 1−𝜎 𝑧
𝜕𝑧 1 − 𝜎 𝑧 𝜕𝑧 1−𝜎 𝑧
𝑓𝑤,𝑏 𝑥 = 𝜎 𝑧
𝑧 = 𝑤 ∙ 𝑥 + 𝑏 = 𝑤𝑖 𝑥𝑖 + 𝑏
= 1Τ1 + 𝑒𝑥𝑝 −𝑧 𝑖
Step 3: Find the best function
1 − 𝑓𝑤,𝑏 𝑥 𝑛 𝑥𝑖𝑛 𝑓𝑤,𝑏 𝑥 𝑛 𝑥𝑖𝑛
−𝑙𝑛𝐿 𝑤, 𝑏 = − 𝑦ො 𝑛 𝑙𝑛𝑓𝑤,𝑏 𝑥 𝑛 + 1 − 𝑦ො 𝑛 𝑙𝑛 1 − 𝑓𝑤,𝑏 𝑥 𝑛
𝜕𝑤𝑖 𝑛 𝜕𝑤𝑖 𝜕𝑤𝑖
= − 𝑦ො 𝑛 − 𝑓𝑤,𝑏 𝑥 𝑛 𝑥𝑖𝑛
𝑛
Larger difference,
𝑤𝑖 ← 𝑤𝑖 − 𝜂 − 𝑦ො 𝑛 − 𝑓𝑤,𝑏 𝑥 𝑛 𝑥𝑖𝑛
larger update
𝑛
Logistic Regression Linear Regression
Step 1: 𝑓𝑤,𝑏 𝑥 = 𝜎 𝑤𝑖 𝑥𝑖 + 𝑏
𝑖
Step 1: 𝑓𝑤,𝑏 𝑥 = 𝜎 𝑤𝑖 𝑥𝑖 + 𝑏
𝑖
Cross
Entropy
Total
Loss
Square
Error
w1 w2
http://jmlr.org/procee
dings/papers/v9/gloro
t10a/glorot10a.pdf
Discriminative v.s. Generative
𝑃 𝐶1 |𝑥 = 𝜎 𝑤 ∙ 𝑥 + 𝑏
Training 1 1 0 0
Data 1
X4 X4 X4
0 1 0
Training 1 1 0 0
Data 1
X4 X4 X4
0 1 0
1
𝑃 𝐶1 = 𝑃 𝑥1 = 1|𝐶1 = 1 𝑃 𝑥2 = 1|𝐶1 = 1
13
12 1 1
𝑃 𝐶2 = 𝑃 𝑥1 = 1|𝐶2 = 𝑃 𝑥2 = 1|𝐶2 =
13 3 3
Training 1 1 0 0
Data 1
X4 X4 X4
0 1 0
C1: 𝑤 1 , 𝑏1 𝑧1 = 𝑤 1 ∙ 𝑥 + 𝑏1 Probability:
1 > 𝑦𝑖 > 0
C2: 𝑤 2 , 𝑏2 𝑧2 = 𝑤 2 ∙ 𝑥 + 𝑏2 σ𝑖 𝑦𝑖 = 1
C3: 𝑤 3 , 𝑏3 𝑧3 = 𝑤 3 ∙ 𝑥 + 𝑏3 yi PCi | x
Softmax
3 0.88 3
e
20
z1 e e z1
y1 e z1 zj
j 1
1 0.12 3
z2 e e z 2 2.7
y2 e z2
e
zj
j 1
0.05 ≈0
z3 -3
3
e e z3
y3 e z3
e
zj
3 j 1
e zj
j 1
Multi-class Classification (3 classes as example)
y ŷ
𝑧1 = 𝑤 1 ∙ 𝑥 + 𝑏1 y1 ŷ1
Cross Entropy
Softmax
𝑥 𝑧2 = 𝑤 2 ∙𝑥 + 𝑏2 y2 3 ŷ2
− 𝑦ො𝑖 𝑙𝑛𝑦𝑖
𝑧3 = 𝑤 3 ∙𝑥 + 𝑏3 y3 ŷ3
𝑖=1
target
If x ∈ class 1 If x ∈ class 2 If x ∈ class 3
1 0 0
𝑦ො = 0 𝑦ො = 1 𝑦ො = 0
0 0 1
Limitation of Logistic Regression
z w1 x1 w2 x2 b
x1 w1 y 0.5
z Class1
y
w2 Class 2 y 0.5
x2
b
Can we?
Input Feature x2
Label
x1 x2 y ≥ 0.5 y < 0.5
0 0 Class 2
0 1 Class 1
1 0 Class 1 y < 0.5 y ≥ 0.5
1 1 Class 2
x1
Limitation of Logistic Regression
x1 w1
• No, we can’t …… z
w2 y
x2 b
x2 x2
x1 x1
Limitation of Logistic Regression
0
𝑥1′ :distance to
• Feature Transformation 0
′ 1
𝑥2 : distance to
1
𝑥1 𝑥1′
𝑥2 𝑥2′
Not always easy
x2 to find a good
𝑥2′ 0 transformation
0 1
1 2
1
0 0 1 2
0 1 1 0
x1 𝑥1′
Limitation of Logistic Regression
• Cascading logistic regression models
z1
x1 𝑥1′
z
y
z2
x2 𝑥2′
x2
z1
x1 𝑥1′ 𝑥1′ =0.27 𝑥1′ =0.05
x1
x1
𝑥1′ =0.73 𝑥1′ =0.27 𝑥1′
w1
x2 z
y
𝑥1′ =0.27 𝑥1′ =0.05 w2
𝑥2′
x1
x2 𝑥2′
x1 𝑥1′
Deep Learning!
z
z z
z
“Neuron”
Neural Network
Reference
• Bishop: Chapter 4.3