Professional Documents
Culture Documents
On CIFAR-10
Credit: Li/Karpathy/Johnson
𝑥 𝑥
𝑊 𝑓 𝑊1 ℎ 𝑊2 𝑓
Credit: Li/Karpathy/Johnson
Loss
(Softmax,
Hinge)
Prediction
Prof. Leal-Taixé and Prof. Niessner 9
Loss functions
Evaluate the ground
• Softmax loss function truth score for the
image
Saturated
neurons kill the
gradient flow
More on zero-
mean data later
Prof. Leal-Taixé and Prof. Niessner 14
tanh
Still saturates
Zero-
centered
Still saturates
Large and
What happens if a consistent
ReLU outputs zero? gradients
1 𝑚
𝛻𝜃 𝐿 = σ 𝛻 𝐿 𝑘 now refers to 𝑘-th iteration
𝑚 𝑖=1 𝜃 𝑖
𝑚 training samples in the current batch
Gradient for the 𝑘-th batch
𝜃 𝑘+1 = 𝜃 𝑘 − 𝛼 ⋅ 𝑣 𝑘+1
model velocity
learning rate
Hyperparameters are 𝛼, 𝛽
𝛽 is often set to 0.9
𝜃 𝑘+1 = 𝜃 𝑘 − 𝛼 ⋅ 𝑣 𝑘+1
Prof. Leal-Taixé and Prof. Niessner Fig. credit: I. Goodfellow 22
RMSProp
𝑠 𝑘+1 = 𝛽 ⋅ 𝑠 𝑘 + (1 − 𝛽)[𝛻𝜃 𝐿 ∘ 𝛻𝜃 𝐿]
Element-wise multiplication
𝛻𝜃 𝐿
𝜃 𝑘+1 = 𝜃𝑘 −𝛼⋅
𝑠 𝑘+1 + 𝜖
Hyperparameters: 𝛼, 𝛽, 𝜖
Needs tuning!
Often 0.9 Typically 10−8
X-direction
Small gradients
(uncentered) variance of gradients
-> second momentum 𝑠 𝑘+1 = 𝛽 ⋅ 𝑠 𝑘 + (1 − 𝛽)[𝛻𝜃 𝐿 ∘ 𝛻𝜃 𝐿]
𝛻𝜃 𝐿
𝜃 𝑘+1 = 𝜃 𝑘 − 𝛼 ⋅
We’re dividing by square gradients: 𝑠 𝑘+1 + 𝜖
- Division in Y-Direction will be large
- Division in X-Direction will be small Can increase learning rate!
𝑣 𝑘+1 = 𝛽2 ⋅ 𝑣 𝑘 + (1 − 𝛽2 )[𝛻𝜃 𝐿 𝜃 𝑘 ∘ 𝛻𝜃 𝐿 𝜃 𝑘 ]
Second momentum:
𝑚𝑘+1 variance of gradients
𝜃 𝑘+1 = 𝜃 𝑘 − 𝛼 ⋅
𝑣 𝑘+1 +𝜖
ෝ 𝑘+1
𝑚 𝑣𝑘
𝜃 𝑘+1 = 𝜃𝑘 −𝛼⋅ 𝑣ො 𝑘+1 =
1 − 𝛽2
𝑣ො 𝑘+1 +𝜖
27
Training NNs
Figure extracted from Deep Learning by Adam Gibson, Josh Patterson, O‘Reily Media Inc., 2017
Source: http://srdas.github.io/DLBook/ImprovingModelGeneralization.html
Lower Increasing
validation error training error
Overfitting
Forward
Prof. Leal-Taixé and Prof. Niessner Srivastava 2014 39
How to deal with
images?
28
32
5 Convolve
activation maps
32 28
Convolve
3 1 3 5 ‘Pooled’ output
Max pool with
2 × 2 filters and stride 2 6 9
6 0 7 9
3 2 1 4 3 4
0 2 4 3
• Conv -> Pool -> Conv -> Pool -> Conv -> FC
• As we go deeper: Width, height Number of filters
Inception block
More expressive
model!
Inputs
Prof. Leal-Taixé and Prof. Niessner 61
Basic structure of a RNN
• We want to have notion of “time” or “sequence”
Output
Hidden
state
Same parameters for
each time step =
generalization!
Prof. Leal-Taixé and Prof. Niessner 62
Long-Short Term Memory Units
• LSTM