Professional Documents
Culture Documents
All-in-One Course
Model Generalization
Quang-Vinh Dinh
Ph.D. in Computer Science
Year 2023
airplane
Network Training
automobile
bird
ship
truck
AI VIETNAM
All-in-One Course
Model Generalization
Cifar-10
input output
Dataset (3,32,32) (32,16,16) (64,8,8) (128,4,4) (256,2,2)
Data Normalization
(convert to 0-mean Aim to reduce this gap
and 1-deviation) Flatten
(3x3) Convolution
𝑋−𝜇 padding=‘same’
𝑋ത =
𝜎 stride=1 + ReLU
Dense Layer-10
+ Softmax
1
𝜇 = 𝑋𝑖
𝑛 𝑖 (2x2) max pooling
Dense Layer-512
1 + ReLU
𝜎= 𝑋𝑖 − 𝜇 2
𝑛 𝑖
Adam lr=1e-3 ; He Init 2
Training
1 2 3 4
Testing
3
AI VIETNAM
All-in-One Course
Model Generalization
❖ Trick 1: ‘Learn hard ’ – randomly add noise to training data
4
AI VIETNAM
All-in-One Course
Model Generalization
❖ Trick 1: ‘Learn hard ’ – randomly add noise to training data
Speed-limit
sign detection
5
AI VIETNAM
All-in-One Course
Model Generalization
❖ Trick 1: ‘Learn hard ’ – randomly add noise to training data
Add noise
In PyTorch
6
AI VIETNAM
All-in-One Course
Model Generalization
❖ Trick 1: ‘Learn hard ’ – randomly add noise to training data
Add noise
In PyTorch
val_accuracy increases
from ~78% to ~80%
7
AI VIETNAM
AI-in-One Course
Network Training
❖ Solution 2: Batch normalization
Input data for a node in batch normalization layer
𝑋 = 𝑋1 , … , 𝑋𝑚
𝑚 is mini-batch size
Compute mean and variance
𝑚 𝑚
1 1
𝜇 = 𝑋𝑖 2
𝜎 = 𝑋𝑖 − 𝜇 2
𝑚 𝑚
𝑖=1 𝑖=1
Normalize 𝑋𝑖
𝑋𝑖 − 𝜇
Batch Normalization 𝑋𝑖 =
𝜎2 + 𝜖
𝜖 is a very small value
Do not need bias when using BN
Scale and shift 𝑋𝑖
𝑌𝑖 = 𝛾𝑋𝑖 + β
𝜇 and 𝜎 are updated in forward pass
𝛾 and β are updated in backward pass 𝛾 and β are two learning parameters
8
AI VIETNAM
All-in-One Course
Network Training
❖ Trick 2: Batch normalization
Input data for a node in batch normalization layer
𝑋 = 𝑋1 , … , 𝑋𝑚
𝑚 is mini-batch size
Compute mean and variance
𝑚 𝑚
1 1
𝜇 = 𝑋𝑖 𝜎2 = 𝑋𝑖 − 𝜇 2
𝑚 𝑚
𝑖=1 𝑖=1
Normalize 𝑋𝑖
𝑋𝑖 − 𝜇
𝑋𝑖 =
Batch Normalization 𝜎2 + 𝜖
𝜖 is a very small value
Scale and shift 𝑋𝑖
What if
𝑌𝑖 = 𝛾𝑋𝑖 + β
𝛾 = 𝜎 2 + 𝜖 and β = 𝜇 𝛾 and β are two learning parameters
8
Trick 2: Batch normalization
Network Training Input data for a node in batch normalization layer
𝜇𝑈 = 5.0 −1.51 𝑋 = 𝑋1 , … , 𝑋𝑚
−5
𝜖 = 10 −0.75
𝜎𝑈2 = 2.64 𝑚 is mini-batch size
= 1.51
𝑈
𝛾𝑈 = 1.0 −0.37 Compute mean and variance
1 0.37 𝑚 𝑚
1 1
3 β𝑈 = 0.0 0.75 −1.51 𝜇 = 𝑋𝑖 2
𝜎 = 𝑋𝑖 − 𝜇 2
9 −0.75 𝑚 𝑚
𝑖=1 𝑖=1
𝑋𝑈 = 1.51
4 𝑌𝑈 =
6 −0.37 Normalize 𝑋𝑖
7 0.37 𝑋𝑖 − 𝜇
0.75 𝑋𝑖 =
0.61 𝜎2 + 𝜖
6 −0.61 𝜖 is a very small value
4 1.22
7 𝑌𝑉 = Scale and shift 𝑋𝑖
𝑋𝑉 = 0.0
5 0.61 𝑌𝑖 = 𝛾𝑋𝑖 + β
𝜇𝑉 = 5.0 0.61
6 −1.83 𝛾 and β are two learning parameters
−0.61
2
𝜎𝑉2 = 1.63 1.22
𝑉 =
𝛾𝑉 = 1.0 0.0
0.61 𝛾 and β are updated in training process
β𝑉 = 0.0 −1.83 10
AI VIETNAM
All-in-One Course
Network Training
❖ Trick 2: Batch normalization for 2D data
H …
𝑋= , 𝑋 = ,
sample 1 sample 2 sample 1 sample 2
𝑌 = …
batch-size = 2
Batch-Norm Layer
sample_shape = (2, 2, 2)
12
AI VIETNAM
All-in-One Course
Network Training
❖ Trick 2: Batch normalization Backward
Input data for a node in batch normalization layer
𝑋 = 𝑋1 , … , 𝑋𝑚
𝑚 is mini-batch size 𝛾
Compute mean and variance 𝜕𝐿
𝑚 𝑚
1 1 𝜕𝑌𝑖
𝜇 = 𝑋𝑖 𝜎2 = 𝑋𝑖 − 𝜇 2
𝑋𝑖 𝜇 𝑋𝑖 𝑌𝑖
𝑚 𝑚
𝑖=1 𝑖=1
Normalize 𝑋𝑖
𝑋𝑖 − 𝜇 𝛽
𝑋𝑖 =
𝜎2 + 𝜖 𝜎2
𝜖 is a very small value
Scale and shift 𝑋𝑖
𝑌𝑖 = 𝛾𝑋𝑖 + β
𝛾 and β are two learning parameters 13
𝛾
𝜕𝐿 𝑚 𝑚
1 1
𝜕𝑌𝑖 𝜇 = 𝑋𝑖 2 2
𝜎 = 𝑋𝑖 − 𝜇
𝑚 𝑚
𝑋𝑖 𝜇 𝑋𝑖 𝑌𝑖 𝑖=1 𝑖=1
𝑋𝑖 − 𝜇
𝑋𝑖 = 𝑌𝑖 = 𝛾𝑋𝑖 + β
𝛽 𝜎2 + 𝜖
𝜎2
𝑚 𝑚 𝜕𝐿 𝜕𝐿
𝜕𝐿 𝜕𝐿 𝜕𝐿 𝜕𝐿 = 𝛾 𝜕𝐿 𝜕𝐿 𝜕𝑋𝑖 𝜕𝐿 𝜕𝜇 𝜕𝐿 𝜕𝜎 2
= 𝑋𝑖 =
𝜕𝑋𝑖 𝜕𝑌 = + + 2
𝜕𝛾 𝜕𝑌𝑖 𝜕β 𝜕𝑌𝑖 𝑖
𝜕𝑋𝑖 𝜕𝑋𝑖 𝜕𝑋𝑖 𝜕𝜇 𝜕𝑋𝑖 𝜕𝜎 𝜕𝑋𝑖
𝑖=1 𝑖=1
𝑚 𝑚
𝜕𝐿 𝜕𝐿 𝜕𝑋𝑖 𝜕𝐿 −1 2 −3 𝜕𝑋𝑖 1
= = 𝑋𝑖 − 𝜇 𝜎 +𝜖 2 =
𝜕𝜎 2 𝜕𝑋𝑖 𝜕𝜎 2 𝜕𝑋𝑖 2 𝜕𝑋𝑖 𝜎2 + 𝜖
𝑖=1 𝑖=1
𝜕𝜎 2 2 𝑋𝑖 − 𝜇
𝑚 𝑚 =
𝜕𝐿 𝜕𝐿 −1 𝜕𝐿 1 𝜕𝑋𝑖 𝑚
= − 2 2 𝑋𝑖 − 𝜇
𝜕𝜇
𝜕𝑋𝑖 𝜎 + 𝜖 𝜕𝜎 𝑚
2 𝜕𝜇 1
𝑖=1 𝑖=1 =
𝜕𝑋𝑖 𝑚
14
AI VIETNAM
All-in-One Course
Model Generalization
❖ Trick 2: Batch normalization
Input data for a node in batch normalization layer
𝑋 = 𝑋1 , … , 𝑋𝑚
𝑚 is mini-batch size
Compute mean and variance
mini-batch 1 mini-batch 2 𝑚 𝑚
1 1
𝜇 = 𝑋𝑖 2
𝜎 = 𝑋𝑖 − 𝜇 2
𝜇1 , 𝜎1 ≠ 𝜇2 , 𝜎2 𝑚
𝑖=1
𝑚
𝑖=1
very
likely Normalize 𝑋𝑖
𝑋𝑖 − 𝜇
𝑋𝑖 =
𝜎2 + 𝜖
𝜖 is a very small value
Add noise to the output of BN layers Scale and shift 𝑋𝑖
𝑌𝑖 = 𝛾𝑋𝑖 + β
𝛾 and β are two learning parameters
15
Model Generalization
❖ Trick 2: Batch normalization
input output
(3,32,32) (32,16,16) (64,8,8) (128,4,4) (256,2,2)
(3x3) Convolution
padding=‘same’ Flatten
stride=1 + ReLU
Dense Layer-512
(2x2) max pooling
+ ReLU
layer 𝑗
𝑎 = 𝐷⨀𝜎 𝑍
𝜕𝐿 𝜕𝐿 𝜕𝑎 𝜕𝐿
= = ×𝐷
𝜕𝜎 𝜕𝑎 𝜕𝜎 𝜕𝑎
18
AI VIETNAM
All-in-One Course
Overfitting
Dropout
Given a dropping rate r
1
𝑠𝑐𝑎𝑙𝑒 =
1−𝑟 19
AI VIETNAM
All-in-One Course
Model Generalization
❖ Trick 3: Dropout
20
AI VIETNAM
All-in-One Course
Model Generalization
❖ Trick 3: Dropout
dropout 30% dropout 30%
input output
(3,32,32) (32,16,16) (64,8,8) (128,4,4) (256,2,2)
Dropout val_accuracy
increases from
~84% to ~86.6%
AI VIETNAM
All-in-One Course
Model Generalization
❖ Trick 4: Kernel regularization
𝐿2 regularization
Smaller weights
→ simpler models
In PyTorch
22
AI VIETNAM
All-in-One Course
Model Generalization
❖ Trick 4: Kernel regularizer
dropout 30% dropout 30%
input output
(3,32,32) (32,16,16) (64,8,8) (128,4,4) (256,2,2)
Dropout
val_accuracy
(3x3) Convolution increases from
padding=‘same’ ~86.6% to ~87.5%
stride=1 + ReLU +
kernel regularization
23
AI VIETNAM
All-in-One Course
Model Generalization
❖ Trick 5: Data augmentation
A normal case A perfect case: Have unlimited training
Image Image
Training data cover the whole distribution
Testing data
Data distribution
Training data But, impractical!!!
AI VIETNAM
All-in-One Course
Model Generalization
❖ Trick 5: Data augmentation
Increase data by altering the training data
horizontal
flip
rotate
Image
crop and
resize Testing data
Data distribution
Training data
Augmented training data
25
AI VIETNAM
All-in-One Course
Model Generalization
❖ Trick 5: Data augmentation
26
AI VIETNAM
All-in-One Course
Model Generalization
❖ What we have
Horizontal flip + crop-and-resize Batch normalization
Dropout
Kernel regularization
Data augmentation
28
Optimization
❖ Learning rate
29
AI VIETNAM
All-in-One Course
Model Generalization
❖ Trick 6: Reduce learning rate
𝑒𝑝𝑜𝑐ℎ
𝜂 = 𝜂0 × 𝛾
30
AI VIETNAM
All-in-One Course
Model Generalization
❖ Trick 6: Reduce learning rate
val_accuracy reaches to ~90.6%
train_accuracy reaches to ~98%
31
AI VIETNAM
All-in-One Course
Model Generalization
❖ Discussion: Predict training and test accuracy when using more data augmentation
horizontal
flip
rotate
crop and
resize
Add noise
32
AI VIETNAM
All-in-One Course
Model Generalization
❖ Trick 7: Increase model capacity (and use more data augmentation)
input output
(3,32,32) (64,16,16) (128,8,8) (256,4,4) (512,2,2)
33
AI VIETNAM
All-in-One Course
Model Generalization
❖ Trick 8: Using skip-connection
input + + + + output
(3,32,32) (64,16,16) (128,8,8) (256,4,4) (512,2,2)
34
AI VIETNAM
All-in-One Course
Model Generalization
❖ Trick 8: Using skip-connection
val_accuracy reaches to ~93%
train_accuracy reaches to ~97%
35
AI VIETNAM
All-in-One Course
Model Generalization
❖ Increase model capacity once more
input + + + + output
(3,32,32) (128,16,16) (256,8,8) (512,4,4) (1024,2,2)
36
AI VIETNAM
All-in-One Course
Model Generalization
❖ Increase model capacity once more
val_accuracy reaches to ~94.5%
train_accuracy reaches to ~98.3%
37
AI VIETNAM
All-in-One Course
Summary
❖ How to increase validation accuracy
Add noise
Trick 2:
Using Batch Normalization
Trick 1: ‘Learn hard ’ – randomly 𝑋𝑖 − 𝜇
add noise to training data
𝑋𝑖 =
𝜎2 + 𝜖
(*) Use Pretrained Models
Trick 3: Using Dropout
Trick 5: Data augmentation
𝐿 = 𝐶𝐸 + 𝜆 𝑊 2
𝐿2 regularization