You are on page 1of 46

AI VIETNAM

All-in-One Course

Model Generalization

Quang-Vinh Dinh
Ph.D. in Computer Science

Year 2023
airplane
Network Training
automobile

bird

Cifar-10 dataset cat


(complex dataset)
deer
Color images
dog
Resolution=32x32
frog
Training set: 50000 samples

Testing set: 10000 samples horse

ship

truck
AI VIETNAM
All-in-One Course
Model Generalization

Cifar-10
input output
Dataset (3,32,32) (32,16,16) (64,8,8) (128,4,4) (256,2,2)

Data Normalization
(convert to 0-mean Aim to reduce this gap
and 1-deviation) Flatten
(3x3) Convolution
𝑋−𝜇 padding=‘same’
𝑋ത =
𝜎 stride=1 + ReLU
Dense Layer-10
+ Softmax
1
𝜇 = ෍ 𝑋𝑖
𝑛 𝑖 (2x2) max pooling
Dense Layer-512
1 + ReLU
𝜎= ෍ 𝑋𝑖 − 𝜇 2
𝑛 𝑖
Adam lr=1e-3 ; He Init 2
Training
1 2 3 4
Testing

Robustly fit Overfit

3
AI VIETNAM
All-in-One Course
Model Generalization
❖ Trick 1: ‘Learn hard ’ – randomly add noise to training data

4
AI VIETNAM
All-in-One Course
Model Generalization
❖ Trick 1: ‘Learn hard ’ – randomly add noise to training data

Speed-limit
sign detection

5
AI VIETNAM
All-in-One Course
Model Generalization
❖ Trick 1: ‘Learn hard ’ – randomly add noise to training data

Add noise

In PyTorch

6
AI VIETNAM
All-in-One Course
Model Generalization
❖ Trick 1: ‘Learn hard ’ – randomly add noise to training data

Add noise

In PyTorch

val_accuracy increases
from ~78% to ~80%

7
AI VIETNAM
AI-in-One Course
Network Training
❖ Solution 2: Batch normalization
Input data for a node in batch normalization layer
𝑋 = 𝑋1 , … , 𝑋𝑚
𝑚 is mini-batch size
Compute mean and variance
𝑚 𝑚
1 1
𝜇 = ෍ 𝑋𝑖 2
𝜎 = ෍ 𝑋𝑖 − 𝜇 2
𝑚 𝑚
𝑖=1 𝑖=1

Normalize 𝑋𝑖
𝑋𝑖 − 𝜇
Batch Normalization 𝑋෠𝑖 =
𝜎2 + 𝜖
𝜖 is a very small value
Do not need bias when using BN
Scale and shift 𝑋෠𝑖
𝑌𝑖 = 𝛾𝑋෠𝑖 + β
𝜇 and 𝜎 are updated in forward pass
𝛾 and β are updated in backward pass 𝛾 and β are two learning parameters
8
AI VIETNAM
All-in-One Course
Network Training
❖ Trick 2: Batch normalization
Input data for a node in batch normalization layer
𝑋 = 𝑋1 , … , 𝑋𝑚
𝑚 is mini-batch size
Compute mean and variance
𝑚 𝑚
1 1
𝜇 = ෍ 𝑋𝑖 𝜎2 = ෍ 𝑋𝑖 − 𝜇 2
𝑚 𝑚
𝑖=1 𝑖=1

Normalize 𝑋𝑖
𝑋𝑖 − 𝜇
𝑋෠𝑖 =
Batch Normalization 𝜎2 + 𝜖
𝜖 is a very small value
Scale and shift 𝑋෠𝑖
What if
𝑌𝑖 = 𝛾𝑋෠𝑖 + β
𝛾 = 𝜎 2 + 𝜖 and β = 𝜇 𝛾 and β are two learning parameters
8
Trick 2: Batch normalization
Network Training Input data for a node in batch normalization layer
𝜇𝑈 = 5.0 −1.51 𝑋 = 𝑋1 , … , 𝑋𝑚
−5
𝜖 = 10 −0.75
𝜎𝑈2 = 2.64 𝑚 is mini-batch size
෡ = 1.51
𝑈
𝛾𝑈 = 1.0 −0.37 Compute mean and variance
1 0.37 𝑚 𝑚
1 1
3 β𝑈 = 0.0 0.75 −1.51 𝜇 = ෍ 𝑋𝑖 2
𝜎 = ෍ 𝑋𝑖 − 𝜇 2
9 −0.75 𝑚 𝑚
𝑖=1 𝑖=1
𝑋𝑈 = 1.51
4 𝑌𝑈 =
6 −0.37 Normalize 𝑋𝑖
7 0.37 𝑋𝑖 − 𝜇
0.75 𝑋෠𝑖 =
0.61 𝜎2 + 𝜖
6 −0.61 𝜖 is a very small value
4 1.22
7 𝑌𝑉 = Scale and shift 𝑋෠𝑖
𝑋𝑉 = 0.0
5 0.61 𝑌𝑖 = 𝛾𝑋෠𝑖 + β
𝜇𝑉 = 5.0 0.61
6 −1.83 𝛾 and β are two learning parameters
−0.61
2
𝜎𝑉2 = 1.63 1.22
𝑉෠ =
𝛾𝑉 = 1.0 0.0
0.61 𝛾 and β are updated in training process
β𝑉 = 0.0 −1.83 10
AI VIETNAM
All-in-One Course
Network Training
❖ Trick 2: Batch normalization for 2D data

H …

Compute C means of H*W*B values Compute C variances of H*W*B values


11
AI VIETNAM
All-in-One Course
Network Training
𝜖 = 10−5 𝜇 = [2.0, 3.0] 𝛾 = 1.0
𝜎 2 = [4.0, 3.7] β = 0.0

0 5 6 4 1.6 − 1.1 1.6 0.5


3 0 5 2 −1.1 0.5 1.1 − 0.5
1 4 2 3
3 0 0 2 −0.5 1.1 −0.5 0.0
0.5 − 1.1 −1.6 − 0.5

𝑋= , 𝑋෠ = ,
sample 1 sample 2 sample 1 sample 2
𝑌෠ = …

batch-size = 2
Batch-Norm Layer
sample_shape = (2, 2, 2)
12
AI VIETNAM
All-in-One Course
Network Training
❖ Trick 2: Batch normalization Backward
Input data for a node in batch normalization layer
𝑋 = 𝑋1 , … , 𝑋𝑚
𝑚 is mini-batch size 𝛾
Compute mean and variance 𝜕𝐿
𝑚 𝑚
1 1 𝜕𝑌𝑖
𝜇 = ෍ 𝑋𝑖 𝜎2 = ෍ 𝑋𝑖 − 𝜇 2
𝑋𝑖 𝜇 𝑋෠𝑖 𝑌𝑖
𝑚 𝑚
𝑖=1 𝑖=1

Normalize 𝑋𝑖
𝑋𝑖 − 𝜇 𝛽
𝑋෠𝑖 =
𝜎2 + 𝜖 𝜎2
𝜖 is a very small value
Scale and shift 𝑋෠𝑖
𝑌𝑖 = 𝛾𝑋෠𝑖 + β
𝛾 and β are two learning parameters 13
𝛾
𝜕𝐿 𝑚 𝑚
1 1
𝜕𝑌𝑖 𝜇 = ෍ 𝑋𝑖 2 2
𝜎 = ෍ 𝑋𝑖 − 𝜇
𝑚 𝑚
𝑋𝑖 𝜇 𝑋෠𝑖 𝑌𝑖 𝑖=1 𝑖=1

𝑋𝑖 − 𝜇
𝑋෠𝑖 = 𝑌𝑖 = 𝛾𝑋෠𝑖 + β
𝛽 𝜎2 + 𝜖
𝜎2

𝑚 𝑚 𝜕𝐿 𝜕𝐿
𝜕𝐿 𝜕𝐿 𝜕𝐿 𝜕𝐿 = 𝛾 𝜕𝐿 𝜕𝐿 𝜕𝑋෠𝑖 𝜕𝐿 𝜕𝜇 𝜕𝐿 𝜕𝜎 2
=෍ 𝑋෠𝑖 =෍ ෠
𝜕𝑋𝑖 𝜕𝑌 = + + 2
𝜕𝛾 𝜕𝑌𝑖 𝜕β 𝜕𝑌𝑖 𝑖 ෠
𝜕𝑋𝑖 𝜕𝑋𝑖 𝜕𝑋𝑖 𝜕𝜇 𝜕𝑋𝑖 𝜕𝜎 𝜕𝑋𝑖
𝑖=1 𝑖=1
𝑚 𝑚
𝜕𝐿 𝜕𝐿 𝜕𝑋෠𝑖 𝜕𝐿 −1 2 −3 𝜕𝑋෠𝑖 1
=෍ =෍ 𝑋𝑖 − 𝜇 𝜎 +𝜖 2 =
𝜕𝜎 2 𝜕𝑋෠𝑖 𝜕𝜎 2 𝜕𝑋෠𝑖 2 𝜕𝑋𝑖 𝜎2 + 𝜖
𝑖=1 𝑖=1
𝜕𝜎 2 2 𝑋𝑖 − 𝜇
𝑚 𝑚 =
𝜕𝐿 𝜕𝐿 −1 𝜕𝐿 1 𝜕𝑋𝑖 𝑚
=෍ − 2 ෍ 2 𝑋𝑖 − 𝜇
𝜕𝜇 ෠
𝜕𝑋𝑖 𝜎 + 𝜖 𝜕𝜎 𝑚
2 𝜕𝜇 1
𝑖=1 𝑖=1 =
𝜕𝑋𝑖 𝑚
14
AI VIETNAM
All-in-One Course
Model Generalization
❖ Trick 2: Batch normalization
Input data for a node in batch normalization layer
𝑋 = 𝑋1 , … , 𝑋𝑚
𝑚 is mini-batch size
Compute mean and variance
mini-batch 1 mini-batch 2 𝑚 𝑚
1 1
𝜇 = ෍ 𝑋𝑖 2
𝜎 = ෍ 𝑋𝑖 − 𝜇 2
𝜇1 , 𝜎1 ≠ 𝜇2 , 𝜎2 𝑚
𝑖=1
𝑚
𝑖=1
very
likely Normalize 𝑋𝑖
𝑋𝑖 − 𝜇
𝑋෠𝑖 =
𝜎2 + 𝜖
𝜖 is a very small value
Add noise to the output of BN layers Scale and shift 𝑋෠𝑖
𝑌𝑖 = 𝛾𝑋෠𝑖 + β
𝛾 and β are two learning parameters
15
Model Generalization
❖ Trick 2: Batch normalization

input output
(3,32,32) (32,16,16) (64,8,8) (128,4,4) (256,2,2)

(3x3) Convolution
padding=‘same’ Flatten
stride=1 + ReLU

Dense Layer-512
(2x2) max pooling
+ ReLU

Batch Dense Layer-10


normalization + Softmax
val_accuracy increases from ~80.9% to ~84% 16
AI VIETNAM
All-in-One Course
Model Generalization
❖ Trick 3: Dropout

layer 𝑖 Apply dropout 50% to layer 𝑖

layer 𝑗

~50% nodes randomly selected in the 𝑖 𝑡ℎ layer are


set to zeros (kind of noise adding)
17
AI VIETNAM
All-in-One Course
Model Generalization
❖ Trick 3: Dropout

Apply dropout 50% to layer 𝑖

𝑎 = 𝐷⨀𝜎 𝑍

𝜕𝐿 𝜕𝐿 𝜕𝑎 𝜕𝐿
= = ×𝐷
𝜕𝜎 𝜕𝑎 𝜕𝜎 𝜕𝑎

~50% nodes randomly selected in


the 𝑖 𝑡ℎ layer are set to zeros

18
AI VIETNAM
All-in-One Course
Overfitting
Dropout
Given a dropping rate r

Randomly sets input units to 0 with


a frequency of r

Only applying in training mode

1
𝑠𝑐𝑎𝑙𝑒 =
1−𝑟 19
AI VIETNAM
All-in-One Course
Model Generalization
❖ Trick 3: Dropout

Apply dropout 50% to layer 𝑖

~50% nodes randomly selected in


https://deepnotes.io/dropout the 𝑖 𝑡ℎ layer are set to zeros

20
AI VIETNAM
All-in-One Course
Model Generalization
❖ Trick 3: Dropout
dropout 30% dropout 30%

input output
(3,32,32) (32,16,16) (64,8,8) (128,4,4) (256,2,2)

dropout 30% dropout 30%

Dropout val_accuracy
increases from
~84% to ~86.6%
AI VIETNAM
All-in-One Course
Model Generalization
❖ Trick 4: Kernel regularization

Prevent network from


𝐿 = 𝑐𝑟𝑜𝑠𝑠𝑒𝑛𝑡𝑟𝑜𝑝𝑦 + 𝜆 𝑊 2 focusing on specific features

𝐿2 regularization
Smaller weights
→ simpler models

In PyTorch

22
AI VIETNAM
All-in-One Course
Model Generalization
❖ Trick 4: Kernel regularizer
dropout 30% dropout 30%

input output
(3,32,32) (32,16,16) (64,8,8) (128,4,4) (256,2,2)

dropout 30% dropout 30%

Dropout
val_accuracy
(3x3) Convolution increases from
padding=‘same’ ~86.6% to ~87.5%
stride=1 + ReLU +
kernel regularization

23
AI VIETNAM
All-in-One Course
Model Generalization
❖ Trick 5: Data augmentation
A normal case A perfect case: Have unlimited training

Image Image
Training data cover the whole distribution
Testing data
Data distribution
Training data But, impractical!!!
AI VIETNAM
All-in-One Course
Model Generalization
❖ Trick 5: Data augmentation
Increase data by altering the training data
horizontal
flip

rotate

Image
crop and
resize Testing data
Data distribution
Training data
Augmented training data
25
AI VIETNAM
All-in-One Course
Model Generalization
❖ Trick 5: Data augmentation

Horizontal flip + crop-and-resize

val_accuracy reaches to ~90.6%

26
AI VIETNAM
All-in-One Course
Model Generalization
❖ What we have
Horizontal flip + crop-and-resize Batch normalization

Dropout

Kernel regularization

Data augmentation

Idea: try to increase train_accuracy,


expect val_accuracy increases too
val_accuracy reaches to ~90.6%
train_accuracy reaches to ~92%
Increase model capacity
27
Optimization
❖ Learning rate

28
Optimization
❖ Learning rate

29
AI VIETNAM
All-in-One Course
Model Generalization
❖ Trick 6: Reduce learning rate
𝑒𝑝𝑜𝑐ℎ
𝜂 = 𝜂0 × 𝛾

30
AI VIETNAM
All-in-One Course
Model Generalization
❖ Trick 6: Reduce learning rate
val_accuracy reaches to ~90.6%
train_accuracy reaches to ~98%

31
AI VIETNAM
All-in-One Course
Model Generalization
❖ Discussion: Predict training and test accuracy when using more data augmentation

horizontal
flip

rotate

crop and
resize
Add noise

32
AI VIETNAM
All-in-One Course
Model Generalization
❖ Trick 7: Increase model capacity (and use more data augmentation)

input output
(3,32,32) (64,16,16) (128,8,8) (256,4,4) (512,2,2)

val_accuracy reaches to ~93%


train_accuracy reaches to ~96%

33
AI VIETNAM
All-in-One Course
Model Generalization
❖ Trick 8: Using skip-connection

input + + + + output
(3,32,32) (64,16,16) (128,8,8) (256,4,4) (512,2,2)

(3x3) Convolution (3x3) Convolution


Dense Layer-512 Dense Layer-10
padding=‘same’ padding=‘same’ (2x2) max pooling Flatten
+ ReLU + Softmax
stride=1 + ReLU stride=2 + ReLU

34
AI VIETNAM
All-in-One Course
Model Generalization
❖ Trick 8: Using skip-connection
val_accuracy reaches to ~93%
train_accuracy reaches to ~97%

35
AI VIETNAM
All-in-One Course
Model Generalization
❖ Increase model capacity once more

input + + + + output
(3,32,32) (128,16,16) (256,8,8) (512,4,4) (1024,2,2)

(3x3) Convolution (3x3) Convolution


Dense Layer-512 Dense Layer-10
padding=‘same’ padding=‘same’ (2x2) max pooling Flatten
+ ReLU + Softmax
stride=1 + ReLU stride=2 + ReLU

36
AI VIETNAM
All-in-One Course
Model Generalization
❖ Increase model capacity once more
val_accuracy reaches to ~94.5%
train_accuracy reaches to ~98.3%

37
AI VIETNAM
All-in-One Course
Summary
❖ How to increase validation accuracy

Add noise
Trick 2:
Using Batch Normalization
Trick 1: ‘Learn hard ’ – randomly 𝑋𝑖 − 𝜇
add noise to training data
𝑋෠𝑖 =
𝜎2 + 𝜖
(*) Use Pretrained Models
Trick 3: Using Dropout
Trick 5: Data augmentation

𝐿 = 𝐶𝐸 + 𝜆 𝑊 2

𝐿2 regularization

Trick 4: Kernel regularization

You might also like