Mode Generalization

AI VIETNAM
All-in-One Course
Model Generalization
Quang-Vinh Dinh
Ph.D. in Computer Science
Year 2023
airplane
Network Training
automobile
bird
Cifar-10 dataset cat

(complex dataset)
deer
Color images
dog
Resolution=32x32
frog
Training set: 50000 samples
Testing set: 10000 samples horse
ship
truck
AI VIETNAM
All-in-One Course
Cifar-10
input output
Dataset (3,32,32) (32,16,16) (64,8,8) (128,4,4) (256,2,2)
Data Normalization
(convert to 0-mean Aim to reduce this gap
and 1-deviation) Flatten
(3x3) Convolution
𝑋−𝜇 padding=‘same’
𝑋ത =
𝜎 stride=1 + ReLU
Dense Layer-10
+ Softmax
1
𝜇 = ෍ 𝑋𝑖
𝑛 𝑖 (2x2) max pooling
Dense Layer-512
1 + ReLU
𝜎= ෍ 𝑋𝑖 − 𝜇 2
𝑛 𝑖
Adam lr=1e-3 ; He Init 2
Training
1 2 3 4
Testing
Robustly fit Overfit
3
AI VIETNAM
All-in-One Course
❖ Trick 1: ‘Learn hard ’ – randomly add noise to training data
4
AI VIETNAM
All-in-One Course
Speed-limit
sign detection
5
AI VIETNAM
All-in-One Course
Add noise
In PyTorch
6
AI VIETNAM
All-in-One Course
Add noise
In PyTorch
val_accuracy increases
from ~78% to ~80%
7
AI VIETNAM
AI-in-One Course
Network Training
❖ Solution 2: Batch normalization
Input data for a node in batch normalization layer
𝑋 = 𝑋1 , … , 𝑋𝑚
𝑚 is mini-batch size
Compute mean and variance
𝑚 𝑚
1 1
𝜇 = ෍ 𝑋𝑖 2
𝜎 = ෍ 𝑋𝑖 − 𝜇 2
𝑚 𝑚
𝑖=1 𝑖=1
Normalize 𝑋𝑖
𝑋𝑖 − 𝜇
Batch Normalization 𝑋෠𝑖 =
𝜎2 + 𝜖
𝜖 is a very small value
Do not need bias when using BN
Scale and shift 𝑋෠𝑖
𝑌𝑖 = 𝛾𝑋෠𝑖 + β
𝜇 and 𝜎 are updated in forward pass
𝛾 and β are updated in backward pass 𝛾 and β are two learning parameters
8
AI VIETNAM
All-in-One Course
Network Training
❖ Trick 2: Batch normalization
𝑋 = 𝑋1 , … , 𝑋𝑚
𝑚 𝑚
1 1
𝜇 = ෍ 𝑋𝑖 𝜎2 = ෍ 𝑋𝑖 − 𝜇 2
𝑚 𝑚
𝑖=1 𝑖=1
Normalize 𝑋𝑖
𝑋𝑖 − 𝜇
𝑋෠𝑖 =
Batch Normalization 𝜎2 + 𝜖
What if
𝛾 = 𝜎 2 + 𝜖 and β = 𝜇 𝛾 and β are two learning parameters
8
Trick 2: Batch normalization
Network Training Input data for a node in batch normalization layer
𝜇𝑈 = 5.0 −1.51 𝑋 = 𝑋1 , … , 𝑋𝑚
−5
𝜖 = 10 −0.75
𝜎𝑈2 = 2.64 𝑚 is mini-batch size
෡ = 1.51
𝑈
𝛾𝑈 = 1.0 −0.37 Compute mean and variance
1 0.37 𝑚 𝑚
1 1
3 β𝑈 = 0.0 0.75 −1.51 𝜇 = ෍ 𝑋𝑖 2
𝜎 = ෍ 𝑋𝑖 − 𝜇 2
9 −0.75 𝑚 𝑚
𝑖=1 𝑖=1
𝑋𝑈 = 1.51
4 𝑌𝑈 =
6 −0.37 Normalize 𝑋𝑖
7 0.37 𝑋𝑖 − 𝜇
0.75 𝑋෠𝑖 =
0.61 𝜎2 + 𝜖
6 −0.61 𝜖 is a very small value
4 1.22
7 𝑌𝑉 = Scale and shift 𝑋෠𝑖
𝑋𝑉 = 0.0
5 0.61 𝑌𝑖 = 𝛾𝑋෠𝑖 + β
𝜇𝑉 = 5.0 0.61
6 −1.83 𝛾 and β are two learning parameters
−0.61
2
𝜎𝑉2 = 1.63 1.22
𝑉෠ =
𝛾𝑉 = 1.0 0.0
0.61 𝛾 and β are updated in training process
β𝑉 = 0.0 −1.83 10
AI VIETNAM
All-in-One Course
Network Training
❖ Trick 2: Batch normalization for 2D data
H …
Compute C means of H*W*B values Compute C variances of H*W*B values

11
AI VIETNAM
All-in-One Course
Network Training
𝜖 = 10−5 𝜇 = [2.0, 3.0] 𝛾 = 1.0
𝜎 2 = [4.0, 3.7] β = 0.0
0 5 6 4 1.6 − 1.1 1.6 0.5

3 0 5 2 −1.1 0.5 1.1 − 0.5
1 4 2 3
3 0 0 2 −0.5 1.1 −0.5 0.0
0.5 − 1.1 −1.6 − 0.5
𝑋= , 𝑋෠ = ,
sample 1 sample 2 sample 1 sample 2
𝑌෠ = …
batch-size = 2
Batch-Norm Layer
sample_shape = (2, 2, 2)
12
AI VIETNAM
All-in-One Course
Network Training
❖ Trick 2: Batch normalization Backward
𝑋 = 𝑋1 , … , 𝑋𝑚
𝑚 is mini-batch size 𝛾
Compute mean and variance 𝜕𝐿
𝑚 𝑚
1 1 𝜕𝑌𝑖
𝜇 = ෍ 𝑋𝑖 𝜎2 = ෍ 𝑋𝑖 − 𝜇 2
𝑋𝑖 𝜇 𝑋෠𝑖 𝑌𝑖
𝑚 𝑚
𝑖=1 𝑖=1
Normalize 𝑋𝑖
𝑋𝑖 − 𝜇 𝛽
𝑋෠𝑖 =
𝜎2 + 𝜖 𝜎2
𝛾 and β are two learning parameters 13
𝛾
𝜕𝐿 𝑚 𝑚
1 1
𝜕𝑌𝑖 𝜇 = ෍ 𝑋𝑖 2 2
𝜎 = ෍ 𝑋𝑖 − 𝜇
𝑚 𝑚
𝑋𝑖 𝜇 𝑋෠𝑖 𝑌𝑖 𝑖=1 𝑖=1
𝑋𝑖 − 𝜇
𝑋෠𝑖 = 𝑌𝑖 = 𝛾𝑋෠𝑖 + β
𝛽 𝜎2 + 𝜖
𝜎2
𝑚 𝑚 𝜕𝐿 𝜕𝐿
𝜕𝐿 𝜕𝐿 𝜕𝐿 𝜕𝐿 = 𝛾 𝜕𝐿 𝜕𝐿 𝜕𝑋෠𝑖 𝜕𝐿 𝜕𝜇 𝜕𝐿 𝜕𝜎 2
=෍ 𝑋෠𝑖 =෍ ෠
𝜕𝑋𝑖 𝜕𝑌 = + + 2
𝜕𝛾 𝜕𝑌𝑖 𝜕β 𝜕𝑌𝑖 𝑖 ෠
𝜕𝑋𝑖 𝜕𝑋𝑖 𝜕𝑋𝑖 𝜕𝜇 𝜕𝑋𝑖 𝜕𝜎 𝜕𝑋𝑖
𝑖=1 𝑖=1
𝑚 𝑚
𝜕𝐿 𝜕𝐿 𝜕𝑋෠𝑖 𝜕𝐿 −1 2 −3 𝜕𝑋෠𝑖 1
=෍ =෍ 𝑋𝑖 − 𝜇 𝜎 +𝜖 2 =
𝜕𝜎 2 𝜕𝑋෠𝑖 𝜕𝜎 2 𝜕𝑋෠𝑖 2 𝜕𝑋𝑖 𝜎2 + 𝜖
𝑖=1 𝑖=1
𝜕𝜎 2 2 𝑋𝑖 − 𝜇
𝑚 𝑚 =
𝜕𝐿 𝜕𝐿 −1 𝜕𝐿 1 𝜕𝑋𝑖 𝑚
=෍ − 2 ෍ 2 𝑋𝑖 − 𝜇
𝜕𝜇 ෠
𝜕𝑋𝑖 𝜎 + 𝜖 𝜕𝜎 𝑚
2 𝜕𝜇 1
𝑖=1 𝑖=1 =
𝜕𝑋𝑖 𝑚
14
AI VIETNAM
All-in-One Course
𝑋 = 𝑋1 , … , 𝑋𝑚
mini-batch 1 mini-batch 2 𝑚 𝑚
1 1
𝜇 = ෍ 𝑋𝑖 2
𝜎 = ෍ 𝑋𝑖 − 𝜇 2
𝜇1 , 𝜎1 ≠ 𝜇2 , 𝜎2 𝑚
𝑖=1
𝑚
𝑖=1
very
likely Normalize 𝑋𝑖
𝑋𝑖 − 𝜇
𝑋෠𝑖 =
𝜎2 + 𝜖
Add noise to the output of BN layers Scale and shift 𝑋෠𝑖
𝛾 and β are two learning parameters
15
input output
(3,32,32) (32,16,16) (64,8,8) (128,4,4) (256,2,2)
(3x3) Convolution
padding=‘same’ Flatten
stride=1 + ReLU
Dense Layer-512
(2x2) max pooling
+ ReLU
Batch Dense Layer-10

normalization + Softmax
val_accuracy increases from ~80.9% to ~84% 16
AI VIETNAM
All-in-One Course
❖ Trick 3: Dropout
layer 𝑖 Apply dropout 50% to layer 𝑖
layer 𝑗
~50% nodes randomly selected in the 𝑖 𝑡ℎ layer are

set to zeros (kind of noise adding)
17
AI VIETNAM
All-in-One Course
Apply dropout 50% to layer 𝑖
𝑎 = 𝐷⨀𝜎 𝑍
𝜕𝐿 𝜕𝐿 𝜕𝑎 𝜕𝐿
= = ×𝐷
𝜕𝜎 𝜕𝑎 𝜕𝜎 𝜕𝑎
~50% nodes randomly selected in

the 𝑖 𝑡ℎ layer are set to zeros
18
AI VIETNAM
All-in-One Course
Overfitting
Dropout
Given a dropping rate r
Randomly sets input units to 0 with

a frequency of r
Only applying in training mode
1
𝑠𝑐𝑎𝑙𝑒 =
1−𝑟 19
AI VIETNAM
All-in-One Course
Apply dropout 50% to layer 𝑖
~50% nodes randomly selected in

https://deepnotes.io/dropout the 𝑖 𝑡ℎ layer are set to zeros
20
AI VIETNAM
All-in-One Course
dropout 30% dropout 30%
input output
(3,32,32) (32,16,16) (64,8,8) (128,4,4) (256,2,2)
Dropout val_accuracy
increases from
~84% to ~86.6%
AI VIETNAM
All-in-One Course
❖ Trick 4: Kernel regularization
Prevent network from

𝐿 = 𝑐𝑟𝑜𝑠𝑠𝑒𝑛𝑡𝑟𝑜𝑝𝑦 + 𝜆 𝑊 2 focusing on specific features
𝐿2 regularization
Smaller weights
→ simpler models
In PyTorch
22
AI VIETNAM
All-in-One Course
❖ Trick 4: Kernel regularizer
input output
(3,32,32) (32,16,16) (64,8,8) (128,4,4) (256,2,2)
Dropout
val_accuracy
(3x3) Convolution increases from
padding=‘same’ ~86.6% to ~87.5%
stride=1 + ReLU +
kernel regularization
23
AI VIETNAM
All-in-One Course
❖ Trick 5: Data augmentation
A normal case A perfect case: Have unlimited training
Image Image
Training data cover the whole distribution
Testing data
Data distribution
Training data But, impractical!!!
AI VIETNAM
All-in-One Course
Increase data by altering the training data
horizontal
flip
rotate
Image
crop and
resize Testing data
Data distribution
Training data
Augmented training data
25
AI VIETNAM
All-in-One Course
Horizontal flip + crop-and-resize
val_accuracy reaches to ~90.6%
26
AI VIETNAM
All-in-One Course
❖ What we have
Horizontal flip + crop-and-resize Batch normalization
Dropout
Kernel regularization
Data augmentation
Idea: try to increase train_accuracy,

expect val_accuracy increases too
train_accuracy reaches to ~92%
Increase model capacity
27
Optimization
❖ Learning rate
28
Optimization
❖ Learning rate
29
AI VIETNAM
All-in-One Course
❖ Trick 6: Reduce learning rate
𝑒𝑝𝑜𝑐ℎ
𝜂 = 𝜂0 × 𝛾
30
AI VIETNAM
All-in-One Course
❖ Trick 6: Reduce learning rate
31
AI VIETNAM
All-in-One Course
❖ Discussion: Predict training and test accuracy when using more data augmentation
horizontal
flip
rotate
crop and
resize
Add noise
32
AI VIETNAM
All-in-One Course
❖ Trick 7: Increase model capacity (and use more data augmentation)
input output
(3,32,32) (64,16,16) (128,8,8) (256,4,4) (512,2,2)
val_accuracy reaches to ~93%

33
AI VIETNAM
All-in-One Course
❖ Trick 8: Using skip-connection
input + + + + output
(3,32,32) (64,16,16) (128,8,8) (256,4,4) (512,2,2)
(3x3) Convolution (3x3) Convolution

Dense Layer-512 Dense Layer-10
padding=‘same’ padding=‘same’ (2x2) max pooling Flatten
+ ReLU + Softmax
stride=1 + ReLU stride=2 + ReLU
34
AI VIETNAM
All-in-One Course
❖ Trick 8: Using skip-connection
val_accuracy reaches to ~93%
35
AI VIETNAM
All-in-One Course
❖ Increase model capacity once more
input + + + + output
(3,32,32) (128,16,16) (256,8,8) (512,4,4) (1024,2,2)
(3x3) Convolution (3x3) Convolution

Dense Layer-512 Dense Layer-10
padding=‘same’ padding=‘same’ (2x2) max pooling Flatten
+ ReLU + Softmax
stride=1 + ReLU stride=2 + ReLU
36
AI VIETNAM
All-in-One Course
❖ Increase model capacity once more
train_accuracy reaches to ~98.3%
37
AI VIETNAM
All-in-One Course
Summary
❖ How to increase validation accuracy
Add noise
Trick 2:
Using Batch Normalization
Trick 1: ‘Learn hard ’ – randomly 𝑋𝑖 − 𝜇
add noise to training data
𝑋෠𝑖 =
𝜎2 + 𝜖
(*) Use Pretrained Models
Trick 3: Using Dropout
Trick 5: Data augmentation
𝐿 = 𝐶𝐸 + 𝜆 𝑊 2
𝐿2 regularization
Trick 4: Kernel regularization

Mode Generalization

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mode Generalization

Uploaded by

Copyright:

Available Formats

AI VIETNAM

Cifar-10 dataset cat

Testing set: 10000 samples horse

Robustly fit Overfit

Compute C means of H*W*B values Compute C variances of H*W*B values

0 5 6 4 1.6 − 1.1 1.6 0.5

Batch Dense Layer-10

layer 𝑖 Apply dropout 50% to layer 𝑖

~50% nodes randomly selected in the 𝑖 𝑡ℎ layer are

Apply dropout 50% to layer 𝑖

~50% nodes randomly selected in

Randomly sets input units to 0 with

Only applying in training mode

Apply dropout 50% to layer 𝑖

~50% nodes randomly selected in

dropout 30% dropout 30%

Prevent network from

dropout 30% dropout 30%

Horizontal flip + crop-and-resize

val_accuracy reaches to ~90.6%

Idea: try to increase train_accuracy,

val_accuracy reaches to ~93%

(3x3) Convolution (3x3) Convolution

(3x3) Convolution (3x3) Convolution

Trick 4: Kernel regularization

You might also like

Compute C means of HWB values Compute C variances of HWB values