7 CNN 3

Batch Normalization
Shusen Wang
Feature Scaling for Linear Models
Why Feature Scaling?
People’s feature vectors: 𝐱+ , ⋯ , 𝐱 - ∈ ℝ& .
• The 1st dimension is one person’s income (in dollars).
• Assume it is randomly from the Gaussian distribution 𝑁 3000, 400& .
• The 2nd dimension is one person’s height (in inch).
• The 1st dimension is one person’s income (in dollars).
• The 2nd dimension is one person’s height (in inch).
• Hessian matrix of least squares regression model:

+ -
∑35+ 𝐱 3 𝐱 34 9137.3 206.6
𝐇= = ×10<.
- 206.6 4.8
=>?@ 𝐇
• Condition number: = 9.2×10C .
=>AB 𝐇
Bad condition number means slow convergence of gradient descent!
• The 1st dimension is one person’s income (in thousand dollars).
• Assume it is randomly from the Gaussian distribution 𝑁 3, 0.4& .
• The 2nd dimension is one person’s height (in foot).
• Assume it is randomly from the Gaussian distribution 𝑁 5.75, 0.25& .
• Change metric.
• Hessian matrix of linear regression:
+ -
∑35+ 𝐱 3 𝐱 34 9.1 17.2
𝐇= = .
- 17.2 33.1
=>?@ 𝐇
• Condition number: = 281.7. (As opposed to 92K.)
=>AB 𝐇
Feature Scaling for 1D Data
Assume the samples 𝑥+ , ⋯ , 𝑥- are one-dimensional.
GH IJKL(GH )
• Min-max normalization: 𝑥3F = .
JOP(GH )IJKL(GH )
• After the scaling, the samples 𝑥+F , ⋯ , 𝑥-F are in 0, 1 .

Feature Scaling for 1D Data
Assume the samples 𝑥+ , ⋯ , 𝑥- are one-dimensional.
R
GH IQ
• Standardization: 𝑥3F = R
.
S
+
• 𝜇̂ = ∑-35+ 𝑥3 is the sample mean.
-
+
• 𝜎W & = ∑-35+ 𝑥3 − 𝜇̂ & is the sample variance.
-I+
• After the scaling, the samples 𝑥+F, ⋯ , 𝑥-F have zero mean and unit variance.
Feature Scaling for High-Dim Data
• Independently perform feature scaling for every feature.

• E.g., when scaling the “height” feature, ignore the “income” feature.
Feature Scaling for High-Dim Data
• Independently perform feature scaling for every feature.

• E.g., when scaling the “height” feature, ignore the “income” feature.
Batch Normalization
Batch Normalization: Standardization of Hidden Layers
• Let 𝐱 Y ∈ ℝZ be the output of the 𝑘-th hidden layer.

•𝛍R ∈ ℝZ : sample mean of 𝐱 Y evaluated on a batch of samples.
•𝛔R ∈ ℝZ : sample std of 𝐱 Y evaluated on a batch of samples.
• 𝛄 ∈ ℝZ : scaling parameter (trainable).
• 𝛃 ∈ ℝZ : shifting parameter (trainable).
c
Y Rb
Gb IQ
• Standardization: 𝑧a = R b de.ee+
, for 𝑗 = 1, ⋯ , 𝑑.
S
Yd+ Y
• Scale and shift: 𝑥a = 𝑧a ⋅ 𝛾a + 𝛽a , for 𝑗 = 1, ⋯ , 𝑑.𝑗 = 1, ⋯ , 𝑑.

•𝛍R ∈ ℝZ : sample mean of 𝐱 Y evaluated on a batch of samples.
•𝛔R ∈ ℝZ : sample std of 𝐱 Y evaluated on a batch of samples.
c
Y Rb
Gb IQ
, for 𝑗 = 1, ⋯ , 𝑑.
S
Yd+ Y
• Scale and shift: 𝑥a = 𝑧a ⋅ 𝛾a + 𝛽a , for 𝑗 = 1, ⋯ , 𝑑.

•𝛍R ∈ ℝZ : sample
Non-trainable. Y
mean of 𝐱Just evaluated on ainbatch
record them of samples.
the forward pass;
•𝛔 use them
R ∈ ℝZ : sample stdinofthe
𝐱 Ybackpropagation.
evaluated on a batch of samples.
c
Y Rb
Gb IQ
, for 𝑗 = 1, ⋯ , 𝑑.
S
Yd+ Y
Backpropagation for Batch Normalization Layer
c
Y Rb
Gb IQ
, for 𝑗 = 1, ⋯ , 𝑑.
S
Yd+ Y
• Scale and shift: 𝑥a = 𝑧a ∘ 𝛾a + 𝛽a , for 𝑗 = 1, ⋯ , 𝑑.
Y
𝐱 𝐳 Y
𝐱 Yd+
standardization scaling & shifting
R
𝝁 R
𝝈 𝜸 𝜷
Non-trainable Trainable
c
Y Rb
Gb IQ
, for 𝑗 = 1, ⋯ , 𝑑.
S
Yd+ Y
rs Yd+
We know ctu from the backpropagation (from the top to 𝑥 .)
rGb
ctu
rs rs rGb rs Y
• Use = ctu = ctu 𝑧a to update 𝛾a ;
rvb rGb rvb rGb
ctu
rs rs rGb rs
• Use = ctu = ctu to update 𝛽a .
rwb rGb rwb rGb
c
Y Rb
Gb IQ
, for 𝑗 = 1, ⋯ , 𝑑.
S
Yd+ Y
rs Yd+
We know ctu from the backpropagation (from the top to 𝑥 .)
rGb
ctu
rs rs rGb rs
Compute c = ctu c = 𝛾.
rxb rGb rxb rG ctu
c
rs rs rxb rs +
Compute c = c c = c Rb de.ee+
and pass it to the lower layers.
rGb rxb rGb rxb S
Batch Normalization Layer in Keras
Batch Normalization Layer
•𝛍 R ∈ ℝZ : non-trainable parameters.
R, 𝛔
• 𝛄, 𝛃 ∈ ℝZ : trainable parameters.
c
Y Rb
Gb IQ
, for 𝑗 = 1, ⋯ , 𝑑.
S
Yd+ Y
•𝛍 R ∈ ℝZ : non-trainable parameters.
R, 𝛔
Difficulty: There are 4𝑑 parameters which must be stored in memory.

𝑑 can be very large!
Example:
• The 1st Conv Layer in VGG16 Net outputs a 150×150×64 tensor.
• The number of parameters in a single Batch Normalization Layer
would be 4𝑑 = 1.44M.
Let 𝐱 Y ∈ ℝZ be the output of the 𝑘-th hidden layer.
•Solution:
Z
R
•• 𝛍 , R
Make the 4: parameters
𝛔 ∈ ℝ non-trainable parameters.
1×1×64, instead of 150×150×64.
Difficulty: There are 4𝑑 parameters which must be stored in memory.

Let 𝐱 Y ∈ ℝZ be the output of the 𝑘-th hidden layer.
•Solution:
Z
R
•• 𝛍 , R
Make the 4: parameters
𝛔 ∈ ℝ non-trainable parameters.
1×1×64, instead of 150×150×64.
𝛄, 𝛃 ∈ ℝZ : trainable parameters.
• How?
• A scalar parameter for a slice (e.g., a 150×150 matrix) of the tensor.
Difficulty:
• Of course,There aremake
you can 4𝑑 parameters which150×1×1
the parameters must be stored in memory.
or 1×150×1.
3. Use Batch Normalization
In [6]:CNN for Digit Recognition
from keras import models

from keras import layers
model = models.Sequential()
model.add(layers.Conv2D(10, (5, 5), input_shape=(28, 28, 1)))
model.add(layers.BatchNormalization())
model.add(layers.Activation('relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(20, (5, 5)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Flatten())
model.add(layers.Dense(100))
model.add(layers.Dense(10, activation='softmax'))
# print the summary of the model.

CNN for Digit Recognition
Train the model (with Batch Normalization) on MNIST (𝑛 = 50,000).

Train the model (without Batch Normalization) on MNIST (𝑛 = 50,000).
Without Batch Normalization,

it takes 10x more epochs to
converge.
Summary
Feature Scaling
• Make the scales of all the features comparable.

• Why? Better conditioned Hessian matrix è Faster convergence.
Feature Scaling
• Make the scales of all the features comparable.

• Why? Better conditioned Hessian matrix è Faster convergence.
• Methods:
• Min-Max Normalization: scale the features to [0, 1].
• Standardization: every feature has zero mean and unit variance.
Batch Normalization
• Feature standardization for the hidden layers.

• Why? Faster convergence.
• 2 trainable parameters: shifting and scaling.
• 2 non-trainable parameters: mean, variance.
Batch Normalization
• Feature standardization for the hidden layers.

• Why? Faster convergence.
• 2 trainable parameters: shifting and scaling.
• 2 non-trainable parameters: mean, variance.
• Keras provides layers.BatchNormalization().
• Put the BN layer after Conv and before Activation.

7 CNN 3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

7 CNN 3

Uploaded by

Copyright:

Available Formats

Batch Normalization

• Hessian matrix of least squares regression model:

• After the scaling, the samples 𝑥+F , ⋯ , 𝑥-F are in 0, 1 .

• Independently perform feature scaling for every feature.

• Independently perform feature scaling for every feature.

• Let 𝐱 Y ∈ ℝZ be the output of the 𝑘-th hidden layer.

• Let 𝐱 Y ∈ ℝZ be the output of the 𝑘-th hidden layer.

• Let 𝐱 Y ∈ ℝZ be the output of the 𝑘-th hidden layer.

Difficulty: There are 4𝑑 parameters which must be stored in memory.

Difficulty: There are 4𝑑 parameters which must be stored in memory.

from keras import models

# print the summary of the model.

Train the model (with Batch Normalization) on MNIST (𝑛 = 50,000).

Train the model (without Batch Normalization) on MNIST (𝑛 = 50,000).

Without Batch Normalization,

• Make the scales of all the features comparable.

• Make the scales of all the features comparable.

• Feature standardization for the hidden layers.

• Feature standardization for the hidden layers.

You might also like