Professional Documents
Culture Documents
Shusen Wang
Feature Scaling for Linear Models
Why Feature Scaling?
People’s feature vectors: 𝐱+ , ⋯ , 𝐱 - ∈ ℝ& .
• The 1st dimension is one person’s income (in dollars).
• Assume it is randomly from the Gaussian distribution 𝑁 3000, 400& .
• The 2nd dimension is one person’s height (in inch).
• Assume it is randomly from the Gaussian distribution 𝑁 69, 3& .
Why Feature Scaling?
People’s feature vectors: 𝐱+ , ⋯ , 𝐱 - ∈ ℝ& .
• The 1st dimension is one person’s income (in dollars).
• Assume it is randomly from the Gaussian distribution 𝑁 3000, 400& .
• The 2nd dimension is one person’s height (in inch).
• Assume it is randomly from the Gaussian distribution 𝑁 69, 3& .
GH IJKL(GH )
• Min-max normalization: 𝑥3F = .
JOP(GH )IJKL(GH )
R
GH IQ
• Standardization: 𝑥3F = R
.
S
+
• 𝜇̂ = ∑-35+ 𝑥3 is the sample mean.
-
+
• 𝜎W & = ∑-35+ 𝑥3 − 𝜇̂ & is the sample variance.
-I+
• After the scaling, the samples 𝑥+F, ⋯ , 𝑥-F have zero mean and unit variance.
Feature Scaling for High-Dim Data
Y
𝐱 𝐳 Y
𝐱 Yd+
standardization scaling & shifting
R
𝝁 R
𝝈 𝜸 𝜷
Non-trainable Trainable
Backpropagation for Batch Normalization Layer
c
Y Rb
Gb IQ
• Standardization: 𝑧a = R b de.ee+
, for 𝑗 = 1, ⋯ , 𝑑.
S
Yd+ Y
• Scale and shift: 𝑥a = 𝑧a ∘ 𝛾a + 𝛽a , for 𝑗 = 1, ⋯ , 𝑑.
rs Yd+
We know ctu from the backpropagation (from the top to 𝑥 .)
rGb
ctu
rs rs rGb rs Y
• Use = ctu = ctu 𝑧a to update 𝛾a ;
rvb rGb rvb rGb
ctu
rs rs rGb rs
• Use = ctu = ctu to update 𝛽a .
rwb rGb rwb rGb
Backpropagation for Batch Normalization Layer
c
Y Rb
Gb IQ
• Standardization: 𝑧a = R b de.ee+
, for 𝑗 = 1, ⋯ , 𝑑.
S
Yd+ Y
• Scale and shift: 𝑥a = 𝑧a ∘ 𝛾a + 𝛽a , for 𝑗 = 1, ⋯ , 𝑑.
rs Yd+
We know ctu from the backpropagation (from the top to 𝑥 .)
rGb
ctu
rs rs rGb rs
Compute c = ctu c = 𝛾.
rxb rGb rxb rG ctu
c
rs rs rxb rs +
Compute c = c c = c Rb de.ee+
and pass it to the lower layers.
rGb rxb rGb rxb S
Batch Normalization Layer in Keras
Batch Normalization Layer
• Let 𝐱 Y ∈ ℝZ be the output of the 𝑘-th hidden layer.
•𝛍 R ∈ ℝZ : non-trainable parameters.
R, 𝛔
• 𝛄, 𝛃 ∈ ℝZ : trainable parameters.
c
Y Rb
Gb IQ
• Standardization: 𝑧a = R b de.ee+
, for 𝑗 = 1, ⋯ , 𝑑.
S
Yd+ Y
• Scale and shift: 𝑥a = 𝑧a ⋅ 𝛾a + 𝛽a , for 𝑗 = 1, ⋯ , 𝑑.
Batch Normalization Layer
• Let 𝐱 Y ∈ ℝZ be the output of the 𝑘-th hidden layer.
•𝛍 R ∈ ℝZ : non-trainable parameters.
R, 𝛔
• 𝛄, 𝛃 ∈ ℝZ : trainable parameters.
Example:
• The 1st Conv Layer in VGG16 Net outputs a 150×150×64 tensor.
• The number of parameters in a single Batch Normalization Layer
would be 4𝑑 = 1.44M.
Batch Normalization Layer
Let 𝐱 Y ∈ ℝZ be the output of the 𝑘-th hidden layer.
•Solution:
Z
R
•• 𝛍 , R
Make the 4: parameters
𝛔 ∈ ℝ non-trainable parameters.
1×1×64, instead of 150×150×64.
• 𝛄, 𝛃 ∈ ℝZ : trainable parameters.
model = models.Sequential()
model.add(layers.Conv2D(10, (5, 5), input_shape=(28, 28, 1)))
model.add(layers.BatchNormalization())
model.add(layers.Activation('relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(20, (5, 5)))
model.add(layers.BatchNormalization())
model.add(layers.Activation('relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Flatten())
model.add(layers.Dense(100))
model.add(layers.BatchNormalization())
model.add(layers.Activation('relu'))
model.add(layers.Dense(10, activation='softmax'))