Professional Documents
Culture Documents
Convolutional Neural
Networks
CS 535 Deep Learning, Winter 2020
Fuxin Li
ML “grass”
ML “motorcycle”
ML “person”
“panda”
“dog”
3
Neural Networks
• Extremely high dimensionality!
• 256x256 image has already 65,536 * 3
dimensions
• One hidden layer with 500 hidden units
require 65,536 * 3 * 500 connections
(98 Million parameters)
Challenges in Image Classification
The convolution operator
Sobel filter Convolution
Convolution
𝐼 ∗ 𝐺𝑥
6
2D Convolution with Padding
0 0 0 0 0
0 1 3 1 0 -2 -2 1
0 0 -1 1 0 ∗ -2 0 1 =
0 2 2 -1 0 1 1 1
0 0 0 0 0
2D Convolution with Padding
0 0 0 0 0
0 1 3 1 0 -2 -2 1 2
0 0 -1 1 0 ∗ -2 0 1 =
0 2 2 -1 0 1 1 1
0 0 0 0 0
3 × 1 + −1 × 1 = 2
2D Convolution with Padding
0 0 0 0 0
0 1 3 1 0 -2 -2 1 2 -1
0 0 -1 1 0 ∗ -2 0 1 =
0 2 2 -1 0 1 1 1
0 0 0 0 0
1 × −2 + 1 × 1 + 1 × −1 + 1 × 1 = −1
2D Convolution with Padding
0 0 0 0 0
0 1 3 1 0 -2 -2 1 2 -1 -6
0 0 -1 1 0 ∗ -2 0 1 =
0 2 2 -1 0 1 1 1
0 0 0 0 0
What if:
0 0 3 3 0
0 1 3 1 0 -2 -2 1 2 -1 -18
0 0 -1 1 0 ∗ -2 0 1 =
0 2 2 -1 0 1 1 1
0 0 0 0 0
2D Convolution with Padding
0 0 0 0 0
0 1 3 1 0 -2 -2 1 2 -1 -6
0 0 -1 1 0 ∗ -2 0 1 = 4
0 2 2 -1 0 1 1 1
0 0 0 0 0
2D Convolution with Padding
0 0 0 0 0
0 1 3 1 0 -2 -2 1 2 -1 -6
0 0 -1 1 0 ∗ -2 0 1 = 4 -3
0 2 2 -1 0 1 1 1
0 0 0 0 0
2D Convolution with Padding
0 0 0 0 0
0 1 3 1 0 -2 -2 1 2 -1 -6
0 0 -1 1 0 ∗ -2 0 1 = 4 -3 -5
0 2 2 -1 0 1 1 1
0 0 0 0 0
2D Convolution with Padding
0 0 0 0 0
0 1 3 1 0 -2 -2 1 2 -1 -6
0 0 -1 1 0 ∗ -2 0 1 = 4 -3 -5
0 2 2 -1 0 1 1 1 1
0 0 0 0 0
2D Convolution with Padding
0 0 0 0 0
0 1 3 1 0 -2 -2 1 2 -1 -6
0 0 -1 1 0 ∗ -2 0 1 = 4 -3 -5
0 2 2 -1 0 1 1 1 1 -2
0 0 0 0 0
2D Convolution with Padding
0 0 0 0 0
0 1 3 1 0 -2 -2 1 2 -1 -6
0 0 -1 1 0 ∗ -2 0 1 = 4 -3 -5
0 2 2 -1 0 1 1 1 1 -2 -2
0 0 0 0 0
Filter size and Input/Output size
m N-m+1
N
m
N N-m+1
Filter Input Output
ReLU(𝐼 ∗ 𝑓6 )
ReLU(𝐼 ∗ 𝑓1)
For each pixel
• In a color image:
Pixel: R G B
Conv with 8
neighbor pixels:
Pixel: Ch 1 Ch 2 Ch 3 … … Ch 64
……
Circle
Detector
21
\newpage
\pagestyle{empty}
3x3x3
+ReLU
e.g. 64 filters
Convolution
3x3x64 …
Note different
dimensionality
for filters in this
layer
What’s the shape of weights and input
Input
• e.g. 64 filters level 1 224 x 224 x 3
• 128 filters level 2
3x3x3x64
Weights +ReLU
Output1:
224 x 224 x 64
Convolution
3x3x64x128 …
Output1:
224 x 224 x 128
Dramatic reduction on the number of
parameters
• Think about a fully-connected layer on 256 x 256 image with 500
hidden units
• Num. of params = 65536 * 3 * 500 = 98.3 Million
• 1 convolutional layer on 256 x 256 image with 11x11 and 500 hidden
units?
• Num. of params = 11 * 11 * 3 * 500 = 181,500
• 2 convolutional layers on 256 x 256 image with 11x11 – 3x3 sized
filters and 500 hidden units in each layer?
• Num. of params = 181,500 + 3 * 3 * 500 * 500 = 2.4 Million
Back to images
• Why images are much harder than digits?
• Much more deformation
• Much more noises
• Noisy background
Pooling
• Localized max-pooling (stride-2) helps achieving some location
invariance
• As well as filtering out irrelevant background information
e.g. 𝑥𝑜𝑢𝑡 = max(𝑥11 , 𝑥12 , 𝑥21 , 𝑥22 )
26
Deformation enabled by max-pooling
New filter in
the next layer
Strides
• Reduce image size by strides
• Stride = 1, convolution on every pixel
• Stride = 2, convolution on every 2 pixels
• Stride = 0.5, convolution on every half pixel (interpolation, Long et al. 2015)
Stride = 2
The VGG Network
224 x 224
64 filters
112 x 112
56 x 56
28 x 28
14 x 14
7x7
Fully
connected
Airplane Dog Car …… SUV Minivan Sign Pole
Forward pass:
Compute 𝑓 𝑋; 𝑊 = 𝑋∗𝑊
𝑓 𝐗; 𝐖 𝐙 = 𝑓 (𝐗; 𝐖)
𝐗
=𝐗∗𝐖
𝜕𝐿 𝜕𝐙 𝜕𝐿
Backward pass: 𝜕𝐙 𝜕𝐗 𝜕𝐙
Compute𝜕𝑍 𝜕𝐿 𝜕𝐙
=? 𝜕𝐙 𝜕𝐖
𝜕𝑋
𝜕𝑍
=? 𝐖
𝜕𝑊
Historical Remarks: MNIST
Le Net
subsampling to 16x16
Convolutional net LeNet-1 1.7 LeCun et al. 1998
pixels
Convolutional net LeNet-4 none 1.1 LeCun et al. 1998
Convolutional net LeNet-4
with K-NN instead of last none 1.1 LeCun et al. 1998
layer
Convolutional net LeNet-4
with local learning instead none 1.1 LeCun et al. 1998
of last layer
Convolutional net LeNet-5,
none 0.95 LeCun et al. 1998
[no distortions]
Convolutional net, cross-
none 0.4 Simard et al., ICDAR 2003
entropy [elastic distortions]
The 82 errors
made by LeNet5
• ReLU rectifier
• Max-pooling
• Replacable
• Local response normalization
• Hurts performance rather than helps
• Dropout regularization (to-be-discussed)
• Replaceable by some other regularization techniques
Convolution/ReLU/Pooling
Graham. Fractional Max-pooling. arXiv:1412.6071v4
arXiv!
• Because how fast the field evolves, most deep learning papers are on
arXiv first
• http://arxiv.org/list/cs.CV/recent
• http://arxiv.org/list/cs.CL/recent
Takeaways:
1) Long-range correlation
2) Local correlation stronger than non-local
Un-max-pooling
• Instead of mapping
pixels to features,
map the other way
around