Convolutional Neural Networks - Annotated

COE101
Introductory
Artificial Intelligence
College of Engineering
Deep Learning -
Convolutional
Neural Networks
Convolutional
Neural Networks
What are they?
• Can take in an input image, assign
importance to parts of it and differentiate
one from the other
• A class of neural networks
• Not fully-connected
• A much much lower number of weights
and biases to learn  more manageable
A Typical Convolutional Neural Networks Architecture
Applications
Convolutional Neural Networks
Applications
Classification Retrieval
Object Detection and Segmentation
Detection Segmentation
Self Driving Cars
Steering Angle Acceleration & Brake Levels

Deep Learning &
Computer Vision – Regression of 3
Values
Images and Sensors Output

Face Recognition
[Taigman et al. 2014]

Activity Detection
[Simonyan et al. 2014]

Pose Estimation and Gaming
[Toshev, Szegedy 2014]
[Guo et al. 2014]

Bioimaging, Space Exploration,
Safe Driving Detection
[Levy et al. 2016]
[Sermanet et al. 2011]

[Dieleman et al. 2014] [Ciresan et al.]
Environmental Applications, Mapping, and GIS
Whale recognition, Kaggle Challenge Mnih and Hinton, 2010

No errors Minor errors Somewhat related
Captioning
A white teddy bear sitting in A man in a baseball A woman is holding a

the grass uniform throwing a ball cat in her hand
[Vinyals et al., 2015]
[Karpathy and Fei-
Fei, 2015]
A man riding a wave on A cat sitting on a A woman standing on a

top of a surfboard suitcase on the floor beach holding a surfboard
Style Transfer
Deep Learning
Architectures
Architecture Over the Years
CNN Architectures  The first successful applications of Convolutional
Networks were developed by Yann LeCun in
LeNet 1990’s.
 The best known is the LeNet architecture that was
used to read zip codes, digits, etc.
CNN Architectures  The first work that popularized Convolutional
Networks in Computer Vision.
 This Network has a very similar architecture to
AlexNet LeNet, but was deeper, bigger, and featured
Convolutional Layers stacked on top of each other
 Main contribution in the development of an
CNN Architectures Inception Module that dramatically reduced the
number of parameters in the network (4M,
GoogLeNet compared to AlexNet with 60M).
CNN Architectures  Main contribution is in showing that the depth of the
network is a critical component for good
VGGNet performance.
 Their final best network contains 16 CONV/FC
layers and, appealingly, features an extremely
homogeneous architecture that only performs 3x3
convolutions and 2x2 pooling from the beginning to
the end.
CNN Architectures
ResNet
relu
F(x) + x
 Very deep networks using residual
connections
 Features special skip connections and a ..
.
heavy use of batch normalization. X
 ResNets are currently by far state of the F(x) relu
identity
art Convolutional Neural Network models
and are the default choice for using
ConvNets in practice.
X
Residual block
ResNet:
CNN Architectures Moderate efficiency

depending on
model, highest
Comparing Complexity GoogLeNet: accuracy
VGG: Highest
memory,
most efficient
most
operations
AlexNet:
An Analysis of Deep Neural Network Models for Practical Applications, 2017. Smaller compute, still memory
Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with permission. heavy, lower accuracy
Architecture
Overview
Convolution Layer
32x32x3 image  preserve spatial structure
32 height
32 width
3 depth
Convolution Layer
32x32x3 image
5x5x3 filter
Convolve the filter with the image
32 height i.e. “slide over the image spatially,
computing dot products”
32 width
3 depth
Filters always extend the full
Convolution Layer depth of the input volume
32x32x3 image
5x5x3 filter
Convolve the filter with the image
i.e. “slide over the image spatially,
32 height
computing dot products”
32 width
3 depth
Convolution Layer
32x32x3 image
5x5x3 filter
32
1 number:
the result of taking a dot product between the
filter and a small 5x5x3 chunk of the image
32 (i.e. 5*5*3 = 75-dimensional dot product +
3 bias)
Convolution Layer
activation map
32x32x3 image
5x5x3 filter
32
28
convolve (slide) over all

spatial locations
32 28
3 1
Convolution Layer consider a second, green filter
32x32x3 image activation maps
5x5x3 filter
32
28
convolve (slide) over all

spatial locations
32 28
3 1
For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps:
activation maps
32
28
Convolution Layer
32 28
3 6
We stack these up to get a “new image” of size 28x28x6!

Preview
ConvNet is a sequence of Convolution Layers, interspersed with activation
functions
32 28
CONV,
ReLU
e.g. 6
32 5x5x3 28
3 filters 6
Preview
ConvNet is a sequence of Convolutional Layers, interspersed with activation
functions
32 28 24
….
CONV, CONV, CONV,
ReLU ReLU ReLU
e.g. 6 e.g. 10
5x5x6
32 5x5x3 28 filters 24
3 filters 6 10
Preview
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
7
7
assume 3x3 filter
7
7
assume 3x3 filter
7
7
assume 3x3 filter
7
7
assume 3x3 filter
7
7
assume 3x3 filter
applied with stride 2
7
7
assume 3x3 filter
7
7
assume 3x3 filter
7
N
Output size:
(N - F) / stride + 1
F
e.g. N = 7, F = 3:
F N
stride 1 => (7 - 3)/1 + 1 = 5
stride 2 => (7 - 3)/2 + 1 = 3
stride 3 => (7 - 3)/3 + 1 = 2.33 :\
In practice: Common to zero pad the border
0 0 0 0 0 0
e.g. input 7x7
0 3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
0
(recall:)
(N - F) / stride + 1
0 0 0 0 0 0
e.g. input 7x7
0
7x7 output!
0
0 0 0 0 0 0
e.g. input 7x7
0
7x7 output!
0
in general, common to see CONV layers with
stride 1, filters of size FxF, and zero-padding with
(F-1)/2. (will preserve size spatially)
e.g. F = 3 => zero pad with 1
F = 5 => zero pad with
2 F = 7 => zero pad
with 3
Remember back to…
E.g. 32x32 input convolved repeatedly with 5x5 filters shrinks volumes spatially!
(32 -> 28 -> 24 ...). Shrinking too fast is not good, doesn’t work well.
32 28 24
….
CONV, CONV, CONV,
ReLU ReLU ReLU
e.g. 6 e.g. 10
5x5x6
32 5x5x3 28 filters 24
3 filters 6 10
Convolutional Layer
Example
Input volume: 32x32x3

10 5x5 filters with stride 1, pad 2
Output volume size: ?

Convolutional Layer
Example

Output volume size:

(32+2*2-5)/1+1 = 32 spatially, so
32x32x10
Convolutional Layer
Example

Number of parameters in this layer?

Convolutional Layer
Example

Number of parameters in this layer?

each filter has 5*5*3 + 1 = 76 params (+1 for bias)
=> 76*10 = 760
Convolutional Layer
Common settings:
Summary
K = (powers of 2, e.g. 32, 64, 128, 512)
• Accepts a volume of size
- F = 3, S = 1, P = 1
• Requires four hyperparameters:
 Number of filters - F = 5, S = 1, P = 2
 Their spatial extent - F = 5, S = 2, P = ? (whatever fits)
 The stride - F = 1, S = 1, P = 0
 The amount of zero padding
• Produces a volume of size where:
 =
 =
 =
• With parameter sharing, it introduces weights per filter, for a total of weights and biases.
• In the output volume, the d-th depth slice (of size ) is the result of performing a valid convolution of the d-th filter over the
input volume with a stride of , and then offset by d-th bias.
Pooling layer
Makes the representations smaller and more manageable
operates over each activation map independently
MAX POOLING
Single depth slice

1 1 2 4
x max pool with 2x2 filters
5 6 7 8 and stride 2 6 8
3 2 1 0 3 4
1 2 3 4
y
Pooling Layer
Common settings:
Summary
- F = 2, S = 2
• Accepts a volume of size
- F = 3, S = 2
• Requires two hyperparameters:
 Their spatial extent
 The stride
• Produces a volume of size where:
 =
 =
 =
• introduces zero parameters since it computes a fixed function of the input.
• Note that it is not common to use zero-padding for pooling layers.
Fully Connected Layer
Contains neurons that connect to the entire input volume, as in ordinary Neural Networks
Batch Size
• The batch size is a hyperparameter

that defines the number of samples
to work through before updating the
internal model parameters.
• Think of a batch as a for-loop
iterating over one or more samples
and making predictions.
• At the end of the batch, the
predictions are compared to the
expected output variables and an
error is calculated.
• A training dataset can be divided into
one or more batches.
CNN Example
1. Convolutional Layer
1 1 1 0 0
0 1 1 1 0
1 -1
0 0 1 1 1 * =
1 0
0 0 1 1 0
0 1 1 0 0
CNN
Example
11 -11 1 0 0
0
10 01 1 1 0
0 0 1 1 1 =
0 0 1 1 0
0 1 1 0 0
CNN
Example
1 1 1 0 0
1 -1 0 1
0 1 1 1 0
1 0
0 0 1 1 1 =
0 0 1 1 0
0 1 1 0 0
CNN
Example
1 1 11 -10 0
0 1 2
0 1 11 01 0
0 0 1 1 1 =
0 0 1 1 0
0 1 1 0 0
CNN
Example
1 1 1 1 0 -1 0
0 1 2 1
0 1 1 11 00
0 0 1 1 1 =
0 0 1 1 0
0 1 1 0 0
CNN
Example
1 1 1 0 0
0 1 2 1
10 -11 1 1 0
-1
=
10 00 1 1 1
0 0 1 1 0
0 1 1 0 0
CNN
Example
1 1 1 0 0
0 1 2 1
0 11 -11 1 0
-1 0
0 10 01 1 1 =
0 0 1 1 0
0 1 1 0 0
CNN
Example
1 1 1 0 0
0 1 2 1
0 1 11 -11 0
-1 0 1
0 0 11 01 1 =
0 0 1 1 0
0 1 1 0 0
CNN
Example
1 1 1 0 0
0 1 2 1
0 1 1 1 1 -1 0
-1 0 1 2
0 0 1 11 01 =
0 0 1 1 0
0 1 1 0 0
CNN
Example
1 1 1 0 0
0 1 2 1
0 1 1 1 0
-1 0 1 2
=
10 -10 1 1 1
0
10 00 1 1 0
0 1 1 0 0
CNN
Example
1 1 1 0 0
0 1 2 1
0 1 1 1 0
-1 0 1 2
0 10 -11 1 1 =
0 -1
0 10 01 1 0
0 1 1 0 0
CNN
Example
1 1 1 0 0
0 1 2 1
0 1 1 1 0
-1 0 1 2
0 0 11 -11 1 =
0 -1 1
0 0 11 01 0
0 1 1 0 0
CNN
Example
1 1 1 0 0
0 1 2 1
0 1 1 1 0
-1 0 1 2
0 0 1 1 1 -1 1 =
0 -1 1 1
0 0 1 11 00
0 1 1 0 0
CNN
Example
1 1 1 0 0
0 1 2 1
0 1 1 1 0
-1 0 1 2
0 0 1 1 1 =
0 -1 1 1
10 -10 1 1 0
0
10 01 1 0 0
CNN
Example
1 1 1 0 0
0 1 2 1
0 1 1 1 0
-1 0 1 2
0 0 1 1 1 =
0 -1 1 1
0 10 -11 1 0
0 0
0 11 01 0 0
CNN
Example
1 1 1 0 0
0 1 2 1
0 1 1 1 0
-1 0 1 2
0 0 1 1 1 =
0 -1 1 1
0 0 11 -11 0
0 0 1
0 1 11 00 0
CNN
Example
1 1 1 0 0
0 1 2 1
0 1 1 1 0
-1 0 1 2
0 0 1 1 1 =
0 -1 1 1
0 0 1 11 -10
0 0 1 1
0 1 1 10 00
CNN Example
2. ReLU Activation
0 1 2 1 0 1 2 1
-1 0 1 2 ReLU 0 0 1 2
0 -1 1 1 0 0 1 1
0 0 1 1 0 0 1 1
CNN Example
3. Pooling Layer
0 1 2 1
Max Pooling with
0 0 1 2 2x2 filter, Stride 2
1
0 0 1 1
0 0 1 1
CNN Example
3. Pooling Layer
0 1 2 1
Max Pooling with
1 2
0 0 1 1
0 0 1 1
CNN Example
3. Pooling Layer
0 1 2 1
Max Pooling with
1 2
0 0 1 1 0
0 0 1 1
CNN Example
3. Pooling Layer
0 1 2 1
Max Pooling with
1 2
0 0 1 1 0 1
0 0 1 1
CNN Example
4. Fully Connected Layer
1
1 2 Flattening
2
0 1 0
Flattened Matrix from
the Pooling Layer is 1
fed as input to the
Fully Connected Layer
Convolutional Neural Networks
Summary
• ConvNets stack CONV,POOL,FC layers
• Trend towards smaller filters and deeper architectures
• Trend towards getting rid of POOL/FC layers (just CONV)
• Typical architectures look like:
[(CONV-RELU)*N-POOL?]*M-(FC-RELU)*K,SOFTMAX
where N is usually up to ~5, M is large, 0 <= K <= 2.
but recent advances such as ResNet/GoogLeNet
challenge this paradigm

Convolutional Neural Networks - Annotated

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Convolutional Neural Networks - Annotated

Uploaded by

Copyright:

Available Formats

COE101

Steering Angle Acceleration & Brake Levels

Images and Sensors Output

[Taigman et al. 2014]

[Simonyan et al. 2014]

[Guo et al. 2014]

[Levy et al. 2016]

[Sermanet et al. 2011]

Whale recognition, Kaggle Challenge Mnih and Hinton, 2010

A white teddy bear sitting in A man in a baseball A woman is holding a

A man riding a wave on A cat sitting on a A woman standing on a

CNN Architectures Moderate efficiency

convolve (slide) over all

32x32x3 image activation maps

convolve (slide) over all

We stack these up to get a “new image” of size 28x28x6!

Input volume: 32x32x3

Output volume size: ?

Input volume: 32x32x3

Output volume size:

Input volume: 32x32x3

Number of parameters in this layer?

Input volume: 32x32x3

Number of parameters in this layer?

Single depth slice

• The batch size is a hyperparameter

You might also like