You are on page 1of 83

COE101

Introductory
Artificial Intelligence
College of Engineering

Deep Learning -
Convolutional
Neural Networks
Convolutional
Neural Networks
What are they?
• Can take in an input image, assign
importance to parts of it and differentiate
one from the other
• A class of neural networks
• Not fully-connected
• A much much lower number of weights
and biases to learn  more manageable
A Typical Convolutional Neural Networks Architecture
Applications
Convolutional Neural Networks
Applications
Classification Retrieval
Object Detection and Segmentation

Detection Segmentation
Self Driving Cars

Steering Angle Acceleration & Brake Levels


Deep Learning &
Computer Vision – Regression of 3
Values

Images and Sensors Output


Face Recognition

[Taigman et al. 2014]


Activity Detection

[Simonyan et al. 2014]


Pose Estimation and Gaming
[Toshev, Szegedy 2014]

[Guo et al. 2014]


Bioimaging, Space Exploration,
Safe Driving Detection

[Levy et al. 2016]

[Sermanet et al. 2011]


[Dieleman et al. 2014] [Ciresan et al.]
Environmental Applications, Mapping, and GIS

Whale recognition, Kaggle Challenge Mnih and Hinton, 2010


No errors Minor errors Somewhat related

Captioning

A white teddy bear sitting in A man in a baseball A woman is holding a


the grass uniform throwing a ball cat in her hand
[Vinyals et al., 2015]
[Karpathy and Fei-
Fei, 2015]

A man riding a wave on A cat sitting on a A woman standing on a


top of a surfboard suitcase on the floor beach holding a surfboard
Style Transfer
Deep Learning
Architectures
Architecture Over the Years
CNN Architectures  The first successful applications of Convolutional
Networks were developed by Yann LeCun in
LeNet 1990’s.
 The best known is the LeNet architecture that was
used to read zip codes, digits, etc.
CNN Architectures  The first work that popularized Convolutional
Networks in Computer Vision.
 This Network has a very similar architecture to
AlexNet LeNet, but was deeper, bigger, and featured
Convolutional Layers stacked on top of each other
 Main contribution in the development of an
CNN Architectures Inception Module that dramatically reduced the
number of parameters in the network (4M,
GoogLeNet compared to AlexNet with 60M).
CNN Architectures  Main contribution is in showing that the depth of the
network is a critical component for good
VGGNet performance.
 Their final best network contains 16 CONV/FC
layers and, appealingly, features an extremely
homogeneous architecture that only performs 3x3
convolutions and 2x2 pooling from the beginning to
the end.
CNN Architectures
ResNet
relu
F(x) + x
 Very deep networks using residual
connections
 Features special skip connections and a ..
.
heavy use of batch normalization. X
 ResNets are currently by far state of the F(x) relu
identity
art Convolutional Neural Network models
and are the default choice for using
ConvNets in practice.
X
Residual block
ResNet:

CNN Architectures Moderate efficiency


depending on
model, highest
Comparing Complexity GoogLeNet: accuracy
VGG: Highest
memory,
most efficient
most
operations

AlexNet:
An Analysis of Deep Neural Network Models for Practical Applications, 2017. Smaller compute, still memory
Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with permission. heavy, lower accuracy
Architecture
Overview
Convolution Layer
32x32x3 image  preserve spatial structure

32 height

32 width
3 depth
Convolution Layer
32x32x3 image
5x5x3 filter
Convolve the filter with the image
32 height i.e. “slide over the image spatially,
computing dot products”

32 width
3 depth
Filters always extend the full
Convolution Layer depth of the input volume

32x32x3 image
5x5x3 filter
Convolve the filter with the image
i.e. “slide over the image spatially,
32 height
computing dot products”

32 width
3 depth
Convolution Layer
32x32x3 image
5x5x3 filter
32

1 number:
the result of taking a dot product between the
filter and a small 5x5x3 chunk of the image
32 (i.e. 5*5*3 = 75-dimensional dot product +
3 bias)
Convolution Layer
activation map
32x32x3 image
5x5x3 filter
32

28

convolve (slide) over all


spatial locations

32 28
3 1
Convolution Layer consider a second, green filter

32x32x3 image activation maps

5x5x3 filter
32

28

convolve (slide) over all


spatial locations

32 28
3 1
For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps:
activation maps

32

28

Convolution Layer

32 28
3 6

We stack these up to get a “new image” of size 28x28x6!


Preview
ConvNet is a sequence of Convolution Layers, interspersed with activation
functions

32 28

CONV,
ReLU
e.g. 6

32 5x5x3 28
3 filters 6
Preview
ConvNet is a sequence of Convolutional Layers, interspersed with activation
functions

32 28 24

….
CONV, CONV, CONV,
ReLU ReLU ReLU
e.g. 6 e.g. 10
5x5x6
32 5x5x3 28 filters 24
3 filters 6 10
Preview
A closer look at spatial dimensions:

7
7x7 input (spatially)
assume 3x3 filter

7
A closer look at spatial dimensions:

7
7x7 input (spatially)
assume 3x3 filter

7
A closer look at spatial dimensions:

7
7x7 input (spatially)
assume 3x3 filter

7
A closer look at spatial dimensions:

7
7x7 input (spatially)
assume 3x3 filter

7
A closer look at spatial dimensions:

7
7x7 input (spatially)
assume 3x3 filter

7
A closer look at spatial dimensions:

7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2

7
A closer look at spatial dimensions:

7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2

7
A closer look at spatial dimensions:

7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2

7
N
Output size:
(N - F) / stride + 1
F
e.g. N = 7, F = 3:
F N
stride 1 => (7 - 3)/1 + 1 = 5
stride 2 => (7 - 3)/2 + 1 = 3
stride 3 => (7 - 3)/3 + 1 = 2.33 :\
In practice: Common to zero pad the border
0 0 0 0 0 0
e.g. input 7x7
0 3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
0

(recall:)
(N - F) / stride + 1
In practice: Common to zero pad the border
0 0 0 0 0 0
e.g. input 7x7
0 3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
0
7x7 output!
0
In practice: Common to zero pad the border
0 0 0 0 0 0
e.g. input 7x7
0 3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
0
7x7 output!
0
in general, common to see CONV layers with
stride 1, filters of size FxF, and zero-padding with
(F-1)/2. (will preserve size spatially)
e.g. F = 3 => zero pad with 1
F = 5 => zero pad with
2 F = 7 => zero pad
with 3
Remember back to…
E.g. 32x32 input convolved repeatedly with 5x5 filters shrinks volumes spatially!
(32 -> 28 -> 24 ...). Shrinking too fast is not good, doesn’t work well.

32 28 24

….
CONV, CONV, CONV,
ReLU ReLU ReLU
e.g. 6 e.g. 10
5x5x6
32 5x5x3 28 filters 24
3 filters 6 10
Convolutional Layer
Example

Input volume: 32x32x3


10 5x5 filters with stride 1, pad 2

Output volume size: ?


Convolutional Layer
Example

Input volume: 32x32x3


10 5x5 filters with stride 1, pad 2

Output volume size:


(32+2*2-5)/1+1 = 32 spatially, so
32x32x10
Convolutional Layer
Example

Input volume: 32x32x3


10 5x5 filters with stride 1, pad 2

Number of parameters in this layer?


Convolutional Layer
Example

Input volume: 32x32x3


10 5x5 filters with stride 1, pad 2

Number of parameters in this layer?


each filter has 5*5*3 + 1 = 76 params (+1 for bias)
=> 76*10 = 760
Convolutional Layer
Common settings:
Summary
K = (powers of 2, e.g. 32, 64, 128, 512)
• Accepts a volume of size
- F = 3, S = 1, P = 1
• Requires four hyperparameters:
 Number of filters - F = 5, S = 1, P = 2
 Their spatial extent - F = 5, S = 2, P = ? (whatever fits)
 The stride - F = 1, S = 1, P = 0
 The amount of zero padding
• Produces a volume of size where:
 =
 =
 =
• With parameter sharing, it introduces weights per filter, for a total of weights and biases.
• In the output volume, the d-th depth slice (of size ) is the result of performing a valid convolution of the d-th filter over the
input volume with a stride of , and then offset by d-th bias.
Pooling layer
Makes the representations smaller and more manageable
operates over each activation map independently
MAX POOLING

Single depth slice


1 1 2 4
x max pool with 2x2 filters
5 6 7 8 and stride 2 6 8

3 2 1 0 3 4

1 2 3 4

y
Pooling Layer
Common settings:
Summary
- F = 2, S = 2
• Accepts a volume of size
- F = 3, S = 2
• Requires two hyperparameters:
 Their spatial extent
 The stride
• Produces a volume of size where:
 =
 =
 =
• introduces zero parameters since it computes a fixed function of the input.
• Note that it is not common to use zero-padding for pooling layers.
Fully Connected Layer
Contains neurons that connect to the entire input volume, as in ordinary Neural Networks
Batch Size

• The batch size is a hyperparameter


that defines the number of samples
to work through before updating the
internal model parameters.
• Think of a batch as a for-loop
iterating over one or more samples
and making predictions.
• At the end of the batch, the
predictions are compared to the
expected output variables and an
error is calculated.
• A training dataset can be divided into
one or more batches.
CNN Example
1. Convolutional Layer
1 1 1 0 0
0 1 1 1 0
1 -1
0 0 1 1 1 * =
1 0
0 0 1 1 0
0 1 1 0 0
CNN
Example

11 -11 1 0 0
0
10 01 1 1 0
0 0 1 1 1 =
0 0 1 1 0
0 1 1 0 0
CNN
Example
1 1 1 0 0
1 -1 0 1
0 1 1 1 0
1 0
0 0 1 1 1 =
0 0 1 1 0
0 1 1 0 0
CNN
Example
1 1 11 -10 0
0 1 2
0 1 11 01 0
0 0 1 1 1 =
0 0 1 1 0
0 1 1 0 0
CNN
Example
1 1 1 1 0 -1 0
0 1 2 1
0 1 1 11 00
0 0 1 1 1 =
0 0 1 1 0
0 1 1 0 0
CNN
Example
1 1 1 0 0
0 1 2 1
10 -11 1 1 0
-1
=
10 00 1 1 1
0 0 1 1 0
0 1 1 0 0
CNN
Example
1 1 1 0 0
0 1 2 1
0 11 -11 1 0
-1 0
0 10 01 1 1 =
0 0 1 1 0
0 1 1 0 0
CNN
Example
1 1 1 0 0
0 1 2 1
0 1 11 -11 0
-1 0 1
0 0 11 01 1 =
0 0 1 1 0
0 1 1 0 0
CNN
Example
1 1 1 0 0
0 1 2 1
0 1 1 1 1 -1 0
-1 0 1 2
0 0 1 11 01 =
0 0 1 1 0
0 1 1 0 0
CNN
Example
1 1 1 0 0
0 1 2 1
0 1 1 1 0
-1 0 1 2
=
10 -10 1 1 1
0
10 00 1 1 0
0 1 1 0 0
CNN
Example
1 1 1 0 0
0 1 2 1
0 1 1 1 0
-1 0 1 2
0 10 -11 1 1 =
0 -1
0 10 01 1 0
0 1 1 0 0
CNN
Example
1 1 1 0 0
0 1 2 1
0 1 1 1 0
-1 0 1 2
0 0 11 -11 1 =
0 -1 1
0 0 11 01 0
0 1 1 0 0
CNN
Example
1 1 1 0 0
0 1 2 1
0 1 1 1 0
-1 0 1 2
0 0 1 1 1 -1 1 =
0 -1 1 1
0 0 1 11 00
0 1 1 0 0
CNN
Example
1 1 1 0 0
0 1 2 1
0 1 1 1 0
-1 0 1 2
0 0 1 1 1 =
0 -1 1 1
10 -10 1 1 0
0
10 01 1 0 0
CNN
Example
1 1 1 0 0
0 1 2 1
0 1 1 1 0
-1 0 1 2
0 0 1 1 1 =
0 -1 1 1
0 10 -11 1 0
0 0
0 11 01 0 0
CNN
Example
1 1 1 0 0
0 1 2 1
0 1 1 1 0
-1 0 1 2
0 0 1 1 1 =
0 -1 1 1
0 0 11 -11 0
0 0 1
0 1 11 00 0
CNN
Example
1 1 1 0 0
0 1 2 1
0 1 1 1 0
-1 0 1 2
0 0 1 1 1 =
0 -1 1 1
0 0 1 11 -10
0 0 1 1
0 1 1 10 00
CNN Example
2. ReLU Activation

0 1 2 1 0 1 2 1
-1 0 1 2 ReLU 0 0 1 2
0 -1 1 1 0 0 1 1
0 0 1 1 0 0 1 1
CNN Example
3. Pooling Layer

0 1 2 1
Max Pooling with
0 0 1 2 2x2 filter, Stride 2
1

0 0 1 1
0 0 1 1
CNN Example
3. Pooling Layer

0 1 2 1
Max Pooling with
0 0 1 2 2x2 filter, Stride 2
1 2

0 0 1 1
0 0 1 1
CNN Example
3. Pooling Layer

0 1 2 1
Max Pooling with
0 0 1 2 2x2 filter, Stride 2
1 2

0 0 1 1 0

0 0 1 1
CNN Example
3. Pooling Layer

0 1 2 1
Max Pooling with
0 0 1 2 2x2 filter, Stride 2
1 2

0 0 1 1 0 1

0 0 1 1
CNN Example
4. Fully Connected Layer

1
1 2 Flattening
2
0 1 0
Flattened Matrix from
the Pooling Layer is 1
fed as input to the
Fully Connected Layer
Convolutional Neural Networks
Summary
• ConvNets stack CONV,POOL,FC layers
• Trend towards smaller filters and deeper architectures
• Trend towards getting rid of POOL/FC layers (just CONV)
• Typical architectures look like:
[(CONV-RELU)*N-POOL?]*M-(FC-RELU)*K,SOFTMAX
where N is usually up to ~5, M is large, 0 <= K <= 2.
but recent advances such as ResNet/GoogLeNet
challenge this paradigm

You might also like