Deep Learning CNN

BIL 722: Advanced Topics in Computer Vision
(Deep Learning for Computer Vision)

Spring 2016
Week 3: Convolutional Neural Nets

Aykut Erdem
AlexNet, Krizhevsky et al., 2012

Administrative
Paper presentations will be starting next week!
• 8 papers not chosen yet

• Any volunteers for Theano, Keras, CNTK, Marvin
tutorials?
2
Presentations
• You can use any presentation tool (e.g., Powerpoint, Keynote,
LaTex) provided that the tool has options to export the slides
to PDF.
• Each presentation should be clear, well organized and very
technical, and roughly 30 minutes long.
• You are allowed to reuse the material already exist on the web
as long as you clearly cite the source of the media that you
have used in your presentation.
• Extra credit will be awarded to those students who also
conduct some experiments demonstrating how the method
works in practice.
3
Presentations
Deadline:
• You should meet with me 3-4 days before the
presentation date to discuss your slides
• The presentation should be submitted by the night before
the class
• Presentations grading rubric on the webpage
4
Suggested Presentation Outline
• High-level overview of the paper (main contributions)
• Problem statement and motivation (clear definition of the
problem, why it is interesting and important)
• Key technical ideas (overview of the approach)
• Experimental set-up (datasets, evaluation metrics, applications)
• Strengths and weaknesses (discussion of the results obtained)
• Connections with other work (how it relates to other
approaches, its similarities and differences)
• Future direction (open research questions)
5
Homework
Due March 15, 2016
(till 12:30pm)
• Fine-tuning a pre-trained model to

classify cultural events on the image
data from ChaLearn Looking at
People 2015 Challenge (CVPR 2015)
• The purpose is to help you learn
about the fundamentals of training
and understanding convolutional
networks:
- applying dropout, batch normalization and data augmentation to reduce
overfitting,
- combining models into ensembles to improve the performance,
- using transfer learning to adapt a pre-trained model to a new dataset,
- using data gradients to visualize saliency maps
6
A bit of history
Hubel & Wiesel,
1959
RECEPTIVE FIELDS OF SINGLE
NEURONES IN
THE CAT'S STRIATE CORTEX
1962
RECEPTIVE FIELDS,
BINOCULAR INTERACTION
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
AND FUNCTIONAL
ARCHITECTURE IN
THE CAT'S VISUAL CORTEX
1968...
https://youtu.be/8VdFf3egwfg?t=1m10s 7
8
A bit of history
Topographical mapping
in the cortex:
nearby cells in cortex
represented
nearby regions in the
visual field
9
10
Hierarchical organization
The “Halle Berry” Neuron
Invariant visual representation by single neurons
in the human brain [Quiroga et al., Nature, 2005]
A single unit in the right

anterior hippocampus that
responds to pictures of the
actress Halle Berry
11
A bit of history “sandwich” architecture (SCSCSC…)
simple cells: modifiable parameters
complex cells: perform pooling
Neurocognitron
[Fukushima 1980]
12
A bit of history
Gradient-based learning
applied to document
recognition
[LeCun, Bottou, Bengio, Haffner
1998]
LeNet-5
13
A bit of history
ImageNet Classification with Deep
Convolutional Neural Networks
[Krizhevsky, Sutskever, Hinton, 2012]
“AlexNet”
14
Convolutional Neural Networks
(First without the brain stuff)
15
Convolution Layer
32x32x3 image
32 height
32 width
3 depth
16
Convolution Layer
32x32x3 image
5x5x3 filter
32
Convolve the filter with the image

i.e. “slide over the image spatially,
computing dot products”
32
3
17
Convolution Layer
Filters always extend the full
depth of the input volume
32x32x3 image
5x5x3 filter
32
Convolve the filter with the image

i.e. “slide over the image spatially,
computing dot products”
32
3
18
Convolution Layer
32x32x3 image
5x5x3 filter
32
1 number:
the result of taking a dot product between the
filter and a small 5x5x3 chunk of the image

32 (i.e. 5*5*3 = 75-dimensional dot product + bias)
3
19
Convolution Layer
activation map
32x32x3 image
5x5x3 filter
32
28
convolve (slide) over all

spatial locations
32 28
3 1
20
Convolution Layer
consider a second, green filter
32x32x3 image activation maps
5x5x3 filter
32
28

spatial locations
32 28
3 1
21
For example, if we had 6 5x5 filters, we’ll get 6
separate activation maps:
activation maps
32
28
Convolution Layer
32 28
3 6
We stack these up to get a “new image” of size 28x28x6!
22
Preview: ConvNet is a sequence of Convolutional
Layers, interspersed with activation functions
32 28
CONV,
ReLU
e.g. 6
5x5x3
32 filters 28
3 6
23
Preview: ConvNet is a sequence of Convolutional
Layers, interspersed with activation functions
32 28 24
….
CONV, CONV, CONV,
ReLU ReLU ReLU
e.g. 6 e.g. 10
5x5x3 5x5x6
32 filters 28 filters 24
3 6 10
24
25
LeCun slides]
[From recent Yann
Preview
26
LeCun slides]
[From recent Yann
Preview
one filter =>
one activation map
example 5x5 filters
(32 total)
We call the layer convolutional

because it is related to convolution
of two signals:
elementwise multiplication
and sum of a filter and the
signal (image)
27
28
Preview
A closer look at spatial dimensions:
activation map
32x32x3 image
5x5x3 filter
32
28

spatial locations
32 28
3 1
29
7
7x7 input (spatially)
assume 3x3 filter
7
30
7
assume 3x3 filter
7
31
7
assume 3x3 filter
7
32
7
assume 3x3 filter
7
33
7
assume 3x3 filter
=> 5x5 output

7
34
7
assume 3x3 filter
applied with stride 2
7
35
7
assume 3x3 filter
7
36
7
assume 3x3 filter
=> 3x3 output!
7
37
7
assume 3x3 filter
applied with stride 3?
7
38
7
assume 3x3 filter
applied with stride 3?
7
doesn’t fit!
cannot apply 3x3 filter on

7x7 input with stride 3.
39
N
Output size:
(N - F) / stride + 1
F
e.g. N = 7, F = 3:
N stride 1 => (7 - 3)/1 + 1 = 5
F stride 2 => (7 - 3)/2 + 1 = 3
stride 3 => (7 - 3)/3 + 1 = 2.33 :\
40
In practice: Common to zero pad
the border
0 0 0 0 0 0 e.g. input 7x7
3x3 filter, applied with stride 1
0
pad with 1 pixel border => what is the
0 output?
0
0
(recall:)
(N - F) / stride + 1
41
the border
0 0 0 0 0 0 e.g. input 7x7
0
0 output?
0
7x7 output!
0
42
the border
0 0 0 0 0 0 e.g. input 7x7
0
0 output?
0
7x7 output!
0 in general, common to see CONV layers
with stride 1, filters of size FxF, and zero-
padding with (F-1)/2. (will preserve size

spatially)
e.g. F = 3 => zero pad with 1
F = 5 => zero pad with 2
F = 7 => zero pad with 3
43
Remember back to…
E.g. 32x32 input convolved repeatedly with 5x5 filters
shrinks volumes spatially!
(32 -> 28 -> 24 ...). Shrinking too fast is not good, doesn’t
work well.
32 28 24
….
CONV, CONV, CONV,

ReLU ReLU ReLU
e.g. 6 e.g. 10
5x5x3 5x5x6
32 filters 28 filters 24
3 6 10
44
Examples time:
Input volume: 32x32x3

10 5x5 filters with stride 1, pad 2
Output volume size: ?

45
Examples time:

Output volume size:

(32+2*2-5)/1+1 = 32 spatially, so
32x32x10
46
Examples time:

Number of parameters in this layer?

47
Examples time:

Number of parameters in this layer?

each filter has 5*5*3 + 1 = 76 params
(+1 for bias)

=> 76*10 = 760
48
49
Common settings:
K = (powers of 2, e.g. 32, 64, 128, 512)

- F = 3, S = 1, P = 1
- F = 5, S = 1, P = 2
- F = 5, S = 2, P = ? (whatever fits)
- F = 1, S = 1, P = 0
50
(btw, 1x1 convolution layers make perfect sense)
1x1 CONV
56 with 32 filters
56
(each filter has size
1x1x64, and performs a
64-dimensional dot
56 product)
56
64 32
51
52
Example: CONV layer in Torch
53
Example: CONV layer in Caffe
54
Example: CONV layer in Lasagne
The brain/neuron view of CONV Layer
32x32x3 image
5x5x3 filter
32
1 number:
32 the result of taking a dot product between
the filter and this part of the image
3
(i.e. 5*5*3 = 75-dimensional dot product)
55
32x32x3 image
5x5x3 filter
32
It’s just a neuron with local

connectivity...
1 number:
32 the result of taking a dot product between
the filter and this part of the image
3
(i.e. 5*5*3 = 75-dimensional dot product)
56
32
28 An activation map is a 28x28 sheet of neuron

outputs:
1. Each is connected to a small region in the input
2. All of them share parameters
32
28 “5x5 filter” -> “5x5 receptive field for each neuron”
3
57
32
E.g. with 5 filters,

28 CONV layer consists of
neurons arranged in a 3D grid
(28x28x5)
There will be 5 different

32 28 neurons all looking at the same
3 region in the input volume
5
58
59
two more layers to go: POOL/FC
Pooling layer
- makes the representations smaller and more manageable
- operates over each activation map independently:
60
Max Pooling
Single depth slice

1 1 2 4
x
max pool with 2x2 filters
5 6 7 8 and stride 2 6 8
3 2 1 0 3 4
1 2 3 4
61
62
63
Common settings:
F = 2, S = 2
F = 3, S = 2
Fully Connected Layer (FC layer)
- Contains neurons that connect to the entire input volume, as in ordinary
Neural Networks
64
[ConvNetJS demo: training on CIFAR-10]
http://cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.html
65
Case Study: LeNet-5 [LeCun et al., 1998]
Conv filters were 5x5, applied at stride 1

Subsampling (Pooling) layers were 2x2 applied at stride 2
i.e. architecture is [CONV-POOL-CONV-POOL-CONV-FC]
66
Case Study: AlexNet
[Krizhevsky et al. 2012]
Input: 227x227x3 images
First layer (CONV1): 96 11x11 filters applied at stride 4

=>
Q: what is the output volume size? Hint: (227-11)/4+1 = 55
67
Case Study: AlexNet

=>
Output volume [55x55x96]
Q: What is the total number of parameters in this layer?
68
Case Study: AlexNet

=>
Output volume [55x55x96]
Parameters: (11*11*3)*96 = 35K
69
Case Study: AlexNet

After CONV1: 55x55x96
Second layer (POOL1): 3x3 filters applied at stride 2

Q: what is the output volume size? Hint: (55-3)/2+1 = 27
70
Case Study: AlexNet


Output volume: 27x27x96
Q: what is the number of parameters in this layer?
71
Case Study: AlexNet


Output volume: 27x27x96
Parameters: 0!
72
Case Study: AlexNet

After POOL1: 27x27x96
...
73
Case Study: AlexNet
Full (simplified) AlexNet architecture:

[227x227x3] INPUT
[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0
[27x27x96] MAX POOL1: 3x3 filters at stride 2
[27x27x96] NORM1: Normalization layer
[13x13x256] NORM2: Normalization layer

[4096] FC6: 4096 neurons
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class scores)
74
Case Study: AlexNet
Full (simplified) AlexNet architecture: Details/Retrospectives:

[227x227x3] INPUT - first use of ReLU
[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0 - used Norm layers (not common
[27x27x96] MAX POOL1: 3x3 filters at stride 2 anymore)
[27x27x96] NORM1: Normalization layer - heavy data augmentation
[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2 - dropout 0.5
[13x13x256] MAX POOL2: 3x3 filters at stride 2 - batch size 128
[13x13x256] NORM2: Normalization layer - SGD Momentum 0.9
[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1 - Learning rate 1e-2, reduced by 10
[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1 manually when val accuracy plateaus
[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1 - L2 weight decay 5e-4
- 7 CNN ensemble: 18.2% -> 15.4%
[4096] FC6: 4096 neurons
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class scores)
75
Case Study: ZFNet [Zeiler and Fergus, 2013]
AlexNet but:
CONV1: change from (11x11 stride 4) to (7x7 stride 2)
CONV3,4,5: instead of 384, 384, 256 filters use 512, 1024, 512
ImageNet top 5 error: 15.4% -> 14.8%

76
Case Study: VGGNet
[Simonyan and Zisserman, 2014]
Only 3x3 CONV stride 1, pad 1

and 2x2 MAX POOL stride 2
best model
11.2% top 5 error in ILSVRC 2013

->
7.3% top 5 error
77
(not counting biases)
INPUT: [224x224x3] memory: 224*224*3=150K params: 0
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728
POOL2: [112x112x64] memory: 112*112*64=800K params: 0
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648

FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448
FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216
78
(not counting biases)

TOTAL memory: 24M * 4 bytes ~= 93MB / image (only forward! ~*2 for bwd)
TOTAL params: 138M parameters
79
(not counting biases) Note:
Most memory is in
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728 early CONV
Most params are

POOL2: [7x7x512] memory: 7*7*512=25K params: 0 in late FC
TOTAL memory: 24M * 4 bytes ~= 93MB / image (only forward! ~*2 for bwd)
TOTAL params: 138M parameters
80
Case Study: GoogLeNet
[Szegedy et al.,
2014]
Inception module
ILSVRC 2014 winner (6.7% top 5 error)
81
Case Study: GoogLeNet
Fun features:
- Only 5 million params!

(Removes FC layers
completely)
Compared to AlexNet:
- 12X less params
- 2x more compute
- 6.67% (vs. 16.4%)
82
Case Study: ResNet [He et al., 2015]
ILSVRC 2015 winner
(3.6% top 5 error)
Slide from Kaiming He’s recent presentation

https://www.youtube.com/watch?v=1PGLj-uKT1w 83
84
(slide from Kaiming He’s recent presentation)
85
ILSVRC 2015 winner
(3.6% top 5 error)
2-3 weeks of training

on 8 GPU machine
at runtime: faster
than a VGGNet!
(even though it has
8x more layers)
(slide from Kaiming He’s recent presentation)

86
224x224x3
spatial dimension
only 56x56!
87
88
[He et al., 2015]
Case Study: ResNet
- Batch Normalization after every CONV layer

- Xavier/2 initialization from He et al.
- SGD + Momentum (0.9)
- Learning rate: 0.1, divided by 10 when validation error plateaus
- Mini-batch size 256
- Weight decay of 1e-5
- No dropout used
89
90
[He et al., 2015]
Case Study: ResNet
(this trick is also used in GoogLeNet)
91
92
[He et al., 2015]
Case Study: ResNet
93
Case Study Bonus: DeepMind’s
AlphaGo
policy network:
[19x19x48] Input
CONV1: 192 5x5 filters , stride 1, pad 2 => [19x19x192]
CONV2..12: 192 3x3 filters, stride 1, pad 1 => [19x19x192]
CONV: 1 1x1 filter, stride 1, pad 0 => [19x19] (probability map of promising moves)
94
Summary
- ConvNets stack CONV,POOL,FC layers

- Trend towards smaller filters and deeper architectures
- Trend towards getting rid of POOL/FC layers (just CONV)
- Typical architectures look like
[(CONV-RELU)*N-POOL?]*M-(FC-RELU)*K,SOFTMAX
where N is usually up to ~5, M is large, 0 <= K <= 2.

- but recent advances such as ResNet/GoogLeNet
challenge this paradigm
95
Tips and Tricks
96
• Shuffle the training samples
• Use Dropoout and Batch

Normalization for regularization
97
Input representation
“Given a rectangular image, we first rescaled
Input representation the image such that the shorter side was of
length 256, and then cropped out the central
256×256 patch from the resulting image”
● Centered (0-mean) RGB values.
• Centered (0-mean) RGB values.
slide by Alex Krizhevsky
An input image (256x256) Minus sign The mean input image

98
Data Augmentation
• Our neural net has 60M
real-valued parameters and
650,000 neurons
• It overfits a lot. Therefore,
they train on 224x224
patches extracted randomly
from 256x256 images, and
also their horizontal
reflections.
“This increases the size of our training set
by a factor of 2048, though the resulting

training examples are, of course, highly
inter- dependent.” [Krizhevsky et al. 2012] 99
Input repr
Data Augmentation ● Centered (0-mean) RG
• Alter the intensities of the

RGB channels in training
images.
“Specifically, we perform PCA on the set of
RGB pixel values throughout the ImageNet
training set. To each training image, we add
multiples of the found principal components,
An input image (256x256) Min
with magnitudes proportional to the corres.
ponding eigenvalues times a random variable
drawn from a Gaussian with mean zero and
standard deviation 0.1…This scheme
approximately captures an important property
of natural images, namely, that object identity
is invariant to changes in the intensity and
color of the illumination. This scheme reduces
the top-1 error rate by over 1%.”
[Krizhevsky et al. 2012] 100

101
Data Augmentation
Horizontal flips
Data Augmentation
Get creative!
Random mix/combinations of :
- translation
- rotation
- stretching
- shearing,
- lens distortions, … (go crazy)
102
Data augmentation improves human
learning, not just deep learning
If you're trying to improve your golf swing or master that tricky guitar
chord progression, here's some good news from researchers at Johns
Hopkins University: You may be able to double how quickly you learn
skills like these by introducing subtle variations into your practice
routine.
The received wisdom on learning motor skills goes something like this:
You need to build up "muscle memory" in order to perform mechanical
tasks, like playing musical instruments or sports, quickly and efficiently.
And the way you do that is via rote repetition — return hundreds of
tennis serves, play that F major scale over and over until your fingers
bleed, etc.
The wisdom on this isn't necessarily wrong, but the Hopkins research
https://www.washingtonpost.com/ suggests it's incomplete. Rather than doing the same thing over and
news/wonk/wp/2016/02/12/how- over, you might be able to learn things even faster — like, twice as fast —
if you change up your routine. Practicing your baseball swing? Change
to-learn-new-skills-twice-as-fast/ the size and weight of your bat. Trying to nail a 12-bar blues in A major
on the guitar? Spend 20 minutes playing the blues in E major, too.
Practice your backhand using tennis rackets of varying size and weight.
103
Convolutions
scale alone rotation
shift are notcolor
enough!
space
Convolutions alone cannot handle this!
slide by Alex Smola
scale shift rotation color space 104

Invariance and Covariance
Visualizing and Understanding Convolutional Neural Networks
0.8 1
a1# 9 a2# 0.7 a3# 0.9 a4#
8 0.8
0.6
7 Lawn Mower 0.7
Canonical Distance
Canonical Distance
0.5 Shih−Tzu
P(true class)
0.6
African Crocodile
5 0.4 African Grey 0.5
Entertrainment Center
0.4 Lawn Mower
0.3
Lawn Mower Shih−Tzu
3 0.3
African Crocodile
0.2
African Crocodile 0.2 African Grey
African Grey Entertrainment Center
0.1
1 0.1
0 0
−60 −40 −20 0 20 40 60 −60 −40 −20 0 20 40 60
Vertical Translation (Pixels) Vertical Translation (Pixels) Vertical Translation (Pixels)
12 0.7 1
b1# b2# 0.6

b3# 0.9 b4#
10
0.8
0.5 0.7
Canonical Distance
Canonical Distance
8
Lawn Mower
P(true class)
0.6
0.4 Shih−Tzu
6 0.5 African Crocodile
African Grey
0.3
0.4 Entertrainment Center
Lawn Mower
4 Lawn Mower
0.2 Shih−Tzu 0.3
Shih−Tzu African Crocodile
African Crocodile African Grey 0.2
2 0.1
African Grey Entertrainment Center 0.1
0 0 0
1 1.2 1.4 1.6 1.8 1 1.2 1.4 1.6 1.8 1 1.2 1.4 1.6 1.8
Scale (Ratio) Scale (Ratio) Scale (Ratio)
15 1.4 1
c1# c2# 1.2

c3# 0.9 c4#
0.8
Canonical Distance 1 0.7
Canonical Distance
10 Lawn Mower
Shih−Tzu
P(true class)
0.6
0.8 African Crocodile
0.5 African Grey
0.6 Entertrainment Center
0.4
5 Lawn Mower Lawn Mower
0.4 0.3
Shih−Tzu Shih−Tzu
African Crocodile African Crocodile 0.2
African Grey 0.2 African Grey
0.1
Entertrainment Center Entertrainment Center
0 0 0
0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350
Rotation Degrees Rotation Degrees Rotation Degrees
gure 4. Analysis of vertical translation, scale, and rotation invariance within the model (rows a-c respectively). Col 1:
Visualizing
ample images undergoingandtheUnderstanding
transformations. Convolutional Networks
Col 2 & 3: Euclidean [Zeiler
distance and
between Fergus,
feature 2014]
vectors from the origin
105
Invariance and Covariance
Table 3: Relative variance and intrinsic dimensionality av
eraged overTable
experiments for different
3: Relative object categories
variance and intrinan
viewpoints (3D orientation, translation, and scale). Each cel
eragedbottom
top – rel. variance; over –experiments fordo
intrinsic dim. We different
not repo
the intrinsicviewpoints (3D itorientation,
dim. of L since translation
is typically larger than 1K
top – rel. variance; bottom – intrinsic
across the experiments and expensive to compute.
(a) Car, pool5 (b) Chair, pool5 the intrinsic dim. of L since it is ty
pool5 fc6 fc7
across the experiments and expensive
(a) Lighting (b) Scale Places
26.8 % 21.4 % 17.8 %
8.5 7.0 5.9
Viewpoint AlexNet 26.4 % 19.4 % pool5
15.6 %
(a) Lighting (b) Scale 8.3Places 7.2 26.86.0%
VGG 21.2 % 16.4 % 12.3 %
8.5
10.0 7.7 6.2
Viewpoint AlexNet 26.4 %
(c) Car, fc6 (d) Chair, fc6 Places 26.8 % 39.1 % 49.4 %
136.3 105.5 8.3
54.6
Style AlexNet 28.2VGG% 40.3 % 21.2
49.4%%
(c) Object color (d) Background color 121.1 125.5 96.7
10.0
VGG 26.4 % 44.3 % 56.2 %
Figure 4: PCA embeddings for different factors using 181.9Places136.3 26.8 %
94.2
AlexNet pool5 features on “car” images. Colors in (a) corre- Places 46.8 % 39.5 % 136.3
32.9 %
spond to location of the light source (green – center). L Style 45.0AlexNet
AlexNet % 40.3 % 28.2
35.0%%
(e) Car, fc7 (f) Chair, fc7 (c) Object color (d) Background color VGG 52.4 % 39.3 % 121.1%
31.5
intuition that higher layers are more invariant to viewpoint. VGG 26.4 %
slide by Joan Bruna
Figure 3: PCA embeddings for 2D position on AlexNet.

Figure
We also 4:note
PCA embeddings
that the for different
residual feature L factors using
is less important color in all layers. This may be related to the fact 181.9
that colo
AlexNet pool5 features on “car” images. Colors in (a) corre-
in higher layers, indicating style and viewpoint are more is a stronger indicator of the scene type
Places than it is
46.8an%objec
of
simplicity and computational efficiency, we consideredeasily
in separable in those layers. These observations are con-
spond to location of the light source (green – center). category. Second, whileL the part AlexNet of the variance explained b
Understanding deep features with computer-generated imagery, [Aubry and Russell, '15]
all experiments a frontal view of all the instances ofsistent
the with our results of section 4.1. Second, the part of 45.0 %
foreground and background color is similar in the fc7 featur
objects. The framework allows the same analysis usingthethevariance associated with style is more important in the of the Places network, it is much VGG
larger for the 52.4 %106
foregroun
Transfer Learning
“You need a lot of a data if you want to

train/use CNNs”
107
of parts at layer 5 in the model. At layer 7, the scores ing invar
are more similar, perhaps due to upper layers trying to class
ble 4, probability,
using [A] as [B]
Acc % 5 train/test a function of %the
folds.
Ours Acc position
Training
[A] of Ours
took
[B] the17gray
min-squ
the layer
for “pomeranian”
utes for 30 Airplane 92.0 drops
images/class. 97.3 significantly.
96.0
77.1The
Dining tab (e):
63.2the
Dogpre-trained
most
77.8 67.7probable la
model beats
Transfer Learning with ConvNets

discriminate between the di↵erent breeds of dog. Bicycle 74.2 84.2 68.9 83.0 87.8 occlusion
for mostBird
locations 73.0 it 80.8
is “pomeranian”,
88.4 Horse but
78.2 if 87.5
the dog’s
86.0 face is o
the best Bottle
reported
Boat 77.5 result
85.3 for 30
85.5 images/class
Motorbike 81.0 90.1 fromfor(Bo
85.1 classi
the 2nd example, 54.3text60.8on 55.8
the car is the 91.6
Person strongest
95.0 feature
90.9 in the lay
in im
et3rd
al.,example
2013)
Bus
by 2.2%.
85.2
contains
89.9
The85.8
multiple convnet
Potted pl
model
objects. 69.4
55.9
The 79.2 trained
57.8 52.2
strongest fromTh
text.
feature
explore the ability of these feature extraction layers to Car 81.9 86.8 78.6 Sheep 83.6
scratch however
to the dog
Cat
Chair(blue
76.4
does
region
65.2 75.4 terribly,
89.3 91.2
in65.0
Sofa
onlyitachieving
(d)),Train
since
65.4 73.4
uses 94.5
86.7 multiple 46.5%.
61.1
91.8
resentatio
feature m
generalize to other datasets, namely Caltech-101 (Fei- Cow 63.2 77.8 74.4 Tv 77.4 80.7 76.1 useful for
feature, Mean
but the74.3
output
82.2 depends on many
% 0 parts
Accof%
5 the vehicle.
• Keep layers 1-7 of our

79.0 # Acc
won 15
fei et al., 2006), Caltech-256 (Griffin et al., 2006) and # Train 15/class
segmenta
30/class
Table 6. PASCAL 2012 classification results, comparing also allow
PASCAL VOC 2012. To do this, we keep layers 1-7 (Bo et
the model al., 2013)
trained from scratch
our Imagenet-pretrained does the
convnet against poorly. 81.4 ±Fig.
leading In
two
0.33 7,
plicitly e
ImageNet-trained model
(Jianchao et al., 2009) 73.2 84.3
of our ImageNet-trained model fixed and train a new we explore
methodsthe([A]=
“one-shot
(Sande et learning”
al., 2012) and(Fei-fei
[B] = (Yan etetal.,
al., 2006)
parts in
2012)).
Non-pretrained convnet 22.8 ± 1.5 46.5 ± 1.7 its predic
softmax classifier on top (for the appropriate number regime. With our pre-trained83.8
ImageNet-pretrained convnet
model,
± 0.5
just86.5
6 Caltech-
± 0.5trained m
fixed
5.1. Layer-by-Layer Performance Breakdown
of classes) using the training images of the new dataset. Table 4. Caltech-101 classification accuracy for leading
256 training images are needed to beat the our For Calte
con-
Since the softmax contains relatively few parameters, method Weusing
vnet models,
explore10howtimes as many
discriminative theimages.
features inThis each shows ilar enou
layer ofagainst two leading alternatemodel are. approaches.
it can be trained quickly from a relatively small num- the power of our
the Imagenet-pretrained
ImageNet feature extractor. We do in the la
this by varying the number of layers retained from
• Train a new softmax classifier Caltech-256: We follow the procedure of sult bring
the ImageNetAcc % andAcc % either Acc % Acc (Griffin
%
ber of examples, as is the case for certain datasets. model place a linear SVM small (i.e
et# Train
al., 2006),
or softmax
15/class15, 30/class
selecting
classifier
(Sohn et al., 2011) 35.1
on 30,
top.
42.1
45,
Tableor45/class
7 60
45.7
training
shows
60/class
results
47.9
images
generaliz
on since
toptheusing
bulk of thethe
modeltraining
This approach is a supervised form of pre-training, per on
(Boclass, Caltech-101
et al., reporting
2013) and
40.5the
± 0.4Caltech-256.
average
48.0 ± 0.2 For
of the both datasets,
51.9per-class
± 0.2 55.2 accura- su↵ering
± 0.3
a steady improvement can be seen as we ascend the although
parameters have been cies in Table
Non-pretr.
model, 5. Our
with
9.0 ±ImageNet-pretrained
best
1.4
results
22.5 ± 0.7 31.2 ± 0.5model
being obtained by using
38.8 ± beats
all
1.4
result, de
ImageNet-pretr. 65.7 ± 0.2 70.6 ± 0.2 72.7 ± 0.4 74.2 ± 0.3
images of direct
the comparisons
new dataset.
learned in a supervised fashion on the ImageNet data. the state-of-the-art
layers. This supports results obtained
the premise that as thebyfeature
(Bo etperforma al.,
Table 5. Caltech
hierarchies 256 classification
become deeper, accuracies.
they learn increasing pow- was used
This prevents to existing algo- 2013) by a significant margin: 74.2% vs 55.2% for 60
erful features.
rithms since they did not use the ImageNet data dur- training
PASCAL images/class. However,
2012: We used as withtraining
the standard
Cal-101 Cal-256 Caltech-101,
and
(30/class) (60/class)
validation images to train
SVM (1) 44.8 a 20-way
± 0.7 24.6 ±softmax
0.4 on top of
the ImageNet-pretrained
SVM (2) convnet.
66.2 ± 0.5 39.6This
± 0.3 is not ideal, as
SVM (3) 72.3 ± 0.4 46.0 ± 0.3
PASCAL images SVMcan
(4) contain multiple
76.6 ± 0.4 51.3 ± 0.1objects and our
model just provides a single exclusive
SVM (5) 86.2 ± 0.8 65.6 ± 0.3 prediction for
SVM (7) 85.5 ± 0.4 71.7 ± 0.2
Softmax (5) 82.9 ± 0.4 65.7 ± 0.5
Softmax (7) 85.4 ± 0.4 72.6 ± 0.1
Table 7. Analysis of the discriminative information con-
tained in each layer of feature maps within our ImageNet-
pretrained convnet. We train either a linear SVM or soft-
max on features from di↵erent layers (as indicated in brack-
ets) from the convnet. Higher layers generally produce
more discriminative features.
6. Discussion
Visualizing and Understanding Convolutional NetworksWe [Zeiler and
explored large Fergus,
convolutional neural2014]
network mod-
els, trained for image classification, in a number ways.
First, we systematically modified the network archi- 108
Stability: Transfer learning
• a CNN trained on a (large enough) dataset generalizes
• A ConvNet
to othertrained
visualon a (large enough) dataset generalizes to
tasks:
other visual tasks
Figure 4. t-SNE map of 20, 000 Flickr test images based on features extracted from the last layer of an AlexNet trained with K = 1, 000.
A full-resolution map is presented in the supplemental material. The inset shows a cluster of sports.
ing one-versus-all logistic loss: using a dictionary of K = times with the name of the individual sport itself. A model
1, 000 words, such a model achieves a precision@10 of trained on classification datasets such as Pascal VOC is un-
16.43 (compared to 17.98 for multiclass logistic loss). We likely to learn similar structure unless an explicit target tax-
surmise this is due to the problems one-versus-all logistic onomy is defined (as in the Imagenet dataset). Our results
loss has in dealing with class imbalance: because the num- suggest that such taxonomies can be learned from weakly
ber of negative examples is much higher than the number labeled data instead.
slide by Joan Bruna
of positive examples (for the most frequent class, more than

95.0% of the data is still negative), the rebalancing weight
4.2. Experiment 2: Transfer Learning
in front ofmap
theofpositive term is on
very
theirhigh, which leadsbytoa weaklyExperimental setup. To assess
trained the quality of the visual fea-
“Learning
Figure
Learning
Flickr
6. t-SNE
dataset.
spikes in the Note visual
visual
that all
gradient the features
10, 000 words
features
based
magnitude from
semantic from
information Large
embeddings
Large
represented
that hamper in theWeakly
as learned
SGDWeakly word 17 supervised
embeddings is the result
tures learned by of Data”,
supervised convolutional
training. supervised Data”,
network
observing that
our models,
[Joulin
[Joulin
these
on the
words et al,transfer-learning
et al.,’15]
are
we performed
’15]
assigned to images with similar visual content (the model did not observe word co-occurrences during training). A full-resolution version 109
1. Train on
Imagenet
110
2. Small dataset:
1. Train on feature extractor
Imagenet
Freeze
these
Train
this
111
2. Small dataset: 3. Medium dataset:

1. Train on feature extractor finetuning
Imagenet
more data = retrain more of
the network (or all of it)
Freeze these
Freeze
these
Train Train this

this
112
2. Small dataset: 3. Medium dataset:

1. Train on feature extractor finetuning
Imagenet
more data = retrain more of
the network (or all of it)
Freeze these
Freeze tip: use only ~1/10th of
these the original learning rate
in finetuning top layer,
and ~1/100th on
intermediate layers
Train Train this

this
113
Today ConvNets are everywhere
Classification Retrieval
[Krizhevsky 2012]
114
Detection Segmentation
[Faster R-CNN: Ren, He, Girshick, Sun 2015] [Farabet et al., 2012]
115
NVIDIA Tegra X1
self-driving cars
116
[Taigman et al. 2014]
[Simonyan et al. 2014] [Goodfellow 2014]
117
[Toshev, Szegedy 2014]

[Mnih 2013]
118
[Ciresan et al. 2013] [Sermanet et al. 2011]

[Ciresan et al.]
119
[Denil et al. 2014]

[Turaga et al., 2010]
120
Whale recognition, Kaggle Challenge Mnih and Hinton, 2010
121
Image
Captioning
[Vinyals et al., 2015]
122
reddit.com/r/deepdream
123
Deep Neural Networks Rival the Representation of Primate IT

Cortex for Core Visual Object Recognition
[Cadieu et al., 2014]
124
Deep Neural Networks Rival the Representation of Primate IT

Cortex for Core Visual Object Recognition
[Cadieu et al., 2014]
125
Understanding ConvNets
126
Understanding ConvNets
- Visualize patches that maximally activate neurons
- Visualize the weights
- Visualize the representation space (e.g. with t-SNE)
- Occlusion experiments
- Human experiment comparisons
- Deconv approaches (single backward pass)
- Optimization over image approaches (optimization)
127
Visualize patches that maximally
activate neurons
one-stream AlexNet
pool5
Rich feature hierarchies for accurate object detection and semantic segmentation
[Girshick, Donahue, Darrell, Malik]
128
Visualize the filters/kernels
(raw weights)
one-stream AlexNet
conv1
only interpretable on the first layer :(

129
Visualize the layer 1
weights
filters/kernels
(raw weights) layer 2
weights
you can still do it

for higher layers,
it’s just not that
interesting
layer 3
(these are taken
weights
from ConvNetJS
CIFAR-10 demo)
130
131
The gabor-like filters fatigue
Visualizing the representation
fc7 layer
4096-dimensional “code” for an image

(layer immediately before the classifier)
can collect the code for many images
132
Visualizing the representation
t-SNE visualization
[van der Maaten & Hinton]
Embed high-dimensional points so

that locally, pairwise distances are
conserved
i.e. similar things end up in similar

places. dissimilar things end up

wherever
Right: Example embedding of

MNIST digits (0-9) in 2D
133
t-SNE
visualization:
two images are
placed nearby if
their CNN codes
are close. See
more:
http://cs.stanford.edu/people/
karpathy/cnnembed/
134
Occlusion experiments
[Zeiler & Fergus 2013]
(as a function of
the position of the
square of zeros in
the original image)
135
Occlusion experiments
[Zeiler & Fergus 2013]
(as a function of
the position of the
square of zeros in
the original image)
136
Visualizing Activations
http://yosinski.com/deepvis
YouTube video
https://www.youtube.com/watch?v=A
gkfIQ4IGaM
(4min)
137
138
Deconv approaches
1. Feed image into net
Q: how can we compute the gradient of any

arbitrary neuron in the network w.r.t. the image?
139
140
Deconv approaches
Deconv approaches
2. Pick a layer, set the gradient there to be all zero except for one 1
for some neuron of interest
3. Backprop to image:
141
Deconv approaches
2. Pick a layer, set the gradient there to be all zero except for one 1
for some neuron of interest
3. Backprop to image:
“Guided
backpropagation:”
instead
142
Deconv approaches
[Visualizing and Understanding Convolutional Networks, Zeiler and Fergus 2013]
[Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps,
Simonyan et al., 2014]
[Striving for Simplicity: The all convolutional net, Springenberg, Dosovitskiy, et al., 2015]
143
Deconv approaches
Backward pass for a ReLU (will be changed in Guided

Backprop)
144
Deconv approaches
145
[Striving for Simplicity: The all convolutional net,
Springenberg, Dosovitskiy, et al., 2015]
Visualization of patterns
learned by the layer
conv6 (top) and layer
conv9 (bottom) of the
network trained on
ImageNet.
Each row corresponds to

one filter.
The visualization using

“guided backpropagation”
is based on the top 10
image patches activating
this filter taken from the
ImageNet dataset.
146
Deconv approaches
bit weird
147
[Visualizing and Understanding Convolutional Networks
Zeiler & Fergus, 2013]
Visualizing arbitrary neurons along the way to the top...
148
Visualizing arbitrary neurons along the
way to the top...
149
Visualizing arbitrary neurons along the
way to the top...
150
Optimization to Image
Q: can we find an image that maximizes

some class score?
151
score for class c (before Softmax)

Q: can we find an image that maximizes

some class score?
152
1. feed in
zeros. zero image
2. set the gradient of the scores vector to be

[0,0,....1,....,0], then backprop to image
153
1. feed in
zeros. zero image
2. set the gradient of the scores vector to be

[0,0,....1,....,0], then backprop to image
3. do a small “image update”

4. forward the image through
the network. score for class c (before Softmax)
5. go back to 2.
154
Deep Inside Convolutional Networks: Visualising Image Classification Models and
Saliency Maps
Karen Simonyan, Andrea Vedaldi, Andrew Zisserman, 2014
1. Find images that maximize some class score:

155
Saliency Maps
1. Find images that maximize some class score:

156
Saliency Maps
2. Visualize the
Data gradient:
(note that the gradient on

data has three channels.
M=?
Here they visualize M, s.t.:
(at each pixel take abs val, and max

over channels)
157
Saliency Maps
2. Visualize the
Data gradient:
(note that the gradient on

data has three channels.
Here they visualize M, s.t.:
(at each pixel take abs val, and max

over channels)
158
Saliency Maps
- Use grabcut for

segmentation
159
We can in fact do this for arbitrary neurons
along the ConvNet
Repeat:
1. Forward an image
2. Set activations in layer of interest to all zero, except
for a 1.0 for a neuron of interest
3. Backprop to image
4. Do an “image update”
160
Understanding Neural Networks Through Deep Visualization
[Yosinski et al., 2015]
Proposed a different form of regularizing the image
More explicit scheme:

Repeat:
- Update the image x with gradient from some unit of

interest
- Blur x a bit
- Take any pixel with small norm to zero (to encourage
sparsity)
161
http://yosinski.com/deepvis
162
163
164
165
http://mtyka.github.io/deepdream/2016/02/05/bilateral-class-vis.html
Class #34, Leatherback Turtle Class #76, Tarantula
166
http://mtyka.github.io/deepdream/2016/02/05/bilateral-class-vis.html
Class #144, Pelican Class #944, Artichoke
167
Multifaceted Feature Visualization: Uncovering the Different Types of
JEFFCLUNE @ UWYO . EDU
Features Learned By Each Neuron in Deep Neural Networks
[Nguyen et al.,’16]
Multifaceted Feature Vis
Algorithm 1 Multifaceted Feature Visualization Cent

Input: a set of images U and a number of facets k fragm
1. for each image in U , compute high-level (here fc7) the s
hidden code i locat
2. Reduce the dimensionality of each code i from 4096 prod
to 50 via PCA. optim
3. Run t-SNE visualization on the entire set of codes i Supp
to produce a 2-D embedding (examples in Fig. 4). sults
4. Locate k clusters in the embedding via k-means. binin
for each cluster mean
5. Compute a mean image x0 by averaging the 15
images nearest to the cluster centroid. 3. R
6. Run activation maximization (see Section 2.2), but
initialize it with x0 instead of a random image. 3.1.
Output: a set of facet visualizations {x1 , x2 , ..., xk }. Mult
diffe
et al., 1967) to find k types of images (Fig. 4). Note ple,
that here we only visualize 10 facets per neuron, but it pepp
is possible to visualize fewer or more facets by changing for th
k. We compute a mean image by averaging m = 15 im- duce
ages (Algorithm 1, step 5) as it works the best compared to seen
m = {1, 50, 100, 200} (data not shown). Supplementary bers,
Sections S1 & S2.4 provide more intuition regarding why 168
Multifaceted Feature Visualization
169
Figure 6. Multifaceted visualization of example neuron feature detectors from all eight layers of a deep convolutional neural network.
The images reflect the true sizes of the receptive fields at different layers. For each neuron, we show visualizations of 4 different
170
Question: Given a CNN code, is it
possible to reconstruct the original
image?
171
Find an image such that:
- Its code is similar to a given code
- It “looks natural” (image prior regularization)
172
Understanding Deep Image Representations by Inverting Them
[Mahendran and Vedaldi, 2014]
reconstructions
original from the 1000
image log probabilities
for ImageNet
(ILSVRC)
classes
173
Reconstructions from the representation after last pooling
layer (immediately before the first Fully Connected layer)
174
175
Reconstructions from intermediate layers
Multiple reconstructions. Images in
quadrants all “look” the same to the CNN
(same code)
176
177
DeepDream https://github.com/google/deepdream
178
DeepDream: set dx = x :)
jitter regularizer
“image update”
179
inception_4c/output
DeepDream modifies the image in a way that “boosts” all activations, at any layer
this creates a feedback loop: e.g. any slightly detected dog face will be made more
and more dog like over time
180
inception_4c/output
181
inception_3b/5x5_reduce
182
Bonus videos
Deep Dream Grocery Trip
https://www.youtube.com/watch?v=DgPaCWJL7XI
Deep Dreaming Fear & Loathing in Las Vegas: the Great San Francisco Acid Wave
https://www.youtube.com/watch?v=oyxSerkkP4o
183
NeuralStyle
[A Neural Algorithm of Artistic Style
by Leon A. Gatys, Alexander S. Ecker, and
Matthias Bethge, 2015]
good implementation by Justin in Torch:
https://github.com/jcjohnson/neural-style
184
185
make your own easily on deepart.io
Step 1: Extract content targets (ConvNet activations of
all layers for the given content image)
content activations
e.g.
at CONV5_1 layer we would have a [14x14x512] array of target activations
186
Step 2: Extract style targets (Gram matrices of ConvNet
activations of all layers for the given style image)
style gram matrices

e.g.
at CONV1 layer (with [224x224x64] activations) would give a [64x64] Gram
matrix of all pairwise activation covariances (summed across spatial locations)
187
Step 3: Optimize over image to have:
- The content of the content image (activations match
content)
- The style of the style image (Gram matrices of
activations match style)
(+Total Variation regularization (maybe))
match content
match style
188
We can pose an optimization over the input
image to maximize any class score.
That seems useful.
Question: Can we use this to “fool” ConvNets?

spoiler alert: yeah
189
[Intriguing properties of neural networks,
Szegedy et al., 2013]
correct +distort ostrich correct +distort ostrich
190
[Deep Neural Networks are Easily Fooled: High Confidence
Predictions for Unrecognizable Images
Nguyen, Yosinski, Clune, 2014]
>99.6%
confidences
191
[Deep Neural Networks are Easily Fooled: High Confidence
Predictions for Unrecognizable Images
Nguyen, Yosinski, Clune, 2014]
>99.6%
confidences
192
These kinds of results were around even
before ConvNets…
[Exploring the Representation Capabilities of the HOG Descriptor,
Tatu et al., 2011]
Identical HOG represention
193
Explaining and Harnessing Adversarial Examples
[Goodfellow, Shlens & Szegedy, 2014]
“primary cause of neural networks’

vulnerability to adversarial
perturbation is their linear nature”
194
Lets fool a binary linear classifier:
(logistic regression)
195
x 2 -1 3 -2 2 2 1 -4 5 1 input example
w -1 -1 1 -1 1 -1 1 1 -1 1 weights
196
x 2 -1 3 -2 2 2 1 -4 5 1 input example
w -1 -1 1 -1 1 -1 1 1 -1 1 weights
class 1 score before:

-2 + 1 + 3 + 2 + 2 - 2 + 1 - 4 - 5 + 1 = -3
=> probability of class 1 is 1/(1+e^(-(-3))) = 0.0474

i.e. the classifier is 95% certain that this is class 0 example.
197
x 2 -1 3 -2 2 2 1 -4 5 1 input example
w -1 -1 1 -1 1 -1 1 1 -1 1 weights
adversarial x ? ? ? ? ? ? ? ? ? ?

-2 + 1 + 3 + 2 + 2 - 2 + 1 - 4 - 5 + 1 = -3

i.e. the classifier is 95% certain that this is class 0 example.
198
x 2 -1 3 -2 2 2 1 -4 5 1 input example
w -1 -1 1 -1 1 -1 1 1 -1 1 weights
adversarial x 1.5 -1.5 3.5 -2.5 2.5 1.5 1.5 -3.5 4.5 1.5

-2 + 1 + 3 + 2 + 2 - 2 + 1 - 4 - 5 + 1 = -3

-1.5+1.5+3.5+2.5+2.5-1.5+1.5-3.5-4.5+1.5 = 2
=> probability of class 1 is now 1/(1+e^(-(2))) = 0.88
i.e. we improved the class 1 probability from 5% to 88%
199
x 2 -1 3 -2 2 2 1 -4 5 1 input example
w -1 -1 1 -1 1 -1 1 1 -1 1 weights
adversarial x 1.5 -1.5 3.5 -2.5 2.5 1.5 1.5 -3.5 4.5 1.5
class 1 score before: This was only with 10

-2 + 1 + 3 + 2 + 2 - 2 + 1 - 4 - 5 + 1 = -3
input dimensions. A
=> probability of class 1 is 1/(1+e^(-(-3))) = 0.0474 224x224 input image
-1.5+1.5+3.5+2.5+2.5-1.5+1.5-3.5-4.5+1.5 = 2 has 150,528.
=> probability of class 1 is now 1/(1+e^(-(2))) = 0.88
i.e. we improved the class 1 probability from 5% to 88% (It’s significantly easier
with more numbers,
need smaller nudge for
each)
200
Blog post: Breaking Linear Classifiers
on ImageNet
Recall CIFAR-10
linear classifiers:
ImageNet classifiers:
201
mix in a tiny bit of
Goldfish classifier weights
+ =
100% Goldfish
202
203
204

Deep Learning CNN

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deep Learning CNN

Uploaded by

Copyright:

Available Formats

BIL 722: Advanced Topics in Computer Vision

(Deep Learning for Computer Vision)

Week 3: Convolutional Neural Nets

AlexNet, Krizhevsky et al., 2012

• 8 papers not chosen yet

• Fine-tuning a pre-trained model to

A single unit in the right

Convolve the filter with the image

computing dot products”

Convolve the filter with the image

computing dot products”

filter and a small 5x5x3 chunk of the image

convolve (slide) over all

convolve (slide) over all

We stack these up to get a “new image” of size 28x28x6!

We call the layer convolutional

convolve (slide) over all

=> 5x5 output

cannot apply 3x3 filter on

padding with (F-1)/2. (will preserve size

CONV, CONV, CONV,

Input volume: 32x32x3

Output volume size: ?

Input volume: 32x32x3

Output volume size:

Input volume: 32x32x3

Number of parameters in this layer?

Input volume: 32x32x3

Number of parameters in this layer?

(+1 for bias)

K = (powers of 2, e.g. 32, 64, 128, 512)

It’s just a neuron with local

28 An activation map is a 28x28 sheet of neuron

2. All of them share parameters

E.g. with 5 filters,

There will be 5 different

Single depth slice

Conv filters were 5x5, applied at stride 1

Input: 227x227x3 images

First layer (CONV1): 96 11x11 filters applied at stride 4

Input: 227x227x3 images

First layer (CONV1): 96 11x11 filters applied at stride 4

Q: What is the total number of parameters in this layer?

Input: 227x227x3 images

First layer (CONV1): 96 11x11 filters applied at stride 4

Parameters: (11*11*3)*96 = 35K

Input: 227x227x3 images

Second layer (POOL1): 3x3 filters applied at stride 2

Q: what is the output volume size? Hint: (55-3)/2+1 = 27

Input: 227x227x3 images

Second layer (POOL1): 3x3 filters applied at stride 2

Q: what is the number of parameters in this layer?

Input: 227x227x3 images

Second layer (POOL1): 3x3 filters applied at stride 2

Input: 227x227x3 images

Full (simplified) AlexNet architecture:

[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1

Full (simplified) AlexNet architecture: Details/Retrospectives:

ImageNet top 5 error: 15.4% -> 14.8%

Only 3x3 CONV stride 1, pad 1

11.2% top 5 error in ILSVRC 2013

CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296

CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296

CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296

Parameters: (11113)*96 = 35K

CONV3-512: [14x14x512] memory: 1414512=100K params: (33512)*512 = 2,359,296

CONV3-512: [14x14x512] memory: 1414512=100K params: (33512)*512 = 2,359,296

CONV3-512: [14x14x512] memory: 1414512=100K params: (33512)*512 = 2,359,296