You are on page 1of 204

BIL 722: Advanced Topics in Computer Vision

(Deep Learning for Computer Vision)


Spring 2016

Week 3: Convolutional Neural Nets


Aykut Erdem

AlexNet, Krizhevsky et al., 2012


Administrative
Paper presentations will be starting next week!

• 8 papers not chosen yet


• Any volunteers for Theano, Keras, CNTK, Marvin
tutorials?

2
Presentations
• You can use any presentation tool (e.g., Powerpoint, Keynote,
LaTex) provided that the tool has options to export the slides
to PDF.
• Each presentation should be clear, well organized and very
technical, and roughly 30 minutes long.
• You are allowed to reuse the material already exist on the web
as long as you clearly cite the source of the media that you
have used in your presentation.
• Extra credit will be awarded to those students who also
conduct some experiments demonstrating how the method
works in practice.

3
Presentations
Deadline:
• You should meet with me 3-4 days before the
presentation date to discuss your slides
• The presentation should be submitted by the night before
the class
• Presentations grading rubric on the webpage

4
Suggested Presentation Outline
• High-level overview of the paper (main contributions)
• Problem statement and motivation (clear definition of the
problem, why it is interesting and important)
• Key technical ideas (overview of the approach)
• Experimental set-up (datasets, evaluation metrics, applications)
• Strengths and weaknesses (discussion of the results obtained)
• Connections with other work (how it relates to other
approaches, its similarities and differences)
• Future direction (open research questions)
5
Homework
Due March 15, 2016
(till 12:30pm)

• Fine-tuning a pre-trained model to


classify cultural events on the image
data from ChaLearn Looking at
People 2015 Challenge (CVPR 2015)
• The purpose is to help you learn
about the fundamentals of training
and understanding convolutional
networks:
- applying dropout, batch normalization and data augmentation to reduce
overfitting,
- combining models into ensembles to improve the performance,
- using transfer learning to adapt a pre-trained model to a new dataset,
- using data gradients to visualize saliency maps
6
A bit of history
Hubel & Wiesel,
1959
RECEPTIVE FIELDS OF SINGLE
NEURONES IN
THE CAT'S STRIATE CORTEX
1962
RECEPTIVE FIELDS,
BINOCULAR INTERACTION
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

AND FUNCTIONAL
ARCHITECTURE IN
THE CAT'S VISUAL CORTEX
1968...

https://youtu.be/8VdFf3egwfg?t=1m10s 7
8
A bit of history
Topographical mapping
in the cortex:
nearby cells in cortex
represented
nearby regions in the
visual field
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

9
10
Hierarchical organization
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
The “Halle Berry” Neuron
Invariant visual representation by single neurons
in the human brain [Quiroga et al., Nature, 2005]

A single unit in the right


anterior hippocampus that
responds to pictures of the
actress Halle Berry
11
A bit of history “sandwich” architecture (SCSCSC…)
simple cells: modifiable parameters
complex cells: perform pooling

Neurocognitron
[Fukushima 1980]
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

12
A bit of history
Gradient-based learning
applied to document
recognition
[LeCun, Bottou, Bengio, Haffner
1998]
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

LeNet-5

13
A bit of history
ImageNet Classification with Deep
Convolutional Neural Networks
[Krizhevsky, Sutskever, Hinton, 2012]
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

“AlexNet”
14
Convolutional Neural Networks
(First without the brain stuff)
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

15
Convolution Layer
32x32x3 image

32 height
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

32 width
3 depth

16
Convolution Layer
32x32x3 image

5x5x3 filter
32

Convolve the filter with the image


i.e. “slide over the image spatially,
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

computing dot products”

32
3

17
Convolution Layer
Filters always extend the full
depth of the input volume
32x32x3 image

5x5x3 filter
32

Convolve the filter with the image


i.e. “slide over the image spatially,
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

computing dot products”

32
3

18
Convolution Layer
32x32x3 image
5x5x3 filter
32

1 number:
the result of taking a dot product between the
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

filter and a small 5x5x3 chunk of the image


32 (i.e. 5*5*3 = 75-dimensional dot product + bias)
3

19
Convolution Layer
activation map
32x32x3 image
5x5x3 filter
32

28

convolve (slide) over all


spatial locations
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

32 28
3 1

20
Convolution Layer
consider a second, green filter
32x32x3 image activation maps
5x5x3 filter
32

28

convolve (slide) over all


spatial locations
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

32 28
3 1

21
For example, if we had 6 5x5 filters, we’ll get 6
separate activation maps:
activation maps

32

28

Convolution Layer
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

32 28
3 6

We stack these up to get a “new image” of size 28x28x6!

22
Preview: ConvNet is a sequence of Convolutional
Layers, interspersed with activation functions

32 28

CONV,
ReLU
e.g. 6
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

5x5x3
32 filters 28
3 6

23
Preview: ConvNet is a sequence of Convolutional
Layers, interspersed with activation functions

32 28 24

….
CONV, CONV, CONV,
ReLU ReLU ReLU
e.g. 6 e.g. 10
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

5x5x3 5x5x6
32 filters 28 filters 24
3 6 10

24
25
LeCun slides]
[From recent Yann
Preview
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
26
LeCun slides]
[From recent Yann
Preview
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
one filter =>
one activation map
example 5x5 filters
(32 total)

We call the layer convolutional


because it is related to convolution
of two signals:
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

elementwise multiplication
and sum of a filter and the
signal (image)

27
28
Preview
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
A closer look at spatial dimensions:
activation map
32x32x3 image
5x5x3 filter
32

28

convolve (slide) over all


spatial locations
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

32 28
3 1

29
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter

7
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

30
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter

7
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

31
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter

7
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

32
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter

7
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

33
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter

=> 5x5 output


7
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

34
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2

7
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

35
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2
7
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

36
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2
=> 3x3 output!
7
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

37
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 3?
7
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

38
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 3?
7
doesn’t fit!
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

cannot apply 3x3 filter on


7x7 input with stride 3.

39
N
Output size:
(N - F) / stride + 1
F
e.g. N = 7, F = 3:
N stride 1 => (7 - 3)/1 + 1 = 5
F stride 2 => (7 - 3)/2 + 1 = 3
stride 3 => (7 - 3)/3 + 1 = 2.33 :\
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

40
In practice: Common to zero pad
the border
0 0 0 0 0 0 e.g. input 7x7
3x3 filter, applied with stride 1
0
pad with 1 pixel border => what is the
0 output?
0
0
(recall:)
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

(N - F) / stride + 1

41
In practice: Common to zero pad
the border
0 0 0 0 0 0 e.g. input 7x7
3x3 filter, applied with stride 1
0
pad with 1 pixel border => what is the
0 output?
0
7x7 output!
0
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

42
In practice: Common to zero pad
the border
0 0 0 0 0 0 e.g. input 7x7
3x3 filter, applied with stride 1
0
pad with 1 pixel border => what is the
0 output?
0
7x7 output!
0 in general, common to see CONV layers
with stride 1, filters of size FxF, and zero-
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

padding with (F-1)/2. (will preserve size


spatially)
e.g. F = 3 => zero pad with 1
F = 5 => zero pad with 2
F = 7 => zero pad with 3

43
Remember back to…
E.g. 32x32 input convolved repeatedly with 5x5 filters
shrinks volumes spatially!
(32 -> 28 -> 24 ...). Shrinking too fast is not good, doesn’t
work well.

32 28 24

….
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

CONV, CONV, CONV,


ReLU ReLU ReLU
e.g. 6 e.g. 10
5x5x3 5x5x6
32 filters 28 filters 24
3 6 10

44
Examples time:

Input volume: 32x32x3


10 5x5 filters with stride 1, pad 2

Output volume size: ?


slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

45
Examples time:

Input volume: 32x32x3


10 5x5 filters with stride 1, pad 2

Output volume size:


(32+2*2-5)/1+1 = 32 spatially, so
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

32x32x10

46
Examples time:

Input volume: 32x32x3


10 5x5 filters with stride 1, pad 2

Number of parameters in this layer?


slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

47
Examples time:

Input volume: 32x32x3


10 5x5 filters with stride 1, pad 2

Number of parameters in this layer?


each filter has 5*5*3 + 1 = 76 params
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

(+1 for bias)


=> 76*10 = 760

48
49
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
Common settings:

K = (powers of 2, e.g. 32, 64, 128, 512)


- F = 3, S = 1, P = 1
- F = 5, S = 1, P = 2
- F = 5, S = 2, P = ? (whatever fits)
- F = 1, S = 1, P = 0
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

50
(btw, 1x1 convolution layers make perfect sense)

1x1 CONV
56 with 32 filters
56
(each filter has size
1x1x64, and performs a
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

64-dimensional dot
56 product)
56
64 32

51
52
Example: CONV layer in Torch
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
53
Example: CONV layer in Caffe
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
54
Example: CONV layer in Lasagne
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
The brain/neuron view of CONV Layer

32x32x3 image
5x5x3 filter
32
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

1 number:
32 the result of taking a dot product between
the filter and this part of the image
3
(i.e. 5*5*3 = 75-dimensional dot product)

55
The brain/neuron view of CONV Layer

32x32x3 image
5x5x3 filter
32

It’s just a neuron with local


connectivity...
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

1 number:
32 the result of taking a dot product between
the filter and this part of the image
3
(i.e. 5*5*3 = 75-dimensional dot product)

56
The brain/neuron view of CONV Layer

32

28 An activation map is a 28x28 sheet of neuron


outputs:
1. Each is connected to a small region in the input
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

2. All of them share parameters

32
28 “5x5 filter” -> “5x5 receptive field for each neuron”
3

57
The brain/neuron view of CONV Layer

32

E.g. with 5 filters,


28 CONV layer consists of
neurons arranged in a 3D grid
(28x28x5)
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

There will be 5 different


32 28 neurons all looking at the same
3 region in the input volume
5

58
59
two more layers to go: POOL/FC
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
Pooling layer
- makes the representations smaller and more manageable
- operates over each activation map independently:
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

60
Max Pooling

Single depth slice


1 1 2 4
x
max pool with 2x2 filters
5 6 7 8 and stride 2 6 8

3 2 1 0 3 4
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

1 2 3 4

61
62
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
63
Common settings:

F = 2, S = 2
F = 3, S = 2
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
Fully Connected Layer (FC layer)
- Contains neurons that connect to the entire input volume, as in ordinary
Neural Networks
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

64
[ConvNetJS demo: training on CIFAR-10]

http://cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.html
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

65
Case Study: LeNet-5 [LeCun et al., 1998]
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

Conv filters were 5x5, applied at stride 1


Subsampling (Pooling) layers were 2x2 applied at stride 2
i.e. architecture is [CONV-POOL-CONV-POOL-CONV-FC]

66
Case Study: AlexNet
[Krizhevsky et al. 2012]

Input: 227x227x3 images

First layer (CONV1): 96 11x11 filters applied at stride 4


=>
Q: what is the output volume size? Hint: (227-11)/4+1 = 55
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

67
Case Study: AlexNet
[Krizhevsky et al. 2012]

Input: 227x227x3 images

First layer (CONV1): 96 11x11 filters applied at stride 4


=>
Output volume [55x55x96]
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

Q: What is the total number of parameters in this layer?

68
Case Study: AlexNet
[Krizhevsky et al. 2012]

Input: 227x227x3 images

First layer (CONV1): 96 11x11 filters applied at stride 4


=>
Output volume [55x55x96]
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

Parameters: (11*11*3)*96 = 35K

69
Case Study: AlexNet
[Krizhevsky et al. 2012]

Input: 227x227x3 images


After CONV1: 55x55x96

Second layer (POOL1): 3x3 filters applied at stride 2


slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

Q: what is the output volume size? Hint: (55-3)/2+1 = 27

70
Case Study: AlexNet
[Krizhevsky et al. 2012]

Input: 227x227x3 images


After CONV1: 55x55x96

Second layer (POOL1): 3x3 filters applied at stride 2


Output volume: 27x27x96
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

Q: what is the number of parameters in this layer?

71
Case Study: AlexNet
[Krizhevsky et al. 2012]

Input: 227x227x3 images


After CONV1: 55x55x96

Second layer (POOL1): 3x3 filters applied at stride 2


Output volume: 27x27x96
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

Parameters: 0!

72
Case Study: AlexNet
[Krizhevsky et al. 2012]

Input: 227x227x3 images


After CONV1: 55x55x96
After POOL1: 27x27x96
...
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

73
Case Study: AlexNet
[Krizhevsky et al. 2012]

Full (simplified) AlexNet architecture:


[227x227x3] INPUT
[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0
[27x27x96] MAX POOL1: 3x3 filters at stride 2
[27x27x96] NORM1: Normalization layer
[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2
[13x13x256] MAX POOL2: 3x3 filters at stride 2
[13x13x256] NORM2: Normalization layer
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1


[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1
[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1
[6x6x256] MAX POOL3: 3x3 filters at stride 2
[4096] FC6: 4096 neurons
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class scores)
74
Case Study: AlexNet
[Krizhevsky et al. 2012]

Full (simplified) AlexNet architecture: Details/Retrospectives:


[227x227x3] INPUT - first use of ReLU
[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0 - used Norm layers (not common
[27x27x96] MAX POOL1: 3x3 filters at stride 2 anymore)
[27x27x96] NORM1: Normalization layer - heavy data augmentation
[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2 - dropout 0.5
[13x13x256] MAX POOL2: 3x3 filters at stride 2 - batch size 128
[13x13x256] NORM2: Normalization layer - SGD Momentum 0.9
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1 - Learning rate 1e-2, reduced by 10
[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1 manually when val accuracy plateaus
[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1 - L2 weight decay 5e-4
[6x6x256] MAX POOL3: 3x3 filters at stride 2
- 7 CNN ensemble: 18.2% -> 15.4%
[4096] FC6: 4096 neurons
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class scores)
75
Case Study: ZFNet [Zeiler and Fergus, 2013]
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

AlexNet but:
CONV1: change from (11x11 stride 4) to (7x7 stride 2)
CONV3,4,5: instead of 384, 384, 256 filters use 512, 1024, 512

ImageNet top 5 error: 15.4% -> 14.8%


76
Case Study: VGGNet
[Simonyan and Zisserman, 2014]

Only 3x3 CONV stride 1, pad 1


and 2x2 MAX POOL stride 2

best model
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

11.2% top 5 error in ILSVRC 2013


->
7.3% top 5 error

77
(not counting biases)
INPUT: [224x224x3] memory: 224*224*3=150K params: 0
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864
POOL2: [112x112x64] memory: 112*112*64=800K params: 0
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456
POOL2: [56x56x128] memory: 56*56*128=400K params: 0
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
POOL2: [28x28x256] memory: 28*28*256=200K params: 0
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
POOL2: [14x14x512] memory: 14*14*512=100K params: 0
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296


POOL2: [7x7x512] memory: 7*7*512=25K params: 0
FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448
FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216
FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000

78
(not counting biases)
INPUT: [224x224x3] memory: 224*224*3=150K params: 0
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864
POOL2: [112x112x64] memory: 112*112*64=800K params: 0
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456
POOL2: [56x56x128] memory: 56*56*128=400K params: 0
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
POOL2: [28x28x256] memory: 28*28*256=200K params: 0
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
POOL2: [14x14x512] memory: 14*14*512=100K params: 0
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296


POOL2: [7x7x512] memory: 7*7*512=25K params: 0
FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448
FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216
FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000

TOTAL memory: 24M * 4 bytes ~= 93MB / image (only forward! ~*2 for bwd)
TOTAL params: 138M parameters

79
(not counting biases) Note:
INPUT: [224x224x3] memory: 224*224*3=150K params: 0
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864
POOL2: [112x112x64] memory: 112*112*64=800K params: 0
Most memory is in
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728 early CONV
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456
POOL2: [56x56x128] memory: 56*56*128=400K params: 0
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
POOL2: [28x28x256] memory: 28*28*256=200K params: 0
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
POOL2: [14x14x512] memory: 14*14*512=100K params: 0
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
Most params are
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296


POOL2: [7x7x512] memory: 7*7*512=25K params: 0 in late FC
FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448
FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216
FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000

TOTAL memory: 24M * 4 bytes ~= 93MB / image (only forward! ~*2 for bwd)
TOTAL params: 138M parameters

80
Case Study: GoogLeNet
[Szegedy et al.,
2014]

Inception module
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

ILSVRC 2014 winner (6.7% top 5 error)

81
Case Study: GoogLeNet
Fun features:

- Only 5 million params!


(Removes FC layers
completely)

Compared to AlexNet:
- 12X less params
- 2x more compute
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

- 6.67% (vs. 16.4%)

82
Case Study: ResNet [He et al., 2015]
ILSVRC 2015 winner
(3.6% top 5 error)
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

Slide from Kaiming He’s recent presentation


https://www.youtube.com/watch?v=1PGLj-uKT1w 83
84
(slide from Kaiming He’s recent presentation)
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
85
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
Case Study: ResNet [He et al., 2015]
ILSVRC 2015 winner
(3.6% top 5 error)

2-3 weeks of training


on 8 GPU machine

at runtime: faster
than a VGGNet!
(even though it has
8x more layers)
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

(slide from Kaiming He’s recent presentation)


86
Case Study: ResNet [He et al., 2015]

224x224x3

spatial dimension
only 56x56!
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

87
88
[He et al., 2015]
Case Study: ResNet
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
Case Study: ResNet [He et al., 2015]

- Batch Normalization after every CONV layer


- Xavier/2 initialization from He et al.
- SGD + Momentum (0.9)
- Learning rate: 0.1, divided by 10 when validation error plateaus
- Mini-batch size 256
- Weight decay of 1e-5
- No dropout used
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

89
90
[He et al., 2015]
Case Study: ResNet
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
Case Study: ResNet [He et al., 2015]
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

(this trick is also used in GoogLeNet)

91
92
[He et al., 2015]
Case Study: ResNet
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
93
Case Study Bonus: DeepMind’s
AlphaGo
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

policy network:
[19x19x48] Input
CONV1: 192 5x5 filters , stride 1, pad 2 => [19x19x192]
CONV2..12: 192 3x3 filters, stride 1, pad 1 => [19x19x192]
CONV: 1 1x1 filter, stride 1, pad 0 => [19x19] (probability map of promising moves)

94
Summary

- ConvNets stack CONV,POOL,FC layers


- Trend towards smaller filters and deeper architectures
- Trend towards getting rid of POOL/FC layers (just CONV)
- Typical architectures look like
[(CONV-RELU)*N-POOL?]*M-(FC-RELU)*K,SOFTMAX
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

where N is usually up to ~5, M is large, 0 <= K <= 2.


- but recent advances such as ResNet/GoogLeNet
challenge this paradigm

95
Tips and Tricks

96
• Shuffle the training samples

• Use Dropoout and Batch


Normalization for regularization

97
Input representation
“Given a rectangular image, we first rescaled
Input representation the image such that the shorter side was of
length 256, and then cropped out the central
256×256 patch from the resulting image”
● Centered (0-mean) RGB values.
• Centered (0-mean) RGB values.
slide by Alex Krizhevsky

An input image (256x256) Minus sign The mean input image


98
Data Augmentation
• Our neural net has 60M
real-valued parameters and
650,000 neurons
• It overfits a lot. Therefore,
they train on 224x224
patches extracted randomly
from 256x256 images, and
also their horizontal
reflections.
“This increases the size of our training set
slide by Alex Krizhevsky

by a factor of 2048, though the resulting


training examples are, of course, highly
inter- dependent.” [Krizhevsky et al. 2012] 99
Input repr
Data Augmentation ● Centered (0-mean) RG

• Alter the intensities of the


RGB channels in training
images.
“Specifically, we perform PCA on the set of
RGB pixel values throughout the ImageNet
training set. To each training image, we add
multiples of the found principal components,
An input image (256x256) Min
with magnitudes proportional to the corres.
ponding eigenvalues times a random variable
drawn from a Gaussian with mean zero and
standard deviation 0.1…This scheme
approximately captures an important property
of natural images, namely, that object identity
is invariant to changes in the intensity and
color of the illumination. This scheme reduces
slide by Alex Krizhevsky

the top-1 error rate by over 1%.”

[Krizhevsky et al. 2012] 100


101
Data Augmentation
Horizontal flips
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
Data Augmentation
Get creative!

Random mix/combinations of :
- translation
- rotation
- stretching
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

- shearing,
- lens distortions, … (go crazy)

102
Data augmentation improves human
learning, not just deep learning

If you're trying to improve your golf swing or master that tricky guitar
chord progression, here's some good news from researchers at Johns
Hopkins University: You may be able to double how quickly you learn
skills like these by introducing subtle variations into your practice
routine.

The received wisdom on learning motor skills goes something like this:
You need to build up "muscle memory" in order to perform mechanical
tasks, like playing musical instruments or sports, quickly and efficiently.
And the way you do that is via rote repetition — return hundreds of
tennis serves, play that F major scale over and over until your fingers
bleed, etc.

The wisdom on this isn't necessarily wrong, but the Hopkins research
https://www.washingtonpost.com/ suggests it's incomplete. Rather than doing the same thing over and
news/wonk/wp/2016/02/12/how- over, you might be able to learn things even faster — like, twice as fast —
if you change up your routine. Practicing your baseball swing? Change
to-learn-new-skills-twice-as-fast/ the size and weight of your bat. Trying to nail a 12-bar blues in A major
on the guitar? Spend 20 minutes playing the blues in E major, too.
Practice your backhand using tennis rackets of varying size and weight.
103
Convolutions
scale alone rotation
shift are notcolor
enough!
space
Convolutions alone cannot handle this!
slide by Alex Smola

scale shift rotation color space 104


Invariance and Covariance
Visualizing and Understanding Convolutional Neural Networks
0.8 1
a1# 9 a2# 0.7 a3# 0.9 a4#
8 0.8
0.6
7 Lawn Mower 0.7

Canonical Distance

Canonical Distance
0.5 Shih−Tzu

P(true class)
0.6
African Crocodile
5 0.4 African Grey 0.5
Entertrainment Center
0.4 Lawn Mower
0.3
Lawn Mower Shih−Tzu
3 0.3
African Crocodile
0.2
African Crocodile 0.2 African Grey
African Grey Entertrainment Center
0.1
1 0.1
Entertrainment Center
0 0
−60 −40 −20 0 20 40 60 −60 −40 −20 0 20 40 60
Vertical Translation (Pixels) Vertical Translation (Pixels) Vertical Translation (Pixels)
12 0.7 1

b1# b2# 0.6


b3# 0.9 b4#
10
0.8
0.5 0.7

Canonical Distance
Canonical Distance

8
Lawn Mower

P(true class)
0.6
0.4 Shih−Tzu
6 0.5 African Crocodile
African Grey
0.3
0.4 Entertrainment Center
Lawn Mower
4 Lawn Mower
0.2 Shih−Tzu 0.3
Shih−Tzu African Crocodile
African Crocodile African Grey 0.2
2 0.1
African Grey Entertrainment Center 0.1
Entertrainment Center
0 0 0
1 1.2 1.4 1.6 1.8 1 1.2 1.4 1.6 1.8 1 1.2 1.4 1.6 1.8
Scale (Ratio) Scale (Ratio) Scale (Ratio)
15 1.4 1

c1# c2# 1.2


c3# 0.9 c4#
0.8
Canonical Distance 1 0.7
Canonical Distance

10 Lawn Mower
Shih−Tzu

P(true class)
0.6
0.8 African Crocodile
0.5 African Grey
0.6 Entertrainment Center
0.4
5 Lawn Mower Lawn Mower
0.4 0.3
Shih−Tzu Shih−Tzu
African Crocodile African Crocodile 0.2
African Grey 0.2 African Grey
0.1
Entertrainment Center Entertrainment Center
0 0 0
0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350
Rotation Degrees Rotation Degrees Rotation Degrees

gure 4. Analysis of vertical translation, scale, and rotation invariance within the model (rows a-c respectively). Col 1:
Visualizing
ample images undergoingandtheUnderstanding
transformations. Convolutional Networks
Col 2 & 3: Euclidean [Zeiler
distance and
between Fergus,
feature 2014]
vectors from the origin
105
Invariance and Covariance
Table 3: Relative variance and intrinsic dimensionality av
eraged overTable
experiments for different
3: Relative object categories
variance and intrinan
viewpoints (3D orientation, translation, and scale). Each cel
eragedbottom
top – rel. variance; over –experiments fordo
intrinsic dim. We different
not repo
the intrinsicviewpoints (3D itorientation,
dim. of L since translation
is typically larger than 1K
top – rel. variance; bottom – intrinsic
across the experiments and expensive to compute.
(a) Car, pool5 (b) Chair, pool5 the intrinsic dim. of L since it is ty
pool5 fc6 fc7
across the experiments and expensive
(a) Lighting (b) Scale Places
26.8 % 21.4 % 17.8 %
8.5 7.0 5.9
Viewpoint AlexNet 26.4 % 19.4 % pool5
15.6 %
(a) Lighting (b) Scale 8.3Places 7.2 26.86.0%
VGG 21.2 % 16.4 % 12.3 %
8.5
10.0 7.7 6.2
Viewpoint AlexNet 26.4 %
(c) Car, fc6 (d) Chair, fc6 Places 26.8 % 39.1 % 49.4 %
136.3 105.5 8.3
54.6
Style AlexNet 28.2VGG% 40.3 % 21.2
49.4%%
(c) Object color (d) Background color 121.1 125.5 96.7
10.0
VGG 26.4 % 44.3 % 56.2 %
Figure 4: PCA embeddings for different factors using 181.9Places136.3 26.8 %
94.2
AlexNet pool5 features on “car” images. Colors in (a) corre- Places 46.8 % 39.5 % 136.3
32.9 %
spond to location of the light source (green – center). L Style 45.0AlexNet
AlexNet % 40.3 % 28.2
35.0%%
(e) Car, fc7 (f) Chair, fc7 (c) Object color (d) Background color VGG 52.4 % 39.3 % 121.1%
31.5
intuition that higher layers are more invariant to viewpoint. VGG 26.4 %
slide by Joan Bruna

Figure 3: PCA embeddings for 2D position on AlexNet.


Figure
We also 4:note
PCA embeddings
that the for different
residual feature L factors using
is less important color in all layers. This may be related to the fact 181.9
that colo
AlexNet pool5 features on “car” images. Colors in (a) corre-
in higher layers, indicating style and viewpoint are more is a stronger indicator of the scene type
Places than it is
46.8an%objec
of
simplicity and computational efficiency, we consideredeasily
in separable in those layers. These observations are con-
spond to location of the light source (green – center). category. Second, whileL the part AlexNet of the variance explained b
Understanding deep features with computer-generated imagery, [Aubry and Russell, '15]
all experiments a frontal view of all the instances ofsistent
the with our results of section 4.1. Second, the part of 45.0 %
foreground and background color is similar in the fc7 featur
objects. The framework allows the same analysis usingthethevariance associated with style is more important in the of the Places network, it is much VGG
larger for the 52.4 %106
foregroun
Transfer Learning

“You need a lot of a data if you want to


train/use CNNs”
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

107
of parts at layer 5 in the model. At layer 7, the scores ing invar
are more similar, perhaps due to upper layers trying to class
ble 4, probability,
using [A] as [B]
Acc % 5 train/test a function of %the
folds.
Ours Acc position
Training
[A] of Ours
took
[B] the17gray
min-squ
the layer
for “pomeranian”
utes for 30 Airplane 92.0 drops
images/class. 97.3 significantly.
96.0
77.1The
Dining tab (e):
63.2the
Dogpre-trained
most
77.8 67.7probable la
model beats

Transfer Learning with ConvNets


discriminate between the di↵erent breeds of dog. Bicycle 74.2 84.2 68.9 83.0 87.8 occlusion
for mostBird
locations 73.0 it 80.8
is “pomeranian”,
88.4 Horse but
78.2 if 87.5
the dog’s
86.0 face is o
the best Bottle
reported
Boat 77.5 result
85.3 for 30
85.5 images/class
Motorbike 81.0 90.1 fromfor(Bo
85.1 classi
the 2nd example, 54.3text60.8on 55.8
the car is the 91.6
Person strongest
95.0 feature
90.9 in the lay
in im
et3rd
al.,example
2013)
Bus
by 2.2%.
85.2
contains
89.9
The85.8
multiple convnet
Potted pl
model
objects. 69.4
55.9
The 79.2 trained
57.8 52.2
strongest fromTh
text.
feature
explore the ability of these feature extraction layers to Car 81.9 86.8 78.6 Sheep 83.6

scratch however
to the dog
Cat
Chair(blue
76.4
does
region
65.2 75.4 terribly,
89.3 91.2
in65.0
Sofa
onlyitachieving
(d)),Train
since
65.4 73.4
uses 94.5
86.7 multiple 46.5%.
61.1
91.8
resentatio
feature m
generalize to other datasets, namely Caltech-101 (Fei- Cow 63.2 77.8 74.4 Tv 77.4 80.7 76.1 useful for
feature, Mean
but the74.3
output
82.2 depends on many
% 0 parts
Accof%
5 the vehicle.

• Keep layers 1-7 of our


79.0 # Acc
won 15
fei et al., 2006), Caltech-256 (Griffin et al., 2006) and # Train 15/class
segmenta
30/class
Table 6. PASCAL 2012 classification results, comparing also allow
PASCAL VOC 2012. To do this, we keep layers 1-7 (Bo et
the model al., 2013)
trained from scratch
our Imagenet-pretrained does the
convnet against poorly. 81.4 ±Fig.
leading In
two
0.33 7,
plicitly e

ImageNet-trained model
(Jianchao et al., 2009) 73.2 84.3
of our ImageNet-trained model fixed and train a new we explore
methodsthe([A]=
“one-shot
(Sande et learning”
al., 2012) and(Fei-fei
[B] = (Yan etetal.,
al., 2006)
parts in
2012)).
Non-pretrained convnet 22.8 ± 1.5 46.5 ± 1.7 its predic
softmax classifier on top (for the appropriate number regime. With our pre-trained83.8
ImageNet-pretrained convnet
model,
± 0.5
just86.5
6 Caltech-
± 0.5trained m

fixed
5.1. Layer-by-Layer Performance Breakdown
of classes) using the training images of the new dataset. Table 4. Caltech-101 classification accuracy for leading
256 training images are needed to beat the our For Calte
con-
Since the softmax contains relatively few parameters, method Weusing
vnet models,
explore10howtimes as many
discriminative theimages.
features inThis each shows ilar enou
layer ofagainst two leading alternatemodel are. approaches.
it can be trained quickly from a relatively small num- the power of our
the Imagenet-pretrained
ImageNet feature extractor. We do in the la
this by varying the number of layers retained from
• Train a new softmax classifier Caltech-256: We follow the procedure of sult bring
the ImageNetAcc % andAcc % either Acc % Acc (Griffin
%
ber of examples, as is the case for certain datasets. model place a linear SVM small (i.e
et# Train
al., 2006),
or softmax
15/class15, 30/class
selecting
classifier
(Sohn et al., 2011) 35.1
on 30,
top.
42.1
45,
Tableor45/class
7 60
45.7
training
shows
60/class
results
47.9
images
generaliz

on since
toptheusing
bulk of thethe
modeltraining
This approach is a supervised form of pre-training, per on
(Boclass, Caltech-101
et al., reporting
2013) and
40.5the
± 0.4Caltech-256.
average
48.0 ± 0.2 For
of the both datasets,
51.9per-class
± 0.2 55.2 accura- su↵ering
± 0.3
a steady improvement can be seen as we ascend the although
parameters have been cies in Table
Non-pretr.
model, 5. Our
with
9.0 ±ImageNet-pretrained
best
1.4
results
22.5 ± 0.7 31.2 ± 0.5model
being obtained by using
38.8 ± beats
all
1.4
result, de
ImageNet-pretr. 65.7 ± 0.2 70.6 ± 0.2 72.7 ± 0.4 74.2 ± 0.3

images of direct
the comparisons
new dataset.
learned in a supervised fashion on the ImageNet data. the state-of-the-art
layers. This supports results obtained
the premise that as thebyfeature
(Bo etperforma al.,
Table 5. Caltech
hierarchies 256 classification
become deeper, accuracies.
they learn increasing pow- was used
This prevents to existing algo- 2013) by a significant margin: 74.2% vs 55.2% for 60
erful features.
rithms since they did not use the ImageNet data dur- training
PASCAL images/class. However,
2012: We used as withtraining
the standard
Cal-101 Cal-256 Caltech-101,
and
(30/class) (60/class)
validation images to train
SVM (1) 44.8 a 20-way
± 0.7 24.6 ±softmax
0.4 on top of
the ImageNet-pretrained
SVM (2) convnet.
66.2 ± 0.5 39.6This
± 0.3 is not ideal, as
SVM (3) 72.3 ± 0.4 46.0 ± 0.3
PASCAL images SVMcan
(4) contain multiple
76.6 ± 0.4 51.3 ± 0.1objects and our
model just provides a single exclusive
SVM (5) 86.2 ± 0.8 65.6 ± 0.3 prediction for
SVM (7) 85.5 ± 0.4 71.7 ± 0.2
Softmax (5) 82.9 ± 0.4 65.7 ± 0.5
Softmax (7) 85.4 ± 0.4 72.6 ± 0.1
Table 7. Analysis of the discriminative information con-
tained in each layer of feature maps within our ImageNet-
pretrained convnet. We train either a linear SVM or soft-
max on features from di↵erent layers (as indicated in brack-
ets) from the convnet. Higher layers generally produce
more discriminative features.

6. Discussion
Visualizing and Understanding Convolutional NetworksWe [Zeiler and
explored large Fergus,
convolutional neural2014]
network mod-
els, trained for image classification, in a number ways.
First, we systematically modified the network archi- 108
Stability: Transfer learning
Transfer Learning with ConvNets
• a CNN trained on a (large enough) dataset generalizes
• A ConvNet
to othertrained
visualon a (large enough) dataset generalizes to
tasks:
other visual tasks

Figure 4. t-SNE map of 20, 000 Flickr test images based on features extracted from the last layer of an AlexNet trained with K = 1, 000.
A full-resolution map is presented in the supplemental material. The inset shows a cluster of sports.

ing one-versus-all logistic loss: using a dictionary of K = times with the name of the individual sport itself. A model
1, 000 words, such a model achieves a precision@10 of trained on classification datasets such as Pascal VOC is un-
16.43 (compared to 17.98 for multiclass logistic loss). We likely to learn similar structure unless an explicit target tax-
surmise this is due to the problems one-versus-all logistic onomy is defined (as in the Imagenet dataset). Our results
loss has in dealing with class imbalance: because the num- suggest that such taxonomies can be learned from weakly
ber of negative examples is much higher than the number labeled data instead.
slide by Joan Bruna

of positive examples (for the most frequent class, more than


95.0% of the data is still negative), the rebalancing weight
4.2. Experiment 2: Transfer Learning
in front ofmap
theofpositive term is on
very
theirhigh, which leadsbytoa weaklyExperimental setup. To assess
trained the quality of the visual fea-
“Learning
Figure
Learning
Flickr
6. t-SNE
dataset.
spikes in the Note visual
visual
that all
gradient the features
10, 000 words
features
based
magnitude from
semantic from
information Large
embeddings
Large
represented
that hamper in theWeakly
as learned
SGDWeakly word 17 supervised
embeddings is the result
tures learned by of Data”,
supervised convolutional
training. supervised Data”,
network
observing that
our models,
[Joulin
[Joulin
these
on the
words et al,transfer-learning
et al.,’15]
are
we performed
’15]
assigned to images with similar visual content (the model did not observe word co-occurrences during training). A full-resolution version 109
Transfer Learning with ConvNets

1. Train on
Imagenet
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

110
Transfer Learning with ConvNets

2. Small dataset:
1. Train on feature extractor
Imagenet

Freeze
these
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

Train
this

111
Transfer Learning with ConvNets

2. Small dataset: 3. Medium dataset:


1. Train on feature extractor finetuning
Imagenet
more data = retrain more of
the network (or all of it)
Freeze these
Freeze
these
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

Train Train this


this

112
Transfer Learning with ConvNets

2. Small dataset: 3. Medium dataset:


1. Train on feature extractor finetuning
Imagenet
more data = retrain more of
the network (or all of it)
Freeze these
Freeze tip: use only ~1/10th of
these the original learning rate
in finetuning top layer,
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

and ~1/100th on
intermediate layers

Train Train this


this

113
Today ConvNets are everywhere
Classification Retrieval
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

[Krizhevsky 2012]

114
Today ConvNets are everywhere
Detection Segmentation
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

[Faster R-CNN: Ren, He, Girshick, Sun 2015] [Farabet et al., 2012]

115
Today ConvNets are everywhere

NVIDIA Tegra X1
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

self-driving cars

116
Today ConvNets are everywhere
[Taigman et al. 2014]
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

[Simonyan et al. 2014] [Goodfellow 2014]

117
Today ConvNets are everywhere

[Toshev, Szegedy 2014]


slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

[Mnih 2013]

118
Today ConvNets are everywhere
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

[Ciresan et al. 2013] [Sermanet et al. 2011]


[Ciresan et al.]

119
Today ConvNets are everywhere

[Denil et al. 2014]


slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

[Turaga et al., 2010]

120
Today ConvNets are everywhere
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

Whale recognition, Kaggle Challenge Mnih and Hinton, 2010

121
Today ConvNets are everywhere
Image
Captioning
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

[Vinyals et al., 2015]

122
Today ConvNets are everywhere
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

reddit.com/r/deepdream

123
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

Deep Neural Networks Rival the Representation of Primate IT


Cortex for Core Visual Object Recognition
[Cadieu et al., 2014]
124
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

Deep Neural Networks Rival the Representation of Primate IT


Cortex for Core Visual Object Recognition
[Cadieu et al., 2014]
125
Understanding ConvNets

126
Understanding ConvNets
- Visualize patches that maximally activate neurons
- Visualize the weights
- Visualize the representation space (e.g. with t-SNE)
- Occlusion experiments
- Human experiment comparisons
- Deconv approaches (single backward pass)
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

- Optimization over image approaches (optimization)

127
Visualize patches that maximally
activate neurons
one-stream AlexNet

pool5
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

Rich feature hierarchies for accurate object detection and semantic segmentation
[Girshick, Donahue, Darrell, Malik]
128
Visualize the filters/kernels
(raw weights)
one-stream AlexNet
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

conv1

only interpretable on the first layer :(


129
Visualize the layer 1
weights
filters/kernels
(raw weights) layer 2
weights

you can still do it


for higher layers,
it’s just not that
interesting
layer 3
(these are taken
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

weights
from ConvNetJS
CIFAR-10 demo)

130
131
The gabor-like filters fatigue
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
Visualizing the representation
fc7 layer

4096-dimensional “code” for an image


(layer immediately before the classifier)
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

can collect the code for many images

132
Visualizing the representation

t-SNE visualization
[van der Maaten & Hinton]

Embed high-dimensional points so


that locally, pairwise distances are
conserved

i.e. similar things end up in similar


slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

places. dissimilar things end up


wherever

Right: Example embedding of


MNIST digits (0-9) in 2D
133
t-SNE
visualization:
two images are
placed nearby if
their CNN codes
are close. See
more:
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

http://cs.stanford.edu/people/
karpathy/cnnembed/

134
Occlusion experiments
[Zeiler & Fergus 2013]

(as a function of
the position of the
square of zeros in
the original image)
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

135
Occlusion experiments
[Zeiler & Fergus 2013]

(as a function of
the position of the
square of zeros in
the original image)
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

136
Visualizing Activations
http://yosinski.com/deepvis

YouTube video
https://www.youtube.com/watch?v=A
gkfIQ4IGaM
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

(4min)

137
138
Deconv approaches
1. Feed image into net

Q: how can we compute the gradient of any


slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

arbitrary neuron in the network w.r.t. the image?

139
140
Deconv approaches
1. Feed image into net
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
Deconv approaches
1. Feed image into net

2. Pick a layer, set the gradient there to be all zero except for one 1
for some neuron of interest
3. Backprop to image:
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

141
Deconv approaches
1. Feed image into net

2. Pick a layer, set the gradient there to be all zero except for one 1
for some neuron of interest
3. Backprop to image:
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

“Guided
backpropagation:”
instead

142
Deconv approaches
[Visualizing and Understanding Convolutional Networks, Zeiler and Fergus 2013]
[Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps,
Simonyan et al., 2014]
[Striving for Simplicity: The all convolutional net, Springenberg, Dosovitskiy, et al., 2015]
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

143
Deconv approaches
[Visualizing and Understanding Convolutional Networks, Zeiler and Fergus 2013]
[Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps,
Simonyan et al., 2014]
[Striving for Simplicity: The all convolutional net, Springenberg, Dosovitskiy, et al., 2015]
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

Backward pass for a ReLU (will be changed in Guided


Backprop)

144
Deconv approaches
[Visualizing and Understanding Convolutional Networks, Zeiler and Fergus 2013]
[Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps,
Simonyan et al., 2014]
[Striving for Simplicity: The all convolutional net, Springenberg, Dosovitskiy, et al., 2015]
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

145
[Striving for Simplicity: The all convolutional net,
Springenberg, Dosovitskiy, et al., 2015]
Visualization of patterns
learned by the layer
conv6 (top) and layer
conv9 (bottom) of the
network trained on
ImageNet.

Each row corresponds to


one filter.
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

The visualization using


“guided backpropagation”
is based on the top 10
image patches activating
this filter taken from the
ImageNet dataset.

146
Deconv approaches
[Visualizing and Understanding Convolutional Networks, Zeiler and Fergus 2013]
[Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps,
Simonyan et al., 2014]
[Striving for Simplicity: The all convolutional net, Springenberg, Dosovitskiy, et al., 2015]
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

bit weird

147
[Visualizing and Understanding Convolutional Networks
Zeiler & Fergus, 2013]
Visualizing arbitrary neurons along the way to the top...
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

148
Visualizing arbitrary neurons along the
way to the top...
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

149
Visualizing arbitrary neurons along the
way to the top...
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

150
Optimization to Image
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

Q: can we find an image that maximizes


some class score?
151
Optimization to Image

score for class c (before Softmax)


slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

Q: can we find an image that maximizes


some class score?
152
Optimization to Image

1. feed in
zeros. zero image

2. set the gradient of the scores vector to be


[0,0,....1,....,0], then backprop to image
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

153
Optimization to Image

1. feed in
zeros. zero image

2. set the gradient of the scores vector to be


[0,0,....1,....,0], then backprop to image
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

3. do a small “image update”


4. forward the image through
the network. score for class c (before Softmax)
5. go back to 2.

154
Deep Inside Convolutional Networks: Visualising Image Classification Models and
Saliency Maps
Karen Simonyan, Andrea Vedaldi, Andrew Zisserman, 2014

1. Find images that maximize some class score:


slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

155
Deep Inside Convolutional Networks: Visualising Image Classification Models and
Saliency Maps
Karen Simonyan, Andrea Vedaldi, Andrew Zisserman, 2014

1. Find images that maximize some class score:


slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

156
Deep Inside Convolutional Networks: Visualising Image Classification Models and
Saliency Maps
Karen Simonyan, Andrea Vedaldi, Andrew Zisserman, 2014

2. Visualize the
Data gradient:

(note that the gradient on


data has three channels.
M=?
Here they visualize M, s.t.:
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

(at each pixel take abs val, and max


over channels)

157
Deep Inside Convolutional Networks: Visualising Image Classification Models and
Saliency Maps
Karen Simonyan, Andrea Vedaldi, Andrew Zisserman, 2014

2. Visualize the
Data gradient:

(note that the gradient on


data has three channels.
Here they visualize M, s.t.:
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

(at each pixel take abs val, and max


over channels)

158
Deep Inside Convolutional Networks: Visualising Image Classification Models and
Saliency Maps
Karen Simonyan, Andrea Vedaldi, Andrew Zisserman, 2014

- Use grabcut for


segmentation
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

159
We can in fact do this for arbitrary neurons
along the ConvNet
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

Repeat:
1. Forward an image
2. Set activations in layer of interest to all zero, except
for a 1.0 for a neuron of interest
3. Backprop to image
4. Do an “image update”
160
Understanding Neural Networks Through Deep Visualization
[Yosinski et al., 2015]

Proposed a different form of regularizing the image

More explicit scheme:


Repeat:
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

- Update the image x with gradient from some unit of


interest
- Blur x a bit
- Take any pixel with small norm to zero (to encourage
sparsity)
161
Understanding Neural Networks Through Deep Visualization
[Yosinski et al., 2015]
http://yosinski.com/deepvis
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

162
Understanding Neural Networks Through Deep Visualization
[Yosinski et al., 2015]
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

163
Understanding Neural Networks Through Deep Visualization
[Yosinski et al., 2015]
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

164
Understanding Neural Networks Through Deep Visualization
[Yosinski et al., 2015]
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

165
http://mtyka.github.io/deepdream/2016/02/05/bilateral-class-vis.html

Class #34, Leatherback Turtle Class #76, Tarantula

166
http://mtyka.github.io/deepdream/2016/02/05/bilateral-class-vis.html

Class #144, Pelican Class #944, Artichoke

167
Multifaceted Feature Visualization: Uncovering the Different Types of
JEFFCLUNE @ UWYO . EDU
Features Learned By Each Neuron in Deep Neural Networks
[Nguyen et al.,’16]
Multifaceted Feature Vis

Algorithm 1 Multifaceted Feature Visualization Cent


Input: a set of images U and a number of facets k fragm
1. for each image in U , compute high-level (here fc7) the s
hidden code i locat
2. Reduce the dimensionality of each code i from 4096 prod
to 50 via PCA. optim
3. Run t-SNE visualization on the entire set of codes i Supp
to produce a 2-D embedding (examples in Fig. 4). sults
4. Locate k clusters in the embedding via k-means. binin
for each cluster mean
5. Compute a mean image x0 by averaging the 15
images nearest to the cluster centroid. 3. R
6. Run activation maximization (see Section 2.2), but
initialize it with x0 instead of a random image. 3.1.
Output: a set of facet visualizations {x1 , x2 , ..., xk }. Mult
diffe
et al., 1967) to find k types of images (Fig. 4). Note ple,
that here we only visualize 10 facets per neuron, but it pepp
is possible to visualize fewer or more facets by changing for th
k. We compute a mean image by averaging m = 15 im- duce
ages (Algorithm 1, step 5) as it works the best compared to seen
m = {1, 50, 100, 200} (data not shown). Supplementary bers,
Sections S1 & S2.4 provide more intuition regarding why 168
Multifaceted Feature Visualization: Uncovering the Different Types of
Features Learned By Each Neuron in Deep Neural Networks
[Nguyen et al.,’16]
Multifaceted Feature Visualization

169
Multifaceted Feature Visualization: Uncovering the Different Types of
Features Learned By Each Neuron in Deep Neural Networks
[Nguyen et al.,’16]

Figure 6. Multifaceted visualization of example neuron feature detectors from all eight layers of a deep convolutional neural network.
The images reflect the true sizes of the receptive fields at different layers. For each neuron, we show visualizations of 4 different
170
Question: Given a CNN code, is it
possible to reconstruct the original
image?
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

171
Find an image such that:
- Its code is similar to a given code
- It “looks natural” (image prior regularization)
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

172
Understanding Deep Image Representations by Inverting Them
[Mahendran and Vedaldi, 2014]

reconstructions
original from the 1000
image log probabilities
for ImageNet
(ILSVRC)
classes
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

173
Reconstructions from the representation after last pooling
layer (immediately before the first Fully Connected layer)
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

174
175
Reconstructions from intermediate layers
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
Multiple reconstructions. Images in
quadrants all “look” the same to the CNN
(same code)
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

176
177
DeepDream https://github.com/google/deepdream
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
178
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
DeepDream: set dx = x :)

jitter regularizer
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

“image update”

179
inception_4c/output
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

DeepDream modifies the image in a way that “boosts” all activations, at any layer

this creates a feedback loop: e.g. any slightly detected dog face will be made more
and more dog like over time

180
inception_4c/output
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

DeepDream modifies the image in a way that “boosts” all activations, at any layer

181
inception_3b/5x5_reduce
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

DeepDream modifies the image in a way that “boosts” all activations, at any layer

182
Bonus videos
Deep Dream Grocery Trip
https://www.youtube.com/watch?v=DgPaCWJL7XI

Deep Dreaming Fear & Loathing in Las Vegas: the Great San Francisco Acid Wave
https://www.youtube.com/watch?v=oyxSerkkP4o
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

183
NeuralStyle
[A Neural Algorithm of Artistic Style
by Leon A. Gatys, Alexander S. Ecker, and
Matthias Bethge, 2015]
good implementation by Justin in Torch:
https://github.com/jcjohnson/neural-style
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

184
185
make your own easily on deepart.io
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
Step 1: Extract content targets (ConvNet activations of
all layers for the given content image)

content activations
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

e.g.
at CONV5_1 layer we would have a [14x14x512] array of target activations

186
Step 2: Extract style targets (Gram matrices of ConvNet
activations of all layers for the given style image)

style gram matrices


slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

e.g.
at CONV1 layer (with [224x224x64] activations) would give a [64x64] Gram
matrix of all pairwise activation covariances (summed across spatial locations)

187
Step 3: Optimize over image to have:
- The content of the content image (activations match
content)
- The style of the style image (Gram matrices of
activations match style)

(+Total Variation regularization (maybe))

match content
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

match style

188
We can pose an optimization over the input
image to maximize any class score.
That seems useful.

Question: Can we use this to “fool” ConvNets?


slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

spoiler alert: yeah

189
[Intriguing properties of neural networks,
Szegedy et al., 2013]
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

correct +distort ostrich correct +distort ostrich

190
[Deep Neural Networks are Easily Fooled: High Confidence
Predictions for Unrecognizable Images
Nguyen, Yosinski, Clune, 2014]

>99.6%
confidences
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

191
[Deep Neural Networks are Easily Fooled: High Confidence
Predictions for Unrecognizable Images
Nguyen, Yosinski, Clune, 2014]

>99.6%
confidences
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

192
These kinds of results were around even
before ConvNets…
[Exploring the Representation Capabilities of the HOG Descriptor,
Tatu et al., 2011]
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

Identical HOG represention

193
Explaining and Harnessing Adversarial Examples
[Goodfellow, Shlens & Szegedy, 2014]

“primary cause of neural networks’


vulnerability to adversarial
perturbation is their linear nature”
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

194
Lets fool a binary linear classifier:
(logistic regression)
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

195
Lets fool a binary linear classifier:

x 2 -1 3 -2 2 2 1 -4 5 1 input example

w -1 -1 1 -1 1 -1 1 1 -1 1 weights
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

196
Lets fool a binary linear classifier:

x 2 -1 3 -2 2 2 1 -4 5 1 input example

w -1 -1 1 -1 1 -1 1 1 -1 1 weights

class 1 score before:


-2 + 1 + 3 + 2 + 2 - 2 + 1 - 4 - 5 + 1 = -3
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

=> probability of class 1 is 1/(1+e^(-(-3))) = 0.0474


i.e. the classifier is 95% certain that this is class 0 example.

197
Lets fool a binary linear classifier:

x 2 -1 3 -2 2 2 1 -4 5 1 input example

w -1 -1 1 -1 1 -1 1 1 -1 1 weights

adversarial x ? ? ? ? ? ? ? ? ? ?

class 1 score before:


-2 + 1 + 3 + 2 + 2 - 2 + 1 - 4 - 5 + 1 = -3
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

=> probability of class 1 is 1/(1+e^(-(-3))) = 0.0474


i.e. the classifier is 95% certain that this is class 0 example.

198
Lets fool a binary linear classifier:

x 2 -1 3 -2 2 2 1 -4 5 1 input example

w -1 -1 1 -1 1 -1 1 1 -1 1 weights

adversarial x 1.5 -1.5 3.5 -2.5 2.5 1.5 1.5 -3.5 4.5 1.5

class 1 score before:


-2 + 1 + 3 + 2 + 2 - 2 + 1 - 4 - 5 + 1 = -3
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

=> probability of class 1 is 1/(1+e^(-(-3))) = 0.0474


-1.5+1.5+3.5+2.5+2.5-1.5+1.5-3.5-4.5+1.5 = 2
=> probability of class 1 is now 1/(1+e^(-(2))) = 0.88
i.e. we improved the class 1 probability from 5% to 88%

199
Lets fool a binary linear classifier:

x 2 -1 3 -2 2 2 1 -4 5 1 input example

w -1 -1 1 -1 1 -1 1 1 -1 1 weights

adversarial x 1.5 -1.5 3.5 -2.5 2.5 1.5 1.5 -3.5 4.5 1.5

class 1 score before: This was only with 10


-2 + 1 + 3 + 2 + 2 - 2 + 1 - 4 - 5 + 1 = -3
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

input dimensions. A
=> probability of class 1 is 1/(1+e^(-(-3))) = 0.0474 224x224 input image
-1.5+1.5+3.5+2.5+2.5-1.5+1.5-3.5-4.5+1.5 = 2 has 150,528.
=> probability of class 1 is now 1/(1+e^(-(2))) = 0.88
i.e. we improved the class 1 probability from 5% to 88% (It’s significantly easier
with more numbers,
need smaller nudge for
each)
200
Blog post: Breaking Linear Classifiers
on ImageNet
Recall CIFAR-10
linear classifiers:

ImageNet classifiers:
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

201
mix in a tiny bit of
Goldfish classifier weights

+ =
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

100% Goldfish

202
203
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
204
slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

You might also like