You are on page 1of 179

Lecture 9:

CNN Architectures

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 1 April 27, 2021


Administrative
- A1 grades will be released hopefully by today: Check
Piazza for regrade policy
- Project proposal grades will be released mid-week

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 2 April 27, 2021


Administrative
- A2 is due Friday April 30th, 11:59pm

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 3 April 27, 2021


Administrative: Midterm
Covers material through Lecture 10 (Thu April 29).

- Tues, May 4 and is worth 15% of your grade.


- available for 24 hours on Gradescope from May 4,
12PM PDT to May 5, 11:59 AM PDT.
- 3-hour consecutive timeframe
- Exam will be designed for 1.5 hours.
- Open book and open internet but no collaboration
- Only make private posts during those 24 hours
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 4 April 27, 2021
Administrative

Midterm review session: Fri April 30th discussion section

Sample midterm has been released on Piazza.

OAE accommodations: If you have not received an email


from us, please reach out to the staff mailing list ASAP.

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 5 April 27, 2021


Last time: fancy optimizers

SGD

SGD+Momentum

RMSProp

Adam

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 6 April 27, 2021


Last time: learning rate scheduling
Step: Reduce learning rate at a few fixed
points. E.g. for ResNets, multiply LR by 0.1
Reduce learning rate after epochs 30, 60, and 90.

Cosine:

Linear:

Inverse sqrt:

: Initial learning rate


: Learning rate at epoch t
: Total number of epochs

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 7 April 27, 2021


Example forward
Last time: dropout as a regularizer pass with a
3-layer network
using dropout

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 8 April 27, 2021


Regularization: A common pattern
Training: Add some kind
of randomness

Testing: Average out randomness


(sometimes approximate)

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 9 April 27, 2021


Regularization: A common pattern
Training: Add some kind Example: Batch
of randomness Normalization

Training:
Normalize using
Testing: Average out randomness stats from random
(sometimes approximate) minibatches

Testing: Use fixed


stats to normalize
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 10 April 27, 2021
Regularization: Data Augmentation

“cat”
Load image
and label
Compute
loss
CNN

This image by Nikita is


licensed under CC-BY 2.0

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 11 April 27, 2021


Regularization: Data Augmentation

“cat”
Load image
and label
Compute
loss
CNN

Transform image

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 12 April 27, 2021


Data Augmentation
Horizontal Flips

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 13 April 27, 2021


Data Augmentation
Random crops and scales
Training: sample random crops / scales
ResNet:
1. Pick random L in range [256, 480]
2. Resize training image, short side = L
3. Sample random 224 x 224 patch

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 14 April 27, 2021


Data Augmentation
Random crops and scales
Training: sample random crops / scales
ResNet:
1. Pick random L in range [256, 480]
2. Resize training image, short side = L
3. Sample random 224 x 224 patch

Testing: average a fixed set of crops


ResNet:
1. Resize image at 5 scales: {224, 256, 384, 480, 640}
2. For each size, use 10 224 x 224 crops: 4 corners + center, + flips

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 15 April 27, 2021


Data Augmentation
Color Jitter
Simple: Randomize
contrast and brightness

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 16 April 27, 2021


Data Augmentation More Complex:
Color Jitter 1. Apply PCA to all [R, G, B]
Simple: Randomize pixels in training set
contrast and brightness
2. Sample a “color offset”
along principal component
directions
1. Add offset to all pixels of a
training image
(As seen in [Krizhevsky et al. 2012], ResNet, etc)

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 17 April 27, 2021


Data Augmentation
Get creative for your problem!
Examples of data augmentations:
- translation
- rotation
- stretching
- shearing,
- lens distortions, … (go crazy)

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 18 April 27, 2021


Automatic Data Augmentation

Cubuk et al., “AutoAugment: Learning Augmentation Strategies from Data”, CVPR 2019

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 19 April 27, 2021


Regularization: A common pattern
Training: Add random noise
Testing: Marginalize over the noise
Examples:
Dropout
Batch Normalization
Data Augmentation

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 20 April 27, 2021


Regularization: DropConnect
Training: Drop connections between neurons (set weights to 0)
Testing: Use all the connections
Examples:
Dropout
Batch Normalization
Data Augmentation
DropConnect

Wan et al, “Regularization of Neural Networks using DropConnect”, ICML 2013

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 21 April 27, 2021


Regularization: Fractional Pooling
Training: Use randomized pooling regions
Testing: Average predictions from several regions
Examples:
Dropout
Batch Normalization
Data Augmentation
DropConnect
Fractional Max Pooling

Graham, “Fractional Max Pooling”, arXiv 2014

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 22 April 27, 2021


Regularization: Stochastic Depth
Training: Skip some layers in the network
Testing: Use all the layer
Examples:
Dropout
Batch Normalization
Data Augmentation
DropConnect
Fractional Max Pooling
Stochastic Depth (will become more
clear in next week's lecture)

Huang et al, “Deep Networks with Stochastic Depth”, ECCV 2016

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 23 April 27, 2021


Regularization: Cutout
Training: Set random image regions to zero
Testing: Use full image
Examples:
Dropout
Batch Normalization
Data Augmentation
DropConnect
Fractional Max Pooling
Stochastic Depth
Cutout / Random Crop
Works very well for small datasets like CIFAR,
DeVries and Taylor, “Improved Regularization of
Convolutional Neural Networks with Cutout”, arXiv 2017
less common for large datasets like ImageNet

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 24 April 27, 2021


Regularization: Mixup
Training: Train on random blends of images
Testing: Use original images
Examples:
Dropout Target label:
Batch Normalization CNN cat: 0.4
Data Augmentation dog: 0.6
DropConnect
Fractional Max Pooling
Stochastic Depth Randomly blend the pixels
Cutout / Random Crop of pairs of training images,
e.g. 40% cat, 60% dog
Mixup
Zhang et al, “mixup: Beyond Empirical Risk Minimization”, ICLR 2018

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 25 April 27, 2021


Regularization - In practice
Training: Add random noise
Testing: Marginalize over the noise
Examples:
- Consider dropout for large
Dropout
fully-connected layers
Batch Normalization
- Batch normalization and data
Data Augmentation
augmentation almost always a
DropConnect
good idea
Fractional Max Pooling
- Try cutout and mixup especially
Stochastic Depth
for small classification datasets
Cutout / Random Crop
Mixup

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 26 April 27, 2021


Choosing Hyperparameters
(without tons of GPUs)

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 27 April 27, 2021


Choosing Hyperparameters
Step 1: Check initial loss

Turn off weight decay, sanity check loss at initialization


e.g. log(C) for softmax with C classes

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 28 April 27, 2021


Choosing Hyperparameters
Step 1: Check initial loss
Step 2: Overfit a small sample

Try to train to 100% training accuracy on a small sample of


training data (~5-10 minibatches); fiddle with architecture,
learning rate, weight initialization

Loss not going down? LR too low, bad initialization


Loss explodes to Inf or NaN? LR too high, bad initialization

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 29 April 27, 2021


Choosing Hyperparameters
Step 1: Check initial loss
Step 2: Overfit a small sample
Step 3: Find LR that makes loss go down

Use the architecture from the previous step, use all training
data, turn on small weight decay, find a learning rate that
makes the loss drop significantly within ~100 iterations

Good learning rates to try: 1e-1, 1e-2, 1e-3, 1e-4

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 30 April 27, 2021


Choosing Hyperparameters
Step 1: Check initial loss
Step 2: Overfit a small sample
Step 3: Find LR that makes loss go down
Step 4: Coarse grid, train for ~1-5 epochs

Choose a few values of learning rate and weight decay around


what worked from Step 3, train a few models for ~1-5 epochs.

Good weight decay to try: 1e-4, 1e-5, 0

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 31 April 27, 2021


Choosing Hyperparameters
Step 1: Check initial loss
Step 2: Overfit a small sample
Step 3: Find LR that makes loss go down
Step 4: Coarse grid, train for ~1-5 epochs
Step 5: Refine grid, train longer

Pick best models from Step 4, train them for longer (~10-20
epochs) without learning rate decay

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 32 April 27, 2021


Choosing Hyperparameters
Step 1: Check initial loss
Step 2: Overfit a small sample
Step 3: Find LR that makes loss go down
Step 4: Coarse grid, train for ~1-5 epochs
Step 5: Refine grid, train longer
Step 6: Look at loss and accuracy curves

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 33 April 27, 2021


Accuracy Accuracy still going up, you
need to train longer

Train

Val

time

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 34 April 27, 2021


Accuracy Huge train / val gap means
overfitting! Increase regularization,
get more data
Train

Val

time

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 35 April 27, 2021


Accuracy No gap between train / val means
underfitting: train longer, use a
bigger model
Train

Val

time

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 36 April 27, 2021


Look at learning curves!
Training Loss Train / Val Accuracy

Losses may be noisy, use a


scatter plot and also plot moving
average to see trends better

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 37 April 27, 2021


Cross-validation

We develop
"command centers"
to visualize all our
models training with
different
hyperparameters

check out weights


and biases

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 38 April 27, 2021


You can plot all your loss curves for different hyperparameters on a single plot

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 39 April 27, 2021


Don't look at accuracy or loss curves for too long!

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 40 April 27, 2021


Choosing Hyperparameters
Step 1: Check initial loss
Step 2: Overfit a small sample
Step 3: Find LR that makes loss go down
Step 4: Coarse grid, train for ~1-5 epochs
Step 5: Refine grid, train longer
Step 6: Look at loss and accuracy curves
Step 7: GOTO step 5

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 41 April 27, 2021


Hyperparameters to play with:
- network architecture
- learning rate, its decay schedule, update type
- regularization (L2/Dropout strength)

This image by Paolo Guereta is licensed under CC-BY 2.0

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 42 April 27, 2021


Random Search for
Random Search vs. Grid Search Hyper-Parameter Optimization
Bergstra and Bengio, 2012
Grid Layout Random Layout

Unimportant Parameter

Unimportant Parameter
Important Parameter Important Parameter
Illustration of Bergstra et al., 2012 by Shayne
Longpre, copyright CS231n 2017

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 43 April 27, 2021


Today

Wrapping up previous topics:


- Data augmentation to improve test time performance
- Transfer learning

New:
- CNN architecture design

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 44 April 27, 2021


Transfer learning

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 45 April 27, 2021


“You need a lot of a data if you want to
train/use CNNs”

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 46 April 27, 2021


ED
“You need a lot of a data if you want to

ST
train/use CNNs”

BU
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 47 April 27, 2021
Transfer Learning with CNNs

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 48 April 27, 2021


Transfer Learning with CNNs

AlexNet:
64 x 3 x 11 x 11

(More on this in Lecture 13)


Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 49 April 27, 2021
Transfer Learning with CNNs

Test image L2 Nearest neighbors in feature space

(More on this in Lecture 13)


Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 50 April 27, 2021
Donahue et al, “DeCAF: A Deep Convolutional Activation
Feature for Generic Visual Recognition”, ICML 2014

Transfer Learning with CNNs Razavian et al, “CNN Features Off-the-Shelf: An


Astounding Baseline for Recognition”, CVPR Workshops
2014

1. Train on Imagenet
FC-1000
FC-4096
FC-4096

MaxPool
Conv-512
Conv-512

MaxPool
Conv-512
Conv-512

MaxPool
Conv-256
Conv-256

MaxPool
Conv-128
Conv-128

MaxPool
Conv-64
Conv-64

Image

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 51 April 27, 2021


Donahue et al, “DeCAF: A Deep Convolutional Activation
Feature for Generic Visual Recognition”, ICML 2014

Transfer Learning with CNNs Razavian et al, “CNN Features Off-the-Shelf: An


Astounding Baseline for Recognition”, CVPR Workshops
2014

1. Train on Imagenet 2. Small Dataset (C classes)


FC-1000 FC-C
FC-4096 FC-4096
FC-4096
Reinitialize
FC-4096
this and train
MaxPool MaxPool
Conv-512 Conv-512
Conv-512 Conv-512

MaxPool MaxPool
Conv-512 Conv-512
Conv-512 Conv-512

MaxPool MaxPool Freeze these


Conv-256 Conv-256
Conv-256 Conv-256

MaxPool MaxPool
Conv-128 Conv-128
Conv-128 Conv-128

MaxPool MaxPool
Conv-64 Conv-64
Conv-64 Conv-64

Image Image

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 52 April 27, 2021


Donahue et al, “DeCAF: A Deep Convolutional Activation
Feature for Generic Visual Recognition”, ICML 2014

Transfer Learning with CNNs Razavian et al, “CNN Features Off-the-Shelf: An


Astounding Baseline for Recognition”, CVPR Workshops
2014

1. Train on Imagenet 2. Small Dataset (C classes)


FC-1000 FC-C
FC-4096 FC-4096
FC-4096
Reinitialize Finetuned from AlexNet
FC-4096
this and train
MaxPool MaxPool
Conv-512 Conv-512
Conv-512 Conv-512

MaxPool MaxPool
Conv-512 Conv-512
Conv-512 Conv-512

MaxPool MaxPool Freeze these


Conv-256 Conv-256
Conv-256 Conv-256

MaxPool MaxPool
Conv-128 Conv-128
Conv-128 Conv-128

MaxPool MaxPool
Conv-64 Conv-64
Donahue et al, “DeCAF: A Deep Convolutional Activation Feature for
Conv-64 Conv-64 Generic Visual Recognition”, ICML 2014
Image Image

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 53 April 27, 2021


Donahue et al, “DeCAF: A Deep Convolutional Activation
Feature for Generic Visual Recognition”, ICML 2014

Transfer Learning with CNNs Razavian et al, “CNN Features Off-the-Shelf: An


Astounding Baseline for Recognition”, CVPR Workshops
2014

1. Train on Imagenet 2. Small Dataset (C classes) 3. Bigger dataset


FC-1000 FC-C FC-C
FC-4096 FC-4096
Reinitialize FC-4096 Train these
FC-4096 FC-4096 FC-4096
this and train
MaxPool MaxPool MaxPool
Conv-512 Conv-512 Conv-512 With bigger
Conv-512 Conv-512 Conv-512
dataset, train
MaxPool MaxPool MaxPool
more layers
Conv-512 Conv-512 Conv-512
Conv-512 Conv-512 Conv-512

MaxPool MaxPool Freeze these MaxPool


Conv-256 Conv-256 Conv-256 Freeze these
Conv-256 Conv-256 Conv-256

MaxPool MaxPool MaxPool


Conv-128 Conv-128 Conv-128 Lower learning rate
Conv-128 Conv-128 Conv-128 when finetuning;
MaxPool MaxPool MaxPool 1/10 of original LR
Conv-64 Conv-64 Conv-64
is good starting
Conv-64 Conv-64 Conv-64
point
Image Image Image

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 54 April 27, 2021


FC-1000
FC-4096
very similar very different
FC-4096 dataset dataset
MaxPool
Conv-512
Conv-512

MaxPool
very little data ? ?
Conv-512
More specific
Conv-512

MaxPool
Conv-256
Conv-256
More generic
MaxPool
Conv-128
Conv-128
quite a lot of ? ?
MaxPool
Conv-64
data
Conv-64

Image

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 55 April 27, 2021


FC-1000
FC-4096
very similar very different
FC-4096 dataset dataset
MaxPool
Conv-512
Conv-512

MaxPool
very little data Use Linear ?
Conv-512
More specific Classifier on
Conv-512
top layer
MaxPool
Conv-256
Conv-256
More generic
MaxPool
Conv-128
Conv-128
quite a lot of Finetune a ?
MaxPool
Conv-64
data few layers
Conv-64

Image

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 56 April 27, 2021


FC-1000
FC-4096
very similar very different
FC-4096 dataset dataset
MaxPool
Conv-512
Conv-512

MaxPool
very little data Use Linear You’re in
Conv-512
More specific Classifier on trouble… Try
Conv-512
top layer linear classifier
MaxPool
Conv-256 from different
Conv-256
More generic stages
MaxPool
Conv-128
Conv-128
quite a lot of Finetune a Finetune a
MaxPool
Conv-64
data few layers larger number
Conv-64 of layers
Image

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 57 April 27, 2021


Transfer learning with CNNs is pervasive…
(it’s the norm, not an exception)
Object Detection
(Fast R-CNN) Image Captioning: CNN + RNN

Karpathy and Fei-Fei, “Deep Visual-Semantic Alignments for


Girshick, “Fast R-CNN”, ICCV 2015 Generating Image Descriptions”, CVPR 2015
Figure copyright Ross Girshick, 2015. Reproduced with permission. Figure copyright IEEE, 2015. Reproduced for educational purposes.

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 58 April 27, 2021


Transfer learning with CNNs is pervasive…
(it’s the norm, not an exception)
Object Detection
CNN pretrained Image Captioning: CNN + RNN
(Fast R-CNN)
on ImageNet

Karpathy and Fei-Fei, “Deep Visual-Semantic Alignments for


Girshick, “Fast R-CNN”, ICCV 2015 Generating Image Descriptions”, CVPR 2015
Figure copyright Ross Girshick, 2015. Reproduced with permission. Figure copyright IEEE, 2015. Reproduced for educational purposes.

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 59 April 27, 2021


Transfer learning with CNNs is pervasive…
(it’s the norm, not an exception)
Object Detection
CNN pretrained Image Captioning: CNN + RNN
(Fast R-CNN)
on ImageNet

Word vectors pretrained


Karpathy and Fei-Fei, “Deep Visual-Semantic Alignments for
Girshick, “Fast R-CNN”, ICCV 2015
Figure copyright Ross Girshick, 2015. Reproduced with permission.
with word2vec Generating Image Descriptions”, CVPR 2015
Figure copyright IEEE, 2015. Reproduced for educational purposes.

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 60 April 27, 2021


Transfer learning with CNNs is pervasive…
(it’s the norm, not an exception)

1. Train CNN on ImageNet


2. Fine-Tune (1) for object detection on
Visual Genome
1. Train BERT language model on lots of text
2. Combine(2) and (3), train for joint image /
language modeling
3. Fine-tune (4) for image captioning, visual
question answering, etc.

Zhou et al, “Unified Vision-Language Pre-Training for Image Captioning and VQA” CVPR 2020 Krishna et al, “Visual genome: Connecting language and vision using crowdsourced dense image annotations” IJCV 2017
Figure copyright Luowei Zhou, 2020. Reproduced with permission. Devlin et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” ArXiv 2018

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 61 April 27, 2021


Transfer learning with CNNs -
Architecture matters

We will discuss different


architectures in detail today

Girshick, “The Generalized R-CNN Framework for Object Detection”, ICCV 2017 Tutorial on Instance-Level Visual Recognition

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 62 April 27, 2021


Transfer learning with CNNs is pervasive…
But recent results show it might not always be necessary!
Training from scratch can work just as
well as training from a pretrained
ImageNet model for object detection

But it takes 2-3x as long to train.

They also find that collecting more data


is better than finetuning on a related
task

He et al, “Rethinking ImageNet Pre-training”, ICCV 2019


Figure copyright Kaiming He, 2019. Reproduced with permission.

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 63 April 27, 2021


Takeaway for your projects and beyond:

Source: AI & Deep Learning Memes For Back-propagated Poets

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 64 April 27, 2021


Takeaway for your projects and beyond:
Have some dataset of interest but it has < ~1M images?

1. Find a very large dataset that has


similar data, train a big ConvNet there
2. Transfer learn to your dataset
Deep learning frameworks provide a “Model Zoo” of pretrained
models so you don’t need to train your own

TensorFlow: https://github.com/tensorflow/models
PyTorch: https://github.com/pytorch/vision

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 65 April 27, 2021


CNN Architectures

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 66 April 27, 2021


Review: LeNet-5
[LeCun et al., 1998]

Conv filters were 5x5, applied at stride 1


Subsampling (Pooling) layers were 2x2 applied at stride 2
i.e. architecture is [CONV-POOL-CONV-POOL-FC-FC]

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 67 April 27, 2021


Review: Convolution
Stride:
32x32x3 image Downsample
output activations
3x3x3 filter
32

Padding:
Preserve
input spatial
dimensions in
32
output activations
3

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 68 April 27, 2021


Review: Convolution
activation maps

32

32

Convolution Layer

32 32
3 6
Each conv filter outputs a “slice” in the activation

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 69 April 27, 2021


Review: Pooling
Single depth slice
1 1 2 4
x max pool with 2x2 filters
5 6 7 8 and stride 2 6 8

3 2 1 0 3 4

1 2 3 4

y
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 70 April 27, 2021
Today: CNN Architectures

Case Studies
- AlexNet
- VGG
- GoogLeNet
- ResNet

Also....
- SENet - DenseNet
- Wide ResNet - MobileNets
- ResNeXT - NASNet
- EfficientNet

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 71 April 27, 2021


ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners

152 layers 152 layers 152 layers

19 layers 22 layers

shallow 8 layers 8 layers

Lin et al Sanchez & Krizhevsky et al Zeiler & Simonyan & Szegedy et al He et al Shao et al Hu et al Russakovsky et al
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet) (ResNet) (SENet)

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 72 April 27, 2021


ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners

First CNN-based winner 152 layers 152 layers 152 layers

19 layers 22 layers

shallow 8 layers 8 layers

Lin et al Sanchez & Krizhevsky et al Zeiler & Simonyan & Szegedy et al He et al Shao et al Hu et al Russakovsky et al
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet) (ResNet) (SENet)

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 73 April 27, 2021


Case Study: AlexNet
[Krizhevsky et al. 2012]

Architecture:
CONV1
MAX POOL1
NORM1
CONV2
MAX POOL2
NORM2
CONV3
CONV4
CONV5
Max POOL3
FC6
FC7
FC8
Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 74 April 27, 2021


Case Study: AlexNet
[Krizhevsky et al. 2012]

Input: 227x227x3 images


W’ = (W - F + 2P) / S + 1
First layer (CONV1): 96 11x11 filters applied at stride 4
=>
Q: what is the output volume size? Hint: (227-11)/4+1 = 55

Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 75 April 27, 2021


Case Study: AlexNet
[Krizhevsky et al. 2012]

Input: 227x227x3 images


W’ = (W - F + 2P) / S + 1
First layer (CONV1): 96 11x11 filters applied at stride 4
=>
227
Output volume [55x55x96]
55 x 55
227
96

3
Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 76 April 27, 2021


Case Study: AlexNet
[Krizhevsky et al. 2012]

Input: 227x227x3 images

First layer (CONV1): 96 11x11 filters applied at stride 4


=> 11 x 11
Output volume [55x55x96]

Q: What is the total number of parameters in this layer?


3

Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 77 April 27, 2021


Case Study: AlexNet
[Krizhevsky et al. 2012]

Input: 227x227x3 images

First layer (CONV1): 96 11x11 filters applied at stride 4


=> 11 x 11
Output volume [55x55x96]
Parameters: (11*11*3 + 1)*96 = 35K
3

Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 78 April 27, 2021


Case Study: AlexNet
[Krizhevsky et al. 2012]

Input: 227x227x3 images


After CONV1: 55x55x96
W’ = (W - F + 2P) / S + 1

Second layer (POOL1): 3x3 filters applied at stride 2

Q: what is the output volume size? Hint: (55-3)/2+1 = 27

Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 79 April 27, 2021


Case Study: AlexNet
[Krizhevsky et al. 2012]

Input: 227x227x3 images


After CONV1: 55x55x96
W’ = (W - F + 2P) / S + 1

Second layer (POOL1): 3x3 filters applied at stride 2


Output volume: 27x27x96

Q: what is the number of parameters in this layer?

Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 80 April 27, 2021


Case Study: AlexNet
[Krizhevsky et al. 2012]

Input: 227x227x3 images


After CONV1: 55x55x96

Second layer (POOL1): 3x3 filters applied at stride 2


Output volume: 27x27x96
Parameters: 0!

Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 81 April 27, 2021


Case Study: AlexNet
[Krizhevsky et al. 2012]

Input: 227x227x3 images


After CONV1: 55x55x96
After POOL1: 27x27x96
...

Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 82 April 27, 2021


Case Study: AlexNet
[Krizhevsky et al. 2012]

Full (simplified) AlexNet architecture:


[227x227x3] INPUT
[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0
[27x27x96] MAX POOL1: 3x3 filters at stride 2
[27x27x96] NORM1: Normalization layer
[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2
[13x13x256] MAX POOL2: 3x3 filters at stride 2
[13x13x256] NORM2: Normalization layer
[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1
[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1
[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1
[6x6x256] MAX POOL3: 3x3 filters at stride 2
[4096] FC6: 4096 neurons
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class scores) Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 83 April 27, 2021


Case Study: AlexNet
[Krizhevsky et al. 2012]

Full (simplified) AlexNet architecture:


[227x227x3] INPUT
[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0
Details/Retrospectives:
[27x27x96] MAX POOL1: 3x3 filters at stride 2
- first use of ReLU
[27x27x96] NORM1: Normalization layer
- used Norm layers (not common anymore)
[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2
- heavy data augmentation
[13x13x256] MAX POOL2: 3x3 filters at stride 2
- dropout 0.5
[13x13x256] NORM2: Normalization layer
- batch size 128
[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1
- SGD Momentum 0.9
[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1
- Learning rate 1e-2, reduced by 10
[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1
manually when val accuracy plateaus
[6x6x256] MAX POOL3: 3x3 filters at stride 2
- L2 weight decay 5e-4
[4096] FC6: 4096 neurons
- 7 CNN ensemble: 18.2% -> 15.4%
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class scores) Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 84 April 27, 2021


Case Study: AlexNet
[Krizhevsky et al. 2012]

Full (simplified) AlexNet architecture:


[227x227x3] INPUT
[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0
[55x55x48] x 2
[27x27x96] MAX POOL1: 3x3 filters at stride 2
[27x27x96] NORM1: Normalization layer
[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2 Historical note: Trained on GTX 580
[13x13x256] MAX POOL2: 3x3 filters at stride 2 GPU with only 3 GB of memory.
[13x13x256] NORM2: Normalization layer Network spread across 2 GPUs, half
[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1 the neurons (feature maps) on each
[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1 GPU.
[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1
[6x6x256] MAX POOL3: 3x3 filters at stride 2
[4096] FC6: 4096 neurons
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class scores) Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 85 April 27, 2021


Case Study: AlexNet
[Krizhevsky et al. 2012]

Full (simplified) AlexNet architecture:


[227x227x3] INPUT
[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0
[27x27x96] MAX POOL1: 3x3 filters at stride 2
[27x27x96] NORM1: Normalization layer CONV1, CONV2, CONV4, CONV5:
[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2 Connections only with feature maps
[13x13x256] MAX POOL2: 3x3 filters at stride 2 on same GPU
[13x13x256] NORM2: Normalization layer
[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1
[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1
[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1
[6x6x256] MAX POOL3: 3x3 filters at stride 2
[4096] FC6: 4096 neurons
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class scores) Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 86 April 27, 2021


Case Study: AlexNet
[Krizhevsky et al. 2012]

Full (simplified) AlexNet architecture:


[227x227x3] INPUT
[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0
[27x27x96] MAX POOL1: 3x3 filters at stride 2
[27x27x96] NORM1: Normalization layer CONV3, FC6, FC7, FC8:
[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2 Connections with all feature maps in
[13x13x256] MAX POOL2: 3x3 filters at stride 2 preceding layer, communication
[13x13x256] NORM2: Normalization layer across GPUs
[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1
[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1
[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1
[6x6x256] MAX POOL3: 3x3 filters at stride 2
[4096] FC6: 4096 neurons
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class scores) Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 87 April 27, 2021


ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners

First CNN-based winner 152 layers 152 layers 152 layers

19 layers 22 layers

shallow 8 layers 8 layers

Lin et al Sanchez & Krizhevsky et al Zeiler & Simonyan & Szegedy et al He et al Shao et al Hu et al Russakovsky et al
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet) (ResNet) (SENet)

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 88 April 27, 2021


ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners

ZFNet: Improved
hyperparameters over 152 layers 152 layers 152 layers
AlexNet

19 layers 22 layers

shallow 8 layers 8 layers

Lin et al Sanchez & Krizhevsky et al Zeiler & Simonyan & Szegedy et al He et al Shao et al Hu et al Russakovsky et al
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet) (ResNet) (SENet)

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 89 April 27, 2021


ZFNet [Zeiler and Fergus, 2013]

AlexNet but:
CONV1: change from (11x11 stride 4) to (7x7 stride 2)
CONV3,4,5: instead of 384, 384, 256 filters use 512, 1024, 512
ImageNet top 5 error: 16.4% -> 11.7%

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 90 April 27, 2021


ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners

Deeper Networks 152 layers 152 layers 152 layers

19 layers 22 layers

shallow 8 layers 8 layers

Lin et al Sanchez & Krizhevsky et al Zeiler & Simonyan & Szegedy et al He et al Shao et al Hu et al Russakovsky et al
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet) (ResNet) (SENet)

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 91 April 27, 2021


Case Study: VGGNet Softmax
Softmax
FC 1000
FC 4096
[Simonyan and Zisserman, 2014] FC 1000 FC 4096
FC 4096 Pool
FC 4096 3x3 conv, 512
Pool 3x3 conv, 512
Small filters, Deeper networks 3x3 conv, 512 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
3x3 conv, 512 Pool

8 layers (AlexNet) Softmax


Pool
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
-> 16 - 19 layers (VGG16Net) FC 1000 3x3 conv, 512 3x3 conv, 512
FC 4096 3x3 conv, 512 3x3 conv, 512
FC 4096 Pool Pool

Only 3x3 CONV stride 1, pad 1 Pool


3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256

and 2x2 MAX POOL stride 2 3x3 conv, 384 Pool Pool
Pool 3x3 conv, 128 3x3 conv, 128
3x3 conv, 384 3x3 conv, 128 3x3 conv, 128
Pool Pool Pool
11.7% top 5 error in ILSVRC’13 (ZFNet) 5x5 conv, 256 3x3 conv, 64 3x3 conv, 64

-> 7.3% top 5 error in ILSVRC’14 11x11 conv, 96


Input
3x3 conv, 64
Input
3x3 conv, 64
Input

AlexNet VGG16 VGG19

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 92 April 27, 2021


Case Study: VGGNet Softmax
Softmax
FC 1000
FC 4096
[Simonyan and Zisserman, 2014] FC 1000 FC 4096
FC 4096 Pool
FC 4096 3x3 conv, 512
Pool
Q: Why use smaller filters? (3x3 conv) 3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
3x3 conv, 512 Pool
Pool 3x3 conv, 512
Softmax 3x3 conv, 512 3x3 conv, 512
FC 1000 3x3 conv, 512 3x3 conv, 512
FC 4096 3x3 conv, 512 3x3 conv, 512
FC 4096 Pool Pool
Pool 3x3 conv, 256 3x3 conv, 256
3x3 conv, 256 3x3 conv, 256 3x3 conv, 256
3x3 conv, 384 Pool Pool
Pool 3x3 conv, 128 3x3 conv, 128
3x3 conv, 384 3x3 conv, 128 3x3 conv, 128
Pool Pool Pool
5x5 conv, 256 3x3 conv, 64 3x3 conv, 64
11x11 conv, 96 3x3 conv, 64 3x3 conv, 64
Input Input Input

AlexNet VGG16 VGG19

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 93 April 27, 2021


Case Study: VGGNet Softmax
Softmax
FC 1000
FC 4096
[Simonyan and Zisserman, 2014] FC 1000 FC 4096
FC 4096 Pool
FC 4096 3x3 conv, 512
Pool
Q: Why use smaller filters? (3x3 conv) 3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
3x3 conv, 512 Pool
Pool 3x3 conv, 512
Softmax 3x3 conv, 512 3x3 conv, 512

Stack of three 3x3 conv (stride 1) layers FC 1000


FC 4096
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512

has same effective receptive field as FC 4096 Pool Pool


Pool 3x3 conv, 256 3x3 conv, 256
one 7x7 conv layer 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256
3x3 conv, 384 Pool Pool
Pool 3x3 conv, 128 3x3 conv, 128

Q: What is the effective receptive field of 3x3 conv, 384


Pool
3x3 conv, 128
Pool
3x3 conv, 128
Pool
three 3x3 conv (stride 1) layers? 5x5 conv, 256 3x3 conv, 64 3x3 conv, 64
11x11 conv, 96 3x3 conv, 64 3x3 conv, 64
Input Input Input

AlexNet VGG16 VGG19

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 94 April 27, 2021


Case Study: VGGNet Softmax
Softmax
FC 1000
FC 4096
[Simonyan and Zisserman, 2014] FC 1000 FC 4096
FC 4096 Pool
FC 4096 3x3 conv, 512
Q: What is the effective receptive field Pool 3x3 conv, 512

of three 3x3 conv (stride 1) layers? 3x3 conv, 512


3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512 Pool
Pool 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
Input A1 A2 A3 3x3 conv, 512 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
Pool Pool
3x3 conv, 256 3x3 conv, 256
3x3 conv, 256 3x3 conv, 256
Pool Pool
3x3 conv, 128 3x3 conv, 128
3x3 conv, 128 3x3 conv, 128
Pool Pool
3x3 conv, 64 3x3 conv, 64
3x3 conv, 64 3x3 conv, 64

Conv1 (3x3) Conv2 (3x3) Conv3 (3x3) Input Input

VGG16 VGG19

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 95 April 27, 2021


Case Study: VGGNet Softmax
Softmax
FC 1000
FC 4096
[Simonyan and Zisserman, 2014] FC 1000 FC 4096
FC 4096 Pool
FC 4096 3x3 conv, 512
Q: What is the effective receptive field Pool 3x3 conv, 512

of three 3x3 conv (stride 1) layers? 3x3 conv, 512


3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512 Pool
Pool 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
Input A1 A2 A3 3x3 conv, 512 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
Pool Pool
3x3 conv, 256 3x3 conv, 256
3x3 conv, 256 3x3 conv, 256
Pool Pool
3x3 conv, 128 3x3 conv, 128
3x3 conv, 128 3x3 conv, 128
Pool Pool
3x3 conv, 64 3x3 conv, 64
3x3 conv, 64 3x3 conv, 64

Conv1 (3x3) Conv2 (3x3) Conv3 (3x3) Input Input

VGG16 VGG19

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 96 April 27, 2021


Case Study: VGGNet Softmax
Softmax
FC 1000
FC 4096
[Simonyan and Zisserman, 2014] FC 1000 FC 4096
FC 4096 Pool
FC 4096 3x3 conv, 512
Q: What is the effective receptive field Pool 3x3 conv, 512

of three 3x3 conv (stride 1) layers? 3x3 conv, 512


3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512 Pool
Pool 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
Input A1 A2 A3 3x3 conv, 512 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
Pool Pool
3x3 conv, 256 3x3 conv, 256
3x3 conv, 256 3x3 conv, 256
Pool Pool
3x3 conv, 128 3x3 conv, 128
3x3 conv, 128 3x3 conv, 128
Pool Pool
3x3 conv, 64 3x3 conv, 64
3x3 conv, 64 3x3 conv, 64

Conv1 (3x3) Conv2 (3x3) Conv3 (3x3) Input Input

VGG16 VGG19

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 97 April 27, 2021


Case Study: VGGNet Softmax
Softmax
FC 1000
FC 4096
[Simonyan and Zisserman, 2014] FC 1000 FC 4096
FC 4096 Pool
FC 4096 3x3 conv, 512
Q: What is the effective receptive field Pool 3x3 conv, 512

of three 3x3 conv (stride 1) layers? 3x3 conv, 512


3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512 Pool
Pool 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
Input A1 A2 A3 3x3 conv, 512 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
Pool Pool
3x3 conv, 256 3x3 conv, 256
3x3 conv, 256 3x3 conv, 256
Pool Pool
3x3 conv, 128 3x3 conv, 128
3x3 conv, 128 3x3 conv, 128
Pool Pool
3x3 conv, 64 3x3 conv, 64
3x3 conv, 64 3x3 conv, 64

Conv1 (3x3) Conv2 (3x3) Conv3 (3x3) Input Input

VGG16 VGG19

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 98 April 27, 2021


Case Study: VGGNet Softmax
Softmax
FC 1000
FC 4096
[Simonyan and Zisserman, 2014] FC 1000 FC 4096
FC 4096 Pool
FC 4096 3x3 conv, 512
Q: What is the effective receptive field Pool 3x3 conv, 512

of three 3x3 conv (stride 1) layers? 3x3 conv, 512


3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512 Pool
Pool 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
Input A1 A2 A3 3x3 conv, 512 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
Pool Pool
3x3 conv, 256 3x3 conv, 256
3x3 conv, 256 3x3 conv, 256
Pool Pool
3x3 conv, 128 3x3 conv, 128
3x3 conv, 128 3x3 conv, 128
Pool Pool
3x3 conv, 64 3x3 conv, 64
3x3 conv, 64 3x3 conv, 64

Conv1 (3x3) Conv2 (3x3) Conv3 (3x3) Input Input

VGG16 VGG19

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 99 April 27, 2021


Case Study: VGGNet Softmax
Softmax
FC 1000
FC 4096
[Simonyan and Zisserman, 2014] FC 1000 FC 4096
FC 4096 Pool
FC 4096 3x3 conv, 512
Pool
Q: Why use smaller filters? (3x3 conv) 3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
3x3 conv, 512 Pool
Pool 3x3 conv, 512
Softmax 3x3 conv, 512 3x3 conv, 512
FC 1000 3x3 conv, 512 3x3 conv, 512

Stack of three 3x3 conv (stride 1) layers FC 4096


FC 4096
3x3 conv, 512
Pool
3x3 conv, 512
Pool
has same effective receptive field as Pool 3x3 conv, 256 3x3 conv, 256
3x3 conv, 256 3x3 conv, 256 3x3 conv, 256
one 7x7 conv layer 3x3 conv, 384 Pool Pool
Pool 3x3 conv, 128 3x3 conv, 128
3x3 conv, 384 3x3 conv, 128 3x3 conv, 128

[7x7] Pool Pool Pool


5x5 conv, 256 3x3 conv, 64 3x3 conv, 64
11x11 conv, 96 3x3 conv, 64 3x3 conv, 64
Input Input Input

AlexNet VGG16 VGG19

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 100 April 27, 2021
Case Study: VGGNet Softmax
Softmax
FC 1000
FC 4096
[Simonyan and Zisserman, 2014] FC 1000 FC 4096
FC 4096 Pool
FC 4096 3x3 conv, 512
Pool
Q: Why use smaller filters? (3x3 conv) 3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
3x3 conv, 512 Pool
Pool 3x3 conv, 512
Stack of three 3x3 conv (stride 1) layers Softmax 3x3 conv, 512 3x3 conv, 512

has same effective receptive field as FC 1000


FC 4096
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512

one 7x7 conv layer FC 4096 Pool Pool


Pool 3x3 conv, 256 3x3 conv, 256
3x3 conv, 256 3x3 conv, 256 3x3 conv, 256
3x3 conv, 384
But deeper, more non-linearities Pool
Pool
3x3 conv, 128
Pool
3x3 conv, 128
3x3 conv, 384 3x3 conv, 128 3x3 conv, 128
Pool Pool Pool
2 2
And fewer parameters: 3 * (3 C ) vs. 5x5 conv, 256 3x3 conv, 64 3x3 conv, 64

72C2 for C channels per layer


11x11 conv, 96 3x3 conv, 64 3x3 conv, 64
Input Input Input

AlexNet VGG16 VGG19

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 101 April 27, 2021
INPUT: [224x224x3] memory: 224*224*3=150K params: 0 (not counting biases)
Softmax
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728 FC 1000
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864 FC 4096
POOL2: [112x112x64] memory: 112*112*64=800K params: 0 FC 4096
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728 Pool

CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456 3x3 conv, 512
3x3 conv, 512
POOL2: [56x56x128] memory: 56*56*128=400K params: 0
3x3 conv, 512
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
Pool
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 3x3 conv, 512
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 3x3 conv, 512
POOL2: [28x28x256] memory: 28*28*256=200K params: 0 3x3 conv, 512
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648 Pool

CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296 3x3 conv, 256
3x3 conv, 256
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
Pool
POOL2: [14x14x512] memory: 14*14*512=100K params: 0
3x3 conv, 128
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 3x3 conv, 128
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 Pool
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 3x3 conv, 64
POOL2: [7x7x512] memory: 7*7*512=25K params: 0 3x3 conv, 64

FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448 Input

FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216 VGG16


FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 102 April 27, 2021
INPUT: [224x224x3] memory: 224*224*3=150K params: 0 (not counting biases)
Softmax
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728 FC 1000
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864 FC 4096
POOL2: [112x112x64] memory: 112*112*64=800K params: 0 FC 4096
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728 Pool

CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456 3x3 conv, 512
3x3 conv, 512
POOL2: [56x56x128] memory: 56*56*128=400K params: 0
3x3 conv, 512
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
Pool
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 3x3 conv, 512
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 3x3 conv, 512
POOL2: [28x28x256] memory: 28*28*256=200K params: 0 3x3 conv, 512
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648 Pool

CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296 3x3 conv, 256
3x3 conv, 256
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
Pool
POOL2: [14x14x512] memory: 14*14*512=100K params: 0
3x3 conv, 128
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 3x3 conv, 128
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 Pool
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 3x3 conv, 64
POOL2: [7x7x512] memory: 7*7*512=25K params: 0 3x3 conv, 64

FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448 Input

FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216 VGG16


FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000
TOTAL memory: 24M * 4 bytes ~= 96MB / image (for a forward pass)
TOTAL params: 138M parameters

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 103 April 27, 2021
INPUT: [224x224x3] memory: 224*224*3=150K params: 0 (not counting biases)
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728 Note:
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864
POOL2: [112x112x64] memory: 112*112*64=800K params: 0
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728 Most memory is in
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456 early CONV
POOL2: [56x56x128] memory: 56*56*128=400K params: 0
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
POOL2: [28x28x256] memory: 28*28*256=200K params: 0
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
POOL2: [14x14x512] memory: 14*14*512=100K params: 0 Most params are
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 in late FC
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
POOL2: [7x7x512] memory: 7*7*512=25K params: 0
FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448
FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216
FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000
TOTAL memory: 24M * 4 bytes ~= 96MB / image (only forward! ~*2 for bwd)
TOTAL params: 138M parameters

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 104 April 27, 2021
INPUT: [224x224x3] memory: 224*224*3=150K params: 0 (not counting biases)
Softmax
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728 FC 1000 fc8
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864 FC 4096 fc7
POOL2: [112x112x64] memory: 112*112*64=800K params: 0 FC 4096 fc6
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728 Pool

CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456 3x3 conv, 512 conv5-3
3x3 conv, 512 conv5-2
POOL2: [56x56x128] memory: 56*56*128=400K params: 0
3x3 conv, 512 conv5-1
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
Pool
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 3x3 conv, 512 conv4-3
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 3x3 conv, 512 conv4-2
POOL2: [28x28x256] memory: 28*28*256=200K params: 0 3x3 conv, 512 conv4-1
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648 Pool

CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296 3x3 conv, 256 conv3-2
3x3 conv, 256 conv3-1
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
Pool
POOL2: [14x14x512] memory: 14*14*512=100K params: 0
3x3 conv, 128 conv2-2
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 3x3 conv, 128 conv2-1
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 Pool
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 3x3 conv, 64 conv1-2
POOL2: [7x7x512] memory: 7*7*512=25K params: 0 3x3 conv, 64 conv1-1

FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448 Input

FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216 VGG16


FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000
TOTAL memory: 24M * 4 bytes ~= 96MB / image (only forward! ~*2 for bwd) Common names
TOTAL params: 138M parameters

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 105 April 27, 2021
Case Study: VGGNet Softmax
Softmax
FC 1000
FC 4096
[Simonyan and Zisserman, 2014] fc8 FC 1000 FC 4096
fc7 FC 4096 Pool
fc6 FC 4096 3x3 conv, 512
Pool 3x3 conv, 512
Details: conv5-3 3x3 conv, 512 3x3 conv, 512
conv5-2 3x3 conv, 512 3x3 conv, 512
- ILSVRC’14 2nd in classification, 1st in conv5-1 3x3 conv, 512 Pool

localization Pool 3x3 conv, 512


Softmax conv4-3 3x3 conv, 512 3x3 conv, 512
- Similar training procedure as Krizhevsky FC 1000 conv4-2 3x3 conv, 512 3x3 conv, 512

2012 fc7 FC 4096 conv4-1 3x3 conv, 512 3x3 conv, 512
fc6 FC 4096 Pool Pool
- No Local Response Normalisation (LRN) Pool conv3-2 3x3 conv, 256 3x3 conv, 256

- Use VGG16 or VGG19 (VGG19 only conv5 3x3 conv, 256 conv3-1 3x3 conv, 256 3x3 conv, 256
conv4 3x3 conv, 384 Pool Pool
slightly better, more memory) Pool conv2-2 3x3 conv, 128 3x3 conv, 128

- Use ensembles for best results conv3 3x3 conv, 384 conv2-1 3x3 conv, 128 3x3 conv, 128
Pool Pool Pool
- FC7 features generalize well to other conv2 5x5 conv, 256 conv1-2 3x3 conv, 64 3x3 conv, 64

tasks conv1 11x11 conv, 96


Input
conv1-1 3x3 conv, 64
Input
3x3 conv, 64
Input

AlexNet VGG16 VGG19

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 106 April 27, 2021
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners

Deeper Networks 152 layers 152 layers 152 layers

19 layers 22 layers

shallow 8 layers 8 layers

Lin et al Sanchez & Krizhevsky et al Zeiler & Simonyan & Szegedy et al He et al Shao et al Hu et al Russakovsky et al
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet) (ResNet) (SENet)

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 107 April 27, 2021
Case Study: GoogLeNet
[Szegedy et al., 2014]

Deeper networks, with computational


efficiency

- ILSVRC’14 classification winner


(6.7% top 5 error)
- 22 layers
- Only 5 million parameters!
12x less than AlexNet Inception module
27x less than VGG-16
- Efficient “Inception” module
- No FC layers

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 108 April 27, 2021
Case Study: GoogLeNet
[Szegedy et al., 2014]

“Inception module”: design a


good local network topology
(network within a network) and
then stack these modules on
top of each other

Inception module

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 109 April 27, 2021
Case Study: GoogLeNet
[Szegedy et al., 2014]

Apply parallel filter operations on


Filter
the input from previous layer:
concatenation - Multiple receptive field sizes
1x1 3x3 5x5 3x3 max for convolution (1x1, 3x3,
convolution convolution convolution pooling
5x5)
- Pooling operation (3x3)
Previous Layer

Concatenate all filter outputs


Naive Inception module together channel-wise

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 110 April 27, 2021
Case Study: GoogLeNet
[Szegedy et al., 2014]

Apply parallel filter operations on


Filter
the input from previous layer:
concatenation - Multiple receptive field sizes
1x1 3x3 5x5 3x3 max for convolution (1x1, 3x3,
convolution convolution convolution pooling
5x5)
- Pooling operation (3x3)
Previous Layer

Concatenate all filter outputs


Naive Inception module together channel-wise

Q: What is the problem with this?


[Hint: Computational complexity]

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 111 April 27, 2021
Case Study: GoogLeNet Q: What is the problem with this?
[Szegedy et al., 2014] [Hint: Computational complexity]

Example:

Filter
concatenation

1x1 conv, 3x3 conv, 5x5 conv,


3x3 pool
128 192 96

Module input: Input

28x28x256

Naive Inception module

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 112 April 27, 2021
Case Study: GoogLeNet Q: What is the problem with this?
[Szegedy et al., 2014] [Hint: Computational complexity]

Q1: What are the output sizes of


Example: all different filter operations?

Filter
concatenation

1x1 conv, 3x3 conv, 5x5 conv,


3x3 pool
128 192 96

Module input: Input

28x28x256

Naive Inception module

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 113 April 27, 2021
Case Study: GoogLeNet Q: What is the problem with this?
[Szegedy et al., 2014] [Hint: Computational complexity]

Q1: What are the output sizes of


Example: all different filter operations?

Filter
concatenation

28x28x128 28x28x192 28x28x96 28x28x256


1x1 conv, 3x3 conv, 5x5 conv,
3x3 pool
128 192 96

Module input: Input

28x28x256

Naive Inception module

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 114 April 27, 2021
Case Study: GoogLeNet Q: What is the problem with this?
[Szegedy et al., 2014] [Hint: Computational complexity]

Q2:What is output size after


Example: filter concatenation?

Filter
concatenation

28x28x128 28x28x192 28x28x96 28x28x256


1x1 conv, 3x3 conv, 5x5 conv,
3x3 pool
128 192 96

Module input: Input

28x28x256

Naive Inception module

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 115 April 27, 2021
Case Study: GoogLeNet Q: What is the problem with this?
[Szegedy et al., 2014] [Hint: Computational complexity]

Q2:What is output size after


Example: filter concatenation?

28x28x(128+192+96+256) = 28x28x672
Filter
concatenation

28x28x128 28x28x192 28x28x96 28x28x256


1x1 conv, 3x3 conv, 5x5 conv,
3x3 pool
128 192 96

Module input: Input

28x28x256

Naive Inception module

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 116 April 27, 2021
Case Study: GoogLeNet Q: What is the problem with this?
[Szegedy et al., 2014] [Hint: Computational complexity]

Q2:What is output size after


Example: filter concatenation?
Conv Ops:
28x28x(128+192+96+256) = 28x28x672 [1x1 conv, 128] 28x28x128x1x1x256
Filter [3x3 conv, 192] 28x28x192x3x3x256
concatenation
[5x5 conv, 96] 28x28x96x5x5x256
28x28x128 28x28x192 28x28x96 28x28x256 Total: 854M ops
1x1 conv, 3x3 conv, 5x5 conv,
3x3 pool
128 192 96

Module input: Input

28x28x256

Naive Inception module

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 117 April 27, 2021
Case Study: GoogLeNet Q: What is the problem with this?
[Szegedy et al., 2014] [Hint: Computational complexity]

Q2:What is output size after


Example: filter concatenation?
Conv Ops:
28x28x(128+192+96+256) = 28x28x672 [1x1 conv, 128] 28x28x128x1x1x256
Filter [3x3 conv, 192] 28x28x192x3x3x256
concatenation
[5x5 conv, 96] 28x28x96x5x5x256
28x28x128 28x28x192 28x28x96 28x28x256 Total: 854M ops
1x1 conv, 3x3 conv, 5x5 conv,
3x3 pool
128 192 96 Very expensive compute

Module input: Input Pooling layer also preserves feature


28x28x256 depth, which means total depth after
concatenation can only grow at every
Naive Inception module
layer!

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 118 April 27, 2021
Case Study: GoogLeNet Q: What is the problem with this?
[Szegedy et al., 2014] [Hint: Computational complexity]

Q2:What is output size after


Example: filter concatenation?

28x28x(128+192+96+256) = 529k Solution: “bottleneck” layers that


Filter
concatenation
use 1x1 convolutions to reduce
28x28x128 28x28x192 28x28x96 28x28x256
feature channel size
1x1 conv, 3x3 conv, 5x5 conv,
3x3 pool
128 192 96

Module input: Input

28x28x256

Naive Inception module

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 119 April 27, 2021
Review: 1x1 convolutions

1x1 CONV
56 with 32 filters
56
(each filter has size
1x1x64, and performs a
64-dimensional dot
56 product)
56
64 32

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 120 April 27, 2021
Review: 1x1 convolutions Alternatively, interpret it as applying
the same FC layer on each input pixel
FC

1x1x64 1x1x32

1x1 CONV
56 with 32 filters
56
(each filter has size
1x1x64, and performs a
64-dimensional dot
56 product)
56
64 32

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 121 April 27, 2021
Review: 1x1 convolutions Alternatively, interpret it as applying
the same FC layer on each input pixel
FC

64 32

1x1 CONV
56 with 32 filters
56

preserves spatial
dimensions, reduces depth!

56 Projects depth to lower 56


64 dimension (combination of 32
feature maps)

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 122 April 27, 2021
Case Study: GoogLeNet
[Szegedy et al., 2014]

Filter
Filter concatenation
concatenation

1x1 3x3 5x5 1x1


1x1 3x3 5x5 3x3 max convolution convolution convolution
convolution
convolution convolution convolution pooling

1x1 1x1 3x3 max


convolution convolution pooling
Previous Layer

Previous Layer
Naive Inception module
Inception module with dimension reduction

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 123 April 27, 2021
Case Study: GoogLeNet
[Szegedy et al., 2014]
1x1 conv “bottleneck”
layers
Filter
Filter concatenation
concatenation

1x1 3x3 5x5 1x1


1x1 3x3 5x5 3x3 max convolution convolution convolution
convolution
convolution convolution convolution pooling

1x1 1x1 3x3 max


convolution convolution pooling
Previous Layer

Previous Layer
Naive Inception module
Inception module with dimension reduction

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 124 April 27, 2021
Case Study: GoogLeNet Using same parallel layers as
naive example, and adding “1x1
[Szegedy et al., 2014]
conv, 64 filter” bottlenecks:
28x28x480
Filter Conv Ops:
concatenation
[1x1 conv, 64] 28x28x64x1x1x256
[1x1 conv, 64] 28x28x64x1x1x256
28x28x128 28x28x192 28x28x96 28x28x64
[1x1 conv, 128] 28x28x128x1x1x256
1x1 conv, 3x3 conv, 5x5 conv, 1x1 conv, [3x3 conv, 192] 28x28x192x3x3x64
128 192 96 64
[5x5 conv, 96] 28x28x96x5x5x64
28x28x64 28x28x64 28x28x256 [1x1 conv, 64] 28x28x64x1x1x256
1x1 conv, 1x1 conv,
3x3 pool Total: 358M ops
64 64

Module input:
Compared to 854M ops for naive version
Previous Layer
28x28x256 Bottleneck can also reduce depth after
pooling layer
Inception module with dimension reduction

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 125 April 27, 2021
Case Study: GoogLeNet
[Szegedy et al., 2014]

Stack Inception modules


with dimension reduction
on top of each other

Inception module

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 126 April 27, 2021
Case Study: GoogLeNet
[Szegedy et al., 2014]

Full GoogLeNet
architecture

Stem Network:
Conv-Pool-
2x Conv-Pool

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 127 April 27, 2021
Case Study: GoogLeNet
[Szegedy et al., 2014]

Full GoogLeNet
architecture

Stacked Inception
Modules

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 128 April 27, 2021
Case Study: GoogLeNet
[Szegedy et al., 2014]

Full GoogLeNet
architecture

Classifier output

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 129 April 27, 2021
Case Study: GoogLeNet HxWxc
AvgPool
[Szegedy et al., 2014] 1x1xc

Full GoogLeNet
architecture

Note: after the last convolutional layer, a global


average pooling layer is used that spatially averages Classifier output
across each feature map, before final FC layer. No
longer multiple expensive FC layers!

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 130 April 27, 2021
Case Study: GoogLeNet
[Szegedy et al., 2014]

Full GoogLeNet
architecture

Auxiliary classification outputs to inject additional gradient at lower layers


(AvgPool-1x1Conv-FC-FC-Softmax)

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 131 April 27, 2021
Case Study: GoogLeNet
[Szegedy et al., 2014]

Full GoogLeNet
architecture

22 total layers with weights


(parallel layers count as 1 layer => 2 layers per Inception module. Don’t count auxiliary output layers)

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 132 April 27, 2021
Case Study: GoogLeNet
[Szegedy et al., 2014]

Deeper networks, with computational


efficiency

- 22 layers
- Efficient “Inception” module
- Avoids expensive FC layers
- 12x less params than AlexNet
- 27x less params than VGG-16 Inception module
- ILSVRC’14 classification winner
(6.7% top 5 error)

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 133 April 27, 2021
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners
“Revolution of Depth”

152 layers 152 layers 152 layers

19 layers 22 layers

shallow 8 layers 8 layers

Lin et al Sanchez & Krizhevsky et al Zeiler & Simonyan & Szegedy et al He et al Shao et al Hu et al Russakovsky et al
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet) (ResNet) (SENet)

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 134 April 27, 2021
Softmax

Case Study: ResNet FC 1000


Pool

3x3 conv, 64
[He et al., 2015] 3x3 conv, 64

3x3 conv, 64
relu 3x3 conv, 64
Very deep networks using residual F(x) + x 3x3 conv, 64
3x3 conv, 64
connections
..
.
conv
- 152-layer model for ImageNet X
3x3 conv, 128
3x3 conv, 128
F(x) relu
- ILSVRC’15 classification winner identity 3x3 conv, 128
3x3 conv, 128
(3.57% top 5 error) conv 3x3 conv, 128
3x3 conv, 128 / 2
- Swept all classification and 3x3 conv, 64
detection competitions in X
3x3 conv, 64

ILSVRC’15 and COCO’15! Residual block


3x3 conv, 64
3x3 conv, 64

3x3 conv, 64
3x3 conv, 64

Pool
7x7 conv, 64 / 2
Input

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 135 April 27, 2021
Case Study: ResNet
[He et al., 2015]

What happens when we continue stacking deeper layers on a “plain” convolutional


neural network?

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 136 April 27, 2021
Case Study: ResNet
[He et al., 2015]

What happens when we continue stacking deeper layers on a “plain” convolutional


neural network?

56-layer

Training error
56-layer
Test error

20-layer

20-layer

Iterations Iterations

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 137 April 27, 2021
Case Study: ResNet
[He et al., 2015]

What happens when we continue stacking deeper layers on a “plain” convolutional


neural network?

56-layer

Training error
56-layer
Test error

20-layer

20-layer

Iterations Iterations

56-layer model performs worse on both test and training error


-> The deeper model performs worse, but it’s not caused by overfitting!

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 138 April 27, 2021
Case Study: ResNet
[He et al., 2015]

Fact: Deep models have more representation power


(more parameters) than shallower models.
Hypothesis: the problem is an optimization problem,
deeper models are harder to optimize

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 139 April 27, 2021
Case Study: ResNet
[He et al., 2015]

Fact: Deep models have more representation power


(more parameters) than shallower models.
Hypothesis: the problem is an optimization problem, H(x)
deeper models are harder to optimize
H(x) Identity
What should the deeper model learn to be at least
as good as the shallower model? relu relu
conv conv
A solution by construction is copying the learned
layers from the shallower model and setting X X
additional layers to identity mapping.

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 140 April 27, 2021
Case Study: ResNet
[He et al., 2015]

Solution: Use network layers to fit a residual mapping instead of directly trying to fit a
desired underlying mapping

H(x)

conv

relu
conv

X
“Plain” layers

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 141 April 27, 2021
Case Study: ResNet
[He et al., 2015]

Solution: Use network layers to fit a residual mapping instead of directly trying to fit a
desired underlying mapping
relu Identity mapping:
H(x) H(x) = F(x) + x H(x) = x if F(x) = 0

conv
conv
F(x) X
relu relu
identity
conv
conv

X X
“Plain” layers Residual block

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 142 April 27, 2021
Case Study: ResNet
[He et al., 2015]

Solution: Use network layers to fit a residual mapping instead of directly trying to fit a
desired underlying mapping
Identity mapping:
H(x) = F(x) + x relu
H(x) = x if F(x) = 0
H(x) H(x) = F(x) + x

conv
conv Use layers to
F(x) X fit residual
relu relu
identity F(x) = H(x) - x
conv
conv instead of
H(x) directly
X X
“Plain” layers Residual block

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 143 April 27, 2021
Softmax

Case Study: ResNet FC 1000


Pool

[He et al., 2015] 3x3 conv, 512


3x3 conv, 512

3x3 conv, 512

Full ResNet architecture: 3x3 conv, 512

relu 3x3 conv, 512


- Stack residual blocks 3x3 conv, 512, /2
F(x) + x
- Every residual block has ..
two 3x3 conv layers .
3x3 conv, 128
3x3 conv 3x3 conv, 128

F(x) X 3x3 conv, 128


relu 3x3 conv, 128
identity
3x3 conv, 128
3x3 conv 3x3 conv, 128, / 2

3x3 conv, 64
3x3 conv, 64

3x3 conv, 64
X 3x3 conv, 64
Residual block 3x3 conv, 64
3x3 conv, 64

Pool
7x7 conv, 64, / 2
Input

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 144 April 27, 2021
Softmax

Case Study: ResNet FC 1000


Pool

[He et al., 2015] 3x3 conv, 512


3x3 conv, 512

3x3 conv, 512

Full ResNet architecture: 3x3 conv, 512

relu 3x3 conv, 512


- Stack residual blocks 3x3 conv, 512, /2
F(x) + x
- Every residual block has ..
two 3x3 conv layers .
3x3 conv, 128
- Periodically, double # of 3x3 conv 3x3 conv, 128
3x3 conv, 128
filters and downsample F(x) X 3x3 conv, 128
filters, /2
relu 3x3 conv, 128
spatially using stride 2 identity spatially with
3x3 conv, 128 stride 2
3x3 conv
(/2 in each dimension) 3x3 conv, 128, / 2

3x3 conv, 64 3x3 conv, 64


Reduce the activation 3x3 conv, 64 filters
volume by half. 3x3 conv, 64
X 3x3 conv, 64
Residual block 3x3 conv, 64
3x3 conv, 64

Pool
7x7 conv, 64, / 2
Input

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 145 April 27, 2021
Softmax

Case Study: ResNet FC 1000


Pool

[He et al., 2015] 3x3 conv, 512


3x3 conv, 512

3x3 conv, 512

Full ResNet architecture: 3x3 conv, 512

relu 3x3 conv, 512


- Stack residual blocks 3x3 conv, 512, /2
F(x) + x
- Every residual block has ..
two 3x3 conv layers .
3x3 conv, 128
- Periodically, double # of 3x3 conv 3x3 conv, 128

filters and downsample F(x) X 3x3 conv, 128


relu 3x3 conv, 128
spatially using stride 2 identity
3x3 conv, 128
3x3 conv
(/2 in each dimension) 3x3 conv, 128, / 2

3x3 conv, 64
- Additional conv layer at 3x3 conv, 64

the beginning (stem) 3x3 conv, 64


X 3x3 conv, 64
Residual block 3x3 conv, 64
3x3 conv, 64

Pool
7x7 conv, 64, / 2 Beginning
Input conv layer

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 146 April 27, 2021
Softmax

Case Study: ResNet FC 1000


Pool
No FC layers
besides FC
1000 to
[He et al., 2015] 3x3 conv, 512
output
3x3 conv, 512
classes
3x3 conv, 512

Full ResNet architecture: 3x3 conv, 512


Global
relu 3x3 conv, 512 average
- Stack residual blocks 3x3 conv, 512, /2 pooling layer
F(x) + x
- Every residual block has ..
after last
conv layer
two 3x3 conv layers .
3x3 conv, 128
- Periodically, double # of 3x3 conv 3x3 conv, 128

filters and downsample F(x) X 3x3 conv, 128


relu 3x3 conv, 128
spatially using stride 2 identity
3x3 conv, 128
3x3 conv
(/2 in each dimension) 3x3 conv, 128, / 2

3x3 conv, 64
- Additional conv layer at 3x3 conv, 64

the beginning (stem) 3x3 conv, 64


X 3x3 conv, 64
- No FC layers at the end Residual block 3x3 conv, 64
(only FC 1000 to output 3x3 conv, 64

classes) Pool
7x7 conv, 64, / 2
- (In theory, you can train a ResNet with Input

input image of variable sizes)

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 147 April 27, 2021
Softmax

Case Study: ResNet FC 1000


Pool

[He et al., 2015] 3x3 conv, 512


3x3 conv, 512

3x3 conv, 512


3x3 conv, 512

3x3 conv, 512

Total depths of 18, 34, 50, 3x3 conv, 512, /2

..
101, or 152 layers for .
ImageNet 3x3 conv, 128
3x3 conv, 128

3x3 conv, 128


3x3 conv, 128

3x3 conv, 128


3x3 conv, 128, / 2

3x3 conv, 64
3x3 conv, 64

3x3 conv, 64
3x3 conv, 64

3x3 conv, 64
3x3 conv, 64

Pool
7x7 conv, 64, / 2
Input

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 148 April 27, 2021
Case Study: ResNet
[He et al., 2015]
28x28x256
output

For deeper networks 1x1 conv, 256


(ResNet-50+), use “bottleneck” BN, relu
layer to improve efficiency
3x3 conv, 64
(similar to GoogLeNet)
BN, relu
1x1 conv, 64

28x28x256
input

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 149 April 27, 2021
Case Study: ResNet
[He et al., 2015]
28x28x256
output
1x1 conv, 256 filters projects
back to 256 feature maps
For deeper networks (28x28x256) 1x1 conv, 256
(ResNet-50+), use “bottleneck” BN, relu
layer to improve efficiency 3x3 conv operates over
3x3 conv, 64
(similar to GoogLeNet) only 64 feature maps
BN, relu
1x1 conv, 64 filters to 1x1 conv, 64
project to 28x28x64
28x28x256
input

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 150 April 27, 2021
Case Study: ResNet
[He et al., 2015]

Training ResNet in practice:

- Batch Normalization after every CONV layer


- Xavier initialization from He et al.
- SGD + Momentum (0.9)
- Learning rate: 0.1, divided by 10 when validation error plateaus
- Mini-batch size 256
- Weight decay of 1e-5
- No dropout used

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 151 April 27, 2021
Case Study: ResNet
[He et al., 2015]

Experimental Results
- Able to train very deep
networks without degrading
(152 layers on ImageNet, 1202
on Cifar)
- Deeper networks now achieve
lower training error as expected
- Swept 1st place in all ILSVRC
and COCO 2015 competitions

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 152 April 27, 2021
Case Study: ResNet
[He et al., 2015]

Experimental Results
- Able to train very deep
networks without degrading
(152 layers on ImageNet, 1202
on Cifar)
- Deeper networks now achieve
lower training error as expected
- Swept 1st place in all ILSVRC
and COCO 2015 competitions

ILSVRC 2015 classification winner (3.6%


top 5 error) -- better than “human
performance”! (Russakovsky 2014)

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 153 April 27, 2021
Comparing complexity...

An Analysis of Deep Neural Network Models for Practical Applications, 2017.

Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with permission.

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 154 April 27, 2021
Comparing complexity... Inception-v4: Resnet + Inception!

An Analysis of Deep Neural Network Models for Practical Applications, 2017.

Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with permission.

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 155 April 27, 2021
VGG: most
Comparing complexity... parameters, most
operations

An Analysis of Deep Neural Network Models for Practical Applications, 2017.

Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with permission.

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 156 April 27, 2021
GoogLeNet:
Comparing complexity... most efficient

An Analysis of Deep Neural Network Models for Practical Applications, 2017.

Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with permission.

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 157 April 27, 2021
AlexNet:
Comparing complexity... Smaller compute, still memory
heavy, lower accuracy

An Analysis of Deep Neural Network Models for Practical Applications, 2017.

Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with permission.

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 158 April 27, 2021
ResNet:
Comparing complexity... Moderate efficiency depending on
model, highest accuracy

An Analysis of Deep Neural Network Models for Practical Applications, 2017.

Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with permission.

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 159 April 27, 2021
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners
Network ensembling

152 layers 152 layers 152 layers

19 layers 22 layers

shallow 8 layers 8 layers

Lin et al Sanchez & Krizhevsky et al Zeiler & Simonyan & Szegedy et al He et al Shao et al Hu et al Russakovsky et al
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet) (ResNet) (SENet)

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 160 April 27, 2021
Improving ResNets...
“Good Practices for Deep Feature Fusion”
[Shao et al. 2016]

- Multi-scale ensembling of Inception, Inception-Resnet, Resnet,


Wide Resnet models
- ILSVRC’16 classification winner

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 161 April 27, 2021
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners
Adaptive feature map reweighting

152 layers 152 layers 152 layers

19 layers 22 layers

shallow 8 layers 8 layers

Lin et al Sanchez & Krizhevsky et al Zeiler & Simonyan & Szegedy et al He et al Shao et al Hu et al Russakovsky et al
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet) (ResNet) (SENet)

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 162 April 27, 2021
Improving ResNets...
Squeeze-and-Excitation Networks (SENet)
[Hu et al. 2017]

- Add a “feature recalibration” module that


learns to adaptively reweight feature maps
- Global information (global avg. pooling
layer) + 2 FC layers used to determine
feature map weights
- ILSVRC’17 classification winner (using
ResNeXt-152 as a base architecture)

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 163 April 27, 2021
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners

152 layers 152 layers 152 layers

19 layers 22 layers

shallow 8 layers 8 layers

Lin et al Sanchez & Krizhevsky et al Zeiler & Simonyan & Szegedy et al He et al Shao et al Hu et al Russakovsky et al
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet) (ResNet) (SENet)

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 164 April 27, 2021
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners

152 layers 152 layers 152 layers

Completion of the challenge:


Annual ImageNet competition no longer
held after 2017 -> now moved to Kaggle.
19 layers 22 layers

shallow 8 layers 8 layers

Lin et al Sanchez & Krizhevsky et al Zeiler & Simonyan & Szegedy et al He et al Shao et al Hu et al Russakovsky et al
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet) (ResNet) (SENet)

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 165 April 27, 2021
But research into CNN architectures is still flourishing

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 166 April 27, 2021
Improving ResNets...

Identity Mappings in Deep Residual Networks


[He et al. 2016]

- Improved ResNet block design from


creators of ResNet conv

- Creates a more direct path for ReLU

propagating information throughout BN


network conv
- Gives better performance
ReLU

BN

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 167 April 27, 2021
Improving ResNets...

Wide Residual Networks


[Zagoruyko et al. 2016]

- Argues that residuals are the


important factor, not depth
3x3 conv, F x k
- User wider residual blocks (F x k 3x3 conv, F

filters instead of F filters in each layer)


3x3 conv, F x k
- 50-layer wide ResNet outperforms 3x3 conv, F

152-layer original ResNet


- Increasing width instead of depth
more computationally efficient Basic residual block Wide residual block
(parallelizable)

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 168 April 27, 2021
Improving ResNets...
Aggregated Residual Transformations for Deep
Neural Networks (ResNeXt)
256-d out
[Xie et al. 2016]

- Also from creators of


256-d out
ResNet
- Increases width of 1x1 conv, 256
1x1 conv, 256 1x1 conv, 256 1x1 conv, 256
residual block through 32
multiple parallel 3x3 conv, 64 paths
3x3 conv, 4 3x3 conv, 4 ... 3x3 conv, 4
pathways
(“cardinality”) 1x1 conv, 64
1x1 conv, 4 1x1 conv, 4 1x1 conv, 4
- Parallel pathways
256-d in
similar in spirit to
Inception module 256-d in

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 169 April 27, 2021
Other ideas...
Densely Connected Convolutional Networks (DenseNet) Softmax

[Huang et al. 2017] FC


1x1 conv, 64 Pool

- Dense blocks where each layer is Concat


Dense Block 3

connected to every other layer in Conv

feedforward fashion 1x1 conv, 64 Pool

Conv
- Alleviates vanishing gradient, Concat
Dense Block 2
strengthens feature propagation, Conv
Conv
encourages feature reuse Pool
- Showed that shallow 50-layer Concat Conv

network can outperform deeper Conv


Dense Block 1

152 layer ResNet Conv

Input Input

Dense Block

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - April 27, 2021


Efficient networks...
MobileNets: Efficient Convolutional Neural Networks for
Mobile Applications [Howard et al. 2017]
- Depthwise separable
BatchNorm
convolutions replace
standard convolutions by Pool
BatchNorm
factorizing them into a C2HW Conv (1x1, C->C) Pointwise
depthwise convolution and a Pool convolutions
1x1 convolution BatchNorm
9C2HW Conv (3x3, C->C)
- Much more efficient, with Pool
little loss in accuracy Standard network
- Follow-up MobileNetV2 work Total compute:9C2HW 9CHW Conv (3x3, C->C, Depthwise
groups=C) convolutions
in 2018 (Sandler et al.)
MobileNets
- ShuffleNet: Zhang et al, Total compute:9CHW + C2HW
CVPR 2018
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - April 27, 2021
Learning to search for network architectures...
Neural Architecture Search with Reinforcement Learning (NAS)
[Zoph et al. 2016]

- “Controller” network that learns to design a good


network architecture (output a string
corresponding to network design)
- Iterate:
1) Sample an architecture from search space
2) Train the architecture to get a “reward” R
corresponding to accuracy
3) Compute gradient of sample probability, and
scale by R to perform controller parameter
update (i.e. increase likelihood of good
architecture being sampled, decrease
likelihood of bad architecture)

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 172 April 27, 2021
Learning to search for network architectures...
Learning Transferable Architectures for Scalable Image
Recognition
[Zoph et al. 2017]
- Applying neural architecture search (NAS) to a
large dataset like ImageNet is expensive
- Design a search space of building blocks
(“cells”) that can be flexibly stacked
- NASNet: Use NAS to find best cell structure
on smaller CIFAR-10 dataset, then transfer
architecture to ImageNet
- Many follow-up works in this
space e.g. AmoebaNet (Real et
al. 2019) and ENAS (Pham,
Guan et al. 2018)

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 173 April 27, 2021
But sometimes smart heuristic is better than NAS ...

EfficientNet: Smart Compound Scaling


[Tan and Le. 2019]
- Increase network capacity by scaling width,
depth, and resolution, while balancing
accuracy and efficiency.
- Search for optimal set of compound scaling
factors given a compute budget (target
memory & flops).
- Scale up using smart heuristic rules

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 174 April 27, 2021
Efficient networks...

https://openai.com/blog/ai-and-efficiency/

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 175 April 27, 2021
Summary: CNN Architectures
Case Studies
- AlexNet
- VGG
- GoogLeNet
- ResNet

Also....
- SENet - DenseNet
- Wide ResNet - MobileNets
- ResNeXT - NASNet

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 176 April 27, 2021
Main takeaways
AlexNet showed that you can use CNNs to train Computer Vision models.
ZFNet, VGG shows that bigger networks work better
GoogLeNet is one of the first to focus on efficiency using 1x1 bottleneck
convolutions and global avg pool instead of FC layers
ResNet showed us how to train extremely deep networks
- Limited only by GPU & memory!
- Showed diminishing returns as networks got bigger
After ResNet: CNNs were better than the human metric and focus shifted to
Efficient networks:
- Lots of tiny networks aimed at mobile devices: MobileNet, ShuffleNet
Neural Architecture Search can now automate architecture design

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 177 April 27, 2021
Summary: CNN Architectures
- Many popular architectures available in model zoos
- ResNet and SENet currently good defaults to use
- Networks have gotten increasingly deep over time
- Many other aspects of network architectures are also continuously
being investigated and improved

- Next time: Recurrent neural networks

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 178 April 27, 2021
Next time: Recurrent Neural Networks

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 179 April 27, 2021

You might also like