Professional Documents
Culture Documents
CNN Architectures
SGD
SGD+Momentum
RMSProp
Adam
Cosine:
Linear:
Inverse sqrt:
Training:
Normalize using
Testing: Average out randomness stats from random
(sometimes approximate) minibatches
“cat”
Load image
and label
Compute
loss
CNN
“cat”
Load image
and label
Compute
loss
CNN
Transform image
Cubuk et al., “AutoAugment: Learning Augmentation Strategies from Data”, CVPR 2019
Use the architecture from the previous step, use all training
data, turn on small weight decay, find a learning rate that
makes the loss drop significantly within ~100 iterations
Pick best models from Step 4, train them for longer (~10-20
epochs) without learning rate decay
Train
Val
time
Val
time
Val
time
We develop
"command centers"
to visualize all our
models training with
different
hyperparameters
Unimportant Parameter
Unimportant Parameter
Important Parameter Important Parameter
Illustration of Bergstra et al., 2012 by Shayne
Longpre, copyright CS231n 2017
New:
- CNN architecture design
ST
train/use CNNs”
BU
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 47 April 27, 2021
Transfer Learning with CNNs
AlexNet:
64 x 3 x 11 x 11
1. Train on Imagenet
FC-1000
FC-4096
FC-4096
MaxPool
Conv-512
Conv-512
MaxPool
Conv-512
Conv-512
MaxPool
Conv-256
Conv-256
MaxPool
Conv-128
Conv-128
MaxPool
Conv-64
Conv-64
Image
MaxPool MaxPool
Conv-512 Conv-512
Conv-512 Conv-512
MaxPool MaxPool
Conv-128 Conv-128
Conv-128 Conv-128
MaxPool MaxPool
Conv-64 Conv-64
Conv-64 Conv-64
Image Image
MaxPool MaxPool
Conv-512 Conv-512
Conv-512 Conv-512
MaxPool MaxPool
Conv-128 Conv-128
Conv-128 Conv-128
MaxPool MaxPool
Conv-64 Conv-64
Donahue et al, “DeCAF: A Deep Convolutional Activation Feature for
Conv-64 Conv-64 Generic Visual Recognition”, ICML 2014
Image Image
MaxPool
very little data ? ?
Conv-512
More specific
Conv-512
MaxPool
Conv-256
Conv-256
More generic
MaxPool
Conv-128
Conv-128
quite a lot of ? ?
MaxPool
Conv-64
data
Conv-64
Image
MaxPool
very little data Use Linear ?
Conv-512
More specific Classifier on
Conv-512
top layer
MaxPool
Conv-256
Conv-256
More generic
MaxPool
Conv-128
Conv-128
quite a lot of Finetune a ?
MaxPool
Conv-64
data few layers
Conv-64
Image
MaxPool
very little data Use Linear You’re in
Conv-512
More specific Classifier on trouble… Try
Conv-512
top layer linear classifier
MaxPool
Conv-256 from different
Conv-256
More generic stages
MaxPool
Conv-128
Conv-128
quite a lot of Finetune a Finetune a
MaxPool
Conv-64
data few layers larger number
Conv-64 of layers
Image
Zhou et al, “Unified Vision-Language Pre-Training for Image Captioning and VQA” CVPR 2020 Krishna et al, “Visual genome: Connecting language and vision using crowdsourced dense image annotations” IJCV 2017
Figure copyright Luowei Zhou, 2020. Reproduced with permission. Devlin et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” ArXiv 2018
Girshick, “The Generalized R-CNN Framework for Object Detection”, ICCV 2017 Tutorial on Instance-Level Visual Recognition
TensorFlow: https://github.com/tensorflow/models
PyTorch: https://github.com/pytorch/vision
Padding:
Preserve
input spatial
dimensions in
32
output activations
3
32
32
Convolution Layer
32 32
3 6
Each conv filter outputs a “slice” in the activation
3 2 1 0 3 4
1 2 3 4
y
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 70 April 27, 2021
Today: CNN Architectures
Case Studies
- AlexNet
- VGG
- GoogLeNet
- ResNet
Also....
- SENet - DenseNet
- Wide ResNet - MobileNets
- ResNeXT - NASNet
- EfficientNet
19 layers 22 layers
Lin et al Sanchez & Krizhevsky et al Zeiler & Simonyan & Szegedy et al He et al Shao et al Hu et al Russakovsky et al
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet) (ResNet) (SENet)
19 layers 22 layers
Lin et al Sanchez & Krizhevsky et al Zeiler & Simonyan & Szegedy et al He et al Shao et al Hu et al Russakovsky et al
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet) (ResNet) (SENet)
Architecture:
CONV1
MAX POOL1
NORM1
CONV2
MAX POOL2
NORM2
CONV3
CONV4
CONV5
Max POOL3
FC6
FC7
FC8
Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.
Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.
3
Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.
Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.
Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.
Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.
Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.
Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.
Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.
19 layers 22 layers
Lin et al Sanchez & Krizhevsky et al Zeiler & Simonyan & Szegedy et al He et al Shao et al Hu et al Russakovsky et al
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet) (ResNet) (SENet)
ZFNet: Improved
hyperparameters over 152 layers 152 layers 152 layers
AlexNet
19 layers 22 layers
Lin et al Sanchez & Krizhevsky et al Zeiler & Simonyan & Szegedy et al He et al Shao et al Hu et al Russakovsky et al
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet) (ResNet) (SENet)
AlexNet but:
CONV1: change from (11x11 stride 4) to (7x7 stride 2)
CONV3,4,5: instead of 384, 384, 256 filters use 512, 1024, 512
ImageNet top 5 error: 16.4% -> 11.7%
19 layers 22 layers
Lin et al Sanchez & Krizhevsky et al Zeiler & Simonyan & Szegedy et al He et al Shao et al Hu et al Russakovsky et al
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet) (ResNet) (SENet)
and 2x2 MAX POOL stride 2 3x3 conv, 384 Pool Pool
Pool 3x3 conv, 128 3x3 conv, 128
3x3 conv, 384 3x3 conv, 128 3x3 conv, 128
Pool Pool Pool
11.7% top 5 error in ILSVRC’13 (ZFNet) 5x5 conv, 256 3x3 conv, 64 3x3 conv, 64
VGG16 VGG19
VGG16 VGG19
VGG16 VGG19
VGG16 VGG19
VGG16 VGG19
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 100 April 27, 2021
Case Study: VGGNet Softmax
Softmax
FC 1000
FC 4096
[Simonyan and Zisserman, 2014] FC 1000 FC 4096
FC 4096 Pool
FC 4096 3x3 conv, 512
Pool
Q: Why use smaller filters? (3x3 conv) 3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
3x3 conv, 512 Pool
Pool 3x3 conv, 512
Stack of three 3x3 conv (stride 1) layers Softmax 3x3 conv, 512 3x3 conv, 512
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 101 April 27, 2021
INPUT: [224x224x3] memory: 224*224*3=150K params: 0 (not counting biases)
Softmax
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728 FC 1000
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864 FC 4096
POOL2: [112x112x64] memory: 112*112*64=800K params: 0 FC 4096
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728 Pool
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456 3x3 conv, 512
3x3 conv, 512
POOL2: [56x56x128] memory: 56*56*128=400K params: 0
3x3 conv, 512
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
Pool
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 3x3 conv, 512
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 3x3 conv, 512
POOL2: [28x28x256] memory: 28*28*256=200K params: 0 3x3 conv, 512
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648 Pool
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296 3x3 conv, 256
3x3 conv, 256
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
Pool
POOL2: [14x14x512] memory: 14*14*512=100K params: 0
3x3 conv, 128
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 3x3 conv, 128
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 Pool
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 3x3 conv, 64
POOL2: [7x7x512] memory: 7*7*512=25K params: 0 3x3 conv, 64
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 102 April 27, 2021
INPUT: [224x224x3] memory: 224*224*3=150K params: 0 (not counting biases)
Softmax
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728 FC 1000
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864 FC 4096
POOL2: [112x112x64] memory: 112*112*64=800K params: 0 FC 4096
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728 Pool
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456 3x3 conv, 512
3x3 conv, 512
POOL2: [56x56x128] memory: 56*56*128=400K params: 0
3x3 conv, 512
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
Pool
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 3x3 conv, 512
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 3x3 conv, 512
POOL2: [28x28x256] memory: 28*28*256=200K params: 0 3x3 conv, 512
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648 Pool
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296 3x3 conv, 256
3x3 conv, 256
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
Pool
POOL2: [14x14x512] memory: 14*14*512=100K params: 0
3x3 conv, 128
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 3x3 conv, 128
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 Pool
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 3x3 conv, 64
POOL2: [7x7x512] memory: 7*7*512=25K params: 0 3x3 conv, 64
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 103 April 27, 2021
INPUT: [224x224x3] memory: 224*224*3=150K params: 0 (not counting biases)
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728 Note:
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864
POOL2: [112x112x64] memory: 112*112*64=800K params: 0
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728 Most memory is in
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456 early CONV
POOL2: [56x56x128] memory: 56*56*128=400K params: 0
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
POOL2: [28x28x256] memory: 28*28*256=200K params: 0
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
POOL2: [14x14x512] memory: 14*14*512=100K params: 0 Most params are
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 in late FC
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
POOL2: [7x7x512] memory: 7*7*512=25K params: 0
FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448
FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216
FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000
TOTAL memory: 24M * 4 bytes ~= 96MB / image (only forward! ~*2 for bwd)
TOTAL params: 138M parameters
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 104 April 27, 2021
INPUT: [224x224x3] memory: 224*224*3=150K params: 0 (not counting biases)
Softmax
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728 FC 1000 fc8
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864 FC 4096 fc7
POOL2: [112x112x64] memory: 112*112*64=800K params: 0 FC 4096 fc6
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728 Pool
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456 3x3 conv, 512 conv5-3
3x3 conv, 512 conv5-2
POOL2: [56x56x128] memory: 56*56*128=400K params: 0
3x3 conv, 512 conv5-1
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
Pool
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 3x3 conv, 512 conv4-3
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 3x3 conv, 512 conv4-2
POOL2: [28x28x256] memory: 28*28*256=200K params: 0 3x3 conv, 512 conv4-1
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648 Pool
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296 3x3 conv, 256 conv3-2
3x3 conv, 256 conv3-1
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
Pool
POOL2: [14x14x512] memory: 14*14*512=100K params: 0
3x3 conv, 128 conv2-2
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 3x3 conv, 128 conv2-1
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 Pool
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 3x3 conv, 64 conv1-2
POOL2: [7x7x512] memory: 7*7*512=25K params: 0 3x3 conv, 64 conv1-1
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 105 April 27, 2021
Case Study: VGGNet Softmax
Softmax
FC 1000
FC 4096
[Simonyan and Zisserman, 2014] fc8 FC 1000 FC 4096
fc7 FC 4096 Pool
fc6 FC 4096 3x3 conv, 512
Pool 3x3 conv, 512
Details: conv5-3 3x3 conv, 512 3x3 conv, 512
conv5-2 3x3 conv, 512 3x3 conv, 512
- ILSVRC’14 2nd in classification, 1st in conv5-1 3x3 conv, 512 Pool
2012 fc7 FC 4096 conv4-1 3x3 conv, 512 3x3 conv, 512
fc6 FC 4096 Pool Pool
- No Local Response Normalisation (LRN) Pool conv3-2 3x3 conv, 256 3x3 conv, 256
- Use VGG16 or VGG19 (VGG19 only conv5 3x3 conv, 256 conv3-1 3x3 conv, 256 3x3 conv, 256
conv4 3x3 conv, 384 Pool Pool
slightly better, more memory) Pool conv2-2 3x3 conv, 128 3x3 conv, 128
- Use ensembles for best results conv3 3x3 conv, 384 conv2-1 3x3 conv, 128 3x3 conv, 128
Pool Pool Pool
- FC7 features generalize well to other conv2 5x5 conv, 256 conv1-2 3x3 conv, 64 3x3 conv, 64
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 106 April 27, 2021
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners
19 layers 22 layers
Lin et al Sanchez & Krizhevsky et al Zeiler & Simonyan & Szegedy et al He et al Shao et al Hu et al Russakovsky et al
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet) (ResNet) (SENet)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 107 April 27, 2021
Case Study: GoogLeNet
[Szegedy et al., 2014]
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 108 April 27, 2021
Case Study: GoogLeNet
[Szegedy et al., 2014]
Inception module
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 109 April 27, 2021
Case Study: GoogLeNet
[Szegedy et al., 2014]
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 110 April 27, 2021
Case Study: GoogLeNet
[Szegedy et al., 2014]
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 111 April 27, 2021
Case Study: GoogLeNet Q: What is the problem with this?
[Szegedy et al., 2014] [Hint: Computational complexity]
Example:
Filter
concatenation
28x28x256
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 112 April 27, 2021
Case Study: GoogLeNet Q: What is the problem with this?
[Szegedy et al., 2014] [Hint: Computational complexity]
Filter
concatenation
28x28x256
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 113 April 27, 2021
Case Study: GoogLeNet Q: What is the problem with this?
[Szegedy et al., 2014] [Hint: Computational complexity]
Filter
concatenation
28x28x256
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 114 April 27, 2021
Case Study: GoogLeNet Q: What is the problem with this?
[Szegedy et al., 2014] [Hint: Computational complexity]
Filter
concatenation
28x28x256
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 115 April 27, 2021
Case Study: GoogLeNet Q: What is the problem with this?
[Szegedy et al., 2014] [Hint: Computational complexity]
28x28x(128+192+96+256) = 28x28x672
Filter
concatenation
28x28x256
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 116 April 27, 2021
Case Study: GoogLeNet Q: What is the problem with this?
[Szegedy et al., 2014] [Hint: Computational complexity]
28x28x256
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 117 April 27, 2021
Case Study: GoogLeNet Q: What is the problem with this?
[Szegedy et al., 2014] [Hint: Computational complexity]
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 118 April 27, 2021
Case Study: GoogLeNet Q: What is the problem with this?
[Szegedy et al., 2014] [Hint: Computational complexity]
28x28x256
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 119 April 27, 2021
Review: 1x1 convolutions
1x1 CONV
56 with 32 filters
56
(each filter has size
1x1x64, and performs a
64-dimensional dot
56 product)
56
64 32
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 120 April 27, 2021
Review: 1x1 convolutions Alternatively, interpret it as applying
the same FC layer on each input pixel
FC
1x1x64 1x1x32
1x1 CONV
56 with 32 filters
56
(each filter has size
1x1x64, and performs a
64-dimensional dot
56 product)
56
64 32
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 121 April 27, 2021
Review: 1x1 convolutions Alternatively, interpret it as applying
the same FC layer on each input pixel
FC
64 32
1x1 CONV
56 with 32 filters
56
preserves spatial
dimensions, reduces depth!
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 122 April 27, 2021
Case Study: GoogLeNet
[Szegedy et al., 2014]
Filter
Filter concatenation
concatenation
Previous Layer
Naive Inception module
Inception module with dimension reduction
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 123 April 27, 2021
Case Study: GoogLeNet
[Szegedy et al., 2014]
1x1 conv “bottleneck”
layers
Filter
Filter concatenation
concatenation
Previous Layer
Naive Inception module
Inception module with dimension reduction
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 124 April 27, 2021
Case Study: GoogLeNet Using same parallel layers as
naive example, and adding “1x1
[Szegedy et al., 2014]
conv, 64 filter” bottlenecks:
28x28x480
Filter Conv Ops:
concatenation
[1x1 conv, 64] 28x28x64x1x1x256
[1x1 conv, 64] 28x28x64x1x1x256
28x28x128 28x28x192 28x28x96 28x28x64
[1x1 conv, 128] 28x28x128x1x1x256
1x1 conv, 3x3 conv, 5x5 conv, 1x1 conv, [3x3 conv, 192] 28x28x192x3x3x64
128 192 96 64
[5x5 conv, 96] 28x28x96x5x5x64
28x28x64 28x28x64 28x28x256 [1x1 conv, 64] 28x28x64x1x1x256
1x1 conv, 1x1 conv,
3x3 pool Total: 358M ops
64 64
Module input:
Compared to 854M ops for naive version
Previous Layer
28x28x256 Bottleneck can also reduce depth after
pooling layer
Inception module with dimension reduction
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 125 April 27, 2021
Case Study: GoogLeNet
[Szegedy et al., 2014]
Inception module
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 126 April 27, 2021
Case Study: GoogLeNet
[Szegedy et al., 2014]
Full GoogLeNet
architecture
Stem Network:
Conv-Pool-
2x Conv-Pool
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 127 April 27, 2021
Case Study: GoogLeNet
[Szegedy et al., 2014]
Full GoogLeNet
architecture
Stacked Inception
Modules
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 128 April 27, 2021
Case Study: GoogLeNet
[Szegedy et al., 2014]
Full GoogLeNet
architecture
Classifier output
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 129 April 27, 2021
Case Study: GoogLeNet HxWxc
AvgPool
[Szegedy et al., 2014] 1x1xc
Full GoogLeNet
architecture
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 130 April 27, 2021
Case Study: GoogLeNet
[Szegedy et al., 2014]
Full GoogLeNet
architecture
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 131 April 27, 2021
Case Study: GoogLeNet
[Szegedy et al., 2014]
Full GoogLeNet
architecture
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 132 April 27, 2021
Case Study: GoogLeNet
[Szegedy et al., 2014]
- 22 layers
- Efficient “Inception” module
- Avoids expensive FC layers
- 12x less params than AlexNet
- 27x less params than VGG-16 Inception module
- ILSVRC’14 classification winner
(6.7% top 5 error)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 133 April 27, 2021
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners
“Revolution of Depth”
19 layers 22 layers
Lin et al Sanchez & Krizhevsky et al Zeiler & Simonyan & Szegedy et al He et al Shao et al Hu et al Russakovsky et al
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet) (ResNet) (SENet)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 134 April 27, 2021
Softmax
3x3 conv, 64
[He et al., 2015] 3x3 conv, 64
3x3 conv, 64
relu 3x3 conv, 64
Very deep networks using residual F(x) + x 3x3 conv, 64
3x3 conv, 64
connections
..
.
conv
- 152-layer model for ImageNet X
3x3 conv, 128
3x3 conv, 128
F(x) relu
- ILSVRC’15 classification winner identity 3x3 conv, 128
3x3 conv, 128
(3.57% top 5 error) conv 3x3 conv, 128
3x3 conv, 128 / 2
- Swept all classification and 3x3 conv, 64
detection competitions in X
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
Pool
7x7 conv, 64 / 2
Input
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 135 April 27, 2021
Case Study: ResNet
[He et al., 2015]
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 136 April 27, 2021
Case Study: ResNet
[He et al., 2015]
56-layer
Training error
56-layer
Test error
20-layer
20-layer
Iterations Iterations
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 137 April 27, 2021
Case Study: ResNet
[He et al., 2015]
56-layer
Training error
56-layer
Test error
20-layer
20-layer
Iterations Iterations
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 138 April 27, 2021
Case Study: ResNet
[He et al., 2015]
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 139 April 27, 2021
Case Study: ResNet
[He et al., 2015]
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 140 April 27, 2021
Case Study: ResNet
[He et al., 2015]
Solution: Use network layers to fit a residual mapping instead of directly trying to fit a
desired underlying mapping
H(x)
conv
relu
conv
X
“Plain” layers
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 141 April 27, 2021
Case Study: ResNet
[He et al., 2015]
Solution: Use network layers to fit a residual mapping instead of directly trying to fit a
desired underlying mapping
relu Identity mapping:
H(x) H(x) = F(x) + x H(x) = x if F(x) = 0
conv
conv
F(x) X
relu relu
identity
conv
conv
X X
“Plain” layers Residual block
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 142 April 27, 2021
Case Study: ResNet
[He et al., 2015]
Solution: Use network layers to fit a residual mapping instead of directly trying to fit a
desired underlying mapping
Identity mapping:
H(x) = F(x) + x relu
H(x) = x if F(x) = 0
H(x) H(x) = F(x) + x
conv
conv Use layers to
F(x) X fit residual
relu relu
identity F(x) = H(x) - x
conv
conv instead of
H(x) directly
X X
“Plain” layers Residual block
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 143 April 27, 2021
Softmax
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
X 3x3 conv, 64
Residual block 3x3 conv, 64
3x3 conv, 64
Pool
7x7 conv, 64, / 2
Input
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 144 April 27, 2021
Softmax
Pool
7x7 conv, 64, / 2
Input
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 145 April 27, 2021
Softmax
3x3 conv, 64
- Additional conv layer at 3x3 conv, 64
Pool
7x7 conv, 64, / 2 Beginning
Input conv layer
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 146 April 27, 2021
Softmax
3x3 conv, 64
- Additional conv layer at 3x3 conv, 64
classes) Pool
7x7 conv, 64, / 2
- (In theory, you can train a ResNet with Input
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 147 April 27, 2021
Softmax
..
101, or 152 layers for .
ImageNet 3x3 conv, 128
3x3 conv, 128
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
Pool
7x7 conv, 64, / 2
Input
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 148 April 27, 2021
Case Study: ResNet
[He et al., 2015]
28x28x256
output
28x28x256
input
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 149 April 27, 2021
Case Study: ResNet
[He et al., 2015]
28x28x256
output
1x1 conv, 256 filters projects
back to 256 feature maps
For deeper networks (28x28x256) 1x1 conv, 256
(ResNet-50+), use “bottleneck” BN, relu
layer to improve efficiency 3x3 conv operates over
3x3 conv, 64
(similar to GoogLeNet) only 64 feature maps
BN, relu
1x1 conv, 64 filters to 1x1 conv, 64
project to 28x28x64
28x28x256
input
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 150 April 27, 2021
Case Study: ResNet
[He et al., 2015]
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 151 April 27, 2021
Case Study: ResNet
[He et al., 2015]
Experimental Results
- Able to train very deep
networks without degrading
(152 layers on ImageNet, 1202
on Cifar)
- Deeper networks now achieve
lower training error as expected
- Swept 1st place in all ILSVRC
and COCO 2015 competitions
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 152 April 27, 2021
Case Study: ResNet
[He et al., 2015]
Experimental Results
- Able to train very deep
networks without degrading
(152 layers on ImageNet, 1202
on Cifar)
- Deeper networks now achieve
lower training error as expected
- Swept 1st place in all ILSVRC
and COCO 2015 competitions
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 153 April 27, 2021
Comparing complexity...
Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with permission.
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 154 April 27, 2021
Comparing complexity... Inception-v4: Resnet + Inception!
Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with permission.
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 155 April 27, 2021
VGG: most
Comparing complexity... parameters, most
operations
Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with permission.
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 156 April 27, 2021
GoogLeNet:
Comparing complexity... most efficient
Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with permission.
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 157 April 27, 2021
AlexNet:
Comparing complexity... Smaller compute, still memory
heavy, lower accuracy
Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with permission.
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 158 April 27, 2021
ResNet:
Comparing complexity... Moderate efficiency depending on
model, highest accuracy
Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with permission.
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 159 April 27, 2021
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners
Network ensembling
19 layers 22 layers
Lin et al Sanchez & Krizhevsky et al Zeiler & Simonyan & Szegedy et al He et al Shao et al Hu et al Russakovsky et al
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet) (ResNet) (SENet)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 160 April 27, 2021
Improving ResNets...
“Good Practices for Deep Feature Fusion”
[Shao et al. 2016]
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 161 April 27, 2021
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners
Adaptive feature map reweighting
19 layers 22 layers
Lin et al Sanchez & Krizhevsky et al Zeiler & Simonyan & Szegedy et al He et al Shao et al Hu et al Russakovsky et al
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet) (ResNet) (SENet)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 162 April 27, 2021
Improving ResNets...
Squeeze-and-Excitation Networks (SENet)
[Hu et al. 2017]
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 163 April 27, 2021
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners
19 layers 22 layers
Lin et al Sanchez & Krizhevsky et al Zeiler & Simonyan & Szegedy et al He et al Shao et al Hu et al Russakovsky et al
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet) (ResNet) (SENet)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 164 April 27, 2021
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners
Lin et al Sanchez & Krizhevsky et al Zeiler & Simonyan & Szegedy et al He et al Shao et al Hu et al Russakovsky et al
Perronnin (AlexNet) Fergus Zisserman (VGG) (GoogLeNet) (ResNet) (SENet)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 165 April 27, 2021
But research into CNN architectures is still flourishing
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 166 April 27, 2021
Improving ResNets...
BN
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 167 April 27, 2021
Improving ResNets...
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 168 April 27, 2021
Improving ResNets...
Aggregated Residual Transformations for Deep
Neural Networks (ResNeXt)
256-d out
[Xie et al. 2016]
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 169 April 27, 2021
Other ideas...
Densely Connected Convolutional Networks (DenseNet) Softmax
Conv
- Alleviates vanishing gradient, Concat
Dense Block 2
strengthens feature propagation, Conv
Conv
encourages feature reuse Pool
- Showed that shallow 50-layer Concat Conv
Input Input
Dense Block
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 172 April 27, 2021
Learning to search for network architectures...
Learning Transferable Architectures for Scalable Image
Recognition
[Zoph et al. 2017]
- Applying neural architecture search (NAS) to a
large dataset like ImageNet is expensive
- Design a search space of building blocks
(“cells”) that can be flexibly stacked
- NASNet: Use NAS to find best cell structure
on smaller CIFAR-10 dataset, then transfer
architecture to ImageNet
- Many follow-up works in this
space e.g. AmoebaNet (Real et
al. 2019) and ENAS (Pham,
Guan et al. 2018)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 173 April 27, 2021
But sometimes smart heuristic is better than NAS ...
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 174 April 27, 2021
Efficient networks...
https://openai.com/blog/ai-and-efficiency/
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 175 April 27, 2021
Summary: CNN Architectures
Case Studies
- AlexNet
- VGG
- GoogLeNet
- ResNet
Also....
- SENet - DenseNet
- Wide ResNet - MobileNets
- ResNeXT - NASNet
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 176 April 27, 2021
Main takeaways
AlexNet showed that you can use CNNs to train Computer Vision models.
ZFNet, VGG shows that bigger networks work better
GoogLeNet is one of the first to focus on efficiency using 1x1 bottleneck
convolutions and global avg pool instead of FC layers
ResNet showed us how to train extremely deep networks
- Limited only by GPU & memory!
- Showed diminishing returns as networks got bigger
After ResNet: CNNs were better than the human metric and focus shifted to
Efficient networks:
- Lots of tiny networks aimed at mobile devices: MobileNet, ShuffleNet
Neural Architecture Search can now automate architecture design
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 177 April 27, 2021
Summary: CNN Architectures
- Many popular architectures available in model zoos
- ResNet and SENet currently good defaults to use
- Networks have gotten increasingly deep over time
- Many other aspects of network architectures are also continuously
being investigated and improved
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 178 April 27, 2021
Next time: Recurrent Neural Networks
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 9 - 179 April 27, 2021