You are on page 1of 36

Deep Residual Learning

MSRA @ ILSVRC & COCO 2015 competitions

Kaiming He
with Xiangyu Zhang, Shaoqing Ren, Jifeng Dai, & Jian Sun
Microsoft Research Asia (MSRA)
MSRA @ ILSVRC & COCO 2015 Competitions
1st places in all five main tracks
ImageNet Classification: Ultra-deep (quote Yann) 152-layer nets
ImageNet Detection: 16% better than 2nd
ImageNet Localization: 27% better than 2nd
COCO Detection: 11% better than 2nd
COCO Segmentation: 12% better than 2nd

*improvements are relative numbers

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. arXiv 2015.
Revolution of Depth 28.2
25.8
152 layers

16.4

11.7
22 layers 19 layers
6.7 7.3

3.57 8 layers 8 layers shallow

ILSVRC'15 ILSVRC'14 ILSVRC'14 ILSVRC'13 ILSVRC'12 ILSVRC'11 ILSVRC'10


ResNet GoogleNet VGG AlexNet

ImageNet Classification top-5 error (%)


Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. arXiv 2015.
101 layers
Revolution of Depth
86
Engines of
visual recognition 66
58

34
16 layers
8 layers
shallow

HOG, DPM AlexNet VGG ResNet


(RCNN) (RCNN) (Faster RCNN)*

PASCAL VOC 2007 Object Detection mAP (%)


*w/ other improvements & more data

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. arXiv 2015.
Revolution of Depth
AlexNet, 8 layers 11x11 conv, 96, /4, pool/2
(ILSVRC 2012)
5x5 conv, 256, pool/2

3x3 conv, 384

3x3 conv, 384

3x3 conv, 256, pool/2

fc, 4096

fc, 4096

fc, 1000

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. arXiv 2015.
Revolution of Depth soft max2

Soft maxAct iv at ion

FC

Av eragePool
7 x7 + 1 (V)

11x11 conv, 96, /4, pool/2 3x3 conv, 64


AlexNet, 8 layers VGG, 19 layers GoogleNet, 22 layers
Dept hConcat

Conv Conv Conv Conv


1 x1 + 1 (S) 3 x3 + 1 (S) 5 x5 + 1 (S) 1 x1 + 1 (S)

5x5 conv, 256, pool/2 3x3 conv, 64, pool/2 Conv


1 x1 + 1 (S)
Conv
1 x1 + 1 (S)
MaxPool
3 x3 + 1 (S)

(ILSVRC 2012) 3x3 conv, 384


(ILSVRC 2014) 3x3 conv, 128
(ILSVRC 2014) Conv
1 x1 + 1 (S)
Conv
Dept hConcat

3 x3 + 1 (S)
Conv
5 x5 + 1 (S)
Conv
1 x1 + 1 (S)
soft max1

Conv Conv MaxPool


Soft maxAct iv at ion
1 x1 + 1 (S) 1 x1 + 1 (S) 3 x3 + 1 (S)

3x3 conv, 384 3x3 conv, 128, pool/2 MaxPool


3 x3 + 2 (S)
FC

Dept hConcat FC

3x3 conv, 256, pool/2 3x3 conv, 256 Conv


1 x1 + 1 (S)
Conv
3 x3 + 1 (S)
Conv
5 x5 + 1 (S)
Conv
1 x1 + 1 (S)
Conv
1 x1 + 1 (S)

Conv Conv MaxPool Av eragePool


1 x1 + 1 (S) 1 x1 + 1 (S) 3 x3 + 1 (S) 5 x5 + 3 (V)

fc, 4096 3x3 conv, 256 Dept hConcat

Conv Conv Conv Conv


1 x1 + 1 (S) 3 x3 + 1 (S) 5 x5 + 1 (S) 1 x1 + 1 (S)

fc, 4096 3x3 conv, 256 Conv


1 x1 + 1 (S)
Conv
1 x1 + 1 (S)
MaxPool
3 x3 + 1 (S)

Dept hConcat soft max0

fc, 1000 3x3 conv, 256, pool/2 Conv


1 x1 + 1 (S)
Conv
3 x3 + 1 (S)
Conv
5 x5 + 1 (S)
Conv
1 x1 + 1 (S)
Soft maxAct iv at ion

Conv Conv MaxPool


FC
1 x1 + 1 (S) 1 x1 + 1 (S) 3 x3 + 1 (S)

3x3 conv, 512 Dept hConcat FC

Conv Conv Conv Conv Conv


1 x1 + 1 (S) 3 x3 + 1 (S) 5 x5 + 1 (S) 1 x1 + 1 (S) 1 x1 + 1 (S)

3x3 conv, 512 Conv


1 x1 + 1 (S)
Conv
1 x1 + 1 (S)
MaxPool
3 x3 + 1 (S)
Av eragePool
5 x5 + 3 (V)

Dept hConcat

3x3 conv, 512 Conv


1 x1 + 1 (S)
Conv
3 x3 + 1 (S)
Conv
5 x5 + 1 (S)
Conv
1 x1 + 1 (S)

Conv Conv MaxPool


1 x1 + 1 (S) 1 x1 + 1 (S) 3 x3 + 1 (S)

3x3 conv, 512, pool/2 MaxPool


3 x3 + 2 (S)

Dept hConcat

3x3 conv, 512 Conv


1 x1 + 1 (S)
Conv
3 x3 + 1 (S)
Conv
5 x5 + 1 (S)
Conv
1 x1 + 1 (S)

Conv Conv MaxPool


1 x1 + 1 (S) 1 x1 + 1 (S) 3 x3 + 1 (S)

3x3 conv, 512 Dept hConcat

Conv Conv Conv Conv


1 x1 + 1 (S) 3 x3 + 1 (S) 5 x5 + 1 (S) 1 x1 + 1 (S)

3x3 conv, 512 Conv


1 x1 + 1 (S)
Conv
1 x1 + 1 (S)
MaxPool
3 x3 + 1 (S)

MaxPool
3 x3 + 2 (S)

3x3 conv, 512, pool/2 LocalRespNorm

Conv
3 x3 + 1 (S)

fc, 4096 Conv


1 x1 + 1 (V)

LocalRespNorm

fc, 4096 MaxPool


3 x3 + 2 (S)

Conv
7 x7 + 2 (S)

fc, 1000 input

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. arXiv 2015.
7x7 conv, 64, /2, pool/2

1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x2 conv, 128, /2

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

Revolution of Depth
3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 256, /2

3x3 conv, 256

11x11 conv, 96, /4, pool/2 1x1 conv, 1024

5x5 conv, 256, pool/2 1x1 conv, 256


3x3 conv, 64 3x3 conv, 256
3x3 conv, 384

AlexNet, 8 layers VGG, 19 layers ResNet, 152 layers


3x3 conv, 64, pool/2 1x1 conv, 1024
3x3 conv, 384
3x3 conv, 128 1x1 conv, 256
3x3 conv, 256, pool/2
3x3 conv, 128, pool/2 3x3 conv, 256
fc, 4096
3x3 conv, 256 1x1 conv, 1024
fc, 4096
3x3 conv, 256 1x1 conv, 256
fc, 1000
3x3 conv, 256 3x3 conv, 256
3x3 conv, 256, pool/2 1x1 conv, 1024

(ILSVRC 2012) (ILSVRC 2014) (ILSVRC 2015)


3x3 conv, 512 1x1 conv, 256
3x3 conv, 512 3x3 conv, 256
3x3 conv, 512 1x1 conv, 1024
3x3 conv, 512, pool/2 1x1 conv, 256
3x3 conv, 512 3x3 conv, 256
3x3 conv, 512 1x1 conv, 1024
3x3 conv, 512 1x1 conv, 256
3x3 conv, 512, pool/2 3x3 conv, 256
fc, 4096 1x1 conv, 1024
fc, 4096 1x1 conv, 256
fc, 1000 3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 512, /2

3x3 conv, 512

1x1 conv, 2048

1x1 conv, 512

3x3 conv, 512

1x1 conv, 2048

1x1 conv, 512

3x3 conv, 512

1x1 conv, 2048

ave pool, fc 1000

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. arXiv 2015.
7x7 conv, 64, /2, pool/2

1x1 conv, 64

Revolution of Depth
3x3 conv, 64

1x1 conv, 256

1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x1 conv, 64

ResNet, 152 layers 3x3 conv, 64

1x1 conv, 256

1x2 conv, 128, /2

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128


(there was an animation here)
1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 256, /2


Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. arXiv 2015.
3x3 conv, 256
1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 256, /2

Revolution of Depth 3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

ResNet, 152 layers


1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256


(there was an animation here)
1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

Kaiming He, Xiangyu Zhang, Shaoqing


1x1 conv, 256 Ren, & Jian Sun. Deep Residual Learning for Image Recognition. arXiv 2015.
1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

Revolution of Depth 1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

ResNet, 152 layers 3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024


(there was an animation here)
1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

Kaiming He, Xiangyu Zhang, Shaoqing


3x3 conv, 256 Ren, & Jian Sun. Deep Residual Learning for Image Recognition. arXiv 2015.
3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

Revolution of Depth 1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

ResNet, 152 layers 1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 512, /2

3x3 conv, 512

1x1 conv, 2048

1x1 conv, 512

3x3 conv, 512

1x1 conv, 2048

1x1 conv, 512


(there was an animation here)
3x3 conv, 512

1x1 conv, 2048

ave pool, fc 1000

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. arXiv 2015.
Is learning better networks
as simple as stacking more layers?

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. arXiv 2015.
Simply stacking layers?
CIFAR-10
train error (%) test error (%)
20 20
56-layer
56-layer

10 10

20-layer
20-layer
0 0
0 1 2 3 4 5 6 0 1 2 3 4 5 6
iter. (1e4) iter. (1e4)

Plain nets: stacking 3x3 conv layers


56-layer net has higher training error and test error than 20-layer net

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. arXiv 2015.
Simply stacking layers?
CIFAR-10 ImageNet-1000
20
60
56-layer
44-layer 50

32-layer

error (%)
error (%)

10 34-layer
20-layer 40

5 plain-20 30
plain-32
plain-18
plain-44
plain-56 solid: test/val plain-34 18-layer
0 20
0 1 2 3
iter. (1e4)
4 5 6
dashed: train 0 10 20 30
iter. (1e4)
40 50

Overly deep plain nets have higher training error


A general phenomenon, observed in many datasets

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. arXiv 2015.
7x7 conv, 64, /2 7x7 conv, 64, /2

a shallower 3x3 conv, 64

3x3 conv, 64
3x3 conv, 64

3x3 conv, 64
a deeper
model 3x3 conv, 64 3x3 conv, 64 counterpart
3x3 conv, 64 3x3 conv, 64

(18 layers) 3x3 conv, 64 (34 layers)


3x3 conv, 64

3x3 conv, 128, /2 3x3 conv, 128, /2

3x3 conv, 128 3x3 conv, 128

3x3 conv, 128 3x3 conv, 128

3x3 conv, 128 3x3 conv, 128

3x3 conv, 128


A deeper model should not have
3x3 conv, 128

3x3 conv, 128


higher training error
3x3 conv, 128

3x3 conv, 256, /2 3x3 conv, 256, /2


extra
3x3 conv, 256

3x3 conv, 256 layers


3x3 conv, 256

3x3 conv, 256


A solution by construction:
3x3 conv, 256 3x3 conv, 256

3x3 conv, 256


original layers: copied from a
3x3 conv, 256

3x3 conv, 256


learned shallower model
3x3 conv, 256
extra layers: set as identity
at least the same training error
3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

Optimization difficulties: solvers


3x3 conv, 512, /2 3x3 conv, 512, /2

3x3 conv, 512 3x3 conv, 512

3x3 conv, 512 3x3 conv, 512

3x3 conv, 512 3x3 conv, 512 cannot find the solution when going
deeper
3x3 conv, 512

3x3 conv, 512

fc 1000 fc 1000

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. arXiv 2015.
Deep Residual Learning
Plaint net is any desired mapping,
hope the 2 weight layers fit ()

weight layer
any two
stacked layers relu
weight layer
relu
()

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. arXiv 2015.
Deep Residual Learning
Residual net is any desired mapping,
hope the 2 weight layers fit ()

weight layer hope the 2 weight layers fit ()


() relu identity let = +
weight layer

= +
relu

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. arXiv 2015.
Deep Residual Learning
is a residual mapping w.r.t. identity

If identity were optimal,
weight layer
easy to set weights as 0
() relu identity
weight layer If optimal mapping is closer to identity,
easier to find small fluctuations
= +
relu

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. arXiv 2015.
Related Works Residual Representations
VLAD & Fisher Vector [Jegou et al 2010], [Perronnin et al 2007]
Encoding residual vectors; powerful shallower representations.

Product Quantization (IVF-ADC) [Jegou et al 2011]


Quantizing residual vectors; efficient nearest-neighbor search.

MultiGrid & Hierarchical Precondition [Briggs, et al 2000], [Szeliski 1990, 2006]


Solving residual sub-problems; efficient PDE solvers.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. arXiv 2015.
7x7 conv, 64, /2 7x7 conv, 64, /2

pool, /2 pool, /2

3x3 conv, 64 3x3 conv, 64

3x3 conv, 64 3x3 conv, 64

Network Design plain net


3x3 conv, 64

3x3 conv, 64

3x3 conv, 64
3x3 conv, 64

3x3 conv, 64

3x3 conv, 64
ResNet
3x3 conv, 64 3x3 conv, 64

3x3 conv, 128, /2 3x3 conv, 128, /2

3x3 conv, 128 3x3 conv, 128

Keep it simple 3x3 conv, 128

3x3 conv, 128


3x3 conv, 128

3x3 conv, 128

3x3 conv, 128 3x3 conv, 128

3x3 conv, 128 3x3 conv, 128

Our basic design (VGG-style)


3x3 conv, 128 3x3 conv, 128

3x3 conv, 128 3x3 conv, 128

3x3 conv, 256, /2 3x3 conv, 256, /2

all 3x3 conv (almost) 3x3 conv, 256

3x3 conv, 256


3x3 conv, 256

3x3 conv, 256

spatial size /2 => # filters x2 3x3 conv, 256

3x3 conv, 256


3x3 conv, 256

3x3 conv, 256

Simple design; just deep! 3x3 conv, 256

3x3 conv, 256


3x3 conv, 256

3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

Other remarks:
3x3 conv, 256 3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

no max pooling (almost)


3x3 conv, 256 3x3 conv, 256

3x3 conv, 512, /2 3x3 conv, 512, /2

no hidden fc 3x3 conv, 512

3x3 conv, 512


3x3 conv, 512

3x3 conv, 512

no dropout 3x3 conv, 512

3x3 conv, 512


3x3 conv, 512

3x3 conv, 512

3x3 conv, 512 3x3 conv, 512

avg pool avg pool

fc 1000 fc 1000

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. arXiv 2015.
Training
All plain/residual nets are trained from scratch

All plain/residual nets use Batch Normalization

Standard hyper-parameters & augmentation

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. arXiv 2015.
CIFAR-10 experiments
CIFAR-10 plain nets CIFAR-10 ResNets
20 20
ResNet-20
56-layer ResNet-32
ResNet-44
ResNet-56
44-layer ResNet-110
20-layer

error (%)
32-layer
error (%)

10
20-layer
10 32-layer
44-layer
5 plain-20
plain-32
5
56-layer
plain-44
plain-56
solid: test 110-layer
0
0 1 2 3 4 5 6 dashed: train 0
0 1 2 3 4 5 6
iter. (1e4) iter. (1e4)

Deep ResNets can be trained without difficulties


Deeper ResNets have lower training error, and also lower test error

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. arXiv 2015.
ImageNet experiments
ImageNet plain nets ImageNet ResNets

60 60

50 50

error (%)
error (%)

40
34-layer 40 18-layer

30 30
solid: test ResNet-18
plain-18
plain-34
dashed: train 18-layer ResNet-34 34-layer
20 20
0 10 20 30 40 50 0 10 20 30 40 50
iter. (1e4) iter. (1e4)

Deep ResNets can be trained without difficulties


Deeper ResNets have lower training error, and also lower test error

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. arXiv 2015.
ImageNet experiments
A practical design of going deeper
64-d 256-d

3x3, 64 1x1, 64
relu
relu
3x3, 64
relu
3x3, 64
1x1, 256

relu relu

similar
all-3x3 complexity bottleneck
(for ResNet-50/101/152)

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. arXiv 2015.
ImageNet experiments
8
Deeper ResNets have lower error 7.4
this model has
lower time complexity
than VGG-16/19 6.7 7

6.1
5.7 6

ResNet-152 ResNet-101 ResNet-50 ResNet-34


10-crop testing, top-5 val error (%)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. arXiv 2015.
ImageNet experiments 28.2
25.8
152 layers

16.4

11.7
22 layers 19 layers
6.7 7.3

3.57 8 layers 8 layers shallow

ILSVRC'15 ILSVRC'14 ILSVRC'14 ILSVRC'13 ILSVRC'12 ILSVRC'11 ILSVRC'10


ResNet GoogleNet VGG AlexNet

ImageNet Classification top-5 error (%)


Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. arXiv 2015.
Just classification?
A treasure from ImageNet is on learning features.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. arXiv 2015.
Features matter. (quote [Girshick et al. 2014], the R-CNN paper)
2nd-place margin
task winner
MSRA (relative)

ImageNet Localization (top-5 error) 12.0 9.0 27%


ImageNet Detection (mAP@.5) 53.6 absolute 62.1 16%
8.5% better!
COCO Detection (mAP@.5:.95) 33.5 37.3 11%
COCO Segmentation (mAP@.5:.95) 25.1 28.2 12%

Our results are all based on ResNet-101


Our features are well transferrable
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. arXiv 2015.
classifier
Object Detection (brief)
RoI pooling

Simply Faster R-CNN + ResNet proposals

Faster R-CNN Region Proposal Net


mAP@.5 mAP@.5:.95
baseline
VGG-16 41.5 21.5
feature map
ResNet-101 48.4 27.2
COCO detection results
(ResNet has 28% relative gain) CNN

image

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. arXiv 2015.
Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NIPS 2015.
Object Detection (brief)
RPN learns proposals by extremely deep nets
We use only 300 proposals (no SS/EB/MCG!)

Add what is just missing in Faster R-CNN


Iterative localization
Context modeling
Multi-scale testing

All are based on CNN features; all are end-to-end (train and/or inference)

All benefit more from deeper features cumulative gains!

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. arXiv 2015.
Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NIPS 2015.
Our results on COCO too many objects, lets check carefully!
*the original image is from the COCO dataset
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. arXiv 2015.
Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NIPS 2015.
*the original image is from the COCO dataset
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. arXiv 2015.
Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NIPS 2015.
*the original image is from the COCO dataset
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. arXiv 2015.
Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NIPS 2015.
Instance Segmentation (brief)
Solely CNN-based (features matter)
box instances (RoIs)

CONVs Differentiable RoI warping layer (w.r.t box coord.)


Multi-task cascades, exact end-to-end training

mask instances
CONVs
FCs

RoI warping ,
categorized instances
pooling
for each RoI
FCs
person
person person
conv feature map
horse

masking
for each RoI

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. arXiv 2015.
Jifeng Dai, Kaiming He, & Jian Sun. Instance-aware Semantic Segmentation via Multi-task Network Cascades. arXiv 2015.
input
*the original image is from the COCO dataset
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. arXiv 2015.
Jifeng Dai, Kaiming He, & Jian Sun. Instance-aware Semantic Segmentation via Multi-task Network Cascades. arXiv 2015.
Conclusions MSRA team

Deeper is still better

Features matter!
Kaiming He Xiangyu Zhang Shaoqing Ren

Faster R-CNN is just amazing

Jifeng Dai Jian Sun

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. arXiv 2015.
Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NIPS 2015.
Jifeng Dai, Kaiming He, & Jian Sun. Instance-aware Semantic Segmentation via Multi-task Network Cascades. arXiv 2015.

You might also like