Ilsvrc2015 Deep Residual Learning Kaiminghe

Deep Residual Learning
MSRA @ ILSVRC & COCO 2015 competitions
Kaiming He
with Xiangyu Zhang, Shaoqing Ren, Jifeng Dai, & Jian Sun
Microsoft Research Asia (MSRA)
MSRA @ ILSVRC & COCO 2015 Competitions
1st places in all five main tracks
ImageNet Classification: Ultra-deep (quote Yann) 152-layer nets
ImageNet Detection: 16% better than 2nd
ImageNet Localization: 27% better than 2nd
COCO Detection: 11% better than 2nd
COCO Segmentation: 12% better than 2nd
*improvements are relative numbers
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. arXiv 2015.
Revolution of Depth 28.2
25.8
152 layers
16.4
11.7
22 layers 19 layers
6.7 7.3
3.57 8 layers 8 layers shallow
ILSVRC'15 ILSVRC'14 ILSVRC'14 ILSVRC'13 ILSVRC'12 ILSVRC'11 ILSVRC'10

ResNet GoogleNet VGG AlexNet
ImageNet Classification top-5 error (%)

101 layers
Revolution of Depth
86
Engines of
visual recognition 66
58
34
16 layers
8 layers
shallow
HOG, DPM AlexNet VGG ResNet

(RCNN) (RCNN) (Faster RCNN)*
PASCAL VOC 2007 Object Detection mAP (%)

*w/ other improvements & more data
Revolution of Depth
AlexNet, 8 layers 11x11 conv, 96, /4, pool/2
(ILSVRC 2012)
5x5 conv, 256, pool/2
3x3 conv, 384
3x3 conv, 384
fc, 4096
fc, 4096
fc, 1000
Revolution of Depth soft max2
Soft maxAct iv at ion
FC
Av eragePool
7 x7 + 1 (V)
11x11 conv, 96, /4, pool/2 3x3 conv, 64

AlexNet, 8 layers VGG, 19 layers GoogleNet, 22 layers
Dept hConcat
Conv Conv Conv Conv

1 x1 + 1 (S) 3 x3 + 1 (S) 5 x5 + 1 (S) 1 x1 + 1 (S)
5x5 conv, 256, pool/2 3x3 conv, 64, pool/2 Conv

1 x1 + 1 (S)
Conv
1 x1 + 1 (S)
MaxPool
3 x3 + 1 (S)
(ILSVRC 2012) 3x3 conv, 384

(ILSVRC 2014) 3x3 conv, 128
(ILSVRC 2014) Conv
1 x1 + 1 (S)
Conv
Dept hConcat
3 x3 + 1 (S)
Conv
5 x5 + 1 (S)
Conv
1 x1 + 1 (S)
soft max1
Conv Conv MaxPool

1 x1 + 1 (S) 1 x1 + 1 (S) 3 x3 + 1 (S)
3x3 conv, 384 3x3 conv, 128, pool/2 MaxPool

3 x3 + 2 (S)
FC
Dept hConcat FC
3x3 conv, 256, pool/2 3x3 conv, 256 Conv

1 x1 + 1 (S)
Conv
3 x3 + 1 (S)
Conv
5 x5 + 1 (S)
Conv
1 x1 + 1 (S)
Conv
1 x1 + 1 (S)
Conv Conv MaxPool Av eragePool

1 x1 + 1 (S) 1 x1 + 1 (S) 3 x3 + 1 (S) 5 x5 + 3 (V)
fc, 4096 3x3 conv, 256 Dept hConcat
Conv Conv Conv Conv

1 x1 + 1 (S) 3 x3 + 1 (S) 5 x5 + 1 (S) 1 x1 + 1 (S)
fc, 4096 3x3 conv, 256 Conv

1 x1 + 1 (S)
Conv
1 x1 + 1 (S)
MaxPool
3 x3 + 1 (S)
Dept hConcat soft max0
fc, 1000 3x3 conv, 256, pool/2 Conv

1 x1 + 1 (S)
Conv
3 x3 + 1 (S)
Conv
5 x5 + 1 (S)
Conv
1 x1 + 1 (S)
Conv Conv MaxPool

FC
1 x1 + 1 (S) 1 x1 + 1 (S) 3 x3 + 1 (S)
3x3 conv, 512 Dept hConcat FC
Conv Conv Conv Conv Conv

1 x1 + 1 (S) 3 x3 + 1 (S) 5 x5 + 1 (S) 1 x1 + 1 (S) 1 x1 + 1 (S)
3x3 conv, 512 Conv

1 x1 + 1 (S)
Conv
1 x1 + 1 (S)
MaxPool
3 x3 + 1 (S)
Av eragePool
5 x5 + 3 (V)
Dept hConcat
3x3 conv, 512 Conv

1 x1 + 1 (S)
Conv
3 x3 + 1 (S)
Conv
5 x5 + 1 (S)
Conv
1 x1 + 1 (S)
Conv Conv MaxPool

1 x1 + 1 (S) 1 x1 + 1 (S) 3 x3 + 1 (S)
3x3 conv, 512, pool/2 MaxPool

3 x3 + 2 (S)
Dept hConcat
3x3 conv, 512 Conv

1 x1 + 1 (S)
Conv
3 x3 + 1 (S)
Conv
5 x5 + 1 (S)
Conv
1 x1 + 1 (S)
Conv Conv MaxPool

1 x1 + 1 (S) 1 x1 + 1 (S) 3 x3 + 1 (S)
3x3 conv, 512 Dept hConcat
Conv Conv Conv Conv

1 x1 + 1 (S) 3 x3 + 1 (S) 5 x5 + 1 (S) 1 x1 + 1 (S)
3x3 conv, 512 Conv

1 x1 + 1 (S)
Conv
1 x1 + 1 (S)
MaxPool
3 x3 + 1 (S)
MaxPool
3 x3 + 2 (S)
3x3 conv, 512, pool/2 LocalRespNorm
Conv
3 x3 + 1 (S)
fc, 4096 Conv

1 x1 + 1 (V)
LocalRespNorm
fc, 4096 MaxPool

3 x3 + 2 (S)
Conv
7 x7 + 2 (S)
fc, 1000 input
7x7 conv, 64, /2, pool/2
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x2 conv, 128, /2
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
Revolution of Depth
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 256, /2
3x3 conv, 256
11x11 conv, 96, /4, pool/2 1x1 conv, 1024
5x5 conv, 256, pool/2 1x1 conv, 256

3x3 conv, 64 3x3 conv, 256
3x3 conv, 384
AlexNet, 8 layers VGG, 19 layers ResNet, 152 layers

3x3 conv, 384
3x3 conv, 128 1x1 conv, 256
fc, 4096
3x3 conv, 256 1x1 conv, 1024
fc, 4096
3x3 conv, 256 1x1 conv, 256
fc, 1000
3x3 conv, 256 3x3 conv, 256
3x3 conv, 256, pool/2 1x1 conv, 1024
(ILSVRC 2012) (ILSVRC 2014) (ILSVRC 2015)

3x3 conv, 512 1x1 conv, 256
3x3 conv, 512 3x3 conv, 256
3x3 conv, 512 1x1 conv, 1024
3x3 conv, 512 3x3 conv, 256
3x3 conv, 512 1x1 conv, 1024
3x3 conv, 512 1x1 conv, 256
fc, 4096 1x1 conv, 1024
fc, 4096 1x1 conv, 256
fc, 1000 3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 512, /2
3x3 conv, 512
1x1 conv, 2048
1x1 conv, 512
3x3 conv, 512
1x1 conv, 2048
1x1 conv, 512
3x3 conv, 512
1x1 conv, 2048
ave pool, fc 1000
7x7 conv, 64, /2, pool/2
1x1 conv, 64
Revolution of Depth
3x3 conv, 64
1x1 conv, 256
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x1 conv, 64
ResNet, 152 layers 3x3 conv, 64
1x1 conv, 256
1x2 conv, 128, /2
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128

(there was an animation here)
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 256, /2

3x3 conv, 256
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 256, /2
Revolution of Depth 3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
ResNet, 152 layers

1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256

1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
Kaiming He, Xiangyu Zhang, Shaoqing

1x1 conv, 256 Ren, & Jian Sun. Deep Residual Learning for Image Recognition. arXiv 2015.
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024

1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
Kaiming He, Xiangyu Zhang, Shaoqing

3x3 conv, 256 Ren, & Jian Sun. Deep Residual Learning for Image Recognition. arXiv 2015.
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 512, /2
3x3 conv, 512
1x1 conv, 2048
1x1 conv, 512
3x3 conv, 512
1x1 conv, 2048
1x1 conv, 512

3x3 conv, 512
1x1 conv, 2048
ave pool, fc 1000
Is learning better networks
as simple as stacking more layers?
Simply stacking layers?
CIFAR-10
train error (%) test error (%)
20 20
56-layer
56-layer
10 10
20-layer
20-layer
0 0
0 1 2 3 4 5 6 0 1 2 3 4 5 6
iter. (1e4) iter. (1e4)
Plain nets: stacking 3x3 conv layers

56-layer net has higher training error and test error than 20-layer net
Simply stacking layers?
CIFAR-10 ImageNet-1000
20
60
56-layer
44-layer 50
32-layer
error (%)
error (%)
10 34-layer
20-layer 40
5 plain-20 30
plain-32
plain-18
plain-44
plain-56 solid: test/val plain-34 18-layer
0 20
0 1 2 3
iter. (1e4)
4 5 6
dashed: train 0 10 20 30
iter. (1e4)
40 50
Overly deep plain nets have higher training error

A general phenomenon, observed in many datasets
7x7 conv, 64, /2 7x7 conv, 64, /2
a shallower 3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
a deeper
model 3x3 conv, 64 3x3 conv, 64 counterpart
3x3 conv, 64 3x3 conv, 64
(18 layers) 3x3 conv, 64 (34 layers)

3x3 conv, 64
3x3 conv, 128, /2 3x3 conv, 128, /2
3x3 conv, 128 3x3 conv, 128
3x3 conv, 128 3x3 conv, 128
3x3 conv, 128 3x3 conv, 128
3x3 conv, 128

A deeper model should not have
3x3 conv, 128
3x3 conv, 128

higher training error
3x3 conv, 128
3x3 conv, 256, /2 3x3 conv, 256, /2

extra
3x3 conv, 256
3x3 conv, 256 layers

3x3 conv, 256
3x3 conv, 256

A solution by construction:
3x3 conv, 256 3x3 conv, 256
3x3 conv, 256

original layers: copied from a
3x3 conv, 256
3x3 conv, 256

learned shallower model
3x3 conv, 256
extra layers: set as identity
at least the same training error
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
Optimization difficulties: solvers

3x3 conv, 512, /2 3x3 conv, 512, /2
3x3 conv, 512 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512 cannot find the solution when going
deeper
3x3 conv, 512
3x3 conv, 512
fc 1000 fc 1000
Plaint net is any desired mapping,
hope the 2 weight layers fit ()
weight layer
any two
stacked layers relu
weight layer
relu
()
Residual net is any desired mapping,
hope the 2 weight layers fit ()
weight layer hope the 2 weight layers fit ()

() relu identity let = +
weight layer
= +
relu
is a residual mapping w.r.t. identity

If identity were optimal,
weight layer
easy to set weights as 0
() relu identity
weight layer If optimal mapping is closer to identity,
easier to find small fluctuations
= +
relu
Related Works Residual Representations
VLAD & Fisher Vector [Jegou et al 2010], [Perronnin et al 2007]
Encoding residual vectors; powerful shallower representations.
Product Quantization (IVF-ADC) [Jegou et al 2011]

Quantizing residual vectors; efficient nearest-neighbor search.
MultiGrid & Hierarchical Precondition [Briggs, et al 2000], [Szeliski 1990, 2006]

Solving residual sub-problems; efficient PDE solvers.
7x7 conv, 64, /2 7x7 conv, 64, /2
pool, /2 pool, /2
Network Design plain net

3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
ResNet
3x3 conv, 128, /2 3x3 conv, 128, /2
3x3 conv, 128 3x3 conv, 128
Keep it simple 3x3 conv, 128
3x3 conv, 128

3x3 conv, 128
3x3 conv, 128
3x3 conv, 128 3x3 conv, 128
3x3 conv, 128 3x3 conv, 128
Our basic design (VGG-style)

3x3 conv, 128 3x3 conv, 128
3x3 conv, 128 3x3 conv, 128
3x3 conv, 256, /2 3x3 conv, 256, /2
all 3x3 conv (almost) 3x3 conv, 256
3x3 conv, 256

3x3 conv, 256
3x3 conv, 256
spatial size /2 => # filters x2 3x3 conv, 256
3x3 conv, 256

3x3 conv, 256
3x3 conv, 256
Simple design; just deep! 3x3 conv, 256
3x3 conv, 256

3x3 conv, 256
3x3 conv, 256
3x3 conv, 256 3x3 conv, 256
3x3 conv, 256 3x3 conv, 256
Other remarks:
3x3 conv, 256 3x3 conv, 256
3x3 conv, 256 3x3 conv, 256
no max pooling (almost)

3x3 conv, 256 3x3 conv, 256
3x3 conv, 512, /2 3x3 conv, 512, /2
no hidden fc 3x3 conv, 512
3x3 conv, 512

3x3 conv, 512
3x3 conv, 512
no dropout 3x3 conv, 512
3x3 conv, 512

3x3 conv, 512
3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
avg pool avg pool
fc 1000 fc 1000
Training
All plain/residual nets are trained from scratch
All plain/residual nets use Batch Normalization
Standard hyper-parameters & augmentation
CIFAR-10 experiments
CIFAR-10 plain nets CIFAR-10 ResNets
20 20
ResNet-20
56-layer ResNet-32
ResNet-44
ResNet-56
44-layer ResNet-110
20-layer
error (%)
32-layer
error (%)
10
20-layer
10 32-layer
44-layer
5 plain-20
plain-32
5
56-layer
plain-44
plain-56
solid: test 110-layer
0
0 1 2 3 4 5 6 dashed: train 0
0 1 2 3 4 5 6
Deep ResNets can be trained without difficulties

Deeper ResNets have lower training error, and also lower test error
ImageNet experiments
ImageNet plain nets ImageNet ResNets
60 60
50 50
error (%)
error (%)
40
34-layer 40 18-layer
30 30
solid: test ResNet-18
plain-18
plain-34
dashed: train 18-layer ResNet-34 34-layer
20 20
0 10 20 30 40 50 0 10 20 30 40 50
Deep ResNets can be trained without difficulties

Deeper ResNets have lower training error, and also lower test error
A practical design of going deeper
64-d 256-d
3x3, 64 1x1, 64
relu
relu
3x3, 64
relu
3x3, 64
1x1, 256
relu relu
similar
all-3x3 complexity bottleneck
(for ResNet-50/101/152)
8
Deeper ResNets have lower error 7.4
this model has
lower time complexity
than VGG-16/19 6.7 7
6.1
5.7 6
ResNet-152 ResNet-101 ResNet-50 ResNet-34

10-crop testing, top-5 val error (%)
ImageNet experiments 28.2
25.8
152 layers
16.4
11.7
22 layers 19 layers
6.7 7.3
3.57 8 layers 8 layers shallow
ILSVRC'15 ILSVRC'14 ILSVRC'14 ILSVRC'13 ILSVRC'12 ILSVRC'11 ILSVRC'10

ResNet GoogleNet VGG AlexNet
ImageNet Classification top-5 error (%)

Just classification?
A treasure from ImageNet is on learning features.
Features matter. (quote [Girshick et al. 2014], the R-CNN paper)
2nd-place margin
task winner
MSRA (relative)
ImageNet Localization (top-5 error) 12.0 9.0 27%

ImageNet Detection (mAP@.5) 53.6 absolute 62.1 16%
8.5% better!
COCO Detection (mAP@.5:.95) 33.5 37.3 11%
COCO Segmentation (mAP@.5:.95) 25.1 28.2 12%
Our results are all based on ResNet-101

Our features are well transferrable
classifier
Object Detection (brief)
RoI pooling
Simply Faster R-CNN + ResNet proposals
Faster R-CNN Region Proposal Net

mAP@.5 mAP@.5:.95
baseline
VGG-16 41.5 21.5
feature map
ResNet-101 48.4 27.2
COCO detection results
(ResNet has 28% relative gain) CNN
image
Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NIPS 2015.
Object Detection (brief)
RPN learns proposals by extremely deep nets
We use only 300 proposals (no SS/EB/MCG!)
Add what is just missing in Faster R-CNN

Iterative localization
Context modeling
Multi-scale testing
All are based on CNN features; all are end-to-end (train and/or inference)
All benefit more from deeper features cumulative gains!
Our results on COCO too many objects, lets check carefully!
*the original image is from the COCO dataset
Instance Segmentation (brief)
Solely CNN-based (features matter)
box instances (RoIs)
CONVs Differentiable RoI warping layer (w.r.t box coord.)

Multi-task cascades, exact end-to-end training
mask instances
CONVs
FCs
RoI warping ,
categorized instances
pooling
for each RoI
FCs
person
person person
conv feature map
horse
masking
for each RoI
Jifeng Dai, Kaiming He, & Jian Sun. Instance-aware Semantic Segmentation via Multi-task Network Cascades. arXiv 2015.
input
Conclusions MSRA team
Deeper is still better
Features matter!
Kaiming He Xiangyu Zhang Shaoqing Ren
Faster R-CNN is just amazing
Jifeng Dai Jian Sun

Ilsvrc2015 Deep Residual Learning Kaiminghe

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ilsvrc2015 Deep Residual Learning Kaiminghe

Uploaded by

Copyright:

Available Formats

Deep Residual Learning

MSRA @ ILSVRC & COCO 2015 competitions

*improvements are relative numbers

3.57 8 layers 8 layers shallow

ILSVRC'15 ILSVRC'14 ILSVRC'14 ILSVRC'13 ILSVRC'12 ILSVRC'11 ILSVRC'10

ImageNet Classification top-5 error (%)

HOG, DPM AlexNet VGG ResNet

PASCAL VOC 2007 Object Detection mAP (%)

3x3 conv, 384

3x3 conv, 384

3x3 conv, 256, pool/2

Soft maxAct iv at ion

11x11 conv, 96, /4, pool/2 3x3 conv, 64

Conv Conv Conv Conv

5x5 conv, 256, pool/2 3x3 conv, 64, pool/2 Conv

(ILSVRC 2012) 3x3 conv, 384

Conv Conv MaxPool

3x3 conv, 384 3x3 conv, 128, pool/2 MaxPool

3x3 conv, 256, pool/2 3x3 conv, 256 Conv

Conv Conv MaxPool Av eragePool

fc, 4096 3x3 conv, 256 Dept hConcat

Conv Conv Conv Conv

fc, 4096 3x3 conv, 256 Conv

Dept hConcat soft max0

fc, 1000 3x3 conv, 256, pool/2 Conv

Conv Conv MaxPool

3x3 conv, 512 Dept hConcat FC

Conv Conv Conv Conv Conv

3x3 conv, 512 Conv

3x3 conv, 512 Conv

Conv Conv MaxPool

3x3 conv, 512, pool/2 MaxPool

3x3 conv, 512 Conv

Conv Conv MaxPool

3x3 conv, 512 Dept hConcat

Conv Conv Conv Conv

3x3 conv, 512 Conv

3x3 conv, 512, pool/2 LocalRespNorm

fc, 4096 Conv

fc, 4096 MaxPool

fc, 1000 input

1x1 conv, 256

1x1 conv, 256

1x1 conv, 256

1x2 conv, 128, /2

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512