You are on page 1of 25

8/18/2016 CS231nConvolutionalNeuralNetworksforVisualRecognition

CS231nConvolutionalNeuralNetworksforVisualRecognition

TableofContents:

ArchitectureOverview
ConvNetLayers
ConvolutionalLayer
PoolingLayer
NormalizationLayer
FullyConnectedLayer
ConvertingFullyConnectedLayerstoConvolutionalLayers
ConvNetArchitectures
LayerPatterns
LayerSizingPatterns
CaseStudies(LeNet/AlexNet/ZFNet/GoogLeNet/VGGNet)
ComputationalConsiderations
AdditionalReferences

ConvolutionalNeuralNetworks(CNNs/ConvNets)
Convolutional Neural Networks are very similar to ordinary Neural Networks from the previous
chapter: they are made up of neurons that have learnable weights and biases. Each neuron
receives some inputs, performs a dot product and optionally follows it with a nonlinearity. The
wholenetworkstillexpressesasingledifferentiablescorefunction:fromtherawimagepixelson
oneendtoclassscoresattheother.Andtheystillhavealossfunction(e.g.SVM/Softmax)onthe
last (fullyconnected) layer and all the tips/tricks we developed for learning regular Neural
Networksstillapply.

So what does change? ConvNet architectures make the explicit assumption that the inputs are
images,whichallowsustoencodecertainpropertiesintothearchitecture.Thesethenmakethe
forward function more efficient to implement and vastly reduce the amount of parameters in the
network.

ArchitectureOverview

http://cs231n.github.io/convolutionalnetworks/ 1/25
8/18/2016 CS231nConvolutionalNeuralNetworksforVisualRecognition

Recall:RegularNeuralNets.Aswesawinthepreviouschapter,NeuralNetworksreceiveaninput
(asinglevector),andtransformitthroughaseriesofhiddenlayers.Eachhiddenlayerismadeup
ofasetofneurons,whereeachneuronisfullyconnectedtoallneuronsinthepreviouslayer,and
where neurons in a single layer function completely independently and do not share any
connections.Thelastfullyconnectedlayeriscalledtheoutputlayerandinclassificationsettings
itrepresentstheclassscores.

RegularNeuralNetsdontscalewelltofullimages.InCIFAR10,imagesareonlyofsize32x32x3
(32wide,32high,3colorchannels),soasinglefullyconnectedneuroninafirsthiddenlayerofa
regularNeuralNetworkwouldhave32*32*3=3072weights.Thisamountstillseemsmanageable,
butclearlythisfullyconnectedstructuredoesnotscaletolargerimages.Forexample,animageof
more respectible size, e.g. 200x200x3, would lead to neurons that have 200*200*3 = 120,000
weights. Moreover, we would almost certainly want to have several such neurons, so the
parameterswouldaddupquickly!Clearly,thisfullconnectivityiswastefulandthehugenumberof
parameterswouldquicklyleadtooverfitting.

3Dvolumesofneurons.ConvolutionalNeuralNetworkstakeadvantageofthefactthattheinput
consistsofimagesandtheyconstrainthearchitectureinamoresensibleway.Inparticular,unlike
aregularNeuralNetwork,thelayersofaConvNethaveneuronsarrangedin3dimensions:width,
height, depth. (Note that the word depth here refers to the third dimension of an activation
volume,nottothedepthofafullNeuralNetwork,whichcanrefertothetotalnumberoflayersina
network.)Forexample,theinputimagesinCIFAR10areaninputvolumeofactivations,andthe
volume has dimensions 32x32x3 (width, height, depth respectively). As we will soon see, the
neuronsinalayerwillonlybeconnectedtoasmallregionofthelayerbeforeit,insteadofallof
theneuronsinafullyconnectedmanner.Moreover,thefinaloutputlayerwouldforCIFAR10have
dimensions1x1x10,becausebytheendoftheConvNetarchitecturewewillreducethefullimage
intoasinglevectorofclassscores,arrangedalongthedepthdimension.Hereisavisualization:


Left: A regular 3layer Neural Network. Right: A ConvNet arranges its neurons in three dimensions (width,
height,depth),asvisualizedinoneofthelayers.EverylayerofaConvNettransformsthe3Dinputvolumeto
a3Doutputvolumeofneuronactivations.Inthisexample,theredinputlayerholdstheimage,soitswidth
andheightwouldbethedimensionsoftheimage,andthedepthwouldbe3(Red,Green,Bluechannels).

http://cs231n.github.io/convolutionalnetworks/ 2/25
8/18/2016 CS231nConvolutionalNeuralNetworksforVisualRecognition

AConvNetismadeupofLayers.EveryLayerhasasimpleAPI:Ittransformsaninput3Dvolume
toanoutput3Dvolumewithsomedifferentiablefunctionthatmayormaynothaveparameters.

LayersusedtobuildConvNets
Aswedescribedabove,asimpleConvNetisasequenceoflayers,andeverylayerofaConvNet
transforms one volume of activations to another through a differentiable function. We use three
main types of layers to build ConvNet architectures: Convolutional Layer, Pooling Layer,
andFullyConnected Layer (exactly as seen in regular Neural Networks). We will stack these
layerstoformafullConvNetarchitecture.

Example Architecture: Overview. We will go into more details below, but a simple ConvNet for
CIFAR10classificationcouldhavethearchitecture[INPUTCONVRELUPOOLFC].Inmore
detail:

INPUT[32x32x3]willholdtherawpixelvaluesoftheimage,inthiscaseanimageofwidth32,
height32,andwiththreecolorchannelsR,G,B.
CONVlayerwillcomputetheoutputofneuronsthatareconnectedtolocalregionsinthe
input,eachcomputingadotproductbetweentheirweightsandasmallregiontheyare
connectedtointheinputvolume.Thismayresultinvolumesuchas[32x32x12]ifwedecided
touse12filters.
RELUlayerwillapplyanelementwiseactivationfunction,suchasthe
max(0, x) max(0,x)thresholdingatzero.Thisleavesthesizeofthevolumeunchanged
([32x32x12]).
POOLlayerwillperformadownsamplingoperationalongthespatialdimensions(width,
height),resultinginvolumesuchas[16x16x12].
FC(i.e.fullyconnected)layerwillcomputetheclassscores,resultinginvolumeofsize
[1x1x10],whereeachofthe10numberscorrespondtoaclassscore,suchasamongthe10
categoriesofCIFAR10.AswithordinaryNeuralNetworksandasthenameimplies,each
neuroninthislayerwillbeconnectedtoallthenumbersinthepreviousvolume.

Inthisway,ConvNetstransformtheoriginalimagelayerbylayerfromtheoriginalpixelvaluesto
thefinalclassscores.Notethatsomelayerscontainparametersandotherdont.Inparticular,the
CONV/FClayersperformtransformationsthatareafunctionofnotonlytheactivationsintheinput
volume,butalsooftheparameters(theweightsandbiasesoftheneurons).Ontheotherhand,
theRELU/POOLlayerswillimplementafixedfunction.TheparametersintheCONV/FClayerswill
be trained with gradient descent so that the class scores that the ConvNet computes are
consistentwiththelabelsinthetrainingsetforeachimage.

Insummary:

http://cs231n.github.io/convolutionalnetworks/ 3/25
8/18/2016 CS231nConvolutionalNeuralNetworksforVisualRecognition

AConvNetarchitectureisinthesimplestcasealistofLayersthattransformtheimage
volumeintoanoutputvolume(e.g.holdingtheclassscores)
ThereareafewdistincttypesofLayers(e.g.CONV/FC/RELU/POOLarebyfarthemost
popular)
EachLayeracceptsaninput3Dvolumeandtransformsittoanoutput3Dvolumethrougha
differentiablefunction
EachLayermayormaynothaveparameters(e.g.CONV/FCdo,RELU/POOLdont)
EachLayermayormaynothaveadditionalhyperparameters(e.g.CONV/FC/POOLdo,
RELUdoesnt)

TheactivationsofanexampleConvNetarchitecture.Theinitialvolumestorestherawimagepixels(left)and
thelastvolumestorestheclassscores(right).Eachvolumeofactivationsalongtheprocessingpathisshown
asacolumn.Sinceit'sdifficulttovisualize3Dvolumes,welayouteachvolume'sslicesinrows.Thelastlayer
volume holds the scores for each class, but here we only visualize the sorted top 5 scores, and print the
labelsofeachone.Thefullwebbaseddemoisshownintheheaderofourwebsite.Thearchitectureshown
hereisatinyVGGNet,whichwewilldiscusslater.

We now describe the individual layers and the details of their hyperparameters and their
connectivities.

ConvolutionalLayer

The Conv layer is the core building block of a Convolutional Network that does most of the
computationalheavylifting.

http://cs231n.github.io/convolutionalnetworks/ 4/25
8/18/2016 CS231nConvolutionalNeuralNetworksforVisualRecognition

Overview and intuition without brain stuff. Lets first discuss what the CONV layer computes
withoutbrain/neuronanalogies.TheCONVlayersparametersconsistofasetoflearnablefilters.
Every filter is small spatially (along width and height), but extends through the full depth of the
inputvolume.Forexample,atypicalfilteronafirstlayerofaConvNetmighthavesize5x5x3(i.e.
5 pixels width and height, and 3 because images have depth 3, the color channels). During the
forward pass, we slide (more precisely, convolve) each filter across the width and height of the
input volume and compute dot products between the entries of the filter and the input at any
position.Asweslidethefilteroverthewidthandheightoftheinputvolumewewillproducea2
dimensional activation map that gives the responses of that filter at every spatial position.
Intuitively,thenetworkwilllearnfiltersthatactivatewhentheyseesometypeofvisualfeaturesuch
as an edge of some orientation or a blotch of some color on the first layer, or eventually entire
honeycomborwheellikepatternsonhigherlayersofthenetwork.Now,wewillhaveanentireset
of filters in each CONV layer (e.g. 12 filters), and each of them will produce a separate 2
dimensional activation map. We will stack these activation maps along the depth dimension and
producetheoutputvolume.

Thebrainview.Ifyoureafanofthebrain/neuronanalogies,everyentryinthe3Doutputvolume
canalsobeinterpretedasanoutputofaneuronthatlooksatonlyasmallregionintheinputand
shares parameters with all neurons to the left and right spatially (since these numbers all result
from applying the same filter). We now discuss the details of the neuron connectivities, their
arrangementinspace,andtheirparametersharingscheme.

Local Connectivity. When dealing with highdimensional inputs such as images, as we saw
aboveitisimpracticaltoconnectneuronstoallneuronsinthepreviousvolume.Instead,wewill
connect each neuron to only a local region of the input volume. The spatial extent of this
connectivityisahyperparametercalledthereceptivefieldoftheneuron(equivalentlythisisthe
filtersize).Theextentoftheconnectivityalongthedepthaxisisalwaysequaltothedepthofthe
input volume. It is important to emphasize again this asymmetry in how we treat the spatial
dimensions(widthandheight)andthedepthdimension:Theconnectionsarelocalinspace(along
widthandheight),butalwaysfullalongtheentiredepthoftheinputvolume.

Example1.Forexample,supposethattheinputvolumehassize[32x32x3],(e.g.anRGBCIFAR
10image).Ifthereceptivefield(orthefiltersize)is5x5,theneachneuronintheConvLayerwill
haveweightstoa[5x5x3]regionintheinputvolume,foratotalof5*5*3=75weights(and+1bias
parameter).Noticethattheextentoftheconnectivityalongthedepthaxismustbe3,sincethisis
thedepthoftheinputvolume.

Example2.Supposeaninputvolumehadsize[16x16x20].Thenusinganexamplereceptivefield
sizeof3x3,everyneuronintheConvLayerwouldnowhaveatotalof3*3*20=180connectionsto
theinputvolume.Noticethat,again,theconnectivityislocalinspace(e.g.3x3),butfullalongthe
inputdepth(20).

http://cs231n.github.io/convolutionalnetworks/ 5/25
8/18/2016 CS231nConvolutionalNeuralNetworksforVisualRecognition


Left:Anexampleinputvolumeinred(e.g.a32x32x3CIFAR10image),andanexamplevolumeofneurons
inthefirstConvolutionallayer.Eachneuronintheconvolutionallayerisconnectedonlytoalocalregionin
theinputvolumespatially,buttothefulldepth(i.e.allcolorchannels).Note,therearemultipleneurons(5in
thisexample)alongthedepth,alllookingatthesameregionintheinputseediscussionofdepthcolumnsin
textbelow.Right:The neurons from the Neural Network chapter remain unchanged: They still compute a
dotproductoftheirweightswiththeinputfollowedbyanonlinearity,buttheirconnectivityisnowrestrictedto
belocalspatially.

Spatialarrangement.WehaveexplainedtheconnectivityofeachneuronintheConvLayertothe
inputvolume,butwehaventyetdiscussedhowmanyneuronsthereareintheoutputvolumeor
howtheyarearranged.Threehyperparameterscontrolthesizeoftheoutputvolume:thedepth,
strideandzeropadding.Wediscussthesenext:

1.First,thedepthoftheoutputvolumeisahyperparameter:itcorrespondstothenumberof
filterswewouldliketouse,eachlearningtolookforsomethingdifferentintheinput.For
example,ifthefirstConvolutionalLayertakesasinputtherawimage,thendifferentneurons
alongthedepthdimensionmayactivateinpresenceofvariousorientededged,orblobsof
color.Wewillrefertoasetofneuronsthatarealllookingatthesameregionoftheinputas
adepthcolumn(somepeoplealsopreferthetermfibre).
2.Second,wemustspecifythestridewithwhichweslidethefilter.Whenthestrideis1thenwe
movethefiltersonepixelatatime.Whenthestrideis2(oruncommonly3ormore,though
thisisrareinpractice)thenthefiltersjump2pixelsatatimeasweslidethemaround.This
willproducesmalleroutputvolumesspatially.
3.Aswewillsoonsee,sometimesitwillbeconvenienttopadtheinputvolumewithzeros
aroundtheborder.Thesizeofthiszeropaddingisahyperparameter.Thenicefeatureof
zeropaddingisthatitwillallowustocontrolthespatialsizeoftheoutputvolumes(most
commonlyaswellseesoonwewilluseittoexactlypreservethespatialsizeoftheinput
volumesotheinputandoutputwidthandheightarethesame).

We can compute the spatial size of the output volume as a function of the input volume size (
W W), the receptive field size of the Conv Layer neurons (F F), the stride with which they are
http://cs231n.github.io/convolutionalnetworks/ 6/25
8/18/2016 CS231nConvolutionalNeuralNetworksforVisualRecognition

applied(S S),andtheamountofzeropaddingused(P P)ontheborder.Youcanconvinceyourself


that the correct formula for calculating how many neurons fit is given by
(W F + 2P )/S + 1 (WF+2P)/S+1.Forexamplefora7x7inputanda3x3filterwithstride1
andpad0wewouldgeta5x5output.Withstride2wewouldgeta3x3output.Letsalsoseeone
moregraphicalexample:

Illustration of spatial arrangement. In this example there is only one spatial dimension (xaxis), one neuron
with a receptive field size of F = 3, the input size is W = 5, and there is zero padding of P = 1. Left: The
neuronstridedacrosstheinputinstrideofS=1,givingoutputofsize(53+2)/1+1=5.Right:Theneuron
usesstrideofS=2,givingoutputofsize(53+2)/2+1=3.NoticethatstrideS=3couldnotbeusedsince
itwouldn'tfitneatlyacrossthevolume.Intermsoftheequation,thiscanbedeterminedsince(53+2)=4
isnotdivisibleby3.
Theneuronweightsareinthisexample[1,0,1](shownonveryright),anditsbiasiszero.Theseweightsare
sharedacrossallyellowneurons(seeparametersharingbelow).

Useofzeropadding.Intheexampleaboveonleft,notethattheinputdimensionwas5andthe
outputdimensionwasequal:also5.Thisworkedoutsobecauseourreceptivefieldswere3and
weusedzeropaddingof1.Iftherewasnozeropaddingused,thentheoutputvolumewouldhave
hadspatialdimensionofonly3,becausethatitishowmanyneuronswouldhavefitacrossthe
originalinput.Ingeneral,settingzeropaddingtobeP = (F 1)/2 P=(F1)/2whenthestrideis
S = 1 S=1ensuresthattheinputvolumeandoutputvolumewillhavethesamesizespatially.Itis
verycommontousezeropaddinginthiswayandwewilldiscussthefullreasonswhenwetalk
moreaboutConvNetarchitectures.

Constraints on strides. Note again that the spatial arrangement hyperparameters have mutual
constraints. For example, when the input has size W = 10 W=10, no zeropadding is used
P = 0 P=0,andthefiltersizeisF = 3 F=3,thenitwouldbeimpossibletousestride S = 2 S=2,
since (W F + 2P )/S + 1 = (10 3 + 0)/2 + 1 = 4.5 (WF+2P)/S+1=(103+0)/2+1=4.5,
i.e.notaninteger,indicatingthattheneuronsdontfitneatlyandsymmetricallyacrosstheinput.
Therefore,thissettingofthehyperparametersisconsideredtobeinvalid,andaConvNetlibrary
could throw an exception or zero pad the rest to make it fit, or crop the input to make it fit, or
something.AswewillseeintheConvNetarchitecturessection,sizingtheConvNetsappropriately

http://cs231n.github.io/convolutionalnetworks/ 7/25
8/18/2016 CS231nConvolutionalNeuralNetworksforVisualRecognition

sothatallthedimensionsworkoutcanbearealheadache,whichtheuseofzeropaddingand
somedesignguidelineswillsignificantlyalleviate.

Realworldexample.TheKrizhevskyetal.architecturethatwontheImageNetchallengein2012
accepted images of size [227x227x3]. On the first Convolutional Layer, it used neurons with
receptivefieldsizeF = 11 F=11,strideS = 4 S=4andnozeropadding P = 0 P=0.Since(227
11)/4 + 1 = 55, and since the Conv layer had a depth of K = 96 K=96, the Conv layer output
volume had size [55x55x96]. Each of the 55*55*96 neurons in this volume was connected to a
regionofsize[11x11x3]intheinputvolume.Moreover,all96neuronsineachdepthcolumnare
connectedtothesame[11x11x3]regionoftheinput,butofcoursewithdifferentweights.Asafun
aside,ifyoureadtheactualpaperitclaimsthattheinputimageswere224x224,whichissurely
incorrectbecause(22411)/4+1isquiteclearlynotaninteger.Thishasconfusedmanypeoplein
thehistoryofConvNetsandlittleisknownaboutwhathappened.MyownbestguessisthatAlex
usedzeropaddingof3extrapixelsthathedoesnotmentioninthepaper.

Parameter Sharing. Parameter sharing scheme is used in Convolutional Layers to control the
number of parameters. Using the realworld example above, we see that there are 55*55*96 =
290,400 neurons in the first Conv Layer, and each has 11*11*3 = 363 weights and 1 bias.
Together,thisaddsupto290400*364=105,705,600parametersonthefirstlayeroftheConvNet
alone.Clearly,thisnumberisveryhigh.

Itturnsoutthatwecandramaticallyreducethenumberofparametersbymakingonereasonable
assumption:Thatifonefeatureisusefultocomputeatsomespatialposition(x,y),thenitshould
also be useful to compute at a different position (x2,y2). In other words, denoting a single 2
dimensionalsliceofdepthasadepthslice(e.g.avolumeofsize[55x55x96]has96depthslices,
eachofsize[55x55]),wearegoingtoconstraintheneuronsineachdepthslicetousethesame
weightsandbias.Withthisparametersharingscheme,thefirstConvLayerinourexamplewould
now have only 96 unique set of weights (one for each depth slice), for a total of 96*11*11*3 =
34,848 unique weights, or 34,944 parameters (+96 biases). Alternatively, all 55*55 neurons in
each depth slice will now be using the same parameters. In practice during backpropagation,
every neuron in the volume will compute the gradient for its weights, but these gradients will be
addedupacrosseachdepthsliceandonlyupdateasinglesetofweightsperslice.

Noticethatifallneuronsinasingledepthsliceareusingthesameweightvector,thentheforward
passoftheCONVlayercanineachdepthslicebecomputedasaconvolutionof the neurons
weightswiththeinputvolume(Hencethename:ConvolutionalLayer).Thisiswhyitiscommonto
refertothesetsofweightsasafilter(orakernel),thatisconvolvedwiththeinput.

http://cs231n.github.io/convolutionalnetworks/ 8/25
8/18/2016 CS231nConvolutionalNeuralNetworksforVisualRecognition

ExamplefilterslearnedbyKrizhevskyetal.Eachofthe96filtersshownhereisofsize[11x11x3],andeach
one is shared by the 55*55 neurons in one depth slice. Notice that the parameter sharing assumption is
relatively reasonable: If detecting a horizontal edge is important at some location in the image, it should
intuitively be useful at some other location as well due to the translationallyinvariant structure of images.
Thereisthereforenoneedtorelearntodetectahorizontaledgeateveryoneofthe55*55distinctlocations
intheConvlayeroutputvolume.

Note that sometimes the parameter sharing assumption may not make sense. This is especially
thecasewhenthe input images to a ConvNet have some specific centeredstructure,wherewe
shouldexpect,forexample,thatcompletelydifferentfeaturesshouldbelearnedononesideofthe
imagethananother.Onepracticalexampleiswhentheinputarefacesthathavebeencenteredin
theimage.Youmightexpectthatdifferenteyespecificorhairspecificfeaturescould(andshould)
belearnedindifferentspatiallocations.Inthatcaseitiscommontorelaxtheparametersharing
scheme,andinsteadsimplycallthelayeraLocallyConnectedLayer.

Numpyexamples.Tomakethediscussionabovemoreconcrete,letsexpressthesameideasbut
incodeandwithaspecificexample.Supposethattheinputvolumeisanumpyarray X .Then:

Adepthcolumn(orafibre)atposition (x,y) wouldbetheactivations X[x,y,:] .


Adepthslice,orequivalentlyanactivationmapatdepth d wouldbethe
activations X[:,:,d] .

Conv Layer Example. Suppose that the input volume X has shape X.shape: (11,11,4) .
Supposefurtherthatweusenozeropadding(P = 0 P=0),thatthefiltersizeis F = 5 F=5, and
thatthestrideis S = 2 S=2.Theoutputvolumewouldthereforehavespatialsize(115)/2+1=4,
givingavolumewithwidthandheightof4.Theactivationmapintheoutputvolume(callit V ),
wouldthenlookasfollows(onlysomeoftheelementsarecomputedinthisexample):

V[0,0,0]=np.sum(X[:5,:5,:]*W0)+b0

http://cs231n.github.io/convolutionalnetworks/ 9/25
8/18/2016 CS231nConvolutionalNeuralNetworksforVisualRecognition

V[1,0,0]=np.sum(X[2:7,:5,:]*W0)+b0
V[2,0,0]=np.sum(X[4:9,:5,:]*W0)+b0
V[3,0,0]=np.sum(X[6:11,:5,:]*W0)+b0

Rememberthatinnumpy,theoperation * abovedenoteselementwisemultiplicationbetweenthe
arrays. Notice also that the weight vector W0 is the weight vector of that neuron and b0 is the
bias.Here, W0 isassumedtobeofshape W0.shape:(5,5,4) ,sincethefiltersizeis5andthe
depthoftheinputvolumeis4.Noticethatateachpoint,wearecomputingthedotproductasseen
beforeinordinaryneuralnetworks.Also,weseethatweareusingthesameweightandbias(due
toparametersharing),andwherethedimensionsalongthewidthareincreasinginstepsof2(i.e.
thestride).Toconstructasecondactivationmapintheoutputvolume,wewouldhave:

V[0,0,1]=np.sum(X[:5,:5,:]*W1)+b1
V[1,0,1]=np.sum(X[2:7,:5,:]*W1)+b1
V[2,0,1]=np.sum(X[4:9,:5,:]*W1)+b1
V[3,0,1]=np.sum(X[6:11,:5,:]*W1)+b1
V[0,1,1]=np.sum(X[:5,2:7,:]*W1)+b1 (exampleofgoingalongy)
V[2,3,1]=np.sum(X[4:9,6:11,:]*W1)+b1 (oralongboth)

whereweseethatweareindexingintotheseconddepthdimensionin V (atindex1)becausewe
are computing the second activation map, and that a different set of parameters ( W1 ) is now
used.Intheexampleabove,weareforbrevityleavingoutsomeoftheotheroperationstheConv
Layer would perform to fill the other parts of the output array V . Additionally, recall that these
activationmapsareoftenfollowedelementwisethroughanactivationfunctionsuchasReLU,but
thisisnotshownhere.

Summary.Tosummarize,theConvLayer:

AcceptsavolumeofsizeW1 H 1 D 1 W1H1D1
Requiresfourhyperparameters:
NumberoffiltersK K,
theirspatialextentF F,
thestrideS S,
theamountofzeropaddingP P.
ProducesavolumeofsizeW2 H 2 D 2 W2H2D2where:
W2 = (W1 F + 2P )/S + 1 W2=(W1F+2P)/S+1
H 2 = (H 1 F + 2P )/S + 1 H2=(H1F+2P)/S+1(i.e.widthandheightarecomputed
equallybysymmetry)
D 2 = K D2=K
Withparametersharing,itintroducesF F D 1 FFD1weightsperfilter,foratotalof
(F F D 1 ) K (FFD1)KweightsandK Kbiases.
http://cs231n.github.io/convolutionalnetworks/ 10/25
8/18/2016 CS231nConvolutionalNeuralNetworksforVisualRecognition

Intheoutputvolume,thed dthdepthslice(ofsizeW2 H 2 W2H2)istheresultof


performingavalidconvolutionofthed dthfilterovertheinputvolumewithastrideofS S,and
thenoffsetbyd dthbias.

AcommonsettingofthehyperparametersisF = 3, S = 1, P = 1 F=3,S=1,P=1.However,there
are common conventions and rules of thumb that motivate these hyperparameters. See
theConvNetarchitecturessectionbelow.

Convolution Demo. Below is a running demo of a CONV layer. Since 3D volumes are hard to
visualize, all the volumes (the input volume (in blue), the weight volumes (in red), the output
volume (in green)) are visualized with each depth slice stacked in rows. The input volume is of
size W1 = 5, H 1 = 5, D 1 = 3 W1=5,H1=5,D1=3, and the CONV layer parameters are
K = 2, F = 3, S = 2, P = 1 K=2,F=3,S=2,P=1. That is, we have two filters of size 3 3 33,
andtheyareappliedwithastrideof2.Therefore,theoutputvolumesizehasspatialsize(53+
2)/2+1=3.Moreover,noticethatapaddingofP = 1 P=1isappliedtotheinputvolume,making
the outer border of the input volume zero. The visualization below iterates over the output
activations (green), and shows that each element is computed by elementwise multiplying the
highlighted input (blue) with the filter (red), summing it up, and then offsetting the result by the
bias.

http://cs231n.github.io/convolutionalnetworks/ 11/25
8/18/2016 CS231nConvolutionalNeuralNetworksforVisualRecognition

Implementation as Matrix Multiplication. Note that the convolution operation essentially


performsdotproductsbetweenthefiltersandlocalregionsoftheinput.Acommonimplementation
pattern of the CONV layer is to take advantage of this fact and formulate the forward pass of a
convolutionallayerasonebigmatrixmultiplyasfollows:

1.Thelocalregionsintheinputimagearestretchedoutintocolumnsinanoperationcommonly
calledim2col.Forexample,iftheinputis[227x227x3]anditistobeconvolvedwith11x11x3
filtersatstride4,thenwewouldtake[11x11x3]blocksofpixelsintheinputandstretcheach
blockintoacolumnvectorofsize11*11*3=363.Iteratingthisprocessintheinputatstrideof
http://cs231n.github.io/convolutionalnetworks/ 12/25
8/18/2016 CS231nConvolutionalNeuralNetworksforVisualRecognition

4gives(22711)/4+1=55locationsalongbothwidthandheight,leadingtoanoutput
matrix X_col ofim2colofsize[363x3025],whereeverycolumnisastretchedoutreceptive
fieldandthereare55*55=3025ofthemintotal.Notethatsincethereceptivefieldsoverlap,
everynumberintheinputvolumemaybeduplicatedinmultipledistinctcolumns.
2.TheweightsoftheCONVlayeraresimilarlystretchedoutintorows.Forexample,ifthereare
96filtersofsize[11x11x3]thiswouldgiveamatrix W_row ofsize[96x363].
3.Theresultofaconvolutionisnowequivalenttoperformingonelargematrix
multiply np.dot(W_row,X_col) ,whichevaluatesthedotproductbetweeneveryfilterand
everyreceptivefieldlocation.Inourexample,theoutputofthisoperationwouldbe[96x
3025],givingtheoutputofthedotproductofeachfilterateachlocation.
4.Theresultmustfinallybereshapedbacktoitsproperoutputdimension[55x55x96].

Thisapproachhasthedownsidethatitcanusealotofmemory,sincesomevaluesintheinput
volumearereplicatedmultipletimesin X_col .However,thebenefitisthattherearemanyvery
efficientimplementationsofMatrixMultiplicationthatwecantakeadvantageof(forexample,inthe
commonlyusedBLASAPI).Moreover,thesameim2colideacanbereusedtoperformthepooling
operation,whichwediscussnext.

Backpropagation. The backward pass for a convolution operation (for both the data and the
weights) is also a convolution (but with spatiallyflipped filters). This is easy to derive in the 1
dimensionalcasewithatoyexample(notexpandedonfornow).

1x1 convolution. As an aside, several papers use 1x1 convolutions, as first investigated
byNetworkinNetwork.Somepeopleareatfirstconfusedtosee1x1convolutionsespeciallywhen
they come from signal processing background. Normally signals are 2dimensional so 1x1
convolutionsdonotmakesense(itsjustpointwisescaling).However,inConvNetsthisisnotthe
case because one must remember that we operate over 3dimensional volumes, and that the
filters always extend through the full depth of the input volume. For example, if the input is
[32x32x3] then doing 1x1 convolutions would effectively be doing 3dimensional dot products
(sincetheinputdepthis3channels).

Dilatedconvolutions.Arecentdevelopment(e.g.seepaperbyFisherYuandVladlenKoltun)is
to introduce one more hyperparameter to the CONV layer called the dilation. So far weve only
dicussedCONVfiltersthatarecontiguous.However,itspossibletohavefiltersthathavespaces
between each cell, called dilation. As an example, in one dimension a filter w of size 3 would
computeoverinput x thefollowing: w[0]*x[0]+w[1]*x[1]+w[2]*x[2] .Thisisdilationof0.
For dilation 1 the filter would instead compute w[0]*x[0] + w[1]*x[2] + w[2]*x[4] In other
wordsthereisagapof1betweentheapplications.Thiscanbeveryusefulinsomesettingstouse
in conjunction with 0dilated filters because it allows you to merge spatial information across the
inputsmuchmoreagressivelywithfewerlayers.Forexample,ifyoustacktwo3x3CONVlayers
on top of each other than you can convince yourself that the neurons on the 2nd layer are a

http://cs231n.github.io/convolutionalnetworks/ 13/25
8/18/2016 CS231nConvolutionalNeuralNetworksforVisualRecognition

function of a 5x5 patch of the input (we would say that the effective receptive field of these
neuronsis5x5).Ifweusedilatedconvolutionsthenthiseffectivereceptivefieldwouldgrowmuch
quicker.

PoolingLayer

ItiscommontoperiodicallyinsertaPoolinglayerinbetweensuccessiveConvlayersinaConvNet
architecture.Itsfunctionistoprogressivelyreducethespatialsizeoftherepresentationtoreduce
the amount of parameters and computation in the network, and hence to also control overfitting.
ThePoolingLayeroperatesindependentlyoneverydepthsliceoftheinputandresizesitspatially,
usingtheMAXoperation.Themostcommonformisapoolinglayerwithfiltersofsize2x2applied
with a stride of 2 downsamples every depth slice in the input by 2 along both width and height,
discarding75%oftheactivations.EveryMAXoperationwouldinthiscasebetakingamaxover4
numbers (little 2x2 region in some depth slice). The depth dimension remains unchanged. More
generally,thepoolinglayer:

AcceptsavolumeofsizeW1 H 1 D 1 W1H1D1
Requirestwohyperparameters:
theirspatialextentF F,
thestrideS S,
ProducesavolumeofsizeW2 H 2 D 2 W2H2D2where:
W2 = (W1 F )/S + 1 W2=(W1F)/S+1
H 2 = (H 1 F )/S + 1 H2=(H1F)/S+1
D 2 = D 1 D2=D1
Introduceszeroparameterssinceitcomputesafixedfunctionoftheinput
NotethatitisnotcommontousezeropaddingforPoolinglayers

Itisworthnotingthatthereareonlytwocommonlyseenvariationsofthemaxpoolinglayerfound
inpractice:ApoolinglayerwithF = 3, S = 2 F=3,S=2(alsocalledoverlappingpooling),andmore
commonlyF = 2, S = 2 F=2,S=2.Poolingsizeswithlargerreceptivefieldsaretoodestructive.

Generalpooling. In addition to max pooling, the pooling units can also perform other functions,
suchasaveragepoolingorevenL2normpooling.Averagepoolingwasoftenusedhistoricallybut
hasrecentlyfallenoutoffavorcomparedtothemaxpoolingoperation,whichhasbeenshownto
workbetterinpractice.

http://cs231n.github.io/convolutionalnetworks/ 14/25
8/18/2016 CS231nConvolutionalNeuralNetworksforVisualRecognition


Pooling layer downsamples the volume spatially, independently in each depth slice of the input
volume.Left:Inthisexample,theinputvolumeofsize[224x224x64]ispooledwithfiltersize2,stride2into
output volume of size [112x112x64]. Notice that the volume depth is preserved.Right: The most common
downsamplingoperationismax,givingrisetomaxpooling,hereshownwithastrideof2.Thatis,eachmax
istakenover4numbers(little2x2square).

Backpropagation.Recallfromthebackpropagationchapterthatthebackwardpassforamax(x,
y) operation has a simple interpretation as only routing the gradient to the input that had the
highestvalueintheforwardpass.Hence,duringtheforwardpassofapoolinglayeritiscommon
to keep track of the index of the max activation (sometimes also called the switches) so that
gradientroutingisefficientduringbackpropagation.

Gettingridofpooling.Manypeopledislikethepoolingoperationandthinkthatwecangetaway
withoutit.Forexample,StrivingforSimplicity:TheAllConvolutionalNetproposes to discard the
pooling layer in favor of architecture that only consists of repeated CONV layers. To reduce the
size of the representation they suggest using larger stride in CONV layer once in a while.
Discardingpoolinglayershasalsobeenfoundtobeimportantintraininggoodgenerativemodels,
such as variational autoencoders (VAEs) or generative adversarial networks (GANs). It seems
likelythatfuturearchitectureswillfeatureveryfewtonopoolinglayers.

NormalizationLayer

Many types of normalization layers have been proposed for use in ConvNet architectures,
sometimeswiththeintentionsofimplementinginhibitionschemesobservedinthebiologicalbrain.
However, these layers have since fallen out of favor because in practice their contribution has
beenshowntobeminimal,ifany.Forvarioustypesofnormalizations,seethediscussioninAlex
KrizhevskyscudaconvnetlibraryAPI.

Fullyconnectedlayer

http://cs231n.github.io/convolutionalnetworks/ 15/25
8/18/2016 CS231nConvolutionalNeuralNetworksforVisualRecognition

Neuronsinafullyconnectedlayerhavefullconnectionstoallactivationsinthepreviouslayer,as
seen in regular Neural Networks. Their activations can hence be computed with a matrix
multiplication followed by a bias offset. See the Neural Network section of the notes for more
information.

ConvertingFClayerstoCONVlayers

ItisworthnotingthattheonlydifferencebetweenFCandCONVlayersisthattheneuronsinthe
CONVlayerareconnectedonlytoalocalregionintheinput,andthatmanyoftheneuronsina
CONVvolumeshareparameters.However,theneuronsinbothlayersstillcomputedotproducts,
sotheirfunctionalformisidentical.Therefore,itturnsoutthatitspossibletoconvertbetweenFC
andCONVlayers:

ForanyCONVlayerthereisanFClayerthatimplementsthesameforwardfunction.The
weightmatrixwouldbealargematrixthatismostlyzeroexceptforatcertainblocks(dueto
localconnectivity)wheretheweightsinmanyoftheblocksareequal(duetoparameter
sharing).
Conversely,anyFClayercanbeconvertedtoaCONVlayer.Forexample,anFClayerwith
K = 4096 K=4096thatislookingatsomeinputvolumeofsize7 7 512 77512canbe
equivalentlyexpressedasaCONVlayerwith
F = 7, P = 0, S = 1, K = 4096 F=7,P=0,S=1,K=4096.Inotherwords,wearesettingthe
filtersizetobeexactlythesizeoftheinputvolume,andhencetheoutputwillsimplybe
1 1 4096 114096sinceonlyasingledepthcolumnfitsacrosstheinputvolume,
givingidenticalresultastheinitialFClayer.

FC>CONVconversion.Ofthesetwoconversions,theabilitytoconvertanFClayertoaCONV
layer is particularly useful in practice. Consider a ConvNet architecture that takes a 224x224x3
image, and then uses a series of CONV layers and POOL layers to reduce the image to an
activationsvolumeofsize7x7x512(inanAlexNetarchitecturethatwellseelater,thisisdoneby
useof5poolinglayersthatdownsampletheinputspatiallybyafactoroftwoeachtime,making
thefinalspatialsize224/2/2/2/2/2=7).Fromthere,anAlexNetusestwoFClayersofsize4096
and finally the last FC layers with 1000 neurons that compute the class scores. We can convert
eachofthesethreeFClayerstoCONVlayersasdescribedabove:

ReplacethefirstFClayerthatlooksat[7x7x512]volumewithaCONVlayerthatusesfilter
sizeF = 7 F=7,givingoutputvolume[1x1x4096].
ReplacethesecondFClayerwithaCONVlayerthatusesfiltersizeF = 1 F=1,givingoutput
volume[1x1x4096]
ReplacethelastFClayersimilarly,withF = 1 F=1,givingfinaloutput[1x1x1000]

http://cs231n.github.io/convolutionalnetworks/ 16/25
8/18/2016 CS231nConvolutionalNeuralNetworksforVisualRecognition

Each of these conversions could in practice involve manipulating (e.g. reshaping) the weight
matrixW WineachFClayerintoCONVlayerfilters.Itturnsoutthatthisconversionallowsusto
slide the original ConvNet very efficiently across many spatial positions in a larger image, in a
singleforwardpass.

For example, if 224x224 image gives a volume of size [7x7x512] i.e. a reduction by 32, then
forwardinganimageofsize384x384throughtheconvertedarchitecturewouldgivetheequivalent
volume in size [12x12x512], since 384/32 = 12. Following through with the next 3 CONV layers
thatwejustconvertedfromFClayerswouldnowgivethefinalvolumeofsize[6x6x1000],since
(12 7)/1 + 1 = 6. Note that instead of a single vector of class scores of size [1x1x1000], were
nowgettingandentire6x6arrayofclassscoresacrossthe384x384image.

Evaluating the original ConvNet (with FC layers) independently across 224x224 crops of the
384x384 image in strides of 32 pixels gives an identical result to forwarding the converted
ConvNetonetime.

Naturally,forwardingtheconvertedConvNetasingletimeismuchmoreefficientthaniteratingthe
originalConvNetoverallthose36locations,sincethe36evaluationssharecomputation.Thistrick
isoftenusedinpracticetogetbetterperformance,whereforexample,itiscommontoresizean
image to make it bigger, use a converted ConvNet to evaluate the class scores at many spatial
positionsandthenaveragetheclassscores.

Lastly, what if we wanted to efficiently apply the original ConvNet over the image but at a stride
smallerthan32pixels?Wecouldachievethiswithmultipleforwardpasses.Forexample,notethat
if we wanted to use a stride of 16 pixels we could do so by combining the volumes received by
forwardingtheconvertedConvNettwice:Firstovertheoriginalimageandsecondovertheimage
butwiththeimageshiftedspatiallyby16pixelsalongbothwidthandheight.

AnIPythonNotebookonNetSurgeryshowshowtoperformtheconversioninpractice,in
code(usingCaffe)

ConvNetArchitectures
We have seen that Convolutional Networks are commonly made up of only three layer types:
CONV,POOL(weassumeMaxpoolunlessstatedotherwise)andFC(shortforfullyconnected).
We will also explicitly write the RELU activation function as a layer, which applies elementwise
nonlinearity.Inthissectionwediscusshowthesearecommonlystackedtogethertoformentire
ConvNets.

LayerPatterns
http://cs231n.github.io/convolutionalnetworks/ 17/25
8/18/2016 CS231nConvolutionalNeuralNetworksforVisualRecognition

LayerPatterns

ThemostcommonformofaConvNetarchitecturestacksafewCONVRELUlayers,followsthem
withPOOLlayers,andrepeatsthispatternuntiltheimagehasbeenmergedspatiallytoasmall
size. At some point, it is common to transition to fullyconnected layers. The last fullyconnected
layer holds the output, such as the class scores. In other words, the most common ConvNet
architecturefollowsthepattern:

INPUT>[[CONV>RELU]*N>POOL?]*M>[FC>RELU]*K>FC

wherethe * indicatesrepetition,andthe POOL? indicatesanoptionalpoolinglayer.Moreover, N


>=0 (andusually N<=3 ), M >= 0 , K >= 0 (and usually K < 3 ). For example, here are
somecommonConvNetarchitecturesyoumayseethatfollowthispattern:

INPUT>FC ,implementsalinearclassifier.Here N=M=K=0 .


INPUT>CONV>RELU>FC
INPUT>[CONV>RELU>POOL]*2>FC>RELU>FC .Hereweseethatthereisa
singleCONVlayerbetweeneveryPOOLlayer.
INPUT>[CONV>RELU>CONV>RELU>POOL]*3>[FC>RELU]*2>FC Here
weseetwoCONVlayersstackedbeforeeveryPOOLlayer.Thisisgenerallyagoodideafor
largeranddeepernetworks,becausemultiplestackedCONVlayerscandevelopmore
complexfeaturesoftheinputvolumebeforethedestructivepoolingoperation.

Prefer a stack of small filter CONV to one large receptive field CONV layer. Suppose that you
stackthree3x3CONVlayersontopofeachother(withnonlinearitiesinbetween,ofcourse).In
this arrangement, each neuron on the first CONV layer has a 3x3 view of the input volume. A
neurononthesecondCONVlayerhasa3x3viewofthefirstCONVlayer,andhencebyextension
a5x5viewoftheinputvolume.Similarly,aneurononthethirdCONVlayerhasa3x3viewofthe
2ndCONVlayer,andhencea7x7viewoftheinputvolume.Supposethatinsteadofthesethree
layersof3x3CONV,weonlywantedtouseasingleCONVlayerwith7x7receptivefields.These
neurons would have a receptive field size of the input volume that is identical in spatial extent
(7x7),butwithseveraldisadvantages.First,theneuronswouldbecomputingalinearfunctionover
the input, while the three stacks of CONV layers contain nonlinearities that make their features
moreexpressive.Second,ifwesupposethatallthevolumeshave C Cchannels, then it can be
seen that the single 7x7 CONV layer would contain
2
C (7 7 C) = 49C C(77C)=49C2parameters,whilethethree3x3CONVlayerswould
only contain 3 (C (3 3 C)) = 27C 2 3(C(33C))=27C2 parameters. Intuitively,
stackingCONVlayerswithtinyfiltersasopposedtohavingoneCONVlayerwithbigfiltersallows
us to express more powerful features of the input, and with fewer parameters. As a practical
disadvantage,wemightneedmorememorytoholdalltheintermediateCONVlayerresultsifwe
plantodobackpropagation.

http://cs231n.github.io/convolutionalnetworks/ 18/25
8/18/2016 CS231nConvolutionalNeuralNetworksforVisualRecognition

Recentdepartures.Itshouldbenotedthattheconventionalparadigmofalinearlistoflayershas
recentlybeenchallanged,inGooglesInceptionarchitecturesandalsoincurrent(stateoftheart)
ResidualNetworksfromMicrosoftResearchAsia.Bothofthese(seedetailsbelowincasestudies
section)featuremoreintricateanddifferentconnectivitystructures.

LayerSizingPatterns

Until now weve omitted mentions of common hyperparameters used in each of the layers in a
ConvNet.Wewillfirststatethecommonrulesofthumbforsizingthearchitecturesandthenfollow
theruleswithadiscussionofthenotation:

Theinputlayer(thatcontainstheimage)shouldbedivisibleby2manytimes.Commonnumbers
include32(e.g.CIFAR10),64,96(e.g.STL10),or224(e.g.commonImageNetConvNets),384,
and512.

The conv layers should be using small filters (e.g. 3x3 or at most 5x5), using a stride of
S = 1 S=1, and crucially, padding the input volume with zeros in such way that the conv layer
does not alter the spatial dimensions of the input. That is, when F = 3 F=3, then using
P = 1 P=1willretaintheoriginalsizeoftheinput.When F = 5 F=5, P = 2 P=2. For a general
F F, it can be seen that P = (F 1)/2 P=(F1)/2 preserves the input size. If you must use
biggerfiltersizes(suchas7x7orso),itisonlycommontoseethisontheveryfirstconvlayerthat
islookingattheinputimage.

The pool layers are in charge of downsampling the spatial dimensions of the input. The most
commonsettingistousemaxpoolingwith2x2receptivefields(i.e.F = 2 F=2),andwithastride
of2(i.e.S = 2 S=2).Notethatthisdiscardsexactly75%oftheactivationsinaninputvolume(due
todownsamplingby2inbothwidthandheight).Anotherslightlylesscommonsettingistouse3x3
receptivefieldswithastrideof2,butthismakes.Itisveryuncommontoseereceptivefieldsizes
formaxpoolingthatarelargerthan3becausethepoolingisthentoolossyandaggressive.This
usuallyleadstoworseperformance.

Reducing sizing headaches. The scheme presented above is pleasing because all the CONV
layerspreservethespatialsizeoftheirinput,whilethePOOLlayersaloneareinchargeofdown
samplingthevolumesspatially.Inanalternativeschemewhereweusestridesgreaterthan1or
dontzeropadtheinputinCONVlayers,wewouldhavetoverycarefullykeeptrackoftheinput
volumesthroughouttheCNNarchitectureandmakesurethatallstridesandfiltersworkout,and
thattheConvNetarchitectureisnicelyandsymmetricallywired.

Why use stride of 1 in CONV? Smaller strides work better in practice. Additionally, as already
mentioned stride 1 allows us to leave all spatial downsampling to the POOL layers, with the
CONVlayersonlytransformingtheinputvolumedepthwise.
http://cs231n.github.io/convolutionalnetworks/ 19/25
8/18/2016 CS231nConvolutionalNeuralNetworksforVisualRecognition

Whyusepadding?Inadditiontotheaforementionedbenefitofkeepingthespatialsizesconstant
after CONV, doing this actually improves performance. If the CONV layers were to not zeropad
the inputs and only perform valid convolutions, then the size of the volumes would reduce by a
smallamountaftereachCONV,andtheinformationattheborderswouldbewashedawaytoo
quickly.

Compromising based on memory constraints. In some cases (especially early in the ConvNet
architectures),theamountofmemorycanbuildupveryquicklywiththerulesofthumbpresented
above.Forexample,filteringa224x224x3imagewiththree3x3CONVlayerswith64filterseach
andpadding1wouldcreatethreeactivationvolumesofsize[224x224x64].Thisamountstoatotal
ofabout10millionactivations,or72MBofmemory(perimage,forbothactivationsandgradients).
SinceGPUsareoftenbottleneckedbymemory,itmaybenecessarytocompromise.Inpractice,
peopleprefertomakethecompromiseatonlythefirstCONVlayerofthenetwork.Forexample,
onecompromisemightbetouseafirstCONVlayerwithfiltersizesof7x7andstrideof2(asseen
inaZFnet).Asanotherexample,anAlexNetusesfilersizesof11x11andstrideof4.

Casestudies

ThereareseveralarchitecturesinthefieldofConvolutionalNetworksthathaveaname.Themost
commonare:

LeNet.ThefirstsuccessfulapplicationsofConvolutionalNetworksweredevelopedbyYann
LeCunin1990s.Ofthese,thebestknownistheLeNetarchitecturethatwasusedtoreadzip
codes,digits,etc.
AlexNet.ThefirstworkthatpopularizedConvolutionalNetworksinComputerVisionwas
theAlexNet,developedbyAlexKrizhevsky,IlyaSutskeverandGeoffHinton.TheAlexNetwas
submittedtotheImageNetILSVRCchallengein2012andsignificantlyoutperformedthe
secondrunnerup(top5errorof16%comparedtorunnerupwith26%error).TheNetwork
hadaverysimilararchitecturetoLeNet,butwasdeeper,bigger,andfeaturedConvolutional
Layersstackedontopofeachother(previouslyitwascommontoonlyhaveasingleCONV
layeralwaysimmediatelyfollowedbyaPOOLlayer).
ZFNet.TheILSVRC2013winnerwasaConvolutionalNetworkfromMatthewZeilerandRob
Fergus.ItbecameknownastheZFNet(shortforZeiler&FergusNet).Itwasanimprovement
onAlexNetbytweakingthearchitecturehyperparameters,inparticularbyexpandingthesize
ofthemiddleconvolutionallayersandmakingthestrideandfiltersizeonthefirstlayer
smaller.
GoogLeNet.TheILSVRC2014winnerwasaConvolutionalNetworkfromSzegedyetal.from
Google.ItsmaincontributionwasthedevelopmentofanInceptionModulethatdramatically
reducedthenumberofparametersinthenetwork(4M,comparedtoAlexNetwith60M).
Additionally,thispaperusesAveragePoolinginsteadofFullyConnectedlayersatthetopof

http://cs231n.github.io/convolutionalnetworks/ 20/25
8/18/2016 CS231nConvolutionalNeuralNetworksforVisualRecognition

theConvNet,eliminatingalargeamountofparametersthatdonotseemtomattermuch.
TherearealsoseveralfollowupversionstotheGoogLeNet,mostrecentlyInceptionv4.
VGGNet.TherunnerupinILSVRC2014wasthenetworkfromKarenSimonyanandAndrew
ZissermanthatbecameknownastheVGGNet.Itsmaincontributionwasinshowingthatthe
depthofthenetworkisacriticalcomponentforgoodperformance.Theirfinalbestnetwork
contains16CONV/FClayersand,appealingly,featuresanextremelyhomogeneous
architecturethatonlyperforms3x3convolutionsand2x2poolingfromthebeginningtothe
end.TheirpretrainedmodelisavailableforplugandplayuseinCaffe.Adownsideofthe
VGGNetisthatitismoreexpensivetoevaluateandusesalotmorememoryandparameters
(140M).Mostoftheseparametersareinthefirstfullyconnectedlayer,anditwassincefound
thattheseFClayerscanberemovedwithnoperformancedowngrade,significantlyreducing
thenumberofnecessaryparameters.
ResNet.ResidualNetworkdevelopedbyKaimingHeetal.wasthewinnerofILSVRC2015.It
featuresspecialskipconnectionsandaheavyuseofbatchnormalization.Thearchitectureis
alsomissingfullyconnectedlayersattheendofthenetwork.Thereaderisalsoreferredto
Kaimingspresentation(video,slides),andsomerecentexperimentsthatreproducethese
networksinTorch.ResNetsarecurrentlybyfarstateoftheartConvolutionalNeuralNetwork
modelsandarethedefaultchoiceforusingConvNetsinpractice(asofMay10,2016).In
particular,alsoseemorerecentdevelopmentsthattweaktheoriginalarchitecture
fromKaimingHeetal.IdentityMappingsinDeepResidualNetworks(publishedMarch2016).

VGGNet in detail. Lets break down the VGGNet in more detail as a case study. The whole
VGGNetiscomposedofCONVlayersthatperform3x3convolutionswithstride1andpad1,and
ofPOOLlayersthatperform2x2maxpoolingwithstride2(andnopadding).Wecanwriteoutthe
sizeoftherepresentationateachstepoftheprocessingandkeeptrackofboththerepresentation
sizeandthetotalnumberofweights:

INPUT:[224x224x3]memory:224*224*3=150Kweights:0
CONV364:[224x224x64]memory:224*224*64=3.2Mweights:(3*3*3)*64=1,728
CONV364:[224x224x64]memory:224*224*64=3.2Mweights:(3*3*64)*64=36,864
POOL2:[112x112x64]memory:112*112*64=800Kweights:0
CONV3128:[112x112x128]memory:112*112*128=1.6Mweights:(3*3*64)*128=73,728
CONV3128:[112x112x128]memory:112*112*128=1.6Mweights:(3*3*128)*128=147,456
POOL2:[56x56x128]memory:56*56*128=400Kweights:0
CONV3256:[56x56x256]memory:56*56*256=800Kweights:(3*3*128)*256=294,912
CONV3256:[56x56x256]memory:56*56*256=800Kweights:(3*3*256)*256=589,824
CONV3256:[56x56x256]memory:56*56*256=800Kweights:(3*3*256)*256=589,824
POOL2:[28x28x256]memory:28*28*256=200Kweights:0
CONV3512:[28x28x512]memory:28*28*512=400Kweights:(3*3*256)*512=1,179,648
CONV3512:[28x28x512]memory:28*28*512=400Kweights:(3*3*512)*512=2,359,296

http://cs231n.github.io/convolutionalnetworks/ 21/25
8/18/2016 CS231nConvolutionalNeuralNetworksforVisualRecognition

CONV3512:[28x28x512]memory:28*28*512=400Kweights:(3*3*512)*512=2,359,296
POOL2:[14x14x512]memory:14*14*512=100Kweights:0
CONV3512:[14x14x512]memory:14*14*512=100Kweights:(3*3*512)*512=2,359,296
CONV3512:[14x14x512]memory:14*14*512=100Kweights:(3*3*512)*512=2,359,296
CONV3512:[14x14x512]memory:14*14*512=100Kweights:(3*3*512)*512=2,359,296
POOL2:[7x7x512]memory:7*7*512=25Kweights:0
FC:[1x1x4096]memory:4096weights:7*7*512*4096=102,760,448
FC:[1x1x4096]memory:4096weights:4096*4096=16,777,216
FC:[1x1x1000]memory:1000weights:4096*1000=4,096,000

TOTALmemory:24M*4bytes~=93MB/image(onlyforward!~*2forbwd)
TOTALparams:138Mparameters

As is common with Convolutional Networks, notice that most of the memory (and also compute
time)isusedintheearlyCONVlayers,andthatmostoftheparametersareinthelastFClayers.
Inthisparticularcase,thefirstFClayercontains100Mweights,outofatotalof140M.

ComputationalConsiderations

The largest bottleneck to be aware of when constructing ConvNet architectures is the memory
bottleneck.ManymodernGPUshavealimitof3/4/6GBmemory,withthebestGPUshavingabout
12GBofmemory.Therearethreemajorsourcesofmemorytokeeptrackof:

Fromtheintermediatevolumesizes:Thesearetherawnumberofactivationsateverylayer
oftheConvNet,andalsotheirgradients(ofequalsize).Usually,mostoftheactivationsareon
theearlierlayersofaConvNet(i.e.firstConvLayers).Thesearekeptaroundbecausethey
areneededforbackpropagation,butacleverimplementationthatrunsaConvNetonlyattest
timecouldinprinciplereducethisbyahugeamount,byonlystoringthecurrentactivationsat
anylayeranddiscardingthepreviousactivationsonlayersbelow.
Fromtheparametersizes:Thesearethenumbersthatholdthenetworkparameters,their
gradientsduringbackpropagation,andcommonlyalsoastepcacheiftheoptimizationisusing
momentum,Adagrad,orRMSProp.Therefore,thememorytostoretheparametervector
alonemustusuallybemultipliedbyafactorofatleast3orso.
EveryConvNetimplementationhastomaintainmiscellaneousmemory,suchastheimage
databatches,perhapstheiraugmentedversions,etc.

Once you have a rough estimate of the total number of values (for activations, gradients, and
misc),thenumbershouldbeconvertedtosizeinGB.Takethenumberofvalues,multiplyby4to
get the raw number of bytes (since every floating point is 4 bytes, or maybe by 8 for double

http://cs231n.github.io/convolutionalnetworks/ 22/25
8/18/2016 CS231nConvolutionalNeuralNetworksforVisualRecognition

precision), and then divide by 1024 multiple times to get the amount of memory in KB, MB, and
finallyGB.Ifyournetworkdoesntfit,acommonheuristictomakeitfitistodecreasethebatch
size,sincemostofthememoryisusuallyconsumedbytheactivations.

AdditionalResources
Additionalresourcesrelatedtoimplementation:

SoumithbenchmarksforCONVperformance
ConvNetJSCIFAR10demoallowsyoutoplaywithConvNetarchitecturesandseetheresults
andcomputationsinrealtime,inthebrowser.
Caffe,oneofthepopularConvNetlibraries.
StateoftheartResNetsinTorch7

http://cs231n.github.io/convolutionalnetworks/ 23/25
8/18/2016 CS231nConvolutionalNeuralNetworksforVisualRecognition

cs231n

http://cs231n.github.io/convolutionalnetworks/ 24/25
8/18/2016 CS231nConvolutionalNeuralNetworksforVisualRecognition

cs231n
karpathy@cs.stanford.edu

http://cs231n.github.io/convolutionalnetworks/ 25/25