You are on page 1of 21

7/16/2015

Neuralnetworksanddeeplearning

CHAPTER5

Whyaredeepneuralnetworkshardtotrain?
Imagineyou'reanengineerwhohasbeenaskedtodesigna
computerfromscratch.Onedayyou'reworkingawayinyouroffice,
designinglogicalcircuits,settingoutANDgates,ORgates,andsoon,
whenyourbosswalksinwithbadnews.Thecustomerhasjust
addedasurprisingdesignrequirement:thecircuitfortheentire
computermustbejusttwolayersdeep:

NeuralNetworksandDeepLearning
Whatthisbookisabout
Ontheexercisesandproblems
Usingneuralnetstorecognize
handwrittendigits
Howthebackpropagation
algorithmworks
Improvingthewayneural
networkslearn
Avisualproofthatneuralnetscan
computeanyfunction
Whyaredeepneuralnetworks
hardtotrain?
Deeplearning
Isthereasimplealgorithmfor
intelligence?
Acknowledgements
FrequentlyAskedQuestions

Sponsors
You'redumbfounded,andtellyourboss:"Thecustomeriscrazy!"
Yourbossreplies:"Ithinkthey'recrazy,too.Butwhatthecustomer
wants,theyget."
Infact,there'salimitedsenseinwhichthecustomerisn'tcrazy.
Supposeyou'reallowedtouseaspeciallogicalgatewhichletsyou
ANDtogetherasmanyinputsasyouwant.Andyou'realsoalloweda
manyinputNANDgate,thatis,agatewhichcanANDmultipleinputs

Thankstoallthesupporterswho
madethebookpossible.Thanksalso
toallthecontributorstothe
BugfinderHallofFame.

andthennegatetheoutput.Withthesespecialgatesitturnsoutto

Thebookiscurrentlyabetarelease,

bepossibletocomputeanyfunctionatallusingacircuitthat'sjust

andisstillunderactive
development.Pleasesenderror

twolayersdeep.
Butjustbecausesomethingispossibledoesn'tmakeitagoodidea.
Inpractice,whensolvingcircuitdesignproblems(ormostanykind
ofalgorithmicproblem),weusuallystartbyfiguringouthowto
solvesubproblems,andthengraduallyintegratethesolutions.In
http://neuralnetworksanddeeplearning.com/chap5.html

reportstomn@michaelnielsen.org.
Forotherenquiries,pleaseseethe
FAQfirst.

Resources
Coderepository
Mailinglistforbookannouncements
1/21

7/16/2015

Neuralnetworksanddeeplearning

otherwords,webuilduptoasolutionthroughmultiplelayersof
abstraction.

MichaelNielsen'sproject
announcementmailinglist

Forinstance,supposewe'redesigningalogicalcircuittomultiply
twonumbers.Chancesarewewanttobuilditupoutofsubcircuits
doingoperationslikeaddingtwonumbers.Thesubcircuitsfor
addingtwonumberswill,inturn,bebuiltupoutofsubsubcircuits
foraddingtwobits.Veryroughlyspeakingourcircuitwilllooklike:

ByMichaelNielsen/Jul2015

Thatis,ourfinalcircuitcontainsatleastthreelayersofcircuit
elements.Infact,it'llprobablycontainmorethanthreelayers,as
webreakthesubtasksdownintosmallerunitsthanI'vedescribed.
Butyougetthegeneralidea.
Sodeepcircuitsmaketheprocessofdesigneasier.Butthey'renot
justhelpfulfordesign.Thereare,infact,mathematicalproofs
showingthatforsomefunctionsveryshallowcircuitsrequire
exponentiallymorecircuitelementstocomputethandodeep
circuits.Forinstance,afamous1984paperbyFurst,Saxeand
Sipser*showedthatcomputingtheparityofasetofbitsrequires

*SeeParity,Circuits,andthePolynomialTime

exponentiallymanygates,ifdonewithashallowcircuit.Onthe

MichaelSipser(1984).

Hierarchy,byMerrickFurst,JamesB.Saxe,and

otherhand,ifyouusedeepercircuitsit'seasytocomputetheparity
usingasmallcircuit:youjustcomputetheparityofpairsofbits,
thenusethoseresultstocomputetheparityofpairsofpairsofbits,
andsoon,buildingupquicklytotheoverallparity.Deepcircuits
thuscanbeintrinsicallymuchmorepowerfulthanshallowcircuits.

http://neuralnetworksanddeeplearning.com/chap5.html

2/21

7/16/2015

Neuralnetworksanddeeplearning

Uptonow,thisbookhasapproachedneuralnetworkslikethecrazy
customer.Almostallthenetworkswe'veworkedwithhavejusta
singlehiddenlayerofneurons(plustheinputandoutputlayers):

Thesesimplenetworkshavebeenremarkablyuseful:inearlier
chaptersweusednetworkslikethistoclassifyhandwrittendigits
withbetterthan98percentaccuracy!Nonetheless,intuitivelywe'd
expectnetworkswithmanymorehiddenlayerstobemore
powerful:

Suchnetworkscouldusetheintermediatelayerstobuildup
multiplelayersofabstraction,justaswedoinBooleancircuits.For
instance,ifwe'redoingvisualpatternrecognition,thentheneurons
inthefirstlayermightlearntorecognizeedges,theneuronsinthe
secondlayercouldlearntorecognizemorecomplexshapes,say
triangleorrectangles,builtupfromedges.Thethirdlayerwould
thenrecognizestillmorecomplexshapes.Andsoon.These
multiplelayersofabstractionseemlikelytogivedeepnetworksa
compellingadvantageinlearningtosolvecomplexpattern
http://neuralnetworksanddeeplearning.com/chap5.html

3/21

7/16/2015

Neuralnetworksanddeeplearning

recognitionproblems.Moreover,justasinthecaseofcircuits,there
aretheoreticalresultssuggestingthatdeepnetworksare
intrinsicallymorepowerfulthanshallownetworks*.

*Forcertainproblemsandnetworkarchitectures
thisisprovedinOnthenumberofresponse
regionsofdeepfeedforwardnetworkswith

Howcanwetrainsuchdeepnetworks?Inthischapter,we'lltry
trainingdeepnetworksusingourworkhorselearningalgorithm
stochasticgradientdescentbybackpropagation.Butwe'llruninto

piecewiselinearactivations,byRazvanPascanu,
GuidoMontfar,andYoshuaBengio(2014).See
alsothemoreinformaldiscussioninsection2of
LearningdeeparchitecturesforAI,byYoshua
Bengio(2009).

trouble,withourdeepnetworksnotperformingmuch(ifatall)
betterthanshallownetworks.
Thatfailureseemssurprisinginthelightofthediscussionabove.
Ratherthangiveupondeepnetworks,we'lldigdownandtryto
understandwhat'smakingourdeepnetworkshardtotrain.When
welookclosely,we'lldiscoverthatthedifferentlayersinourdeep
networkarelearningatvastlydifferentspeeds.Inparticular,when
laterlayersinthenetworkarelearningwell,earlylayersoftenget
stuckduringtraining,learningalmostnothingatall.Thisstuckness
isn'tsimplyduetobadluck.Rather,we'lldiscoverthereare
fundamentalreasonsthelearningslowdownoccurs,connectedto
ouruseofgradientbasedlearningtechniques.
Aswedelveintotheproblemmoredeeply,we'lllearnthatthe
oppositephenomenoncanalsooccur:theearlylayersmaybe
learningwell,butlaterlayerscanbecomestuck.Infact,we'llfind
thatthere'sanintrinsicinstabilityassociatedtolearningby
gradientdescentindeep,manylayerneuralnetworks.This
instabilitytendstoresultineithertheearlyorthelaterlayers
gettingstuckduringtraining.
Thisallsoundslikebadnews.Butbydelvingintothesedifficulties,
wecanbegintogaininsightintowhat'srequiredtotraindeep
networkseffectively.Andsotheseinvestigationsaregood
preparationforthenextchapter,wherewe'llusedeeplearningto
attackimagerecognitionproblems.

Thevanishinggradientproblem
So,whatgoeswrongwhenwetrytotrainadeepnetwork?
http://neuralnetworksanddeeplearning.com/chap5.html

4/21

7/16/2015

Neuralnetworksanddeeplearning

Toanswerthatquestion,let'sfirstrevisitthecaseofanetworkwith
justasinglehiddenlayer.Asperusual,we'llusetheMNISTdigit
classificationproblemasourplaygroundforlearningand
*IintroducedtheMNISTproblemanddatahere

experimentation*.

andhere.

Ifyouwish,youcanfollowalongbytrainingnetworksonyour
computer.Itisalso,ofcourse,finetojustreadalong.Ifyoudowish
tofollowlive,thenyou'llneedPython2.7,Numpy,andacopyofthe
code,whichyoucangetbycloningtherelevantrepositoryfromthe
commandline:
gitclonehttps://github.com/mnielsen/neuralnetworksanddeeplearning.git

Ifyoudon'tusegitthenyoucandownloadthedataandcodehere.
You'llneedtochangeintothesrcsubdirectory.
Then,fromaPythonshellweloadtheMNISTdata:
>>>importmnist_loader
>>>training_data,validation_data,test_data=\
...mnist_loader.load_data_wrapper()

Wesetupournetwork:
>>>importnetwork2
>>>net=network2.Network([784,30,10])

Thisnetworkhas784neuronsintheinputlayer,correspondingto
the28 28

= 784

pixelsintheinputimage.Weuse30hidden

neurons,aswellas10outputneurons,correspondingtothe10
possibleclassificationsfortheMNISTdigits('0','1','2', ,'9').
Let'strytrainingournetworkfor30completeepochs,usingmini
batchesof10trainingexamplesatatime,alearningrate
andregularizationparameter

= 5.0

= 0.1

.Aswetrainwe'llmonitorthe

classificationaccuracyonthevalidation_data*:

*Notethatthenetworkstakequitesometimeto
trainuptoafewminutespertrainingepoch,
dependingonthespeedofyourmachine.Soif

>>>net.SGD(training_data,30,10,0.1,lmbda=5.0,

you'rerunningthecodeit'sbesttocontinue

...evaluation_data=validation_data,monitor_evaluation_accuracy=True)

readingandreturnlater,nottowaitforthecode
tofinishexecuting.

Wegetaclassificationaccuracyof96.48percent(orthereabouts
it'llvaryabitfromruntorun),comparabletoourearlierresults
withasimilarconfiguration.

http://neuralnetworksanddeeplearning.com/chap5.html

5/21

7/16/2015

Neuralnetworksanddeeplearning

Now,let'saddanotherhiddenlayer,alsowith30neuronsinit,and
trytrainingwiththesamehyperparameters:
>>>net=network2.Network([784,30,30,10])
>>>net.SGD(training_data,30,10,0.1,lmbda=5.0,
...evaluation_data=validation_data,monitor_evaluation_accuracy=True)

Thisgivesanimprovedclassificationaccuracy,96.90percent.
That'sencouraging:alittlemoredepthishelping.Let'saddanother
30neuronhiddenlayer:
>>>net=network2.Network([784,30,30,30,10])
>>>net.SGD(training_data,30,10,0.1,lmbda=5.0,
...evaluation_data=validation_data,monitor_evaluation_accuracy=True)

Thatdoesn'thelpatall.Infact,theresultdropsbackdownto96.57
percent,closetoouroriginalshallownetwork.Andsupposewe
insertonefurtherhiddenlayer:
>>>net=network2.Network([784,30,30,30,30,10])
>>>net.SGD(training_data,30,10,0.1,lmbda=5.0,
...evaluation_data=validation_data,monitor_evaluation_accuracy=True)

Theclassificationaccuracydropsagain,to96.53percent.That's
probablynotastatisticallysignificantdrop,butit'snot
encouraging,either.
Thisbehaviourseemsstrange.Intuitively,extrahiddenlayersought
tomakethenetworkabletolearnmorecomplexclassification
functions,andthusdoabetterjobclassifying.Certainly,things
shouldn'tgetworse,sincetheextralayerscan,intheworstcase,
simplydonothing*.Butthat'snotwhat'sgoingon.

*Seethislaterproblemtounderstandhowto
buildahiddenlayerthatdoesnothing.

Sowhatisgoingon?Let'sassumethattheextrahiddenlayersreally
couldhelpinprinciple,andtheproblemisthatourlearning
algorithmisn'tfindingtherightweightsandbiases.We'dliketo
figureoutwhat'sgoingwronginourlearningalgorithm,andhowto
dobetter.
Togetsomeinsightintowhat'sgoingwrong,let'svisualizehowthe
networklearns.Below,I'veplottedpartofa[784, 30, 30, 10]
network,i.e.,anetworkwithtwohiddenlayers,eachcontaining30
hiddenneurons.Eachneuroninthediagramhasalittlebaronit,
http://neuralnetworksanddeeplearning.com/chap5.html

6/21

7/16/2015

Neuralnetworksanddeeplearning

representinghowquicklythatneuronischangingasthenetwork
learns.Abigbarmeanstheneuron'sweightsandbiasarechanging
rapidly,whileasmallbarmeanstheweightsandbiasarechanging
slowly.Moreprecisely,thebarsdenotethegradient C / bforeach
neuron,i.e.,therateofchangeofthecostwithrespecttothe
neuron'sbias.BackinChapter2wesawthatthisgradientquantity
controllednotjusthowrapidlythebiaschangesduringlearning,
butalsohowrapidlytheweightsinputtotheneuronchange,too.
Don'tworryifyoudon'trecallthedetails:thethingtokeepinmind
issimplythatthesebarsshowhowquicklyeachneuron'sweights
andbiasarechangingasthenetworklearns.
Tokeepthediagramsimple,I'veshownjustthetopsixneuronsin
thetwohiddenlayers.I'veomittedtheinputneurons,sincethey've
gotnoweightsorbiasestolearn.I'vealsoomittedtheoutput
neurons,sincewe'redoinglayerwisecomparisons,anditmakes
mostsensetocomparelayerswiththesamenumberofneurons.
Theresultsareplottedattheverybeginningoftraining,i.e.,
immediatelyafterthenetworkisinitialized.Heretheyare*:

*Thedataplottedisgeneratedusingthe
programgenerate_gradient.py.Thesame
programisalsousedtogeneratetheresults
quotedlaterinthissection.

http://neuralnetworksanddeeplearning.com/chap5.html

7/21

7/16/2015

Neuralnetworksanddeeplearning

Thenetworkwasinitializedrandomly,andsoit'snotsurprising
thatthere'salotofvariationinhowrapidlytheneuronslearn.Still,
onethingthatjumpsoutisthatthebarsinthesecondhiddenlayer
aremostlymuchlargerthanthebarsinthefirsthiddenlayer.Asa
result,theneuronsinthesecondhiddenlayerwilllearnquiteabit
fasterthantheneuronsinthefirsthiddenlayer.Isthismerelya
coincidence,oraretheneuronsinthesecondhiddenlayerlikelyto
learnfasterthanneuronsinthefirsthiddenlayeringeneral?
Todeterminewhetherthisisthecase,ithelpstohaveaglobalway
ofcomparingthespeedoflearninginthefirstandsecondhidden
layers.Todothis,let'sdenotethegradientas

= C / b

l
j

,i.e.,the

gradientforthejthneuroninthel thlayer*.Wecanthinkofthe

*BackinChapter2wereferredtothisasthe

gradient asavectorwhoseentriesdeterminehowquicklythe

"gradient".Isay"informal"becauseofcourse

firsthiddenlayerlearns,and asavectorwhoseentriesdetermine
2

howquicklythesecondhiddenlayerlearns.We'llthenusethe

error,butherewe'lladopttheinformalterm
thisdoesn'texplicitlyincludethepartial
derivativesofthecostwithrespecttothe
weights,C /w .

lengthsofthesevectorsas(rough!)globalmeasuresofthespeedat
http://neuralnetworksanddeeplearning.com/chap5.html

8/21

7/16/2015

Neuralnetworksanddeeplearning

whichthelayersarelearning.So,forinstance,thelength

measuresthespeedatwhichthefirsthiddenlayerislearning,while
thelength

measuresthespeedatwhichthesecondhidden

layerislearning.
Withthesedefinitions,andinthesameconfigurationaswasplotted
above,wefind

= 0.07

and

= 0.31

.Sothisconfirms

ourearliersuspicion:theneuronsinthesecondhiddenlayerreally
arelearningmuchfasterthantheneuronsinthefirsthiddenlayer.
Whathappensifweaddmorehiddenlayers?Ifwehavethree
hiddenlayers,ina[784, 30, 30, 30, 10] network,thentherespective
speedsoflearningturnouttobe0.012,0.060,and0.283.Again,
earlierhiddenlayersarelearningmuchslowerthanlaterhidden
layers.Supposeweaddyetanotherlayerwith30 hiddenneurons.
Inthatcase,therespectivespeedsoflearningare0.003,0.017,
0.070,and0.285.Thepatternholds:earlylayerslearnslowerthan
laterlayers.
We'vebeenlookingatthespeedoflearningatthestartoftraining,
thatis,justafterthenetworksareinitialized.Howdoesthespeedof
learningchangeaswetrainournetworks?Let'sreturntolookatthe
networkwithjusttwohiddenlayers.Thespeedoflearningchanges
asfollows:

Togeneratetheseresults,Iusedbatchgradientdescentwithjust
http://neuralnetworksanddeeplearning.com/chap5.html

9/21

7/16/2015

Neuralnetworksanddeeplearning

1,000trainingimages,trainedover500epochs.Thisisabit
differentthanthewayweusuallytrainI'veusednominibatches,
andjust1,000trainingimages,ratherthanthefull50,000image
trainingset.I'mnottryingtodoanythingsneaky,orpullthewool
overyoureyes,butitturnsoutthatusingminibatchstochastic
gradientdescentgivesmuchnoisier(albeitverysimilar,whenyou
averageawaythenoise)results.UsingtheparametersI'vechosenis
aneasywayofsmoothingtheresultsout,sowecanseewhat'sgoing
on.
Inanycase,asyoucanseethetwolayersstartoutlearningatvery
differentspeeds(aswealreadyknow).Thespeedinbothlayers
thendropsveryquickly,beforerebounding.Butthroughitall,the
firsthiddenlayerlearnsmuchmoreslowlythanthesecondhidden
layer.
Whataboutmorecomplexnetworks?Here'stheresultsofasimilar
experiment,butthistimewiththreehiddenlayers(a
[784, 30, 30, 30, 10]

network):

Again,earlyhiddenlayerslearnmuchmoreslowlythanlater
hiddenlayers.Finally,let'saddafourthhiddenlayers(a
[784, 30, 30, 30, 30, 10]

network),andseewhathappenswhenwe

train:

http://neuralnetworksanddeeplearning.com/chap5.html

10/21

7/16/2015

Neuralnetworksanddeeplearning

Again,earlyhiddenlayerslearnmuchmoreslowlythanlater
hiddenlayers.Inthiscase,thefirsthiddenlayerislearningroughly
100timesslowerthanthefinalhiddenlayer.Nowonderwewere
havingtroubletrainingthesenetworksearlier!
Wehavehereanimportantobservation:inatleastsomedeep
neuralnetworks,thegradienttendstogetsmalleraswemove
backwardthroughthehiddenlayers.Thismeansthatneuronsin
theearlierlayerslearnmuchmoreslowlythanneuronsinlater
layers.Andwhilewe'veseenthisinjustasinglenetwork,thereare
fundamentalreasonswhythishappensinmanyneuralnetworks.
Thephenomenonisknownasthevanishinggradientproblem*.

*SeeGradientflowinrecurrentnets:the
difficultyoflearninglongtermdependencies,by
SeppHochreiter,YoshuaBengio,PaoloFrasconi,

Whydoesthevanishinggradientproblemoccur?Aretherewayswe
canavoidit?Andhowshouldwedealwithitintrainingdeepneural

andJrgenSchmidhuber(2001).Thispaper
studiedrecurrentneuralnets,buttheessential
phenomenonisthesameasinthefeedforward

networks?Infact,we'lllearnshortlythatit'snotinevitable,

networkswearestudying.SeealsoSepp

althoughthealternativeisnotveryattractive,either:sometimesthe

Untersuchungenzudynamischenneuronalen

gradientgetsmuchlargerinearlierlayers!Thisistheexploding

Hochreiter'searlierDiplomaThesis,
Netzen(1991,inGerman).

gradientproblem,andit'snotmuchbetternewsthanthevanishing
gradientproblem.Moregenerally,itturnsoutthatthegradientin
deepneuralnetworksisunstable,tendingtoeitherexplodeor
vanishinearlierlayers.Thisinstabilityisafundamentalproblem
forgradientbasedlearningindeepneuralnetworks.It'ssomething
weneedtounderstand,and,ifpossible,takestepstoaddress.

http://neuralnetworksanddeeplearning.com/chap5.html

11/21

7/16/2015

Neuralnetworksanddeeplearning

Oneresponsetovanishing(orunstable)gradientsistowonderif
they'rereallysuchaproblem.Momentarilysteppingawayfrom
neuralnets,imagineweweretryingtonumericallyminimizea
functionf (x) ofasinglevariable.Wouldn'titbegoodnewsifthe
derivativef

(x)

wassmall?Wouldn'tthatmeanwewerealready

nearanextremum?Inasimilarway,mightthesmallgradientin
earlylayersofadeepnetworkmeanthatwedon'tneedtodomuch
adjustmentoftheweightsandbiases?
Ofcourse,thisisn'tthecase.Recallthatwerandomlyinitializedthe
weightandbiasesinthenetwork.Itisextremelyunlikelyourinitial
weightsandbiaseswilldoagoodjobatwhateveritiswewantour
networktodo.Tobeconcrete,considerthefirstlayerofweightsin
a[784, 30, 30, 30, 10] networkfortheMNISTproblem.Therandom
initializationmeansthefirstlayerthrowsawaymostinformation
abouttheinputimage.Eveniflaterlayershavebeenextensively
trained,theywillstillfinditextremelydifficulttoidentifytheinput
image,simplybecausetheydon'thaveenoughinformation.Andso
itcan'tpossiblybethecasethatnotmuchlearningneedstobedone
inthefirstlayer.Ifwe'regoingtotraindeepnetworks,weneedto
figureouthowtoaddressthevanishinggradientproblem.

What'scausingthevanishinggradient
problem?Unstablegradientsindeep
neuralnets
Togetinsightintowhythevanishinggradientproblemoccurs,let's
considerthesimplestdeepneuralnetwork:onewithjustasingle
neuronineachlayer.Here'sanetworkwiththreehiddenlayers:

Here,w

1,

w2 ,

aretheweights,b

1,

b2 ,

arethebiases,andC is

somecostfunction.Justtoremindyouhowthisworks,theoutput
aj

fromthejthneuronis(z ),whereistheusualsigmoid
j

activationfunction,andz

= w j aj1 + bj

istheweightedinputto

theneuron.I'vedrawnthecostC attheendtoemphasizethatthe
http://neuralnetworksanddeeplearning.com/chap5.html

12/21

7/16/2015

Neuralnetworksanddeeplearning

costisafunctionofthenetwork'soutput,a :iftheactualoutput
4

fromthenetworkisclosetothedesiredoutput,thenthecostwillbe
low,whileifit'sfaraway,thecostwillbehigh.
We'regoingtostudythegradient C / b associatedtothefirst
1

hiddenneuron.We'llfigureoutanexpressionfor C / b ,andby
1

studyingthatexpressionwe'llunderstandwhythevanishing
gradientproblemoccurs.
I'llstartbysimplyshowingyoutheexpressionfor C / b .Itlooks
1

forbidding,butit'sactuallygotasimplestructure,whichI'll
describeinamoment.Here'stheexpression(ignorethenetwork,
fornow,andnotethat isjustthederivativeofthefunction):

Thestructureintheexpressionisasfollows:thereisa

(zj )

term

intheproductforeachneuroninthenetworkaweightw termfor
j

eachweightinthenetworkandafinal C / a term,
4

correspondingtothecostfunctionattheend.NoticethatI've
placedeachtermintheexpressionabovethecorrespondingpartof
thenetwork.Sothenetworkitselfisamnemonicfortheexpression.
You'rewelcometotakethisexpressionforgranted,andskiptothe
discussionofhowitrelatestothevanishinggradientproblem.
There'snoharmindoingthis,sincetheexpressionisaspecialcase
ofourearlierdiscussionofbackpropagation.Butthere'salsoa
simpleexplanationofwhytheexpressionistrue,andsoit'sfun
(andperhapsenlightening)totakealookatthatexplanation.
Imaginewemakeasmallchangeb inthebiasb .Thatwillsetoff
1

acascadingseriesofchangesintherestofthenetwork.First,it
causesachangea intheoutputfromthefirsthiddenneuron.
1

That,inturn,willcauseachangez intheweightedinputtothe
2

secondhiddenneuron.Thenachangea intheoutputfromthe
2

secondhiddenneuron.Andsoon,allthewaythroughtoachange
http://neuralnetworksanddeeplearning.com/chap5.html

13/21

7/16/2015
C

Neuralnetworksanddeeplearning

inthecostattheoutput.Wehave
C

b1

(114)

b1

Thissuggeststhatwecanfigureoutanexpressionforthegradient
C / b1

bycarefullytrackingtheeffectofeachstepinthiscascade.

Todothis,let'sthinkabouthowb causestheoutputa fromthe


1

firsthiddenneurontochange.Wehavea

= (z1 ) = (w 1 a0 + b1 )

so
a1

(w 1 a0 + b1 )
b1

b 1

(115)

= (z1 )b1 .

That

(z1 )

(116)

termshouldlookfamiliar:it'sthefirstterminour

claimedexpressionforthegradient C / b .Intuitively,thisterm
1

convertsachangeb inthebiasintoachangea intheoutput


1

activation.Thatchangea inturncausesachangeintheweighted
1

inputz

= w 2 a1 + b2

tothesecondhiddenneuron:
z2

z2
a1

a 1

(117)

= w 2 a 1 .

(118)

Combiningourexpressionsforz anda ,weseehowthe


2

changeinthebiasb propagatesalongthenetworktoaffectz :
1

z2

(z1 )w 2 b1 .

(119)

Again,thatshouldlookfamiliar:we'venowgotthefirsttwoterms
inourclaimedexpressionforthegradient C / b .
1

Wecankeepgoinginthisfashion,trackingthewaychanges
propagatethroughtherestofthenetwork.Ateachneuronwepick
upa

(zj )

term,andthrougheachweightwepickupaw term.
j

TheendresultisanexpressionrelatingthefinalchangeC incost
totheinitialchangeb inthebias:
1

(z1 )w 2 (z2 ) (z4 )

C
a4

b 1 .

(120)

Dividingbyb wedoindeedgetthedesiredexpressionforthe
1

http://neuralnetworksanddeeplearning.com/chap5.html

14/21

7/16/2015

Neuralnetworksanddeeplearning

gradient:
C

= (z1 )w 2 (z2 ) (z4 )

b1

(121)

a4

Whythevanishinggradientproblemoccurs:Tounderstand
whythevanishinggradientproblemoccurs,let'sexplicitlywriteout
theentireexpressionforthegradient:
C
b1

= (z1 ) w 2 (z2 ) w 3 (z3 ) w 4 (z4 )

(122)

a4

Exceptingtheverylastterm,thisexpressionisaproductoftermsof
theformw

(zj )

.Tounderstandhoweachofthosetermsbehave,

let'slookataplotofthefunction :

Derivativeofsigmoidfunction
0.25

0.20

0.15

0.10

0.05

0.00
4

Thederivativereachesamaximumat

(0) = 1/4

.Now,ifweuse

ourstandardapproachtoinitializingtheweightsinthenetwork,
thenwe'llchoosetheweightsusingaGaussianwithmean0 and
standarddeviation1 .Sotheweightswillusuallysatisfy|w

j|

Puttingtheseobservationstogether,weseethatthetermsw
willusuallysatisfy|w

(zj )| < 1/4

< 1
j

(zj )

.Andwhenwetakeaproductof

manysuchterms,theproductwilltendtoexponentiallydecrease:
themoreterms,thesmallertheproductwillbe.Thisisstartingto
smelllikeapossibleexplanationforthevanishinggradient
problem.
Tomakethisallabitmoreexplicit,let'scomparetheexpressionfor
C / b1

toanexpressionforthegradientwithrespecttoalater

bias,say C / b .Ofcourse,wehaven'texplicitlyworkedoutan
3

http://neuralnetworksanddeeplearning.com/chap5.html

15/21

7/16/2015

Neuralnetworksanddeeplearning

expressionfor C / b ,butitfollowsthesamepatterndescribed
3

abovefor C / b .Here'sthecomparisonofthetwoexpressions:
1

Thetwoexpressionssharemanyterms.Butthegradient C / b
includestwoextratermseachoftheformw

(zj )

.Aswe'veseen,

suchtermsaretypicallylessthan1/4 inmagnitude.Andsothe
gradient C / b willusuallybeafactorof16 (ormore)smaller
1

than C / b .Thisistheessentialoriginofthevanishinggradient
3

problem.
Ofcourse,thisisaninformalargument,notarigorousproofthat
thevanishinggradientproblemwilloccur.Thereareseveral
possibleescapeclauses.Inparticular,wemightwonderwhetherthe
weightsw couldgrowduringtraining.Iftheydo,it'spossiblethe
j

termsw

(zj )

intheproductwillnolongersatisfy

|w j (zj )| < 1/4

.Indeed,ifthetermsgetlargeenoughgreater

than1 thenwewillnolongerhaveavanishinggradientproblem.
Instead,thegradientwillactuallygrowexponentiallyaswemove
backwardthroughthelayers.Insteadofavanishinggradient
problem,we'llhaveanexplodinggradientproblem.
Theexplodinggradientproblem:Let'slookatanexplicit
examplewhereexplodinggradientsoccur.Theexampleis
somewhatcontrived:I'mgoingtofixparametersinthenetworkin
justtherightwaytoensurewegetanexplodinggradient.Buteven
thoughtheexampleiscontrived,ithasthevirtueoffirmly
establishingthatexplodinggradientsaren'tmerelyahypothetical
possibility,theyreallycanhappen.
Therearetwostepstogettinganexplodinggradient.First,we
http://neuralnetworksanddeeplearning.com/chap5.html

16/21

7/16/2015

Neuralnetworksanddeeplearning

choosealltheweightsinthenetworktobelarge,say
w 1 = w 2 = w 3 = w 4 = 100

the

(zj )

.Second,we'llchoosethebiasessothat

termsarenottoosmall.That'sactuallyprettyeasytodo:

allweneeddoischoosethebiasestoensurethattheweightedinput
toeachneuronisz

wantz

= 0

(andso

= w 1 a0 + b1 = 0

b1 = 100 a0

(zj ) = 1/4

.Wecanachievethisbysetting

.Wecanusethesameideatoselecttheotherbiases.

Whenwedothis,weseethatallthetermsw
100

1
4

= 25

).So,forinstance,we

(zj )

areequalto

.Withthesechoiceswegetanexplodinggradient.

Theunstablegradientproblem:Thefundamentalproblem
hereisn'tsomuchthevanishinggradientproblemortheexploding
gradientproblem.It'sthatthegradientinearlylayersistheproduct
oftermsfromallthelaterlayers.Whentherearemanylayers,that's
anintrinsicallyunstablesituation.Theonlywayalllayerscanlearn
atclosetothesamespeedisifallthoseproductsoftermscome
closetobalancingout.Withoutsomemechanismorunderlying
reasonforthatbalancingtooccur,it'shighlyunlikelytohappen
simplybychance.Inshort,therealproblemhereisthatneural
networkssufferfromanunstablegradientproblem.Asaresult,if
weusestandardgradientbasedlearningtechniques,different
layersinthenetworkwilltendtolearnatwildlydifferentspeeds.

Exercise
Inourdiscussionofthevanishinggradientproblem,wemade
useofthefactthat|

(z)| < 1/4

.Supposeweusedadifferent

activationfunction,onewhosederivativecouldbemuchlarger.
Wouldthathelpusavoidtheunstablegradientproblem?
Theprevalenceofthevanishinggradientproblem:We've
seenthatthegradientcaneithervanishorexplodeintheearly
layersofadeepnetwork.Infact,whenusingsigmoidneuronsthe
gradientwillusuallyvanish.Toseewhy,consideragainthe
expression|w
need|w

(z)|

(z)| 1

.Toavoidthevanishinggradientproblemwe

.Youmightthinkthiscouldhappeneasilyifw is

verylarge.However,it'smoredifficultthanitlooks.Thereasonis
thatthe

(z)

termalsodependsonw :

http://neuralnetworksanddeeplearning.com/chap5.html

(z) = (wa + b)

,wherea
17/21

7/16/2015

Neuralnetworksanddeeplearning

istheinputactivation.Sowhenwemakew large,weneedtobe
carefulthatwe'renotsimultaneouslymaking

(wa + b)

small.That

turnsouttobeaconsiderableconstraint.Thereasonisthatwhen
wemakew largewetendtomakewa + b verylarge.Lookingatthe
graphof youcanseethatthisputsusoffinthe"wings"ofthe

function,whereittakesverysmallvalues.Theonlywaytoavoid
thisisiftheinputactivationfallswithinafairlynarrowrangeof
values(thisqualitativeexplanationismadequantitativeinthefirst
problembelow).Sometimesthatwillchancetohappen.Moreoften,
though,itdoesnothappen.Andsointhegenericcasewehave
vanishinggradients.

Problems
Considertheproduct|w

(wa + b)|

.Suppose

.(1)Arguethatthiscanonlyeveroccurif

|w (wa + b)| 1
|w| 4

.(2)Supposingthat|w|

activationsaforwhich|w
a

,considerthesetofinput
.Showthatthesetof

(wa + b)| 1

satisfyingthatconstraintcanrangeoveranintervalno

greaterinwidththan

|w|(1 + 1 4/|w| )

2
ln (

1) .

(123)

|w|

(3)Shownumericallythattheaboveexpressionboundingthe
widthoftherangeisgreatestat|w|
value

0.45

6.9

,whereittakesa

.Andsoevengiventhateverythinglinesupjust

perfectly,westillhaveafairlynarrowrangeofinputactivations
whichcanavoidthevanishinggradientproblem.
Identityneuron:Consideraneuronwithasingleinput,x,a
correspondingweight,w ,abiasb,andaweightw onthe
1

output.Showthatbychoosingtheweightsandbias
appropriately,wecanensurew

2 (w 1 x

+ b) x

forx

[0, 1]

Suchaneuroncanthusbeusedasakindofidentityneuron,
thatis,aneuronwhoseoutputisthesame(uptorescalingbya
weightfactor)asitsinput.Hint:Ithelpstorewrite
x = 1/2 +

,toassumew issmall,andtouseaTaylorseries

http://neuralnetworksanddeeplearning.com/chap5.html

18/21

7/16/2015

Neuralnetworksanddeeplearning

expansioninw

Unstablegradientsinmorecomplex
networks
We'vebeenstudyingtoynetworks,withjustoneneuronineach
hiddenlayer.Whataboutmorecomplexdeepnetworks,withmany
neuronsineachhiddenlayer?

Infact,muchthesamebehaviouroccursinsuchnetworks.Inthe
earlierchapteronbackpropagationwesawthatthegradientinthe
l

thlayerofanL layernetworkisgivenby:

= (z )(w

l+1

(z

l+1

)(w

l+2

(z

) a C

Here, (z )isadiagonalmatrixwhoseentriesarethe

(z)

(124)

values

fortheweightedinputstothel thlayer.Thew aretheweight


l

matricesforthedifferentlayers.And

aC

isthevectorofpartial

derivativesofC withrespecttotheoutputactivations.
Thisisamuchmorecomplicatedexpressionthaninthesingle
neuroncase.Still,ifyoulookclosely,theessentialformisvery
similar,withlotsofpairsoftheform(w

.What'smore,the

(z )

matrices (z )havesmallentriesonthediagonal,nonelargerthan

1
4

.Providedtheweightmatricesw aren'ttoolarge,eachadditional

term(w

tendstomakethegradientvectorsmaller,leading

(z )

toavanishinggradient.Moregenerally,thelargenumberofterms
intheproducttendstoleadtoanunstablegradient,justasinour
earlierexample.Inpractice,empiricallyitistypicallyfoundin
http://neuralnetworksanddeeplearning.com/chap5.html

19/21

7/16/2015

Neuralnetworksanddeeplearning

sigmoidnetworksthatgradientsvanishexponentiallyquicklyin
earlierlayers.Asaresult,learningslowsdowninthoselayers.This
slowdownisn'tmerelyanaccidentoraninconvenience:it'sa
fundamentalconsequenceoftheapproachwe'retakingtolearning.

Otherobstaclestodeeplearning
Inthischapterwe'vefocusedonvanishinggradientsand,more
generally,unstablegradientsasanobstacletodeeplearning.In
fact,unstablegradientsarejustoneobstacletodeeplearning,albeit
animportantfundamentalobstacle.Muchongoingresearchaimsto
betterunderstandthechallengesthatcanoccurwhentrainingdeep
networks.Iwon'tcomprehensivelysummarizethatworkhere,but
justwanttobrieflymentionacoupleofpapers,togiveyouthe
flavorofsomeofthequestionspeopleareasking.
Asafirstexample,in2010GlorotandBengio*foundevidence

*Understandingthedifficultyoftrainingdeep

suggestingthattheuseofsigmoidactivationfunctionscancause

andYoshuaBengio(2010).Seealsotheearlier

problemstrainingdeepnetworks.Inparticular,theyfound
evidencethattheuseofsigmoidswillcausetheactivationsinthe

feedforwardneuralnetworks,byXavierGlorot
discussionoftheuseofsigmoidsinEfficient
BackProp,byYannLeCun,LonBottou,
GenevieveOrrandKlausRobertMller(1998).

finalhiddenlayertosaturatenear0earlyintraining,substantially
slowingdownlearning.Theysuggestedsomealternativeactivation
functions,whichappearnottosufferasmuchfromthissaturation
problem.
Asasecondexample,in2013Sutskever,Martens,Dahland
Hinton*studiedtheimpactondeeplearningofboththerandom

*Ontheimportanceofinitializationand

weightinitializationandthemomentumscheduleinmomentum

JamesMartens,GeorgeDahlandGeoffrey

basedstochasticgradientdescent.Inbothcases,makinggood

momentumindeeplearning,byIlyaSutskever,
Hinton(2013).

choicesmadeasubstantialdifferenceintheabilitytotraindeep
networks.
Theseexamplessuggestthat"Whatmakesdeepnetworkshardto
train?"isacomplexquestion.Inthischapter,we'vefocusedonthe
instabilitiesassociatedtogradientbasedlearningindeepnetworks.
Theresultsinthelasttwoparagraphssuggestthatthereisalsoa
roleplayedbythechoiceofactivationfunction,thewayweightsare
http://neuralnetworksanddeeplearning.com/chap5.html

20/21

7/16/2015

Neuralnetworksanddeeplearning

initialized,andevendetailsofhowlearningbygradientdescentis
implemented.And,ofcourse,choiceofnetworkarchitectureand
otherhyperparametersisalsoimportant.Thus,manyfactorscan
playaroleinmakingdeepnetworkshardtotrain,and
understandingallthosefactorsisstillasubjectofongoingresearch.
Thisallseemsratherdownbeatandpessimisminducing.Butthe
goodnewsisthatinthenextchapterwe'llturnthataround,and
developseveralapproachestodeeplearningthattosomeextent
managetoovercomeorroutearoundallthesechallenges.
Inacademicwork,pleasecitethisbookas:MichaelA.Nielsen,"NeuralNetworksandDeepLearning",

Lastupdate:FriJul1008:53:052015

DeterminationPress,2015
ThisworkislicensedunderaCreativeCommonsAttributionNonCommercial3.0UnportedLicense.This
meansyou'refreetocopy,share,andbuildonthisbook,butnottosellit.Ifyou'reinterestedincommercialuse,
pleasecontactme.

http://neuralnetworksanddeeplearning.com/chap5.html

21/21

You might also like