Professional Documents
Culture Documents
Neuralnetworksanddeeplearning
CHAPTER5
Whyaredeepneuralnetworkshardtotrain?
Imagineyou'reanengineerwhohasbeenaskedtodesigna
computerfromscratch.Onedayyou'reworkingawayinyouroffice,
designinglogicalcircuits,settingoutANDgates,ORgates,andsoon,
whenyourbosswalksinwithbadnews.Thecustomerhasjust
addedasurprisingdesignrequirement:thecircuitfortheentire
computermustbejusttwolayersdeep:
NeuralNetworksandDeepLearning
Whatthisbookisabout
Ontheexercisesandproblems
Usingneuralnetstorecognize
handwrittendigits
Howthebackpropagation
algorithmworks
Improvingthewayneural
networkslearn
Avisualproofthatneuralnetscan
computeanyfunction
Whyaredeepneuralnetworks
hardtotrain?
Deeplearning
Isthereasimplealgorithmfor
intelligence?
Acknowledgements
FrequentlyAskedQuestions
Sponsors
You'redumbfounded,andtellyourboss:"Thecustomeriscrazy!"
Yourbossreplies:"Ithinkthey'recrazy,too.Butwhatthecustomer
wants,theyget."
Infact,there'salimitedsenseinwhichthecustomerisn'tcrazy.
Supposeyou'reallowedtouseaspeciallogicalgatewhichletsyou
ANDtogetherasmanyinputsasyouwant.Andyou'realsoalloweda
manyinputNANDgate,thatis,agatewhichcanANDmultipleinputs
Thankstoallthesupporterswho
madethebookpossible.Thanksalso
toallthecontributorstothe
BugfinderHallofFame.
andthennegatetheoutput.Withthesespecialgatesitturnsoutto
Thebookiscurrentlyabetarelease,
bepossibletocomputeanyfunctionatallusingacircuitthat'sjust
andisstillunderactive
development.Pleasesenderror
twolayersdeep.
Butjustbecausesomethingispossibledoesn'tmakeitagoodidea.
Inpractice,whensolvingcircuitdesignproblems(ormostanykind
ofalgorithmicproblem),weusuallystartbyfiguringouthowto
solvesubproblems,andthengraduallyintegratethesolutions.In
http://neuralnetworksanddeeplearning.com/chap5.html
reportstomn@michaelnielsen.org.
Forotherenquiries,pleaseseethe
FAQfirst.
Resources
Coderepository
Mailinglistforbookannouncements
1/21
7/16/2015
Neuralnetworksanddeeplearning
otherwords,webuilduptoasolutionthroughmultiplelayersof
abstraction.
MichaelNielsen'sproject
announcementmailinglist
Forinstance,supposewe'redesigningalogicalcircuittomultiply
twonumbers.Chancesarewewanttobuilditupoutofsubcircuits
doingoperationslikeaddingtwonumbers.Thesubcircuitsfor
addingtwonumberswill,inturn,bebuiltupoutofsubsubcircuits
foraddingtwobits.Veryroughlyspeakingourcircuitwilllooklike:
ByMichaelNielsen/Jul2015
Thatis,ourfinalcircuitcontainsatleastthreelayersofcircuit
elements.Infact,it'llprobablycontainmorethanthreelayers,as
webreakthesubtasksdownintosmallerunitsthanI'vedescribed.
Butyougetthegeneralidea.
Sodeepcircuitsmaketheprocessofdesigneasier.Butthey'renot
justhelpfulfordesign.Thereare,infact,mathematicalproofs
showingthatforsomefunctionsveryshallowcircuitsrequire
exponentiallymorecircuitelementstocomputethandodeep
circuits.Forinstance,afamous1984paperbyFurst,Saxeand
Sipser*showedthatcomputingtheparityofasetofbitsrequires
*SeeParity,Circuits,andthePolynomialTime
exponentiallymanygates,ifdonewithashallowcircuit.Onthe
MichaelSipser(1984).
Hierarchy,byMerrickFurst,JamesB.Saxe,and
otherhand,ifyouusedeepercircuitsit'seasytocomputetheparity
usingasmallcircuit:youjustcomputetheparityofpairsofbits,
thenusethoseresultstocomputetheparityofpairsofpairsofbits,
andsoon,buildingupquicklytotheoverallparity.Deepcircuits
thuscanbeintrinsicallymuchmorepowerfulthanshallowcircuits.
http://neuralnetworksanddeeplearning.com/chap5.html
2/21
7/16/2015
Neuralnetworksanddeeplearning
Uptonow,thisbookhasapproachedneuralnetworkslikethecrazy
customer.Almostallthenetworkswe'veworkedwithhavejusta
singlehiddenlayerofneurons(plustheinputandoutputlayers):
Thesesimplenetworkshavebeenremarkablyuseful:inearlier
chaptersweusednetworkslikethistoclassifyhandwrittendigits
withbetterthan98percentaccuracy!Nonetheless,intuitivelywe'd
expectnetworkswithmanymorehiddenlayerstobemore
powerful:
Suchnetworkscouldusetheintermediatelayerstobuildup
multiplelayersofabstraction,justaswedoinBooleancircuits.For
instance,ifwe'redoingvisualpatternrecognition,thentheneurons
inthefirstlayermightlearntorecognizeedges,theneuronsinthe
secondlayercouldlearntorecognizemorecomplexshapes,say
triangleorrectangles,builtupfromedges.Thethirdlayerwould
thenrecognizestillmorecomplexshapes.Andsoon.These
multiplelayersofabstractionseemlikelytogivedeepnetworksa
compellingadvantageinlearningtosolvecomplexpattern
http://neuralnetworksanddeeplearning.com/chap5.html
3/21
7/16/2015
Neuralnetworksanddeeplearning
recognitionproblems.Moreover,justasinthecaseofcircuits,there
aretheoreticalresultssuggestingthatdeepnetworksare
intrinsicallymorepowerfulthanshallownetworks*.
*Forcertainproblemsandnetworkarchitectures
thisisprovedinOnthenumberofresponse
regionsofdeepfeedforwardnetworkswith
Howcanwetrainsuchdeepnetworks?Inthischapter,we'lltry
trainingdeepnetworksusingourworkhorselearningalgorithm
stochasticgradientdescentbybackpropagation.Butwe'llruninto
piecewiselinearactivations,byRazvanPascanu,
GuidoMontfar,andYoshuaBengio(2014).See
alsothemoreinformaldiscussioninsection2of
LearningdeeparchitecturesforAI,byYoshua
Bengio(2009).
trouble,withourdeepnetworksnotperformingmuch(ifatall)
betterthanshallownetworks.
Thatfailureseemssurprisinginthelightofthediscussionabove.
Ratherthangiveupondeepnetworks,we'lldigdownandtryto
understandwhat'smakingourdeepnetworkshardtotrain.When
welookclosely,we'lldiscoverthatthedifferentlayersinourdeep
networkarelearningatvastlydifferentspeeds.Inparticular,when
laterlayersinthenetworkarelearningwell,earlylayersoftenget
stuckduringtraining,learningalmostnothingatall.Thisstuckness
isn'tsimplyduetobadluck.Rather,we'lldiscoverthereare
fundamentalreasonsthelearningslowdownoccurs,connectedto
ouruseofgradientbasedlearningtechniques.
Aswedelveintotheproblemmoredeeply,we'lllearnthatthe
oppositephenomenoncanalsooccur:theearlylayersmaybe
learningwell,butlaterlayerscanbecomestuck.Infact,we'llfind
thatthere'sanintrinsicinstabilityassociatedtolearningby
gradientdescentindeep,manylayerneuralnetworks.This
instabilitytendstoresultineithertheearlyorthelaterlayers
gettingstuckduringtraining.
Thisallsoundslikebadnews.Butbydelvingintothesedifficulties,
wecanbegintogaininsightintowhat'srequiredtotraindeep
networkseffectively.Andsotheseinvestigationsaregood
preparationforthenextchapter,wherewe'llusedeeplearningto
attackimagerecognitionproblems.
Thevanishinggradientproblem
So,whatgoeswrongwhenwetrytotrainadeepnetwork?
http://neuralnetworksanddeeplearning.com/chap5.html
4/21
7/16/2015
Neuralnetworksanddeeplearning
Toanswerthatquestion,let'sfirstrevisitthecaseofanetworkwith
justasinglehiddenlayer.Asperusual,we'llusetheMNISTdigit
classificationproblemasourplaygroundforlearningand
*IintroducedtheMNISTproblemanddatahere
experimentation*.
andhere.
Ifyouwish,youcanfollowalongbytrainingnetworksonyour
computer.Itisalso,ofcourse,finetojustreadalong.Ifyoudowish
tofollowlive,thenyou'llneedPython2.7,Numpy,andacopyofthe
code,whichyoucangetbycloningtherelevantrepositoryfromthe
commandline:
gitclonehttps://github.com/mnielsen/neuralnetworksanddeeplearning.git
Ifyoudon'tusegitthenyoucandownloadthedataandcodehere.
You'llneedtochangeintothesrcsubdirectory.
Then,fromaPythonshellweloadtheMNISTdata:
>>>importmnist_loader
>>>training_data,validation_data,test_data=\
...mnist_loader.load_data_wrapper()
Wesetupournetwork:
>>>importnetwork2
>>>net=network2.Network([784,30,10])
Thisnetworkhas784neuronsintheinputlayer,correspondingto
the28 28
= 784
pixelsintheinputimage.Weuse30hidden
neurons,aswellas10outputneurons,correspondingtothe10
possibleclassificationsfortheMNISTdigits('0','1','2', ,'9').
Let'strytrainingournetworkfor30completeepochs,usingmini
batchesof10trainingexamplesatatime,alearningrate
andregularizationparameter
= 5.0
= 0.1
.Aswetrainwe'llmonitorthe
classificationaccuracyonthevalidation_data*:
*Notethatthenetworkstakequitesometimeto
trainuptoafewminutespertrainingepoch,
dependingonthespeedofyourmachine.Soif
>>>net.SGD(training_data,30,10,0.1,lmbda=5.0,
you'rerunningthecodeit'sbesttocontinue
...evaluation_data=validation_data,monitor_evaluation_accuracy=True)
readingandreturnlater,nottowaitforthecode
tofinishexecuting.
Wegetaclassificationaccuracyof96.48percent(orthereabouts
it'llvaryabitfromruntorun),comparabletoourearlierresults
withasimilarconfiguration.
http://neuralnetworksanddeeplearning.com/chap5.html
5/21
7/16/2015
Neuralnetworksanddeeplearning
Now,let'saddanotherhiddenlayer,alsowith30neuronsinit,and
trytrainingwiththesamehyperparameters:
>>>net=network2.Network([784,30,30,10])
>>>net.SGD(training_data,30,10,0.1,lmbda=5.0,
...evaluation_data=validation_data,monitor_evaluation_accuracy=True)
Thisgivesanimprovedclassificationaccuracy,96.90percent.
That'sencouraging:alittlemoredepthishelping.Let'saddanother
30neuronhiddenlayer:
>>>net=network2.Network([784,30,30,30,10])
>>>net.SGD(training_data,30,10,0.1,lmbda=5.0,
...evaluation_data=validation_data,monitor_evaluation_accuracy=True)
Thatdoesn'thelpatall.Infact,theresultdropsbackdownto96.57
percent,closetoouroriginalshallownetwork.Andsupposewe
insertonefurtherhiddenlayer:
>>>net=network2.Network([784,30,30,30,30,10])
>>>net.SGD(training_data,30,10,0.1,lmbda=5.0,
...evaluation_data=validation_data,monitor_evaluation_accuracy=True)
Theclassificationaccuracydropsagain,to96.53percent.That's
probablynotastatisticallysignificantdrop,butit'snot
encouraging,either.
Thisbehaviourseemsstrange.Intuitively,extrahiddenlayersought
tomakethenetworkabletolearnmorecomplexclassification
functions,andthusdoabetterjobclassifying.Certainly,things
shouldn'tgetworse,sincetheextralayerscan,intheworstcase,
simplydonothing*.Butthat'snotwhat'sgoingon.
*Seethislaterproblemtounderstandhowto
buildahiddenlayerthatdoesnothing.
Sowhatisgoingon?Let'sassumethattheextrahiddenlayersreally
couldhelpinprinciple,andtheproblemisthatourlearning
algorithmisn'tfindingtherightweightsandbiases.We'dliketo
figureoutwhat'sgoingwronginourlearningalgorithm,andhowto
dobetter.
Togetsomeinsightintowhat'sgoingwrong,let'svisualizehowthe
networklearns.Below,I'veplottedpartofa[784, 30, 30, 10]
network,i.e.,anetworkwithtwohiddenlayers,eachcontaining30
hiddenneurons.Eachneuroninthediagramhasalittlebaronit,
http://neuralnetworksanddeeplearning.com/chap5.html
6/21
7/16/2015
Neuralnetworksanddeeplearning
representinghowquicklythatneuronischangingasthenetwork
learns.Abigbarmeanstheneuron'sweightsandbiasarechanging
rapidly,whileasmallbarmeanstheweightsandbiasarechanging
slowly.Moreprecisely,thebarsdenotethegradient C / bforeach
neuron,i.e.,therateofchangeofthecostwithrespecttothe
neuron'sbias.BackinChapter2wesawthatthisgradientquantity
controllednotjusthowrapidlythebiaschangesduringlearning,
butalsohowrapidlytheweightsinputtotheneuronchange,too.
Don'tworryifyoudon'trecallthedetails:thethingtokeepinmind
issimplythatthesebarsshowhowquicklyeachneuron'sweights
andbiasarechangingasthenetworklearns.
Tokeepthediagramsimple,I'veshownjustthetopsixneuronsin
thetwohiddenlayers.I'veomittedtheinputneurons,sincethey've
gotnoweightsorbiasestolearn.I'vealsoomittedtheoutput
neurons,sincewe'redoinglayerwisecomparisons,anditmakes
mostsensetocomparelayerswiththesamenumberofneurons.
Theresultsareplottedattheverybeginningoftraining,i.e.,
immediatelyafterthenetworkisinitialized.Heretheyare*:
*Thedataplottedisgeneratedusingthe
programgenerate_gradient.py.Thesame
programisalsousedtogeneratetheresults
quotedlaterinthissection.
http://neuralnetworksanddeeplearning.com/chap5.html
7/21
7/16/2015
Neuralnetworksanddeeplearning
Thenetworkwasinitializedrandomly,andsoit'snotsurprising
thatthere'salotofvariationinhowrapidlytheneuronslearn.Still,
onethingthatjumpsoutisthatthebarsinthesecondhiddenlayer
aremostlymuchlargerthanthebarsinthefirsthiddenlayer.Asa
result,theneuronsinthesecondhiddenlayerwilllearnquiteabit
fasterthantheneuronsinthefirsthiddenlayer.Isthismerelya
coincidence,oraretheneuronsinthesecondhiddenlayerlikelyto
learnfasterthanneuronsinthefirsthiddenlayeringeneral?
Todeterminewhetherthisisthecase,ithelpstohaveaglobalway
ofcomparingthespeedoflearninginthefirstandsecondhidden
layers.Todothis,let'sdenotethegradientas
= C / b
l
j
,i.e.,the
gradientforthejthneuroninthel thlayer*.Wecanthinkofthe
*BackinChapter2wereferredtothisasthe
gradient asavectorwhoseentriesdeterminehowquicklythe
"gradient".Isay"informal"becauseofcourse
firsthiddenlayerlearns,and asavectorwhoseentriesdetermine
2
howquicklythesecondhiddenlayerlearns.We'llthenusethe
error,butherewe'lladopttheinformalterm
thisdoesn'texplicitlyincludethepartial
derivativesofthecostwithrespecttothe
weights,C /w .
lengthsofthesevectorsas(rough!)globalmeasuresofthespeedat
http://neuralnetworksanddeeplearning.com/chap5.html
8/21
7/16/2015
Neuralnetworksanddeeplearning
whichthelayersarelearning.So,forinstance,thelength
measuresthespeedatwhichthefirsthiddenlayerislearning,while
thelength
measuresthespeedatwhichthesecondhidden
layerislearning.
Withthesedefinitions,andinthesameconfigurationaswasplotted
above,wefind
= 0.07
and
= 0.31
.Sothisconfirms
ourearliersuspicion:theneuronsinthesecondhiddenlayerreally
arelearningmuchfasterthantheneuronsinthefirsthiddenlayer.
Whathappensifweaddmorehiddenlayers?Ifwehavethree
hiddenlayers,ina[784, 30, 30, 30, 10] network,thentherespective
speedsoflearningturnouttobe0.012,0.060,and0.283.Again,
earlierhiddenlayersarelearningmuchslowerthanlaterhidden
layers.Supposeweaddyetanotherlayerwith30 hiddenneurons.
Inthatcase,therespectivespeedsoflearningare0.003,0.017,
0.070,and0.285.Thepatternholds:earlylayerslearnslowerthan
laterlayers.
We'vebeenlookingatthespeedoflearningatthestartoftraining,
thatis,justafterthenetworksareinitialized.Howdoesthespeedof
learningchangeaswetrainournetworks?Let'sreturntolookatthe
networkwithjusttwohiddenlayers.Thespeedoflearningchanges
asfollows:
Togeneratetheseresults,Iusedbatchgradientdescentwithjust
http://neuralnetworksanddeeplearning.com/chap5.html
9/21
7/16/2015
Neuralnetworksanddeeplearning
1,000trainingimages,trainedover500epochs.Thisisabit
differentthanthewayweusuallytrainI'veusednominibatches,
andjust1,000trainingimages,ratherthanthefull50,000image
trainingset.I'mnottryingtodoanythingsneaky,orpullthewool
overyoureyes,butitturnsoutthatusingminibatchstochastic
gradientdescentgivesmuchnoisier(albeitverysimilar,whenyou
averageawaythenoise)results.UsingtheparametersI'vechosenis
aneasywayofsmoothingtheresultsout,sowecanseewhat'sgoing
on.
Inanycase,asyoucanseethetwolayersstartoutlearningatvery
differentspeeds(aswealreadyknow).Thespeedinbothlayers
thendropsveryquickly,beforerebounding.Butthroughitall,the
firsthiddenlayerlearnsmuchmoreslowlythanthesecondhidden
layer.
Whataboutmorecomplexnetworks?Here'stheresultsofasimilar
experiment,butthistimewiththreehiddenlayers(a
[784, 30, 30, 30, 10]
network):
Again,earlyhiddenlayerslearnmuchmoreslowlythanlater
hiddenlayers.Finally,let'saddafourthhiddenlayers(a
[784, 30, 30, 30, 30, 10]
network),andseewhathappenswhenwe
train:
http://neuralnetworksanddeeplearning.com/chap5.html
10/21
7/16/2015
Neuralnetworksanddeeplearning
Again,earlyhiddenlayerslearnmuchmoreslowlythanlater
hiddenlayers.Inthiscase,thefirsthiddenlayerislearningroughly
100timesslowerthanthefinalhiddenlayer.Nowonderwewere
havingtroubletrainingthesenetworksearlier!
Wehavehereanimportantobservation:inatleastsomedeep
neuralnetworks,thegradienttendstogetsmalleraswemove
backwardthroughthehiddenlayers.Thismeansthatneuronsin
theearlierlayerslearnmuchmoreslowlythanneuronsinlater
layers.Andwhilewe'veseenthisinjustasinglenetwork,thereare
fundamentalreasonswhythishappensinmanyneuralnetworks.
Thephenomenonisknownasthevanishinggradientproblem*.
*SeeGradientflowinrecurrentnets:the
difficultyoflearninglongtermdependencies,by
SeppHochreiter,YoshuaBengio,PaoloFrasconi,
Whydoesthevanishinggradientproblemoccur?Aretherewayswe
canavoidit?Andhowshouldwedealwithitintrainingdeepneural
andJrgenSchmidhuber(2001).Thispaper
studiedrecurrentneuralnets,buttheessential
phenomenonisthesameasinthefeedforward
networks?Infact,we'lllearnshortlythatit'snotinevitable,
networkswearestudying.SeealsoSepp
althoughthealternativeisnotveryattractive,either:sometimesthe
Untersuchungenzudynamischenneuronalen
gradientgetsmuchlargerinearlierlayers!Thisistheexploding
Hochreiter'searlierDiplomaThesis,
Netzen(1991,inGerman).
gradientproblem,andit'snotmuchbetternewsthanthevanishing
gradientproblem.Moregenerally,itturnsoutthatthegradientin
deepneuralnetworksisunstable,tendingtoeitherexplodeor
vanishinearlierlayers.Thisinstabilityisafundamentalproblem
forgradientbasedlearningindeepneuralnetworks.It'ssomething
weneedtounderstand,and,ifpossible,takestepstoaddress.
http://neuralnetworksanddeeplearning.com/chap5.html
11/21
7/16/2015
Neuralnetworksanddeeplearning
Oneresponsetovanishing(orunstable)gradientsistowonderif
they'rereallysuchaproblem.Momentarilysteppingawayfrom
neuralnets,imagineweweretryingtonumericallyminimizea
functionf (x) ofasinglevariable.Wouldn'titbegoodnewsifthe
derivativef
(x)
wassmall?Wouldn'tthatmeanwewerealready
nearanextremum?Inasimilarway,mightthesmallgradientin
earlylayersofadeepnetworkmeanthatwedon'tneedtodomuch
adjustmentoftheweightsandbiases?
Ofcourse,thisisn'tthecase.Recallthatwerandomlyinitializedthe
weightandbiasesinthenetwork.Itisextremelyunlikelyourinitial
weightsandbiaseswilldoagoodjobatwhateveritiswewantour
networktodo.Tobeconcrete,considerthefirstlayerofweightsin
a[784, 30, 30, 30, 10] networkfortheMNISTproblem.Therandom
initializationmeansthefirstlayerthrowsawaymostinformation
abouttheinputimage.Eveniflaterlayershavebeenextensively
trained,theywillstillfinditextremelydifficulttoidentifytheinput
image,simplybecausetheydon'thaveenoughinformation.Andso
itcan'tpossiblybethecasethatnotmuchlearningneedstobedone
inthefirstlayer.Ifwe'regoingtotraindeepnetworks,weneedto
figureouthowtoaddressthevanishinggradientproblem.
What'scausingthevanishinggradient
problem?Unstablegradientsindeep
neuralnets
Togetinsightintowhythevanishinggradientproblemoccurs,let's
considerthesimplestdeepneuralnetwork:onewithjustasingle
neuronineachlayer.Here'sanetworkwiththreehiddenlayers:
Here,w
1,
w2 ,
aretheweights,b
1,
b2 ,
arethebiases,andC is
somecostfunction.Justtoremindyouhowthisworks,theoutput
aj
fromthejthneuronis(z ),whereistheusualsigmoid
j
activationfunction,andz
= w j aj1 + bj
istheweightedinputto
theneuron.I'vedrawnthecostC attheendtoemphasizethatthe
http://neuralnetworksanddeeplearning.com/chap5.html
12/21
7/16/2015
Neuralnetworksanddeeplearning
costisafunctionofthenetwork'soutput,a :iftheactualoutput
4
fromthenetworkisclosetothedesiredoutput,thenthecostwillbe
low,whileifit'sfaraway,thecostwillbehigh.
We'regoingtostudythegradient C / b associatedtothefirst
1
hiddenneuron.We'llfigureoutanexpressionfor C / b ,andby
1
studyingthatexpressionwe'llunderstandwhythevanishing
gradientproblemoccurs.
I'llstartbysimplyshowingyoutheexpressionfor C / b .Itlooks
1
forbidding,butit'sactuallygotasimplestructure,whichI'll
describeinamoment.Here'stheexpression(ignorethenetwork,
fornow,andnotethat isjustthederivativeofthefunction):
Thestructureintheexpressionisasfollows:thereisa
(zj )
term
intheproductforeachneuroninthenetworkaweightw termfor
j
eachweightinthenetworkandafinal C / a term,
4
correspondingtothecostfunctionattheend.NoticethatI've
placedeachtermintheexpressionabovethecorrespondingpartof
thenetwork.Sothenetworkitselfisamnemonicfortheexpression.
You'rewelcometotakethisexpressionforgranted,andskiptothe
discussionofhowitrelatestothevanishinggradientproblem.
There'snoharmindoingthis,sincetheexpressionisaspecialcase
ofourearlierdiscussionofbackpropagation.Butthere'salsoa
simpleexplanationofwhytheexpressionistrue,andsoit'sfun
(andperhapsenlightening)totakealookatthatexplanation.
Imaginewemakeasmallchangeb inthebiasb .Thatwillsetoff
1
acascadingseriesofchangesintherestofthenetwork.First,it
causesachangea intheoutputfromthefirsthiddenneuron.
1
That,inturn,willcauseachangez intheweightedinputtothe
2
secondhiddenneuron.Thenachangea intheoutputfromthe
2
secondhiddenneuron.Andsoon,allthewaythroughtoachange
http://neuralnetworksanddeeplearning.com/chap5.html
13/21
7/16/2015
C
Neuralnetworksanddeeplearning
inthecostattheoutput.Wehave
C
b1
(114)
b1
Thissuggeststhatwecanfigureoutanexpressionforthegradient
C / b1
bycarefullytrackingtheeffectofeachstepinthiscascade.
firsthiddenneurontochange.Wehavea
= (z1 ) = (w 1 a0 + b1 )
so
a1
(w 1 a0 + b1 )
b1
b 1
(115)
= (z1 )b1 .
That
(z1 )
(116)
termshouldlookfamiliar:it'sthefirstterminour
claimedexpressionforthegradient C / b .Intuitively,thisterm
1
activation.Thatchangea inturncausesachangeintheweighted
1
inputz
= w 2 a1 + b2
tothesecondhiddenneuron:
z2
z2
a1
a 1
(117)
= w 2 a 1 .
(118)
changeinthebiasb propagatesalongthenetworktoaffectz :
1
z2
(z1 )w 2 b1 .
(119)
Again,thatshouldlookfamiliar:we'venowgotthefirsttwoterms
inourclaimedexpressionforthegradient C / b .
1
Wecankeepgoinginthisfashion,trackingthewaychanges
propagatethroughtherestofthenetwork.Ateachneuronwepick
upa
(zj )
term,andthrougheachweightwepickupaw term.
j
TheendresultisanexpressionrelatingthefinalchangeC incost
totheinitialchangeb inthebias:
1
C
a4
b 1 .
(120)
Dividingbyb wedoindeedgetthedesiredexpressionforthe
1
http://neuralnetworksanddeeplearning.com/chap5.html
14/21
7/16/2015
Neuralnetworksanddeeplearning
gradient:
C
b1
(121)
a4
Whythevanishinggradientproblemoccurs:Tounderstand
whythevanishinggradientproblemoccurs,let'sexplicitlywriteout
theentireexpressionforthegradient:
C
b1
(122)
a4
Exceptingtheverylastterm,thisexpressionisaproductoftermsof
theformw
(zj )
.Tounderstandhoweachofthosetermsbehave,
let'slookataplotofthefunction :
Derivativeofsigmoidfunction
0.25
0.20
0.15
0.10
0.05
0.00
4
Thederivativereachesamaximumat
(0) = 1/4
.Now,ifweuse
ourstandardapproachtoinitializingtheweightsinthenetwork,
thenwe'llchoosetheweightsusingaGaussianwithmean0 and
standarddeviation1 .Sotheweightswillusuallysatisfy|w
j|
Puttingtheseobservationstogether,weseethatthetermsw
willusuallysatisfy|w
< 1
j
(zj )
.Andwhenwetakeaproductof
manysuchterms,theproductwilltendtoexponentiallydecrease:
themoreterms,thesmallertheproductwillbe.Thisisstartingto
smelllikeapossibleexplanationforthevanishinggradient
problem.
Tomakethisallabitmoreexplicit,let'scomparetheexpressionfor
C / b1
toanexpressionforthegradientwithrespecttoalater
bias,say C / b .Ofcourse,wehaven'texplicitlyworkedoutan
3
http://neuralnetworksanddeeplearning.com/chap5.html
15/21
7/16/2015
Neuralnetworksanddeeplearning
expressionfor C / b ,butitfollowsthesamepatterndescribed
3
abovefor C / b .Here'sthecomparisonofthetwoexpressions:
1
Thetwoexpressionssharemanyterms.Butthegradient C / b
includestwoextratermseachoftheformw
(zj )
.Aswe'veseen,
suchtermsaretypicallylessthan1/4 inmagnitude.Andsothe
gradient C / b willusuallybeafactorof16 (ormore)smaller
1
than C / b .Thisistheessentialoriginofthevanishinggradient
3
problem.
Ofcourse,thisisaninformalargument,notarigorousproofthat
thevanishinggradientproblemwilloccur.Thereareseveral
possibleescapeclauses.Inparticular,wemightwonderwhetherthe
weightsw couldgrowduringtraining.Iftheydo,it'spossiblethe
j
termsw
(zj )
intheproductwillnolongersatisfy
.Indeed,ifthetermsgetlargeenoughgreater
than1 thenwewillnolongerhaveavanishinggradientproblem.
Instead,thegradientwillactuallygrowexponentiallyaswemove
backwardthroughthelayers.Insteadofavanishinggradient
problem,we'llhaveanexplodinggradientproblem.
Theexplodinggradientproblem:Let'slookatanexplicit
examplewhereexplodinggradientsoccur.Theexampleis
somewhatcontrived:I'mgoingtofixparametersinthenetworkin
justtherightwaytoensurewegetanexplodinggradient.Buteven
thoughtheexampleiscontrived,ithasthevirtueoffirmly
establishingthatexplodinggradientsaren'tmerelyahypothetical
possibility,theyreallycanhappen.
Therearetwostepstogettinganexplodinggradient.First,we
http://neuralnetworksanddeeplearning.com/chap5.html
16/21
7/16/2015
Neuralnetworksanddeeplearning
choosealltheweightsinthenetworktobelarge,say
w 1 = w 2 = w 3 = w 4 = 100
the
(zj )
.Second,we'llchoosethebiasessothat
termsarenottoosmall.That'sactuallyprettyeasytodo:
allweneeddoischoosethebiasestoensurethattheweightedinput
toeachneuronisz
wantz
= 0
(andso
= w 1 a0 + b1 = 0
b1 = 100 a0
(zj ) = 1/4
.Wecanachievethisbysetting
.Wecanusethesameideatoselecttheotherbiases.
Whenwedothis,weseethatallthetermsw
100
1
4
= 25
).So,forinstance,we
(zj )
areequalto
.Withthesechoiceswegetanexplodinggradient.
Theunstablegradientproblem:Thefundamentalproblem
hereisn'tsomuchthevanishinggradientproblemortheexploding
gradientproblem.It'sthatthegradientinearlylayersistheproduct
oftermsfromallthelaterlayers.Whentherearemanylayers,that's
anintrinsicallyunstablesituation.Theonlywayalllayerscanlearn
atclosetothesamespeedisifallthoseproductsoftermscome
closetobalancingout.Withoutsomemechanismorunderlying
reasonforthatbalancingtooccur,it'shighlyunlikelytohappen
simplybychance.Inshort,therealproblemhereisthatneural
networkssufferfromanunstablegradientproblem.Asaresult,if
weusestandardgradientbasedlearningtechniques,different
layersinthenetworkwilltendtolearnatwildlydifferentspeeds.
Exercise
Inourdiscussionofthevanishinggradientproblem,wemade
useofthefactthat|
.Supposeweusedadifferent
activationfunction,onewhosederivativecouldbemuchlarger.
Wouldthathelpusavoidtheunstablegradientproblem?
Theprevalenceofthevanishinggradientproblem:We've
seenthatthegradientcaneithervanishorexplodeintheearly
layersofadeepnetwork.Infact,whenusingsigmoidneuronsthe
gradientwillusuallyvanish.Toseewhy,consideragainthe
expression|w
need|w
(z)|
(z)| 1
.Toavoidthevanishinggradientproblemwe
.Youmightthinkthiscouldhappeneasilyifw is
verylarge.However,it'smoredifficultthanitlooks.Thereasonis
thatthe
(z)
termalsodependsonw :
http://neuralnetworksanddeeplearning.com/chap5.html
(z) = (wa + b)
,wherea
17/21
7/16/2015
Neuralnetworksanddeeplearning
istheinputactivation.Sowhenwemakew large,weneedtobe
carefulthatwe'renotsimultaneouslymaking
(wa + b)
small.That
turnsouttobeaconsiderableconstraint.Thereasonisthatwhen
wemakew largewetendtomakewa + b verylarge.Lookingatthe
graphof youcanseethatthisputsusoffinthe"wings"ofthe
function,whereittakesverysmallvalues.Theonlywaytoavoid
thisisiftheinputactivationfallswithinafairlynarrowrangeof
values(thisqualitativeexplanationismadequantitativeinthefirst
problembelow).Sometimesthatwillchancetohappen.Moreoften,
though,itdoesnothappen.Andsointhegenericcasewehave
vanishinggradients.
Problems
Considertheproduct|w
(wa + b)|
.Suppose
.(1)Arguethatthiscanonlyeveroccurif
|w (wa + b)| 1
|w| 4
.(2)Supposingthat|w|
activationsaforwhich|w
a
,considerthesetofinput
.Showthatthesetof
(wa + b)| 1
satisfyingthatconstraintcanrangeoveranintervalno
greaterinwidththan
|w|(1 + 1 4/|w| )
2
ln (
1) .
(123)
|w|
(3)Shownumericallythattheaboveexpressionboundingthe
widthoftherangeisgreatestat|w|
value
0.45
6.9
,whereittakesa
.Andsoevengiventhateverythinglinesupjust
perfectly,westillhaveafairlynarrowrangeofinputactivations
whichcanavoidthevanishinggradientproblem.
Identityneuron:Consideraneuronwithasingleinput,x,a
correspondingweight,w ,abiasb,andaweightw onthe
1
output.Showthatbychoosingtheweightsandbias
appropriately,wecanensurew
2 (w 1 x
+ b) x
forx
[0, 1]
Suchaneuroncanthusbeusedasakindofidentityneuron,
thatis,aneuronwhoseoutputisthesame(uptorescalingbya
weightfactor)asitsinput.Hint:Ithelpstorewrite
x = 1/2 +
,toassumew issmall,andtouseaTaylorseries
http://neuralnetworksanddeeplearning.com/chap5.html
18/21
7/16/2015
Neuralnetworksanddeeplearning
expansioninw
Unstablegradientsinmorecomplex
networks
We'vebeenstudyingtoynetworks,withjustoneneuronineach
hiddenlayer.Whataboutmorecomplexdeepnetworks,withmany
neuronsineachhiddenlayer?
Infact,muchthesamebehaviouroccursinsuchnetworks.Inthe
earlierchapteronbackpropagationwesawthatthegradientinthe
l
thlayerofanL layernetworkisgivenby:
= (z )(w
l+1
(z
l+1
)(w
l+2
(z
) a C
Here, (z )isadiagonalmatrixwhoseentriesarethe
(z)
(124)
values
matricesforthedifferentlayers.And
aC
isthevectorofpartial
derivativesofC withrespecttotheoutputactivations.
Thisisamuchmorecomplicatedexpressionthaninthesingle
neuroncase.Still,ifyoulookclosely,theessentialformisvery
similar,withlotsofpairsoftheform(w
.What'smore,the
(z )
matrices (z )havesmallentriesonthediagonal,nonelargerthan
1
4
.Providedtheweightmatricesw aren'ttoolarge,eachadditional
term(w
tendstomakethegradientvectorsmaller,leading
(z )
toavanishinggradient.Moregenerally,thelargenumberofterms
intheproducttendstoleadtoanunstablegradient,justasinour
earlierexample.Inpractice,empiricallyitistypicallyfoundin
http://neuralnetworksanddeeplearning.com/chap5.html
19/21
7/16/2015
Neuralnetworksanddeeplearning
sigmoidnetworksthatgradientsvanishexponentiallyquicklyin
earlierlayers.Asaresult,learningslowsdowninthoselayers.This
slowdownisn'tmerelyanaccidentoraninconvenience:it'sa
fundamentalconsequenceoftheapproachwe'retakingtolearning.
Otherobstaclestodeeplearning
Inthischapterwe'vefocusedonvanishinggradientsand,more
generally,unstablegradientsasanobstacletodeeplearning.In
fact,unstablegradientsarejustoneobstacletodeeplearning,albeit
animportantfundamentalobstacle.Muchongoingresearchaimsto
betterunderstandthechallengesthatcanoccurwhentrainingdeep
networks.Iwon'tcomprehensivelysummarizethatworkhere,but
justwanttobrieflymentionacoupleofpapers,togiveyouthe
flavorofsomeofthequestionspeopleareasking.
Asafirstexample,in2010GlorotandBengio*foundevidence
*Understandingthedifficultyoftrainingdeep
suggestingthattheuseofsigmoidactivationfunctionscancause
andYoshuaBengio(2010).Seealsotheearlier
problemstrainingdeepnetworks.Inparticular,theyfound
evidencethattheuseofsigmoidswillcausetheactivationsinthe
feedforwardneuralnetworks,byXavierGlorot
discussionoftheuseofsigmoidsinEfficient
BackProp,byYannLeCun,LonBottou,
GenevieveOrrandKlausRobertMller(1998).
finalhiddenlayertosaturatenear0earlyintraining,substantially
slowingdownlearning.Theysuggestedsomealternativeactivation
functions,whichappearnottosufferasmuchfromthissaturation
problem.
Asasecondexample,in2013Sutskever,Martens,Dahland
Hinton*studiedtheimpactondeeplearningofboththerandom
*Ontheimportanceofinitializationand
weightinitializationandthemomentumscheduleinmomentum
JamesMartens,GeorgeDahlandGeoffrey
basedstochasticgradientdescent.Inbothcases,makinggood
momentumindeeplearning,byIlyaSutskever,
Hinton(2013).
choicesmadeasubstantialdifferenceintheabilitytotraindeep
networks.
Theseexamplessuggestthat"Whatmakesdeepnetworkshardto
train?"isacomplexquestion.Inthischapter,we'vefocusedonthe
instabilitiesassociatedtogradientbasedlearningindeepnetworks.
Theresultsinthelasttwoparagraphssuggestthatthereisalsoa
roleplayedbythechoiceofactivationfunction,thewayweightsare
http://neuralnetworksanddeeplearning.com/chap5.html
20/21
7/16/2015
Neuralnetworksanddeeplearning
initialized,andevendetailsofhowlearningbygradientdescentis
implemented.And,ofcourse,choiceofnetworkarchitectureand
otherhyperparametersisalsoimportant.Thus,manyfactorscan
playaroleinmakingdeepnetworkshardtotrain,and
understandingallthosefactorsisstillasubjectofongoingresearch.
Thisallseemsratherdownbeatandpessimisminducing.Butthe
goodnewsisthatinthenextchapterwe'llturnthataround,and
developseveralapproachestodeeplearningthattosomeextent
managetoovercomeorroutearoundallthesechallenges.
Inacademicwork,pleasecitethisbookas:MichaelA.Nielsen,"NeuralNetworksandDeepLearning",
Lastupdate:FriJul1008:53:052015
DeterminationPress,2015
ThisworkislicensedunderaCreativeCommonsAttributionNonCommercial3.0UnportedLicense.This
meansyou'refreetocopy,share,andbuildonthisbook,butnottosellit.Ifyou'reinterestedincommercialuse,
pleasecontactme.
http://neuralnetworksanddeeplearning.com/chap5.html
21/21