Artificial Neural Nets

7/16/2015
Neuralnetworksanddeeplearning
CHAPTER5
Whyaredeepneuralnetworkshardtotrain?
Imagineyou'reanengineerwhohasbeenaskedtodesigna
computerfromscratch.Onedayyou'reworkingawayinyouroffice,
designinglogicalcircuits,settingoutANDgates,ORgates,andsoon,
whenyourbosswalksinwithbadnews.Thecustomerhasjust
addedasurprisingdesignrequirement:thecircuitfortheentire
computermustbejusttwolayersdeep:
NeuralNetworksandDeepLearning
Whatthisbookisabout
Ontheexercisesandproblems
Usingneuralnetstorecognize
handwrittendigits
Howthebackpropagation
algorithmworks
Improvingthewayneural
networkslearn
Avisualproofthatneuralnetscan
computeanyfunction
Whyaredeepneuralnetworks
hardtotrain?
Deeplearning
Isthereasimplealgorithmfor
intelligence?
Acknowledgements
FrequentlyAskedQuestions
Sponsors
You'redumbfounded,andtellyourboss:"Thecustomeriscrazy!"
Yourbossreplies:"Ithinkthey'recrazy,too.Butwhatthecustomer
wants,theyget."
Infact,there'salimitedsenseinwhichthecustomerisn'tcrazy.
Supposeyou'reallowedtouseaspeciallogicalgatewhichletsyou
ANDtogetherasmanyinputsasyouwant.Andyou'realsoalloweda
manyinputNANDgate,thatis,agatewhichcanANDmultipleinputs
Thankstoallthesupporterswho
madethebookpossible.Thanksalso
toallthecontributorstothe
BugfinderHallofFame.
andthennegatetheoutput.Withthesespecialgatesitturnsoutto
Thebookiscurrentlyabetarelease,
bepossibletocomputeanyfunctionatallusingacircuitthat'sjust
andisstillunderactive
development.Pleasesenderror
twolayersdeep.
Butjustbecausesomethingispossibledoesn'tmakeitagoodidea.
Inpractice,whensolvingcircuitdesignproblems(ormostanykind
ofalgorithmicproblem),weusuallystartbyfiguringouthowto
solvesubproblems,andthengraduallyintegratethesolutions.In
http://neuralnetworksanddeeplearning.com/chap5.html
reportstomn@michaelnielsen.org.
Forotherenquiries,pleaseseethe
FAQfirst.
Resources
Coderepository
Mailinglistforbookannouncements
1/21
7/16/2015
otherwords,webuilduptoasolutionthroughmultiplelayersof
abstraction.
MichaelNielsen'sproject
announcementmailinglist
Forinstance,supposewe'redesigningalogicalcircuittomultiply
twonumbers.Chancesarewewanttobuilditupoutofsubcircuits
doingoperationslikeaddingtwonumbers.Thesubcircuitsfor
addingtwonumberswill,inturn,bebuiltupoutofsubsubcircuits
foraddingtwobits.Veryroughlyspeakingourcircuitwilllooklike:
ByMichaelNielsen/Jul2015
Thatis,ourfinalcircuitcontainsatleastthreelayersofcircuit
elements.Infact,it'llprobablycontainmorethanthreelayers,as
webreakthesubtasksdownintosmallerunitsthanI'vedescribed.
Butyougetthegeneralidea.
Sodeepcircuitsmaketheprocessofdesigneasier.Butthey'renot
justhelpfulfordesign.Thereare,infact,mathematicalproofs
showingthatforsomefunctionsveryshallowcircuitsrequire
exponentiallymorecircuitelementstocomputethandodeep
circuits.Forinstance,afamous1984paperbyFurst,Saxeand
Sipser*showedthatcomputingtheparityofasetofbitsrequires
*SeeParity,Circuits,andthePolynomialTime
exponentiallymanygates,ifdonewithashallowcircuit.Onthe
MichaelSipser(1984).
Hierarchy,byMerrickFurst,JamesB.Saxe,and
otherhand,ifyouusedeepercircuitsit'seasytocomputetheparity
usingasmallcircuit:youjustcomputetheparityofpairsofbits,
thenusethoseresultstocomputetheparityofpairsofpairsofbits,
andsoon,buildingupquicklytotheoverallparity.Deepcircuits
thuscanbeintrinsicallymuchmorepowerfulthanshallowcircuits.
2/21
7/16/2015
Uptonow,thisbookhasapproachedneuralnetworkslikethecrazy
customer.Almostallthenetworkswe'veworkedwithhavejusta
singlehiddenlayerofneurons(plustheinputandoutputlayers):
Thesesimplenetworkshavebeenremarkablyuseful:inearlier
chaptersweusednetworkslikethistoclassifyhandwrittendigits
withbetterthan98percentaccuracy!Nonetheless,intuitivelywe'd
expectnetworkswithmanymorehiddenlayerstobemore
powerful:
Suchnetworkscouldusetheintermediatelayerstobuildup
multiplelayersofabstraction,justaswedoinBooleancircuits.For
instance,ifwe'redoingvisualpatternrecognition,thentheneurons
inthefirstlayermightlearntorecognizeedges,theneuronsinthe
secondlayercouldlearntorecognizemorecomplexshapes,say
triangleorrectangles,builtupfromedges.Thethirdlayerwould
thenrecognizestillmorecomplexshapes.Andsoon.These
multiplelayersofabstractionseemlikelytogivedeepnetworksa
compellingadvantageinlearningtosolvecomplexpattern
3/21
7/16/2015
recognitionproblems.Moreover,justasinthecaseofcircuits,there
aretheoreticalresultssuggestingthatdeepnetworksare
intrinsicallymorepowerfulthanshallownetworks*.
*Forcertainproblemsandnetworkarchitectures
thisisprovedinOnthenumberofresponse
regionsofdeepfeedforwardnetworkswith
Howcanwetrainsuchdeepnetworks?Inthischapter,we'lltry
trainingdeepnetworksusingourworkhorselearningalgorithm
stochasticgradientdescentbybackpropagation.Butwe'llruninto
piecewiselinearactivations,byRazvanPascanu,
GuidoMontfar,andYoshuaBengio(2014).See
alsothemoreinformaldiscussioninsection2of
LearningdeeparchitecturesforAI,byYoshua
Bengio(2009).
trouble,withourdeepnetworksnotperformingmuch(ifatall)
betterthanshallownetworks.
Thatfailureseemssurprisinginthelightofthediscussionabove.
Ratherthangiveupondeepnetworks,we'lldigdownandtryto
understandwhat'smakingourdeepnetworkshardtotrain.When
welookclosely,we'lldiscoverthatthedifferentlayersinourdeep
networkarelearningatvastlydifferentspeeds.Inparticular,when
laterlayersinthenetworkarelearningwell,earlylayersoftenget
stuckduringtraining,learningalmostnothingatall.Thisstuckness
isn'tsimplyduetobadluck.Rather,we'lldiscoverthereare
fundamentalreasonsthelearningslowdownoccurs,connectedto
ouruseofgradientbasedlearningtechniques.
Aswedelveintotheproblemmoredeeply,we'lllearnthatthe
oppositephenomenoncanalsooccur:theearlylayersmaybe
learningwell,butlaterlayerscanbecomestuck.Infact,we'llfind
thatthere'sanintrinsicinstabilityassociatedtolearningby
gradientdescentindeep,manylayerneuralnetworks.This
instabilitytendstoresultineithertheearlyorthelaterlayers
gettingstuckduringtraining.
Thisallsoundslikebadnews.Butbydelvingintothesedifficulties,
wecanbegintogaininsightintowhat'srequiredtotraindeep
networkseffectively.Andsotheseinvestigationsaregood
preparationforthenextchapter,wherewe'llusedeeplearningto
attackimagerecognitionproblems.
Thevanishinggradientproblem
So,whatgoeswrongwhenwetrytotrainadeepnetwork?
4/21
7/16/2015
Toanswerthatquestion,let'sfirstrevisitthecaseofanetworkwith
justasinglehiddenlayer.Asperusual,we'llusetheMNISTdigit
classificationproblemasourplaygroundforlearningand
*IintroducedtheMNISTproblemanddatahere
experimentation*.
andhere.
Ifyouwish,youcanfollowalongbytrainingnetworksonyour
computer.Itisalso,ofcourse,finetojustreadalong.Ifyoudowish
tofollowlive,thenyou'llneedPython2.7,Numpy,andacopyofthe
code,whichyoucangetbycloningtherelevantrepositoryfromthe
commandline:
gitclonehttps://github.com/mnielsen/neuralnetworksanddeeplearning.git
Ifyoudon'tusegitthenyoucandownloadthedataandcodehere.
You'llneedtochangeintothesrcsubdirectory.
Then,fromaPythonshellweloadtheMNISTdata:
>>>importmnist_loader
>>>training_data,validation_data,test_data=\
...mnist_loader.load_data_wrapper()
Wesetupournetwork:
>>>importnetwork2
>>>net=network2.Network([784,30,10])
Thisnetworkhas784neuronsintheinputlayer,correspondingto
the28 28
= 784
pixelsintheinputimage.Weuse30hidden
neurons,aswellas10outputneurons,correspondingtothe10
possibleclassificationsfortheMNISTdigits('0','1','2', ,'9').
Let'strytrainingournetworkfor30completeepochs,usingmini
batchesof10trainingexamplesatatime,alearningrate
andregularizationparameter
= 5.0
= 0.1
.Aswetrainwe'llmonitorthe
classificationaccuracyonthevalidation_data*:
*Notethatthenetworkstakequitesometimeto
trainuptoafewminutespertrainingepoch,
dependingonthespeedofyourmachine.Soif
>>>net.SGD(training_data,30,10,0.1,lmbda=5.0,
you'rerunningthecodeit'sbesttocontinue
...evaluation_data=validation_data,monitor_evaluation_accuracy=True)
readingandreturnlater,nottowaitforthecode
tofinishexecuting.
Wegetaclassificationaccuracyof96.48percent(orthereabouts
it'llvaryabitfromruntorun),comparabletoourearlierresults
withasimilarconfiguration.
5/21
7/16/2015
Now,let'saddanotherhiddenlayer,alsowith30neuronsinit,and
trytrainingwiththesamehyperparameters:
>>>net=network2.Network([784,30,30,10])
Thisgivesanimprovedclassificationaccuracy,96.90percent.
That'sencouraging:alittlemoredepthishelping.Let'saddanother
30neuronhiddenlayer:
>>>net=network2.Network([784,30,30,30,10])
Thatdoesn'thelpatall.Infact,theresultdropsbackdownto96.57
percent,closetoouroriginalshallownetwork.Andsupposewe
insertonefurtherhiddenlayer:
>>>net=network2.Network([784,30,30,30,30,10])
Theclassificationaccuracydropsagain,to96.53percent.That's
probablynotastatisticallysignificantdrop,butit'snot
encouraging,either.
Thisbehaviourseemsstrange.Intuitively,extrahiddenlayersought
tomakethenetworkabletolearnmorecomplexclassification
functions,andthusdoabetterjobclassifying.Certainly,things
shouldn'tgetworse,sincetheextralayerscan,intheworstcase,
simplydonothing*.Butthat'snotwhat'sgoingon.
*Seethislaterproblemtounderstandhowto
buildahiddenlayerthatdoesnothing.
Sowhatisgoingon?Let'sassumethattheextrahiddenlayersreally
couldhelpinprinciple,andtheproblemisthatourlearning
algorithmisn'tfindingtherightweightsandbiases.We'dliketo
figureoutwhat'sgoingwronginourlearningalgorithm,andhowto
dobetter.
Togetsomeinsightintowhat'sgoingwrong,let'svisualizehowthe
networklearns.Below,I'veplottedpartofa[784, 30, 30, 10]
network,i.e.,anetworkwithtwohiddenlayers,eachcontaining30
hiddenneurons.Eachneuroninthediagramhasalittlebaronit,
6/21
7/16/2015
representinghowquicklythatneuronischangingasthenetwork
learns.Abigbarmeanstheneuron'sweightsandbiasarechanging
rapidly,whileasmallbarmeanstheweightsandbiasarechanging
slowly.Moreprecisely,thebarsdenotethegradient C / bforeach
neuron,i.e.,therateofchangeofthecostwithrespecttothe
neuron'sbias.BackinChapter2wesawthatthisgradientquantity
controllednotjusthowrapidlythebiaschangesduringlearning,
butalsohowrapidlytheweightsinputtotheneuronchange,too.
Don'tworryifyoudon'trecallthedetails:thethingtokeepinmind
issimplythatthesebarsshowhowquicklyeachneuron'sweights
andbiasarechangingasthenetworklearns.
Tokeepthediagramsimple,I'veshownjustthetopsixneuronsin
thetwohiddenlayers.I'veomittedtheinputneurons,sincethey've
gotnoweightsorbiasestolearn.I'vealsoomittedtheoutput
neurons,sincewe'redoinglayerwisecomparisons,anditmakes
mostsensetocomparelayerswiththesamenumberofneurons.
Theresultsareplottedattheverybeginningoftraining,i.e.,
immediatelyafterthenetworkisinitialized.Heretheyare*:
*Thedataplottedisgeneratedusingthe
programgenerate_gradient.py.Thesame
programisalsousedtogeneratetheresults
quotedlaterinthissection.
7/21
7/16/2015
Thenetworkwasinitializedrandomly,andsoit'snotsurprising
thatthere'salotofvariationinhowrapidlytheneuronslearn.Still,
onethingthatjumpsoutisthatthebarsinthesecondhiddenlayer
aremostlymuchlargerthanthebarsinthefirsthiddenlayer.Asa
result,theneuronsinthesecondhiddenlayerwilllearnquiteabit
fasterthantheneuronsinthefirsthiddenlayer.Isthismerelya
coincidence,oraretheneuronsinthesecondhiddenlayerlikelyto
learnfasterthanneuronsinthefirsthiddenlayeringeneral?
Todeterminewhetherthisisthecase,ithelpstohaveaglobalway
ofcomparingthespeedoflearninginthefirstandsecondhidden
layers.Todothis,let'sdenotethegradientas
= C / b
l
j
,i.e.,the
gradientforthejthneuroninthel thlayer*.Wecanthinkofthe
*BackinChapter2wereferredtothisasthe
gradient asavectorwhoseentriesdeterminehowquicklythe
"gradient".Isay"informal"becauseofcourse
firsthiddenlayerlearns,and asavectorwhoseentriesdetermine
2
howquicklythesecondhiddenlayerlearns.We'llthenusethe
error,butherewe'lladopttheinformalterm
thisdoesn'texplicitlyincludethepartial
derivativesofthecostwithrespecttothe
weights,C /w .
lengthsofthesevectorsas(rough!)globalmeasuresofthespeedat
8/21
7/16/2015
whichthelayersarelearning.So,forinstance,thelength
measuresthespeedatwhichthefirsthiddenlayerislearning,while
thelength
measuresthespeedatwhichthesecondhidden
layerislearning.
Withthesedefinitions,andinthesameconfigurationaswasplotted
above,wefind
= 0.07
and
= 0.31
.Sothisconfirms
ourearliersuspicion:theneuronsinthesecondhiddenlayerreally
arelearningmuchfasterthantheneuronsinthefirsthiddenlayer.
Whathappensifweaddmorehiddenlayers?Ifwehavethree
hiddenlayers,ina[784, 30, 30, 30, 10] network,thentherespective
speedsoflearningturnouttobe0.012,0.060,and0.283.Again,
earlierhiddenlayersarelearningmuchslowerthanlaterhidden
layers.Supposeweaddyetanotherlayerwith30 hiddenneurons.
Inthatcase,therespectivespeedsoflearningare0.003,0.017,
0.070,and0.285.Thepatternholds:earlylayerslearnslowerthan
laterlayers.
We'vebeenlookingatthespeedoflearningatthestartoftraining,
thatis,justafterthenetworksareinitialized.Howdoesthespeedof
learningchangeaswetrainournetworks?Let'sreturntolookatthe
networkwithjusttwohiddenlayers.Thespeedoflearningchanges
asfollows:
Togeneratetheseresults,Iusedbatchgradientdescentwithjust
9/21
7/16/2015
1,000trainingimages,trainedover500epochs.Thisisabit
differentthanthewayweusuallytrainI'veusednominibatches,
andjust1,000trainingimages,ratherthanthefull50,000image
trainingset.I'mnottryingtodoanythingsneaky,orpullthewool
overyoureyes,butitturnsoutthatusingminibatchstochastic
gradientdescentgivesmuchnoisier(albeitverysimilar,whenyou
averageawaythenoise)results.UsingtheparametersI'vechosenis
aneasywayofsmoothingtheresultsout,sowecanseewhat'sgoing
on.
Inanycase,asyoucanseethetwolayersstartoutlearningatvery
differentspeeds(aswealreadyknow).Thespeedinbothlayers
thendropsveryquickly,beforerebounding.Butthroughitall,the
firsthiddenlayerlearnsmuchmoreslowlythanthesecondhidden
layer.
Whataboutmorecomplexnetworks?Here'stheresultsofasimilar
experiment,butthistimewiththreehiddenlayers(a
[784, 30, 30, 30, 10]
network):
Again,earlyhiddenlayerslearnmuchmoreslowlythanlater
hiddenlayers.Finally,let'saddafourthhiddenlayers(a
[784, 30, 30, 30, 30, 10]
network),andseewhathappenswhenwe
train:
10/21
7/16/2015
Again,earlyhiddenlayerslearnmuchmoreslowlythanlater
hiddenlayers.Inthiscase,thefirsthiddenlayerislearningroughly
100timesslowerthanthefinalhiddenlayer.Nowonderwewere
havingtroubletrainingthesenetworksearlier!
Wehavehereanimportantobservation:inatleastsomedeep
neuralnetworks,thegradienttendstogetsmalleraswemove
backwardthroughthehiddenlayers.Thismeansthatneuronsin
theearlierlayerslearnmuchmoreslowlythanneuronsinlater
layers.Andwhilewe'veseenthisinjustasinglenetwork,thereare
fundamentalreasonswhythishappensinmanyneuralnetworks.
Thephenomenonisknownasthevanishinggradientproblem*.
*SeeGradientflowinrecurrentnets:the
difficultyoflearninglongtermdependencies,by
SeppHochreiter,YoshuaBengio,PaoloFrasconi,
Whydoesthevanishinggradientproblemoccur?Aretherewayswe
canavoidit?Andhowshouldwedealwithitintrainingdeepneural
andJrgenSchmidhuber(2001).Thispaper
studiedrecurrentneuralnets,buttheessential
phenomenonisthesameasinthefeedforward
networks?Infact,we'lllearnshortlythatit'snotinevitable,
networkswearestudying.SeealsoSepp
althoughthealternativeisnotveryattractive,either:sometimesthe
Untersuchungenzudynamischenneuronalen
gradientgetsmuchlargerinearlierlayers!Thisistheexploding
Hochreiter'searlierDiplomaThesis,
Netzen(1991,inGerman).
gradientproblem,andit'snotmuchbetternewsthanthevanishing
gradientproblem.Moregenerally,itturnsoutthatthegradientin
deepneuralnetworksisunstable,tendingtoeitherexplodeor
vanishinearlierlayers.Thisinstabilityisafundamentalproblem
forgradientbasedlearningindeepneuralnetworks.It'ssomething
weneedtounderstand,and,ifpossible,takestepstoaddress.
11/21
7/16/2015
Oneresponsetovanishing(orunstable)gradientsistowonderif
they'rereallysuchaproblem.Momentarilysteppingawayfrom
neuralnets,imagineweweretryingtonumericallyminimizea
functionf (x) ofasinglevariable.Wouldn'titbegoodnewsifthe
derivativef
(x)
wassmall?Wouldn'tthatmeanwewerealready
nearanextremum?Inasimilarway,mightthesmallgradientin
earlylayersofadeepnetworkmeanthatwedon'tneedtodomuch
adjustmentoftheweightsandbiases?
Ofcourse,thisisn'tthecase.Recallthatwerandomlyinitializedthe
weightandbiasesinthenetwork.Itisextremelyunlikelyourinitial
weightsandbiaseswilldoagoodjobatwhateveritiswewantour
networktodo.Tobeconcrete,considerthefirstlayerofweightsin
a[784, 30, 30, 30, 10] networkfortheMNISTproblem.Therandom
initializationmeansthefirstlayerthrowsawaymostinformation
abouttheinputimage.Eveniflaterlayershavebeenextensively
trained,theywillstillfinditextremelydifficulttoidentifytheinput
image,simplybecausetheydon'thaveenoughinformation.Andso
itcan'tpossiblybethecasethatnotmuchlearningneedstobedone
inthefirstlayer.Ifwe'regoingtotraindeepnetworks,weneedto
figureouthowtoaddressthevanishinggradientproblem.
What'scausingthevanishinggradient
problem?Unstablegradientsindeep
neuralnets
Togetinsightintowhythevanishinggradientproblemoccurs,let's
considerthesimplestdeepneuralnetwork:onewithjustasingle
neuronineachlayer.Here'sanetworkwiththreehiddenlayers:
Here,w
1,
w2 ,
aretheweights,b
1,
b2 ,
arethebiases,andC is
somecostfunction.Justtoremindyouhowthisworks,theoutput
aj
fromthejthneuronis(z ),whereistheusualsigmoid
j
activationfunction,andz
= w j aj1 + bj
istheweightedinputto
theneuron.I'vedrawnthecostC attheendtoemphasizethatthe
12/21
7/16/2015
costisafunctionofthenetwork'soutput,a :iftheactualoutput
4
fromthenetworkisclosetothedesiredoutput,thenthecostwillbe
low,whileifit'sfaraway,thecostwillbehigh.
We'regoingtostudythegradient C / b associatedtothefirst
1
hiddenneuron.We'llfigureoutanexpressionfor C / b ,andby
1
studyingthatexpressionwe'llunderstandwhythevanishing
gradientproblemoccurs.
I'llstartbysimplyshowingyoutheexpressionfor C / b .Itlooks
1
forbidding,butit'sactuallygotasimplestructure,whichI'll
describeinamoment.Here'stheexpression(ignorethenetwork,
fornow,andnotethat isjustthederivativeofthefunction):
Thestructureintheexpressionisasfollows:thereisa
(zj )
term
intheproductforeachneuroninthenetworkaweightw termfor
j
eachweightinthenetworkandafinal C / a term,
4
correspondingtothecostfunctionattheend.NoticethatI've
placedeachtermintheexpressionabovethecorrespondingpartof
thenetwork.Sothenetworkitselfisamnemonicfortheexpression.
You'rewelcometotakethisexpressionforgranted,andskiptothe
discussionofhowitrelatestothevanishinggradientproblem.
There'snoharmindoingthis,sincetheexpressionisaspecialcase
ofourearlierdiscussionofbackpropagation.Butthere'salsoa
simpleexplanationofwhytheexpressionistrue,andsoit'sfun
(andperhapsenlightening)totakealookatthatexplanation.
Imaginewemakeasmallchangeb inthebiasb .Thatwillsetoff
1
acascadingseriesofchangesintherestofthenetwork.First,it
causesachangea intheoutputfromthefirsthiddenneuron.
1
That,inturn,willcauseachangez intheweightedinputtothe
2
secondhiddenneuron.Thenachangea intheoutputfromthe
2
secondhiddenneuron.Andsoon,allthewaythroughtoachange
13/21
7/16/2015
C
inthecostattheoutput.Wehave
C
b1
(114)
b1
Thissuggeststhatwecanfigureoutanexpressionforthegradient
C / b1
bycarefullytrackingtheeffectofeachstepinthiscascade.
Todothis,let'sthinkabouthowb causestheoutputa fromthe

1
firsthiddenneurontochange.Wehavea
= (z1 ) = (w 1 a0 + b1 )
so
a1
(w 1 a0 + b1 )
b1
b 1
(115)
= (z1 )b1 .
That
(z1 )
(116)
termshouldlookfamiliar:it'sthefirstterminour
claimedexpressionforthegradient C / b .Intuitively,thisterm
1
convertsachangeb inthebiasintoachangea intheoutput

1
activation.Thatchangea inturncausesachangeintheweighted
1
inputz
= w 2 a1 + b2
tothesecondhiddenneuron:
z2
z2
a1
a 1
(117)
= w 2 a 1 .
(118)
Combiningourexpressionsforz anda ,weseehowthe

2
changeinthebiasb propagatesalongthenetworktoaffectz :
1
z2
(z1 )w 2 b1 .
(119)
Again,thatshouldlookfamiliar:we'venowgotthefirsttwoterms
inourclaimedexpressionforthegradient C / b .
1
Wecankeepgoinginthisfashion,trackingthewaychanges
propagatethroughtherestofthenetwork.Ateachneuronwepick
upa
(zj )
term,andthrougheachweightwepickupaw term.
j
TheendresultisanexpressionrelatingthefinalchangeC incost
totheinitialchangeb inthebias:
1
(z1 )w 2 (z2 ) (z4 )
C
a4
b 1 .
(120)
Dividingbyb wedoindeedgetthedesiredexpressionforthe
1
14/21
7/16/2015
gradient:
C
= (z1 )w 2 (z2 ) (z4 )
b1
(121)
a4
Whythevanishinggradientproblemoccurs:Tounderstand
whythevanishinggradientproblemoccurs,let'sexplicitlywriteout
theentireexpressionforthegradient:
C
b1
= (z1 ) w 2 (z2 ) w 3 (z3 ) w 4 (z4 )
(122)
a4
Exceptingtheverylastterm,thisexpressionisaproductoftermsof
theformw
(zj )
.Tounderstandhoweachofthosetermsbehave,
let'slookataplotofthefunction :
Derivativeofsigmoidfunction
0.25
0.20
0.15
0.10
0.05
0.00
4
Thederivativereachesamaximumat
(0) = 1/4
.Now,ifweuse
ourstandardapproachtoinitializingtheweightsinthenetwork,
thenwe'llchoosetheweightsusingaGaussianwithmean0 and
standarddeviation1 .Sotheweightswillusuallysatisfy|w
j|
Puttingtheseobservationstogether,weseethatthetermsw
willusuallysatisfy|w
(zj )| < 1/4
< 1
j
(zj )
.Andwhenwetakeaproductof
manysuchterms,theproductwilltendtoexponentiallydecrease:
themoreterms,thesmallertheproductwillbe.Thisisstartingto
smelllikeapossibleexplanationforthevanishinggradient
problem.
Tomakethisallabitmoreexplicit,let'scomparetheexpressionfor
C / b1
toanexpressionforthegradientwithrespecttoalater
bias,say C / b .Ofcourse,wehaven'texplicitlyworkedoutan
3
15/21
7/16/2015
expressionfor C / b ,butitfollowsthesamepatterndescribed
3
abovefor C / b .Here'sthecomparisonofthetwoexpressions:
1
Thetwoexpressionssharemanyterms.Butthegradient C / b
includestwoextratermseachoftheformw
(zj )
.Aswe'veseen,
suchtermsaretypicallylessthan1/4 inmagnitude.Andsothe
gradient C / b willusuallybeafactorof16 (ormore)smaller
1
than C / b .Thisistheessentialoriginofthevanishinggradient
3
problem.
Ofcourse,thisisaninformalargument,notarigorousproofthat
thevanishinggradientproblemwilloccur.Thereareseveral
possibleescapeclauses.Inparticular,wemightwonderwhetherthe
weightsw couldgrowduringtraining.Iftheydo,it'spossiblethe
j
termsw
(zj )
intheproductwillnolongersatisfy
|w j (zj )| < 1/4
.Indeed,ifthetermsgetlargeenoughgreater
than1 thenwewillnolongerhaveavanishinggradientproblem.
Instead,thegradientwillactuallygrowexponentiallyaswemove
backwardthroughthelayers.Insteadofavanishinggradient
problem,we'llhaveanexplodinggradientproblem.
Theexplodinggradientproblem:Let'slookatanexplicit
examplewhereexplodinggradientsoccur.Theexampleis
somewhatcontrived:I'mgoingtofixparametersinthenetworkin
justtherightwaytoensurewegetanexplodinggradient.Buteven
thoughtheexampleiscontrived,ithasthevirtueoffirmly
establishingthatexplodinggradientsaren'tmerelyahypothetical
possibility,theyreallycanhappen.
Therearetwostepstogettinganexplodinggradient.First,we
16/21
7/16/2015
choosealltheweightsinthenetworktobelarge,say
w 1 = w 2 = w 3 = w 4 = 100
the
(zj )
.Second,we'llchoosethebiasessothat
termsarenottoosmall.That'sactuallyprettyeasytodo:
allweneeddoischoosethebiasestoensurethattheweightedinput
toeachneuronisz
wantz
= 0
(andso
= w 1 a0 + b1 = 0
b1 = 100 a0
(zj ) = 1/4
.Wecanachievethisbysetting
.Wecanusethesameideatoselecttheotherbiases.
Whenwedothis,weseethatallthetermsw
100
1
4
= 25
).So,forinstance,we
(zj )
areequalto
.Withthesechoiceswegetanexplodinggradient.
Theunstablegradientproblem:Thefundamentalproblem
hereisn'tsomuchthevanishinggradientproblemortheexploding
gradientproblem.It'sthatthegradientinearlylayersistheproduct
oftermsfromallthelaterlayers.Whentherearemanylayers,that's
anintrinsicallyunstablesituation.Theonlywayalllayerscanlearn
atclosetothesamespeedisifallthoseproductsoftermscome
closetobalancingout.Withoutsomemechanismorunderlying
reasonforthatbalancingtooccur,it'shighlyunlikelytohappen
simplybychance.Inshort,therealproblemhereisthatneural
networkssufferfromanunstablegradientproblem.Asaresult,if
weusestandardgradientbasedlearningtechniques,different
layersinthenetworkwilltendtolearnatwildlydifferentspeeds.
Exercise
Inourdiscussionofthevanishinggradientproblem,wemade
useofthefactthat|
(z)| < 1/4
.Supposeweusedadifferent
activationfunction,onewhosederivativecouldbemuchlarger.
Wouldthathelpusavoidtheunstablegradientproblem?
Theprevalenceofthevanishinggradientproblem:We've
seenthatthegradientcaneithervanishorexplodeintheearly
layersofadeepnetwork.Infact,whenusingsigmoidneuronsthe
gradientwillusuallyvanish.Toseewhy,consideragainthe
expression|w
need|w
(z)|
(z)| 1
.Toavoidthevanishinggradientproblemwe
.Youmightthinkthiscouldhappeneasilyifw is
verylarge.However,it'smoredifficultthanitlooks.Thereasonis
thatthe
(z)
termalsodependsonw :
(z) = (wa + b)
,wherea
17/21
7/16/2015
istheinputactivation.Sowhenwemakew large,weneedtobe
carefulthatwe'renotsimultaneouslymaking
(wa + b)
small.That
turnsouttobeaconsiderableconstraint.Thereasonisthatwhen
wemakew largewetendtomakewa + b verylarge.Lookingatthe
graphof youcanseethatthisputsusoffinthe"wings"ofthe
function,whereittakesverysmallvalues.Theonlywaytoavoid
thisisiftheinputactivationfallswithinafairlynarrowrangeof
values(thisqualitativeexplanationismadequantitativeinthefirst
problembelow).Sometimesthatwillchancetohappen.Moreoften,
though,itdoesnothappen.Andsointhegenericcasewehave
vanishinggradients.
Problems
Considertheproduct|w
(wa + b)|
.Suppose
.(1)Arguethatthiscanonlyeveroccurif
|w (wa + b)| 1
|w| 4
.(2)Supposingthat|w|
activationsaforwhich|w
a
,considerthesetofinput
.Showthatthesetof
(wa + b)| 1
satisfyingthatconstraintcanrangeoveranintervalno
greaterinwidththan
|w|(1 + 1 4/|w| )
2
ln (
1) .
(123)
|w|
(3)Shownumericallythattheaboveexpressionboundingthe
widthoftherangeisgreatestat|w|
value
0.45
6.9
,whereittakesa
.Andsoevengiventhateverythinglinesupjust
perfectly,westillhaveafairlynarrowrangeofinputactivations
whichcanavoidthevanishinggradientproblem.
Identityneuron:Consideraneuronwithasingleinput,x,a
correspondingweight,w ,abiasb,andaweightw onthe
1
output.Showthatbychoosingtheweightsandbias
appropriately,wecanensurew
2 (w 1 x
+ b) x
forx
[0, 1]
Suchaneuroncanthusbeusedasakindofidentityneuron,
thatis,aneuronwhoseoutputisthesame(uptorescalingbya
weightfactor)asitsinput.Hint:Ithelpstorewrite
x = 1/2 +
,toassumew issmall,andtouseaTaylorseries
18/21
7/16/2015
expansioninw
Unstablegradientsinmorecomplex
networks
We'vebeenstudyingtoynetworks,withjustoneneuronineach
hiddenlayer.Whataboutmorecomplexdeepnetworks,withmany
neuronsineachhiddenlayer?
Infact,muchthesamebehaviouroccursinsuchnetworks.Inthe
earlierchapteronbackpropagationwesawthatthegradientinthe
l
thlayerofanL layernetworkisgivenby:
= (z )(w
l+1
(z
l+1
)(w
l+2
(z
) a C
Here, (z )isadiagonalmatrixwhoseentriesarethe
(z)
(124)
values
fortheweightedinputstothel thlayer.Thew aretheweight

l
matricesforthedifferentlayers.And
aC
isthevectorofpartial
derivativesofC withrespecttotheoutputactivations.
Thisisamuchmorecomplicatedexpressionthaninthesingle
neuroncase.Still,ifyoulookclosely,theessentialformisvery
similar,withlotsofpairsoftheform(w
.What'smore,the
(z )
matrices (z )havesmallentriesonthediagonal,nonelargerthan
1
4
.Providedtheweightmatricesw aren'ttoolarge,eachadditional
term(w
tendstomakethegradientvectorsmaller,leading
(z )
toavanishinggradient.Moregenerally,thelargenumberofterms
intheproducttendstoleadtoanunstablegradient,justasinour
earlierexample.Inpractice,empiricallyitistypicallyfoundin
19/21
7/16/2015
sigmoidnetworksthatgradientsvanishexponentiallyquicklyin
earlierlayers.Asaresult,learningslowsdowninthoselayers.This
slowdownisn'tmerelyanaccidentoraninconvenience:it'sa
fundamentalconsequenceoftheapproachwe'retakingtolearning.
Otherobstaclestodeeplearning
Inthischapterwe'vefocusedonvanishinggradientsand,more
generally,unstablegradientsasanobstacletodeeplearning.In
fact,unstablegradientsarejustoneobstacletodeeplearning,albeit
animportantfundamentalobstacle.Muchongoingresearchaimsto
betterunderstandthechallengesthatcanoccurwhentrainingdeep
networks.Iwon'tcomprehensivelysummarizethatworkhere,but
justwanttobrieflymentionacoupleofpapers,togiveyouthe
flavorofsomeofthequestionspeopleareasking.
Asafirstexample,in2010GlorotandBengio*foundevidence
*Understandingthedifficultyoftrainingdeep
suggestingthattheuseofsigmoidactivationfunctionscancause
andYoshuaBengio(2010).Seealsotheearlier
problemstrainingdeepnetworks.Inparticular,theyfound
evidencethattheuseofsigmoidswillcausetheactivationsinthe
feedforwardneuralnetworks,byXavierGlorot
discussionoftheuseofsigmoidsinEfficient
BackProp,byYannLeCun,LonBottou,
GenevieveOrrandKlausRobertMller(1998).
finalhiddenlayertosaturatenear0earlyintraining,substantially
slowingdownlearning.Theysuggestedsomealternativeactivation
functions,whichappearnottosufferasmuchfromthissaturation
problem.
Asasecondexample,in2013Sutskever,Martens,Dahland
Hinton*studiedtheimpactondeeplearningofboththerandom
*Ontheimportanceofinitializationand
weightinitializationandthemomentumscheduleinmomentum
JamesMartens,GeorgeDahlandGeoffrey
basedstochasticgradientdescent.Inbothcases,makinggood
momentumindeeplearning,byIlyaSutskever,
Hinton(2013).
choicesmadeasubstantialdifferenceintheabilitytotraindeep
networks.
Theseexamplessuggestthat"Whatmakesdeepnetworkshardto
train?"isacomplexquestion.Inthischapter,we'vefocusedonthe
instabilitiesassociatedtogradientbasedlearningindeepnetworks.
Theresultsinthelasttwoparagraphssuggestthatthereisalsoa
roleplayedbythechoiceofactivationfunction,thewayweightsare
20/21
7/16/2015
initialized,andevendetailsofhowlearningbygradientdescentis
implemented.And,ofcourse,choiceofnetworkarchitectureand
otherhyperparametersisalsoimportant.Thus,manyfactorscan
playaroleinmakingdeepnetworkshardtotrain,and
understandingallthosefactorsisstillasubjectofongoingresearch.
Thisallseemsratherdownbeatandpessimisminducing.Butthe
goodnewsisthatinthenextchapterwe'llturnthataround,and
developseveralapproachestodeeplearningthattosomeextent
managetoovercomeorroutearoundallthesechallenges.
Inacademicwork,pleasecitethisbookas:MichaelA.Nielsen,"NeuralNetworksandDeepLearning",
Lastupdate:FriJul1008:53:052015
DeterminationPress,2015
ThisworkislicensedunderaCreativeCommonsAttributionNonCommercial3.0UnportedLicense.This
meansyou'refreetocopy,share,andbuildonthisbook,butnottosellit.Ifyou'reinterestedincommercialuse,
pleasecontactme.
21/21

Artificial Neural Nets

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Artificial Neural Nets

Uploaded by

Copyright:

Available Formats

7/16/2015

Todothis,let'sthinkabouthowb causestheoutputa fromthe

convertsachangeb inthebiasintoachangea intheoutput

Combiningourexpressionsforz anda ,weseehowthe

(z1 )w 2 (z2 ) (z4 )

= (z1 )w 2 (z2 ) (z4 )

= (z1 ) w 2 (z2 ) w 3 (z3 ) w 4 (z4 )

(zj )| < 1/4

|w j (zj )| < 1/4

(z)| < 1/4

fortheweightedinputstothel thlayer.Thew aretheweight

You might also like