You are on page 1of 47

1/10/2017 Neural networks and deep learning

CHAPTER1

Usingneuralnetstorecognizehandwrittendigits

Thehumanvisualsystemisoneofthewondersoftheworld.Considerthe NeuralNetworksandDeepLearning
Whatthisbookisabout
followingsequenceofhandwrittendigits: Ontheexercisesandproblems
Usingneuralnetstorecognize
handwrittendigits
Howthebackpropagationalgorithm
works
Mostpeopleeffortlesslyrecognizethosedigitsas504192.Thateaseis
Improvingthewayneuralnetworks
deceptive.Ineachhemisphereofourbrain,humanshaveaprimaryvisual learn
Avisualproofthatneuralnetscan
cortex,alsoknownasV1,containing140millionneurons,withtensof
computeanyfunction
billionsofconnectionsbetweenthem.Andyethumanvisioninvolvesnot Whyaredeepneuralnetworkshardto
justV1,butanentireseriesofvisualcorticesV2,V3,V4,andV5 train?
Deeplearning
doingprogressivelymorecompleximageprocessing.Wecarryinour Appendix:Isthereasimplealgorithm
headsasupercomputer,tunedbyevolutionoverhundredsofmillionsof forintelligence?
Acknowledgements
years,andsuperblyadaptedtounderstandthevisualworld.Recognizing
FrequentlyAskedQuestions
handwrittendigitsisn'teasy.Rather,wehumansarestupendously,
astoundinglygoodatmakingsenseofwhatoureyesshowus.Butnearly Sponsors
allthatworkisdoneunconsciously.Andsowedon'tusuallyappreciate
howtoughaproblemourvisualsystemssolve.

Thedifficultyofvisualpatternrecognitionbecomesapparentifyou
attempttowriteacomputerprogramtorecognizedigitslikethoseabove.
Whatseemseasywhenwedoitourselvessuddenlybecomesextremely
difficult.Simpleintuitionsabouthowwerecognizeshapes"a9hasa

loopatthetop,andaverticalstrokeinthebottomright"turnouttobe Thankstoallthesupporterswhomade
notsosimpletoexpressalgorithmically.Whenyoutrytomakesuchrules thebookpossible,withespecialthanks
toPavelDudrenov.Thanksalsotoall
precise,youquicklygetlostinamorassofexceptionsandcaveatsand thecontributorstotheBugfinderHallof
specialcases.Itseemshopeless. Fame.

Neuralnetworksapproachtheprobleminadifferentway.Theideaisto Resources
BookFAQ
takealargenumberofhandwrittendigits,knownastrainingexamples,
Coderepository
MichaelNielsen'sproject
announcementmailinglist
DeepLearning,draftbookin
preparation,byYoshuaBengio,Ian
Goodfellow,andAaronCourville

http://neuralnetworksanddeeplearning.com/chap1.html 1/47
1/10/2017 Neural networks and deep learning

ByMichaelNielsen/Jan2017

andthendevelopasystemwhichcanlearnfromthosetrainingexamples.
Inotherwords,theneuralnetworkusestheexamplestoautomatically
inferrulesforrecognizinghandwrittendigits.Furthermore,byincreasing
thenumberoftrainingexamples,thenetworkcanlearnmoreabout
handwriting,andsoimproveitsaccuracy.SowhileI'veshownjust100
trainingdigitsabove,perhapswecouldbuildabetterhandwriting
recognizerbyusingthousandsorevenmillionsorbillionsoftraining
examples.

Inthischapterwe'llwriteacomputerprogramimplementinganeural
networkthatlearnstorecognizehandwrittendigits.Theprogramisjust74
lineslong,andusesnospecialneuralnetworklibraries.Butthisshort
programcanrecognizedigitswithanaccuracyover96percent,without
humanintervention.Furthermore,inlaterchapterswe'lldevelopideas
whichcanimproveaccuracytoover99percent.Infact,thebest
commercialneuralnetworksarenowsogoodthattheyareusedbybanks
toprocesscheques,andbypostofficestorecognizeaddresses.

We'refocusingonhandwritingrecognitionbecauseit'sanexcellent
prototypeproblemforlearningaboutneuralnetworksingeneral.Asa
prototypeithitsasweetspot:it'schallengingit'snosmallfeatto
recognizehandwrittendigitsbutit'snotsodifficultastorequirean
extremelycomplicatedsolution,ortremendouscomputationalpower.
Furthermore,it'sagreatwaytodevelopmoreadvancedtechniques,such
asdeeplearning.Andsothroughoutthebookwe'llreturnrepeatedlytothe
problemofhandwritingrecognition.Laterinthebook,we'lldiscusshow
theseideasmaybeappliedtootherproblemsincomputervision,andalso
inspeech,naturallanguageprocessing,andotherdomains.

http://neuralnetworksanddeeplearning.com/chap1.html 2/47
1/10/2017 Neural networks and deep learning

Ofcourse,ifthepointofthechapterwasonlytowriteacomputer
programtorecognizehandwrittendigits,thenthechapterwouldbemuch
shorter!Butalongthewaywe'lldevelopmanykeyideasaboutneural
networks,includingtwoimportanttypesofartificialneuron(the
perceptronandthesigmoidneuron),andthestandardlearningalgorithm
forneuralnetworks,knownasstochasticgradientdescent.Throughout,I
focusonexplainingwhythingsaredonethewaytheyare,andonbuilding
yourneuralnetworksintuition.Thatrequiresalengthierdiscussionthanif
Ijustpresentedthebasicmechanicsofwhat'sgoingon,butit'sworthitfor
thedeeperunderstandingyou'llattain.Amongstthepayoffs,bytheendof
thechapterwe'llbeinpositiontounderstandwhatdeeplearningis,and
whyitmatters.

Perceptrons
Whatisaneuralnetwork?Togetstarted,I'llexplainatypeofartificial
neuroncalledaperceptron.Perceptronsweredevelopedinthe1950sand
1960sbythescientistFrankRosenblatt,inspiredbyearlierworkby
WarrenMcCullochandWalterPitts.Today,it'smorecommontouseother
modelsofartificialneuronsinthisbook,andinmuchmodernworkon
neuralnetworks,themainneuronmodelusedisonecalledthesigmoid
neuron.We'llgettosigmoidneuronsshortly.Buttounderstandwhy
sigmoidneuronsaredefinedthewaytheyare,it'sworthtakingthetimeto
firstunderstandperceptrons.

Sohowdoperceptronswork?Aperceptrontakesseveralbinaryinputs,
x1 , x2 , ,andproducesasinglebinaryoutput:

Intheexampleshowntheperceptronhasthreeinputs,x 1, x2 , x3 .Ingeneral
itcouldhavemoreorfewerinputs.Rosenblattproposedasimpleruleto
computetheoutput.Heintroducedweights,w 1, w2 , ,realnumbers
expressingtheimportanceoftherespectiveinputstotheoutput.The
neuron'soutput,0or1,isdeterminedbywhethertheweightedsum j
wj x j

islessthanorgreaterthansomethresholdvalue.Justliketheweights,the
thresholdisarealnumberwhichisaparameteroftheneuron.Toputitin
moreprecisealgebraicterms:

http://neuralnetworksanddeeplearning.com/chap1.html 3/47
1/10/2017 Neural networks and deep learning

0 if j
wj x j threshold
output = (1)
{1 if j
wj x j > threshold

That'sallthereistohowaperceptronworks!

That'sthebasicmathematicalmodel.Awayyoucanthinkaboutthe
perceptronisthatit'sadevicethatmakesdecisionsbyweighingup
evidence.Letmegiveanexample.It'snotaveryrealisticexample,butit's
easytounderstand,andwe'llsoongettomorerealisticexamples.Suppose
theweekendiscomingup,andyou'veheardthatthere'sgoingtobea
cheesefestivalinyourcity.Youlikecheese,andaretryingtodecide
whetherornottogotothefestival.Youmightmakeyourdecisionby
weighingupthreefactors:

1.Istheweathergood?
2.Doesyourboyfriendorgirlfriendwanttoaccompanyyou?
3.Isthefestivalnearpublictransit?(Youdon'townacar).

Wecanrepresentthesethreefactorsbycorrespondingbinaryvariables
x1 , x2 ,andx .Forinstance,we'dhavex
3 1 = 1 iftheweatherisgood,and
x1 = 0 iftheweatherisbad.Similarly,x 2 = 1 ifyourboyfriendor
girlfriendwantstogo,andx 2 = 0 ifnot.Andsimilarlyagainforx and 3

publictransit.

Now,supposeyouabsolutelyadorecheese,somuchsothatyou'rehappy
togotothefestivalevenifyourboyfriendorgirlfriendisuninterestedand
thefestivalishardtogetto.Butperhapsyoureallyloathebadweather,
andthere'snowayyou'dgotothefestivaliftheweatherisbad.Youcan
useperceptronstomodelthiskindofdecisionmaking.Onewaytodothis
istochooseaweightw 1 fortheweather,andw
= 6 2 andw
= 2 3 = 2 for
theotherconditions.Thelargervalueofw indicatesthattheweather
1

mattersalottoyou,muchmorethanwhetheryourboyfriendorgirlfriend
joinsyou,orthenearnessofpublictransit.Finally,supposeyouchoosea
thresholdof5fortheperceptron.Withthesechoices,theperceptron
implementsthedesireddecisionmakingmodel,outputting1wheneverthe
weatherisgood,and0whenevertheweatherisbad.Itmakesno
differencetotheoutputwhetheryourboyfriendorgirlfriendwantstogo,
orwhetherpublictransitisnearby.

Byvaryingtheweightsandthethreshold,wecangetdifferentmodelsof
decisionmaking.Forexample,supposeweinsteadchoseathresholdof3.
Thentheperceptronwoulddecidethatyoushouldgotothefestival

http://neuralnetworksanddeeplearning.com/chap1.html 4/47
1/10/2017 Neural networks and deep learning

whenevertheweatherwasgoodorwhenboththefestivalwasnearpublic
transitandyourboyfriendorgirlfriendwaswillingtojoinyou.Inother
words,it'dbeadifferentmodelofdecisionmaking.Droppingthe
thresholdmeansyou'remorewillingtogotothefestival.

Obviously,theperceptronisn'tacompletemodelofhumandecision
making!Butwhattheexampleillustratesishowaperceptroncanweigh
updifferentkindsofevidenceinordertomakedecisions.Anditshould
seemplausiblethatacomplexnetworkofperceptronscouldmakequite
subtledecisions:

Inthisnetwork,thefirstcolumnofperceptronswhatwe'llcallthefirst
layerofperceptronsismakingthreeverysimpledecisions,byweighing
theinputevidence.Whatabouttheperceptronsinthesecondlayer?Each
ofthoseperceptronsismakingadecisionbyweighinguptheresultsfrom
thefirstlayerofdecisionmaking.Inthiswayaperceptroninthesecond
layercanmakeadecisionatamorecomplexandmoreabstractlevelthan
perceptronsinthefirstlayer.Andevenmorecomplexdecisionscanbe
madebytheperceptroninthethirdlayer.Inthisway,amanylayer
networkofperceptronscanengageinsophisticateddecisionmaking.

Incidentally,whenIdefinedperceptronsIsaidthataperceptronhasjusta
singleoutput.Inthenetworkabovetheperceptronslookliketheyhave
multipleoutputs.Infact,they'restillsingleoutput.Themultipleoutput
arrowsaremerelyausefulwayofindicatingthattheoutputfroma
perceptronisbeingusedastheinputtoseveralotherperceptrons.It'sless
unwieldythandrawingasingleoutputlinewhichthensplits.

Let'ssimplifythewaywedescribeperceptrons.Thecondition
j
wj x j > threshold iscumbersome,andwecanmaketwonotational
changestosimplifyit.Thefirstchangeistowrite j
wj x j asadotproduct,
w x j
wj x j ,wherewandxarevectorswhosecomponentsarethe
weightsandinputs,respectively.Thesecondchangeistomovethe
thresholdtotheothersideoftheinequality,andtoreplaceitbywhat's

http://neuralnetworksanddeeplearning.com/chap1.html 5/47
1/10/2017 Neural networks and deep learning

knownastheperceptron'sbias,b threshold .Usingthebiasinsteadof


thethreshold,theperceptronrulecanberewritten:

0 ifw x+b0
output = (2)
{1 ifw x + b > 0

Youcanthinkofthebiasasameasureofhoweasyitistogetthe
perceptrontooutputa1.Ortoputitinmorebiologicalterms,thebiasisa
measureofhoweasyitistogettheperceptrontofire.Foraperceptron
withareallybigbias,it'sextremelyeasyfortheperceptrontooutputa1.
Butifthebiasisverynegative,thenit'sdifficultfortheperceptronto
outputa1.Obviously,introducingthebiasisonlyasmallchangeinhow
wedescribeperceptrons,butwe'llseelaterthatitleadstofurther
notationalsimplifications.Becauseofthis,intheremainderofthebook
wewon'tusethethreshold,we'llalwaysusethebias.

I'vedescribedperceptronsasamethodforweighingevidencetomake
decisions.Anotherwayperceptronscanbeusedistocomputethe
elementarylogicalfunctionsweusuallythinkofasunderlying
computation,functionssuchasAND,OR,andNAND.Forexample,suppose
wehaveaperceptronwithtwoinputs,eachwithweight2,andanoverall
biasof3.Here'sourperceptron:

Thenweseethatinput00 producesoutput1,since
(2) 0 + (2) 0 + 3 = 3ispositive.Here,I'veintroducedthesymbol
tomakethemultiplicationsexplicit.Similarcalculationsshowthatthe
inputs01 and10 produceoutput1.Buttheinput11 producesoutput0,
since(2) 1 + (2) 1 + 3 = 1isnegative.Andsoourperceptron
implementsaNANDgate!

TheNANDexampleshowsthatwecanuseperceptronstocomputesimple
logicalfunctions.Infact,wecanusenetworksofperceptronstocompute
anylogicalfunctionatall.ThereasonisthattheNANDgateisuniversal
forcomputation,thatis,wecanbuildanycomputationupoutofNAND
gates.Forexample,wecanuseNANDgatestobuildacircuitwhichadds
twobits,x andx .Thisrequirescomputingthebitwisesum,x x ,as
1 2 1 2

wellasacarrybitwhichissetto1whenbothx andx are1,i.e.,the


1 2

carrybitisjustthebitwiseproductx 1 x2 :

http://neuralnetworksanddeeplearning.com/chap1.html 6/47
1/10/2017 Neural networks and deep learning

TogetanequivalentnetworkofperceptronswereplacealltheNANDgates
byperceptronswithtwoinputs,eachwithweight2,andanoverallbias
of3.Here'stheresultingnetwork.NotethatI'vemovedtheperceptron
correspondingtothebottomrightNANDgatealittle,justtomakeiteasier
todrawthearrowsonthediagram:

Onenotableaspectofthisnetworkofperceptronsisthattheoutputfrom
theleftmostperceptronisusedtwiceasinputtothebottommost
perceptron.WhenIdefinedtheperceptronmodelIdidn'tsaywhetherthis
kindofdoubleoutputtothesameplacewasallowed.Actually,itdoesn't
muchmatter.Ifwedon'twanttoallowthiskindofthing,thenit'spossible
tosimplymergethetwolines,intoasingleconnectionwithaweightof4
insteadoftwoconnectionswith2weights.(Ifyoudon'tfindthisobvious,
youshouldstopandprovetoyourselfthatthisisequivalent.)Withthat
change,thenetworklooksasfollows,withallunmarkedweightsequalto
2,allbiasesequalto3,andasingleweightof4,asmarked:

UptonowI'vebeendrawinginputslikex andx asvariablesfloatingto


1 2

theleftofthenetworkofperceptrons.Infact,it'sconventionaltodrawan
extralayerofperceptronstheinputlayertoencodetheinputs:

http://neuralnetworksanddeeplearning.com/chap1.html 7/47
1/10/2017 Neural networks and deep learning

Thisnotationforinputperceptrons,inwhichwehaveanoutput,butno
inputs,

isashorthand.Itdoesn'tactuallymeanaperceptronwithnoinputs.Tosee
this,supposewedidhaveaperceptronwithnoinputs.Thentheweighted
sum j
wj x j wouldalwaysbezero,andsotheperceptronwouldoutput1
ifb > 0,and0ifb 0.Thatis,theperceptronwouldsimplyoutputa
fixedvalue,notthedesiredvalue(x ,intheexampleabove).It'sbetterto
1

thinkoftheinputperceptronsasnotreallybeingperceptronsatall,but
ratherspecialunitswhicharesimplydefinedtooutputthedesiredvalues,
x1 , x2 , .
Theadderexampledemonstrateshowanetworkofperceptronscanbe
usedtosimulateacircuitcontainingmanyNANDgates.Andbecause
NANDgatesareuniversalforcomputation,itfollowsthatperceptronsare
alsouniversalforcomputation.

Thecomputationaluniversalityofperceptronsissimultaneously
reassuringanddisappointing.It'sreassuringbecauseittellsusthat
networksofperceptronscanbeaspowerfulasanyothercomputing
device.Butit'salsodisappointing,becauseitmakesitseemasthough
perceptronsaremerelyanewtypeofNANDgate.That'shardlybignews!

However,thesituationisbetterthanthisviewsuggests.Itturnsoutthat
wecandeviselearningalgorithmswhichcanautomaticallytunethe
weightsandbiasesofanetworkofartificialneurons.Thistuninghappens
inresponsetoexternalstimuli,withoutdirectinterventionbya
programmer.Theselearningalgorithmsenableustouseartificialneurons
inawaywhichisradicallydifferenttoconventionallogicgates.Insteadof
explicitlylayingoutacircuitofNANDandothergates,ourneural
networkscansimplylearntosolveproblems,sometimesproblemswhere
itwouldbeextremelydifficulttodirectlydesignaconventionalcircuit.

Sigmoidneurons
http://neuralnetworksanddeeplearning.com/chap1.html 8/47
1/10/2017 Neural networks and deep learning

Learningalgorithmssoundterrific.Buthowcanwedevisesuch
algorithmsforaneuralnetwork?Supposewehaveanetworkof
perceptronsthatwe'dliketousetolearntosolvesomeproblem.For
example,theinputstothenetworkmightbetherawpixeldatafroma
scanned,handwrittenimageofadigit.Andwe'dlikethenetworktolearn
weightsandbiasessothattheoutputfromthenetworkcorrectlyclassifies
thedigit.Toseehowlearningmightwork,supposewemakeasmall
changeinsomeweight(orbias)inthenetwork.Whatwe'dlikeisforthis
smallchangeinweighttocauseonlyasmallcorrespondingchangeinthe
outputfromthenetwork.Aswe'llseeinamoment,thispropertywill
makelearningpossible.Schematically,here'swhatwewant(obviously
thisnetworkistoosimpletodohandwritingrecognition!):

Ifitweretruethatasmallchangeinaweight(orbias)causesonlyasmall
changeinoutput,thenwecouldusethisfacttomodifytheweightsand
biasestogetournetworktobehavemoreinthemannerwewant.For
example,supposethenetworkwasmistakenlyclassifyinganimageasan
"8"whenitshouldbea"9".Wecouldfigureouthowtomakeasmall
changeintheweightsandbiasessothenetworkgetsalittlecloserto
classifyingtheimageasa"9".Andthenwe'drepeatthis,changingthe
weightsandbiasesoverandovertoproducebetterandbetteroutput.The
networkwouldbelearning.

Theproblemisthatthisisn'twhathappenswhenournetworkcontains
perceptrons.Infact,asmallchangeintheweightsorbiasofanysingle
perceptroninthenetworkcansometimescausetheoutputofthat
perceptrontocompletelyflip,sayfrom0to1.Thatflipmaythencause
thebehaviouroftherestofthenetworktocompletelychangeinsome
verycomplicatedway.Sowhileyour"9"mightnowbeclassified
correctly,thebehaviourofthenetworkonalltheotherimagesislikelyto
havecompletelychangedinsomehardtocontrolway.Thatmakesit
difficulttoseehowtograduallymodifytheweightsandbiasessothatthe

http://neuralnetworksanddeeplearning.com/chap1.html 9/47
1/10/2017 Neural networks and deep learning

networkgetsclosertothedesiredbehaviour.Perhapsthere'ssomeclever
wayofgettingaroundthisproblem.Butit'snotimmediatelyobvioushow
wecangetanetworkofperceptronstolearn.

Wecanovercomethisproblembyintroducinganewtypeofartificial
neuroncalledasigmoidneuron.Sigmoidneuronsaresimilarto
perceptrons,butmodifiedsothatsmallchangesintheirweightsandbias
causeonlyasmallchangeintheiroutput.That'sthecrucialfactwhichwill
allowanetworkofsigmoidneuronstolearn.

Okay,letmedescribethesigmoidneuron.We'lldepictsigmoidneuronsin
thesamewaywedepictedperceptrons:

Justlikeaperceptron,thesigmoidneuronhasinputs,x 1, x2 , .But
insteadofbeingjust0or1,theseinputscanalsotakeonanyvalues
between0and1.So,forinstance,0.638 isavalidinputforasigmoid
neuron.Alsojustlikeaperceptron,thesigmoidneuronhasweightsfor
eachinput,w 1, w2 , ,andanoverallbias, b .Buttheoutputisnot0or1.
*Incidentally, issometimescalledthelogistic
Instead,it's (w x + b),where iscalledthesigmoidfunction*,andis
function,andthisnewclassofneuronscalledlogistic

definedby: neurons.It'susefultorememberthisterminology,
sincethesetermsareusedbymanypeopleworking
withneuralnets.However,we'llstickwiththe
1
(z) z . (3) sigmoidterminology.
1 + e

Toputitallalittlemoreexplicitly,theoutputofasigmoidneuronwith
inputsx 1, x2 , ,weights w1 , w2 , ,andbias b is

1
. (4)
1 + exp( j
wj x j b)

Atfirstsight,sigmoidneuronsappearverydifferenttoperceptrons.The
algebraicformofthesigmoidfunctionmayseemopaqueandforbiddingif
you'renotalreadyfamiliarwithit.Infact,therearemanysimilarities
betweenperceptronsandsigmoidneurons,andthealgebraicformofthe
sigmoidfunctionturnsouttobemoreofatechnicaldetailthanatrue
barriertounderstanding.

Tounderstandthesimilaritytotheperceptronmodel,supposez w x + b
isalargepositivenumber.Thene 0andso (z) 1 .Inotherwords,
z

whenz = w x + bislargeandpositive,theoutputfromthesigmoid

http://neuralnetworksanddeeplearning.com/chap1.html 10/47
1/10/2017 Neural networks and deep learning

neuronisapproximately1,justasitwouldhavebeenforaperceptron.
Supposeontheotherhandthatz = w x + bisverynegative.Then
e
z ,and (z) 0 .Sowhenz = w x + bisverynegative,the
behaviourofasigmoidneuronalsocloselyapproximatesaperceptron.It's
onlywhenw x + b isofmodestsizethatthere'smuchdeviationfromthe
perceptronmodel.

Whataboutthealgebraicformof ?Howcanweunderstandthat?Infact,
theexactformof isn'tsoimportantwhatreallymattersistheshapeof
thefunctionwhenplotted.Here'stheshape:

sigmoidfunction
1.0

0.8

0.6

0.4

0.2

0.0
4 3 2 1 0 1 2 3 4
z

Thisshapeisasmoothedoutversionofastepfunction:

stepfunction
1.0

0.8

0.6

0.4

0.2

0.0
4 3 2 1 0 1 2 3 4
z

If hadinfactbeenastepfunction,thenthesigmoidneuronwouldbea
perceptron,sincetheoutputwouldbe1or0dependingonwhether
*Actually,whenw x + b theperceptronoutputs
w x + b waspositiveornegative*.Byusingtheactual functionweget, 0
= 0

,whilethestepfunctionoutputs1 .So,strictly
asalreadyimpliedabove,asmoothedoutperceptron.Indeed,it'sthe speaking,we'dneedtomodifythestepfunctionat
thatonepoint.Butyougettheidea.
smoothnessofthe functionthatisthecrucialfact,notitsdetailedform.
Thesmoothnessof meansthatsmallchanges wj intheweightsand
b

inthebiaswillproduceasmallchange output intheoutputfromthe


neuron.Infact,calculustellsusthat output iswellapproximatedby

output
http://neuralnetworksanddeeplearning.com/chap1.html output 11/47
1/10/2017 Neural networks and deep learning
output output
output wj + b, (5)
w b

j
j

wherethesumisoveralltheweights,w ,and output/w and output/b


j j

denotepartialderivativesoftheoutputwithrespecttow andb, j

respectively.Don'tpanicifyou'renotcomfortablewithpartialderivatives!
Whiletheexpressionabovelookscomplicated,withallthepartial
derivatives,it'sactuallysayingsomethingverysimple(andwhichisvery
goodnews): output isalinearfunctionofthechanges wj and
b inthe
weightsandbias.Thislinearitymakesiteasytochoosesmallchangesin
theweightsandbiasestoachieveanydesiredsmallchangeintheoutput.
Sowhilesigmoidneuronshavemuchofthesamequalitativebehaviouras
perceptrons,theymakeitmucheasiertofigureouthowchangingthe
weightsandbiaseswillchangetheoutput.

Ifit'stheshapeof whichreallymatters,andnotitsexactform,thenwhy
usetheparticularformusedfor inEquation(3)?Infact,laterinthebook
wewilloccasionallyconsiderneuronswheretheoutputisf (w x + b)for
someotheractivationfunctionf ().Themainthingthatchangeswhenwe
useadifferentactivationfunctionisthattheparticularvaluesforthe
partialderivativesinEquation(5)change.Itturnsoutthatwhenwe
computethosepartialderivativeslater,using willsimplifythealgebra,
simplybecauseexponentialshavelovelypropertieswhendifferentiated.In
anycase, iscommonlyusedinworkonneuralnets,andistheactivation
functionwe'llusemostofteninthisbook.

Howshouldweinterprettheoutputfromasigmoidneuron?Obviously,
onebigdifferencebetweenperceptronsandsigmoidneuronsisthat
sigmoidneuronsdon'tjustoutput0or1.Theycanhaveasoutputanyreal
numberbetween0and1,sovaluessuchas0.173 and0.689 are
legitimateoutputs.Thiscanbeuseful,forexample,ifwewanttousethe
outputvaluetorepresenttheaverageintensityofthepixelsinanimage
inputtoaneuralnetwork.Butsometimesitcanbeanuisance.Supposewe
wanttheoutputfromthenetworktoindicateeither"theinputimageisa
9"or"theinputimageisnota9".Obviously,it'dbeeasiesttodothisifthe
outputwasa0ora1,asinaperceptron.Butinpracticewecansetupa
conventiontodealwiththis,forexample,bydecidingtointerpretany
outputofatleast0.5asindicatinga"9",andanyoutputlessthan0.5as
indicating"nota9".I'llalwaysexplicitlystatewhenwe'reusingsucha
convention,soitshouldn'tcauseanyconfusion.

http://neuralnetworksanddeeplearning.com/chap1.html 12/47
1/10/2017 Neural networks and deep learning

Exercises
Sigmoidneuronssimulatingperceptrons,partI
Supposewetakealltheweightsandbiasesinanetworkof
perceptrons,andmultiplythembyapositiveconstant,c > 0.Show
thatthebehaviourofthenetworkdoesn'tchange.

Sigmoidneuronssimulatingperceptrons,partII
Supposewehavethesamesetupasthelastproblemanetworkof
perceptrons.Supposealsothattheoverallinputtothenetworkof
perceptronshasbeenchosen.Wewon'tneedtheactualinputvalue,
wejustneedtheinputtohavebeenfixed.Supposetheweightsand
biasesaresuchthatw x + b 0fortheinputxtoanyparticular
perceptroninthenetwork.Nowreplacealltheperceptronsinthe
networkbysigmoidneurons,andmultiplytheweightsandbiasesby
apositiveconstantc > 0.Showthatinthelimitasc the
behaviourofthisnetworkofsigmoidneuronsisexactlythesameas
thenetworkofperceptrons.Howcanthisfailwhenw x + b = 0for
oneoftheperceptrons?

Thearchitectureofneuralnetworks
InthenextsectionI'llintroduceaneuralnetworkthatcandoaprettygood
jobclassifyinghandwrittendigits.Inpreparationforthat,ithelpsto
explainsometerminologythatletsusnamedifferentpartsofanetwork.
Supposewehavethenetwork:

Asmentionedearlier,theleftmostlayerinthisnetworkiscalledtheinput
layer,andtheneuronswithinthelayerarecalledinputneurons.The
rightmostoroutputlayercontainstheoutputneurons,or,asinthiscase,a
singleoutputneuron.Themiddlelayeriscalledahiddenlayer,sincethe
neuronsinthislayerareneitherinputsnoroutputs.Theterm"hidden"
perhapssoundsalittlemysteriousthefirsttimeIheardthetermIthought
itmusthavesomedeepphilosophicalormathematicalsignificancebutit
reallymeansnothingmorethan"notaninputoranoutput".Thenetwork
http://neuralnetworksanddeeplearning.com/chap1.html 13/47
1/10/2017 Neural networks and deep learning

abovehasjustasinglehiddenlayer,butsomenetworkshavemultiple
hiddenlayers.Forexample,thefollowingfourlayernetworkhastwo
hiddenlayers:

Somewhatconfusingly,andforhistoricalreasons,suchmultiplelayer
networksaresometimescalledmultilayerperceptronsorMLPs,despite
beingmadeupofsigmoidneurons,notperceptrons.I'mnotgoingtouse
theMLPterminologyinthisbook,sinceIthinkit'sconfusing,butwanted
towarnyouofitsexistence.

Thedesignoftheinputandoutputlayersinanetworkisoften
straightforward.Forexample,supposewe'retryingtodeterminewhethera
handwrittenimagedepictsa"9"ornot.Anaturalwaytodesignthe
networkistoencodetheintensitiesoftheimagepixelsintotheinput
neurons.Iftheimageisa64 by64 greyscaleimage,thenwe'dhave
4, 096 = 64 64 inputneurons,withtheintensitiesscaledappropriately
between0and1.Theoutputlayerwillcontainjustasingleneuron,with
outputvaluesoflessthan0.5indicating"inputimageisnota9",and
valuesgreaterthan0.5indicating"inputimageisa9".

Whilethedesignoftheinputandoutputlayersofaneuralnetworkis
oftenstraightforward,therecanbequiteanarttothedesignofthehidden
layers.Inparticular,it'snotpossibletosumupthedesignprocessforthe
hiddenlayerswithafewsimplerulesofthumb.Instead,neuralnetworks
researchershavedevelopedmanydesignheuristicsforthehiddenlayers,
whichhelppeoplegetthebehaviourtheywantoutoftheirnets.For
example,suchheuristicscanbeusedtohelpdeterminehowtotradeoff
thenumberofhiddenlayersagainstthetimerequiredtotrainthenetwork.
We'llmeetseveralsuchdesignheuristicslaterinthisbook.

Uptonow,we'vebeendiscussingneuralnetworkswheretheoutputfrom
onelayerisusedasinputtothenextlayer.Suchnetworksarecalled

http://neuralnetworksanddeeplearning.com/chap1.html 14/47
1/10/2017 Neural networks and deep learning

feedforwardneuralnetworks.Thismeanstherearenoloopsinthe
networkinformationisalwaysfedforward,neverfedback.Ifwedid
haveloops,we'dendupwithsituationswheretheinputtothe function
dependedontheoutput.That'dbehardtomakesenseof,andsowedon't
allowsuchloops.

However,thereareothermodelsofartificialneuralnetworksinwhich
feedbackloopsarepossible.Thesemodelsarecalledrecurrentneural
networks.Theideainthesemodelsistohaveneuronswhichfireforsome
limiteddurationoftime,beforebecomingquiescent.Thatfiringcan
stimulateotherneurons,whichmayfirealittlewhilelater,alsofora
limitedduration.Thatcausesstillmoreneuronstofire,andsoovertime
wegetacascadeofneuronsfiring.Loopsdon'tcauseproblemsinsucha
model,sinceaneuron'soutputonlyaffectsitsinputatsomelatertime,not
instantaneously.

Recurrentneuralnetshavebeenlessinfluentialthanfeedforward
networks,inpartbecausethelearningalgorithmsforrecurrentnetsare(at
leasttodate)lesspowerful.Butrecurrentnetworksarestillextremely
interesting.They'remuchcloserinspirittohowourbrainsworkthan
feedforwardnetworks.Andit'spossiblethatrecurrentnetworkscansolve
importantproblemswhichcanonlybesolvedwithgreatdifficultyby
feedforwardnetworks.However,tolimitourscope,inthisbookwe're
goingtoconcentrateonthemorewidelyusedfeedforwardnetworks.

Asimplenetworktoclassify
handwrittendigits
Havingdefinedneuralnetworks,let'sreturntohandwritingrecognition.
Wecansplittheproblemofrecognizinghandwrittendigitsintotwosub
problems.First,we'dlikeawayofbreakinganimagecontainingmany
digitsintoasequenceofseparateimages,eachcontainingasingledigit.
Forexample,we'dliketobreaktheimage

intosixseparateimages,

http://neuralnetworksanddeeplearning.com/chap1.html 15/47
1/10/2017 Neural networks and deep learning

Wehumanssolvethissegmentationproblemwithease,butit'schallenging
foracomputerprogramtocorrectlybreakuptheimage.Oncetheimage
hasbeensegmented,theprogramthenneedstoclassifyeachindividual
digit.So,forinstance,we'dlikeourprogramtorecognizethatthefirst
digitabove,

isa5.

We'llfocusonwritingaprogramtosolvethesecondproblem,thatis,
classifyingindividualdigits.Wedothisbecauseitturnsoutthatthe
segmentationproblemisnotsodifficulttosolve,onceyouhaveagood
wayofclassifyingindividualdigits.Therearemanyapproachestosolving
thesegmentationproblem.Oneapproachistotrialmanydifferentwaysof
segmentingtheimage,usingtheindividualdigitclassifiertoscoreeach
trialsegmentation.Atrialsegmentationgetsahighscoreiftheindividual
digitclassifierisconfidentofitsclassificationinallsegments,andalow
scoreiftheclassifierishavingalotoftroubleinoneormoresegments.
Theideaisthatiftheclassifierishavingtroublesomewhere,thenit's
probablyhavingtroublebecausethesegmentationhasbeenchosen
incorrectly.Thisideaandothervariationscanbeusedtosolvethe
segmentationproblemquitewell.Soinsteadofworryingabout
segmentationwe'llconcentrateondevelopinganeuralnetworkwhichcan
solvethemoreinterestinganddifficultproblem,namely,recognizing
individualhandwrittendigits.

Torecognizeindividualdigitswewilluseathreelayerneuralnetwork:

http://neuralnetworksanddeeplearning.com/chap1.html 16/47
1/10/2017 Neural networks and deep learning

Theinputlayerofthenetworkcontainsneuronsencodingthevaluesofthe
inputpixels.Asdiscussedinthenextsection,ourtrainingdataforthe
networkwillconsistofmany28 by28 pixelimagesofscanned
handwrittendigits,andsotheinputlayercontains784 = 28 28neurons.
ForsimplicityI'veomittedmostofthe784inputneuronsinthediagram
above.Theinputpixelsaregreyscale,withavalueof0.0representing
white,avalueof1.0representingblack,andinbetweenvalues
representinggraduallydarkeningshadesofgrey.

Thesecondlayerofthenetworkisahiddenlayer.Wedenotethenumber
ofneuronsinthishiddenlayerbyn,andwe'llexperimentwithdifferent
valuesforn.Theexampleshownillustratesasmallhiddenlayer,
containingjustn = 15neurons.

Theoutputlayerofthenetworkcontains10neurons.Ifthefirstneuron
fires,i.e.,hasanoutput 1,thenthatwillindicatethatthenetworkthinks
thedigitisa0.Ifthesecondneuronfiresthenthatwillindicatethatthe
networkthinksthedigitisa1.Andsoon.Alittlemoreprecisely,we
numbertheoutputneuronsfrom0through9,andfigureoutwhichneuron
hasthehighestactivationvalue.Ifthatneuronis,say,neuronnumber6,
thenournetworkwillguessthattheinputdigitwasa6.Andsoonforthe
otheroutputneurons.

Youmightwonderwhyweuse10 outputneurons.Afterall,thegoalof
thenetworkistotelluswhichdigit(0, 1, 2, , 9)correspondstotheinput
image.Aseeminglynaturalwayofdoingthatistousejust4output
neurons,treatingeachneuronastakingonabinaryvalue,dependingon

http://neuralnetworksanddeeplearning.com/chap1.html 17/47
1/10/2017 Neural networks and deep learning

whethertheneuron'soutputiscloserto0orto1.Fourneuronsareenough
toencodetheanswer,since2 4
= 16 ismorethanthe10possiblevalues
fortheinputdigit.Whyshouldournetworkuse10 neuronsinstead?Isn't
thatinefficient?Theultimatejustificationisempirical:wecantryoutboth
networkdesigns,anditturnsoutthat,forthisparticularproblem,the
networkwith10 outputneuronslearnstorecognizedigitsbetterthanthe
networkwith4outputneurons.Butthatleavesuswonderingwhyusing10
outputneuronsworksbetter.Istheresomeheuristicthatwouldtellusin
advancethatweshouldusethe10 outputencodinginsteadofthe4output
encoding?

Tounderstandwhywedothis,ithelpstothinkaboutwhattheneural
networkisdoingfromfirstprinciples.Considerfirstthecasewherewe
use10 outputneurons.Let'sconcentrateonthefirstoutputneuron,theone
that'stryingtodecidewhetherornotthedigitisa0.Itdoesthisby
weighingupevidencefromthehiddenlayerofneurons.Whatarethose
hiddenneuronsdoing?Well,justsupposeforthesakeofargumentthatthe
firstneuroninthehiddenlayerdetectswhetherornotanimagelikethe
followingispresent:

Itcandothisbyheavilyweightinginputpixelswhichoverlapwiththe
image,andonlylightlyweightingtheotherinputs.Inasimilarway,let's
supposeforthesakeofargumentthatthesecond,third,andfourthneurons
inthehiddenlayerdetectwhetherornotthefollowingimagesarepresent:

Asyoumayhaveguessed,thesefourimagestogethermakeupthe0
imagethatwesawinthelineofdigitsshownearlier:

http://neuralnetworksanddeeplearning.com/chap1.html 18/47
1/10/2017 Neural networks and deep learning

Soifallfourofthesehiddenneuronsarefiringthenwecanconcludethat
thedigitisa0.Ofcourse,that'snottheonlysortofevidencewecanuseto
concludethattheimagewasa0wecouldlegitimatelygeta0inmany
otherways(say,throughtranslationsoftheaboveimages,orslight
distortions).Butitseemssafetosaythatatleastinthiscasewe'dconclude
thattheinputwasa0.

Supposingtheneuralnetworkfunctionsinthisway,wecangivea
plausibleexplanationforwhyit'sbettertohave10 outputsfromthe
network,ratherthan4.Ifwehad4outputs,thenthefirstoutputneuron
wouldbetryingtodecidewhatthemostsignificantbitofthedigitwas.
Andthere'snoeasywaytorelatethatmostsignificantbittosimpleshapes
likethoseshownabove.It'shardtoimaginethatthere'sanygood
historicalreasonthecomponentshapesofthedigitwillbecloselyrelated
to(say)themostsignificantbitintheoutput.

Now,withallthatsaid,thisisalljustaheuristic.Nothingsaysthatthe
threelayerneuralnetworkhastooperateinthewayIdescribed,withthe
hiddenneuronsdetectingsimplecomponentshapes.Maybeaclever
learningalgorithmwillfindsomeassignmentofweightsthatletsususe
only4outputneurons.ButasaheuristicthewayofthinkingI've
describedworksprettywell,andcansaveyoualotoftimeindesigning
goodneuralnetworkarchitectures.

Exercise
Thereisawayofdeterminingthebitwiserepresentationofadigitby
addinganextralayertothethreelayernetworkabove.Theextra
layerconvertstheoutputfromthepreviouslayerintoabinary
representation,asillustratedinthefigurebelow.Findasetofweights
andbiasesforthenewoutputlayer.Assumethatthefirst3layersof
neuronsaresuchthatthecorrectoutputinthethirdlayer(i.e.,theold
outputlayer)hasactivationatleast0.99,andincorrectoutputshave
activationlessthan0.01.

http://neuralnetworksanddeeplearning.com/chap1.html 19/47
1/10/2017 Neural networks and deep learning

Learningwithgradientdescent
Nowthatwehaveadesignforourneuralnetwork,howcanitlearnto
recognizedigits?Thefirstthingwe'llneedisadatasettolearnfroma
socalledtrainingdataset.We'llusetheMNISTdataset,whichcontains
tensofthousandsofscannedimagesofhandwrittendigits,togetherwith
theircorrectclassifications.MNIST'snamecomesfromthefactthatitisa
modifiedsubsetoftwodatasetscollectedbyNIST,theUnitedStates'
NationalInstituteofStandardsandTechnology.Here'safewimagesfrom
MNIST:

Asyoucansee,thesedigitsare,infact,thesameasthoseshownatthe
beginningofthischapterasachallengetorecognize.Ofcourse,when
testingournetworkwe'llaskittorecognizeimageswhicharen'tinthe
trainingset!

TheMNISTdatacomesintwoparts.Thefirstpartcontains60,000images
tobeusedastrainingdata.Theseimagesarescannedhandwritingsamples
from250people,halfofwhomwereUSCensusBureauemployees,and
halfofwhomwerehighschoolstudents.Theimagesaregreyscaleand28
by28pixelsinsize.ThesecondpartoftheMNISTdatasetis10,000
imagestobeusedastestdata.Again,theseare28by28greyscaleimages.
We'llusethetestdatatoevaluatehowwellourneuralnetworkhaslearned
torecognizedigits.Tomakethisagoodtestofperformance,thetestdata
wastakenfromadifferentsetof250peoplethantheoriginaltrainingdata
(albeitstillagroupsplitbetweenCensusBureauemployeesandhigh
schoolstudents).Thishelpsgiveusconfidencethatoursystemcan
recognizedigitsfrompeoplewhosewritingitdidn'tseeduringtraining.
http://neuralnetworksanddeeplearning.com/chap1.html 20/47
1/10/2017 Neural networks and deep learning

We'llusethenotationxtodenoteatraininginput.It'llbeconvenientto
regardeachtraininginputxasa28 28 = 784dimensionalvector.Each
entryinthevectorrepresentsthegreyvalueforasinglepixelintheimage.
We'lldenotethecorrespondingdesiredoutputbyy = y(x) ,whereyisa10
dimensionalvector.Forexample,ifaparticulartrainingimage,x,depicts
a6,theny(x) = (0, 0, 0, 0, 0, 0, 1, 0, 0, 0) isthedesiredoutputfromthe
T

network.NotethatT hereisthetransposeoperation,turningarowvector
intoanordinary(column)vector.

Whatwe'dlikeisanalgorithmwhichletsusfindweightsandbiasesso
thattheoutputfromthenetworkapproximatesy(x)foralltraininginputs
x .Toquantifyhowwellwe'reachievingthisgoalwedefineacost
*Sometimesreferredtoasalossorobjective
function*:
function.Weusethetermcostfunctionthroughout
thisbook,butyoushouldnotetheotherterminology,

C(w, b)
1

2n
y(x) a
2
. (6)
sinceit'softenusedinresearchpapersandother
discussionsofneuralnetworks.
x

Here,wdenotesthecollectionofallweightsinthenetwork,ballthe
biases,nisthetotalnumberoftraininginputs,aisthevectorofoutputs
fromthenetworkwhenxisinput,andthesumisoveralltraininginputs,
x .Ofcourse,theoutputadependsonx,wandb,buttokeepthenotation
simpleIhaven'texplicitlyindicatedthisdependence.Thenotationvjust
denotestheusuallengthfunctionforavectorv.We'llcallCthequadratic
costfunctionit'salsosometimesknownasthemeansquarederrororjust
MSE.Inspectingtheformofthequadraticcostfunction,weseethat
C(w, b) isnonnegative,sinceeveryterminthesumisnonnegative.
Furthermore,thecostC(w, b)becomessmall,i.e.,C(w, b) 0,precisely
wheny(x)isapproximatelyequaltotheoutput,a,foralltraininginputs,x.
Soourtrainingalgorithmhasdoneagoodjobifitcanfindweightsand
biasessothatC(w, b) 0.Bycontrast,it'snotdoingsowellwhenC(w, b)
islargethatwouldmeanthaty(x)isnotclosetotheoutputaforalarge
numberofinputs.Sotheaimofourtrainingalgorithmwillbetominimize
thecostC(w, b)asafunctionoftheweightsandbiases.Inotherwords,we
wanttofindasetofweightsandbiaseswhichmakethecostassmallas
possible.We'lldothatusinganalgorithmknownasgradientdescent.

Whyintroducethequadraticcost?Afterall,aren'tweprimarilyinterested
inthenumberofimagescorrectlyclassifiedbythenetwork?Whynottry
tomaximizethatnumberdirectly,ratherthanminimizingaproxymeasure
likethequadraticcost?Theproblemwiththatisthatthenumberof
imagescorrectlyclassifiedisnotasmoothfunctionoftheweightsand
biasesinthenetwork.Forthemostpart,makingsmallchangestothe

http://neuralnetworksanddeeplearning.com/chap1.html 21/47
1/10/2017 Neural networks and deep learning

weightsandbiaseswon'tcauseanychangeatallinthenumberoftraining
imagesclassifiedcorrectly.Thatmakesitdifficulttofigureouthowto
changetheweightsandbiasestogetimprovedperformance.Ifweinstead
useasmoothcostfunctionlikethequadraticcostitturnsouttobeeasyto
figureouthowtomakesmallchangesintheweightsandbiasessoasto
getanimprovementinthecost.That'swhywefocusfirstonminimizing
thequadraticcost,andonlyafterthatwillweexaminetheclassification
accuracy.

Evengiventhatwewanttouseasmoothcostfunction,youmaystill
wonderwhywechoosethequadraticfunctionusedinEquation(6).Isn't
thisaratheradhocchoice?Perhapsifwechoseadifferentcostfunction
we'dgetatotallydifferentsetofminimizingweightsandbiases?Thisisa
validconcern,andlaterwe'llrevisitthecostfunction,andmakesome
modifications.However,thequadraticcostfunctionofEquation(6)works
perfectlywellforunderstandingthebasicsoflearninginneuralnetworks,
sowe'llstickwithitfornow.

Recapping,ourgoalintraininganeuralnetworkistofindweightsand
biaseswhichminimizethequadraticcostfunctionC(w, b).Thisisawell
posedproblem,butit'sgotalotofdistractingstructureascurrentlyposed
theinterpretationofwandbasweightsandbiases,the functionlurking
inthebackground,thechoiceofnetworkarchitecture,MNIST,andsoon.
Itturnsoutthatwecanunderstandatremendousamountbyignoringmost
ofthatstructure,andjustconcentratingontheminimizationaspect.Sofor
nowwe'regoingtoforgetallaboutthespecificformofthecostfunction,
theconnectiontoneuralnetworks,andsoon.Instead,we'regoingto
imaginethatwe'vesimplybeengivenafunctionofmanyvariablesandwe
wanttominimizethatfunction.We'regoingtodevelopatechniquecalled
gradientdescentwhichcanbeusedtosolvesuchminimizationproblems.
Thenwe'llcomebacktothespecificfunctionwewanttominimizefor
neuralnetworks.

Okay,let'ssupposewe'retryingtominimizesomefunction,C(v) .This
couldbeanyrealvaluedfunctionofmanyvariables,v = v 1, v2 , .Note
thatI'vereplacedthewandbnotationbyvtoemphasizethatthiscouldbe
anyfunctionwe'renotspecificallythinkingintheneuralnetworks
contextanymore.TominimizeC(v) ithelpstoimagineCasafunctionof
justtwovariables,whichwe'llcallv andv :
1 2

http://neuralnetworksanddeeplearning.com/chap1.html 22/47
1/10/2017 Neural networks and deep learning

Whatwe'dlikeistofindwhereCachievesitsglobalminimum.Now,of
course,forthefunctionplottedabove,wecaneyeballthegraphandfind
theminimum.Inthatsense,I'veperhapsshownslightlytoosimplea
function!Ageneralfunction,C,maybeacomplicatedfunctionofmany
variables,anditwon'tusuallybepossibletojusteyeballthegraphtofind
theminimum.

Onewayofattackingtheproblemistousecalculustotrytofindthe
minimumanalytically.Wecouldcomputederivativesandthentryusing
themtofindplaceswhereCisanextremum.Withsomeluckthatmight
workwhenCisafunctionofjustoneorafewvariables.Butit'llturninto
anightmarewhenwehavemanymorevariables.Andforneuralnetworks
we'lloftenwantfarmorevariablesthebiggestneuralnetworkshavecost
functionswhichdependonbillionsofweightsandbiasesinanextremely
complicatedway.Usingcalculustominimizethatjustwon'twork!

(Afterassertingthatwe'llgaininsightbyimaginingCasafunctionofjust
twovariables,I'veturnedaroundtwiceintwoparagraphsandsaid,"hey,
butwhatifit'safunctionofmanymorethantwovariables?"Sorryabout
that.PleasebelievemewhenIsaythatitreallydoeshelptoimagineCas
afunctionoftwovariables.Itjusthappensthatsometimesthatpicture
breaksdown,andthelasttwoparagraphsweredealingwithsuch
breakdowns.Goodthinkingaboutmathematicsofteninvolvesjuggling
multipleintuitivepictures,learningwhenit'sappropriatetouseeach
picture,andwhenit'snot.)

Okay,socalculusdoesn'twork.Fortunately,thereisabeautifulanalogy
whichsuggestsanalgorithmwhichworksprettywell.Westartbythinking

http://neuralnetworksanddeeplearning.com/chap1.html 23/47
1/10/2017 Neural networks and deep learning

ofourfunctionasakindofavalley.Ifyousquintjustalittleattheplot
above,thatshouldn'tbetoohard.Andweimagineaballrollingdownthe
slopeofthevalley.Oureverydayexperiencetellsusthattheballwill
eventuallyrolltothebottomofthevalley.Perhapswecanusethisideaas
awaytofindaminimumforthefunction?We'drandomlychoosea
startingpointforan(imaginary)ball,andthensimulatethemotionofthe
ballasitrolleddowntothebottomofthevalley.Wecoulddothis
simulationsimplybycomputingderivatives(andperhapssomesecond
derivatives)ofCthosederivativeswouldtelluseverythingweneedto
knowaboutthelocal"shape"ofthevalley,andthereforehowourball
shouldroll.

BasedonwhatI'vejustwritten,youmightsupposethatwe'llbetryingto
writedownNewton'sequationsofmotionfortheball,consideringthe
effectsoffrictionandgravity,andsoon.Actually,we'renotgoingtotake
theballrollinganalogyquitethatseriouslywe'redevisinganalgorithm
tominimizeC,notdevelopinganaccuratesimulationofthelawsof
physics!Theball'seyeviewismeanttostimulateourimagination,not
constrainourthinking.Soratherthangetintoallthemessydetailsof
physics,let'ssimplyaskourselves:ifweweredeclaredGodforaday,and
couldmakeupourownlawsofphysics,dictatingtotheballhowitshould
roll,whatlaworlawsofmotioncouldwepickthatwouldmakeitsothe
ballalwaysrolledtothebottomofthevalley?

Tomakethisquestionmoreprecise,let'sthinkaboutwhathappenswhen
wemovetheballasmallamount v1 inthev direction,andasmall
1

amount v2 inthev direction.CalculustellsusthatCchangesas


2

follows:

C C
C v1 + v2 . (7)
v v

1 2

We'regoingtofindawayofchoosing v1 and v2 soastomake C

negativei.e.,we'llchoosethemsotheballisrollingdownintothevalley.
Tofigureouthowtomakesuchachoiceithelpstodefine v tobethe
vectorofchangesinv, v ( v1 , v2 )
T
,whereT isagainthetranspose
operation,turningrowvectorsintocolumnvectors.We'llalsodefinethe
gradientofCtobethevectorofpartialderivatives,( C T
C

v1
,
v
2
) .We
denotethegradientvectorbyC,i.e.:

C C T

C ( v
, . (8)
v )
1 2

http://neuralnetworksanddeeplearning.com/chap1.html 24/47
1/10/2017 Neural networks and deep learning

Inamomentwe'llrewritethechange C intermsof v andthegradient,


C.Beforegettingtothat,though,Iwanttoclarifysomethingthat
sometimesgetspeoplehunguponthegradient.WhenmeetingtheC
notationforthefirsttime,peoplesometimeswonderhowtheyshould
thinkaboutthesymbol.What,exactly,doesmean?Infact,it's
perfectlyfinetothinkofCasasinglemathematicalobjectthevector
definedabovewhichhappenstobewrittenusingtwosymbols.Inthis
pointofview,isjustapieceofnotationalflagwaving,tellingyou"hey,
Cisagradientvector".Therearemoreadvancedpointsofviewwhere
canbeviewedasanindependentmathematicalentityinitsownright(for
example,asadifferentialoperator),butwewon'tneedsuchpointsof
view.

Withthesedefinitions,theexpression(7)for C
canberewrittenas

C C v. (9)

ThisequationhelpsexplainwhyCiscalledthegradientvector:C
relateschangesinvtochangesinC,justaswe'dexpectsomethingcalleda
gradienttodo.Butwhat'sreallyexcitingabouttheequationisthatitlets
usseehowtochoose v soastomake C negative.Inparticular,suppose
wechoose

v = C, (10)

whereisasmall,positiveparameter(knownasthelearningrate).Then
Equation(9)tellsusthat C C C = C .Because
2

C
2
0,thisguaranteesthat C 0,i.e.,Cwillalwaysdecrease,
neverincrease,ifwechangevaccordingtotheprescriptionin(10).
(Within,ofcourse,thelimitsoftheapproximationinEquation(9)).Thisis
exactlythepropertywewanted!Andsowe'lltakeEquation(10)todefine
the"lawofmotion"fortheballinourgradientdescentalgorithm.Thatis,
we'lluseEquation(10)tocomputeavaluefor v ,thenmovetheball's
positionvbythatamount:

v v

= v C. (11)

Thenwe'llusethisupdateruleagain,tomakeanothermove.Ifwekeep
doingthis,overandover,we'llkeepdecreasingCuntilwehopewe
reachaglobalminimum.

Summingup,thewaythegradientdescentalgorithmworksisto
repeatedlycomputethegradientC,andthentomoveintheopposite

http://neuralnetworksanddeeplearning.com/chap1.html 25/47
1/10/2017 Neural networks and deep learning

direction,"fallingdown"theslopeofthevalley.Wecanvisualizeitlike
this:

Noticethatwiththisrulegradientdescentdoesn'treproducerealphysical
motion.Inreallifeaballhasmomentum,andthatmomentummayallow
ittorollacrosstheslope,oreven(momentarily)rolluphill.It'sonlyafter
theeffectsoffrictionsetinthattheballisguaranteedtorolldownintothe
valley.Bycontrast,ourruleforchoosing v justsays"godown,right
now".That'sstillaprettygoodruleforfindingtheminimum!

Tomakegradientdescentworkcorrectly,weneedtochoosethelearning
ratetobesmallenoughthatEquation(9)isagoodapproximation.Ifwe
don't,wemightendupwith C > 0 ,whichobviouslywouldnotbegood!
Atthesametime,wedon'twanttobetoosmall,sincethatwillmakethe
changes v tiny,andthusthegradientdescentalgorithmwillworkvery
slowly.Inpracticalimplementations,isoftenvariedsothatEquation(9)
remainsagoodapproximation,butthealgorithmisn'ttooslow.We'llsee
laterhowthisworks.

I'veexplainedgradientdescentwhenCisafunctionofjusttwovariables.
But,infact,everythingworksjustaswellevenwhenCisafunctionof
manymorevariables.SupposeinparticularthatCisafunctionofm
variables,v 1, , vm .Thenthechange C inCproducedbyasmall
change v = ( v1 , , vm )

T
is

C C v, (12)

wherethegradientCisthevector

T
http://neuralnetworksanddeeplearning.com/chap1.html 26/47
1/10/2017 Neural networks and deep learning

C C
, ,
T

C ( v
. (13)
v
1 m
)

Justasforthetwovariablecase,wecanchoose

v = C, (14)

andwe'reguaranteedthatour(approximate)expression(12)for C will
benegative.Thisgivesusawayoffollowingthegradienttoaminimum,
evenwhenCisafunctionofmanyvariables,byrepeatedlyapplyingthe
updaterule

v v

= v C. (15)

Youcanthinkofthisupdateruleasdefiningthegradientdescent
algorithm.Itgivesusawayofrepeatedlychangingthepositionvinorder
tofindaminimumofthefunctionC.Theruledoesn'talwayswork
severalthingscangowrongandpreventgradientdescentfromfindingthe
globalminimumofC,apointwe'llreturntoexploreinlaterchapters.But,
inpracticegradientdescentoftenworksextremelywell,andinneural
networkswe'llfindthatit'sapowerfulwayofminimizingthecost
function,andsohelpingthenetlearn.

Indeed,there'sevenasenseinwhichgradientdescentistheoptimal
strategyforsearchingforaminimum.Let'ssupposethatwe'retryingto
makeamove v inpositionsoastodecreaseCasmuchaspossible.This
isequivalenttominimizing C C v .We'llconstrainthesizeofthe
movesothat v = forsomesmallfixed > 0.Inotherwords,we
wantamovethatisasmallstepofafixedsize,andwe'retryingtofindthe
movementdirectionwhichdecreasesCasmuchaspossible.Itcanbe
provedthatthechoiceof v whichminimizesC v
is
v = C,
where = /Cisdeterminedbythesizeconstraint
v = .So
gradientdescentcanbeviewedasawayoftakingsmallstepsinthe
directionwhichdoesthemosttoimmediatelydecreaseC.

Exercises
Provetheassertionofthelastparagraph.Hint:Ifyou'renotalready
familiarwiththeCauchySchwarzinequality,youmayfindithelpful
tofamiliarizeyourselfwithit.

IexplainedgradientdescentwhenCisafunctionoftwovariables,
andwhenit'safunctionofmorethantwovariables.Whathappens
whenCisafunctionofjustonevariable?Canyouprovidea

http://neuralnetworksanddeeplearning.com/chap1.html 27/47
1/10/2017 Neural networks and deep learning

geometricinterpretationofwhatgradientdescentisdoingintheone
dimensionalcase?

Peoplehaveinvestigatedmanyvariationsofgradientdescent,including
variationsthatmorecloselymimicarealphysicalball.Theseball
mimickingvariationshavesomeadvantages,butalsohaveamajor
disadvantage:itturnsouttobenecessarytocomputesecondpartial
derivativesofC,andthiscanbequitecostly.Toseewhyit'scostly,
supposewewanttocomputeallthesecondpartialderivatives 2
v .
C/ v j k

Ifthereareamillionsuchv variablesthenwe'dneedtocompute
j

somethinglikeatrillion(i.e.,amillionsquared)secondpartial
derivatives*!That'sgoingtobecomputationallycostly.Withthatsaid,
*Actually,morelikehalfatrillion,since
therearetricksforavoidingthiskindofproblem,andfindingalternatives 2
v
C/ vj k =
2
v .Still,yougetthepoint.
C/ vk j

togradientdescentisanactiveareaofinvestigation.Butinthisbookwe'll
usegradientdescent(andvariations)asourmainapproachtolearningin
neuralnetworks.

Howcanweapplygradientdescenttolearninaneuralnetwork?Theidea
istousegradientdescenttofindtheweightsw andbiasesb which k l

minimizethecostinEquation(6).Toseehowthisworks,let'srestatethe
gradientdescentupdaterule,withtheweightsandbiasesreplacingthe
variablesv .Inotherwords,our"position"nowhascomponentsw andb ,
j k l

andthegradientvectorChascorrespondingcomponentsC/w and k

C/b .Writingoutthegradientdescentupdateruleintermsof
l

components,wehave

C
wk w

k
= wk
w
(16)
k

C
bl b

l
= bl
b
. (17)
l

Byrepeatedlyapplyingthisupdaterulewecan"rolldownthehill",and
hopefullyfindaminimumofthecostfunction.Inotherwords,thisisa
rulewhichcanbeusedtolearninaneuralnetwork.

Thereareanumberofchallengesinapplyingthegradientdescentrule.
We'lllookintothoseindepthinlaterchapters.ButfornowIjustwantto
mentiononeproblem.Tounderstandwhattheproblemis,let'slookback
atthequadraticcostinEquation(6).Noticethatthiscostfunctionhasthe
2

formC = C ,thatis,it'sanaverageovercostsC for


1 y(x) a

x x x
n 2

individualtrainingexamples.Inpractice,tocomputethegradientCwe
needtocomputethegradientsC separatelyforeachtraininginput,x,
x

andthenaveragethem,C =
1

n
C .Unfortunately,whenthenumber
x x

http://neuralnetworksanddeeplearning.com/chap1.html 28/47
1/10/2017 Neural networks and deep learning

oftraininginputsisverylargethiscantakealongtime,andlearningthus
occursslowly.

Anideacalledstochasticgradientdescentcanbeusedtospeedup
learning.TheideaistoestimatethegradientCbycomputingC fora x

smallsampleofrandomlychosentraininginputs.Byaveragingoverthis
smallsampleitturnsoutthatwecanquicklygetagoodestimateofthe
truegradientC,andthishelpsspeedupgradientdescent,andthus
learning.

Tomaketheseideasmoreprecise,stochasticgradientdescentworksby
randomlypickingoutasmallnumbermofrandomlychosentraining
inputs.We'lllabelthoserandomtraininginputsX 1, X2 , , Xm ,andreferto
themasaminibatch.Providedthesamplesizemislargeenoughwe
expectthattheaveragevalueoftheC willberoughlyequaltothe
Xj

averageoverallC ,thatis,
x

m
C Xj C
C,
j=1 x x
= (18)
m n

wherethesecondsumisovertheentiresetoftrainingdata.Swapping
sidesweget
m
1
C C Xj , (19)
m
j=1

confirmingthatwecanestimatetheoverallgradientbycomputing
gradientsjustfortherandomlychosenminibatch.

Toconnectthisexplicitlytolearninginneuralnetworks,supposew andb k l

denotetheweightsandbiasesinourneuralnetwork.Thenstochastic
gradientdescentworksbypickingoutarandomlychosenminibatchof
traininginputs,andtrainingwiththose,

C
wk w

k
= wk
m w
Xj
(20)
k
j

C
bl b

l
= bl
m b
Xj
, (21)
l
j

wherethesumsareoverallthetrainingexamplesX inthecurrentmini j

batch.Thenwepickoutanotherrandomlychosenminibatchandtrain
withthose.Andsoon,untilwe'veexhaustedthetraininginputs,whichis
saidtocompleteanepochoftraining.Atthatpointwestartoverwitha
newtrainingepoch.

http://neuralnetworksanddeeplearning.com/chap1.html 29/47
1/10/2017 Neural networks and deep learning

Incidentally,it'sworthnotingthatconventionsvaryaboutscalingofthe
costfunctionandofminibatchupdatestotheweightsandbiases.In
Equation(6)wescaledtheoverallcostfunctionbyafactor .People 1

sometimesomitthe ,summingoverthecostsofindividualtraining
1

examplesinsteadofaveraging.Thisisparticularlyusefulwhenthetotal
numberoftrainingexamplesisn'tknowninadvance.Thiscanoccurif
moretrainingdataisbeinggeneratedinrealtime,forinstance.And,ina
similarway,theminibatchupdaterules(20)and(21)sometimesomitthe
1

m
termoutthefrontofthesums.Conceptuallythismakeslittledifference,
sinceit'sequivalenttorescalingthelearningrate.Butwhendoing
detailedcomparisonsofdifferentworkit'sworthwatchingoutfor.

Wecanthinkofstochasticgradientdescentasbeinglikepoliticalpolling:
it'smucheasiertosampleasmallminibatchthanitistoapplygradient
descenttothefullbatch,justascarryingoutapolliseasierthanrunninga
fullelection.Forexample,ifwehaveatrainingsetofsizen = 60, 000,as
inMNIST,andchooseaminibatchsizeof(say)m = 10,thismeanswe'll
getafactorof6, 000speedupinestimatingthegradient!Ofcourse,the
estimatewon'tbeperfecttherewillbestatisticalfluctuationsbutit
doesn'tneedtobeperfect:allwereallycareaboutismovinginageneral
directionthatwillhelpdecreaseC,andthatmeanswedon'tneedanexact
computationofthegradient.Inpractice,stochasticgradientdescentisa
commonlyusedandpowerfultechniqueforlearninginneuralnetworks,
andit'sthebasisformostofthelearningtechniqueswe'lldevelopinthis
book.

Exercise
Anextremeversionofgradientdescentistouseaminibatchsizeof
just1.Thatis,givenatraininginput,x,weupdateourweightsand
biasesaccordingtotherulesw k
w
k
= wk C /w and
x k

bl b

l
= bl C /b .Thenwechooseanothertraininginput,and
x l

updatetheweightsandbiasesagain.Andsoon,repeatedly.This
procedureisknownasonline,online,orincrementallearning.In
onlinelearning,aneuralnetworklearnsfromjustonetraininginput
atatime(justashumanbeingsdo).Nameoneadvantageandone
disadvantageofonlinelearning,comparedtostochasticgradient
descentwithaminibatchsizeof,say,20 .

Letmeconcludethissectionbydiscussingapointthatsometimesbugs
peoplenewtogradientdescent.InneuralnetworksthecostCis,ofcourse,

http://neuralnetworksanddeeplearning.com/chap1.html 30/47
1/10/2017 Neural networks and deep learning

afunctionofmanyvariablesalltheweightsandbiasesandsoinsome
sensedefinesasurfaceinaveryhighdimensionalspace.Somepeopleget
hungupthinking:"Hey,Ihavetobeabletovisualizealltheseextra
dimensions".Andtheymaystarttoworry:"Ican'tthinkinfour
dimensions,letalonefive(orfivemillion)".Istheresomespecialability
they'remissing,someabilitythat"real"supermathematicianshave?Of
course,theanswerisno.Evenmostprofessionalmathematicianscan't
visualizefourdimensionsespeciallywell,ifatall.Thetricktheyuse,
instead,istodevelopotherwaysofrepresentingwhat'sgoingon.That's
exactlywhatwedidabove:weusedanalgebraic(ratherthanvisual)
representationof C tofigureouthowtomovesoastodecreaseC.
Peoplewhoaregoodatthinkinginhighdimensionshaveamentallibrary
containingmanydifferenttechniquesalongtheselinesouralgebraictrick
isjustoneexample.Thosetechniquesmaynothavethesimplicitywe're
accustomedtowhenvisualizingthreedimensions,butonceyoubuildupa
libraryofsuchtechniques,youcangetprettygoodatthinkinginhigh
dimensions.Iwon'tgointomoredetailhere,butifyou'reinterestedthen
youmayenjoyreadingthisdiscussionofsomeofthetechniques
professionalmathematiciansusetothinkinhighdimensions.Whilesome
ofthetechniquesdiscussedarequitecomplex,muchofthebestcontentis
intuitiveandaccessible,andcouldbemasteredbyanyone.

Implementingournetworktoclassify
digits
Alright,let'swriteaprogramthatlearnshowtorecognizehandwritten
digits,usingstochasticgradientdescentandtheMNISTtrainingdata.
We'lldothiswithashortPython(2.7)program,just74linesofcode!The
firstthingweneedistogettheMNISTdata.Ifyou'reagituserthenyou
canobtainthedatabycloningthecoderepositoryforthisbook,

git clone https://github.com/mnielsen/neural-networks-and-deep-learning.git

Ifyoudon'tusegitthenyoucandownloadthedataandcodehere.

Incidentally,whenIdescribedtheMNISTdataearlier,Isaiditwassplit
into60,000trainingimages,and10,000testimages.That'stheofficial
MNISTdescription.Actually,we'regoingtosplitthedataalittle
differently.We'llleavethetestimagesasis,butsplitthe60,000image
MNISTtrainingsetintotwoparts:asetof50,000images,whichwe'lluse
totrainourneuralnetwork,andaseparate10,000imagevalidationset.

http://neuralnetworksanddeeplearning.com/chap1.html 31/47
1/10/2017 Neural networks and deep learning

Wewon'tusethevalidationdatainthischapter,butlaterinthebookwe'll
finditusefulinfiguringouthowtosetcertainhyperparametersofthe
neuralnetworkthingslikethelearningrate,andsoon,whicharen't
directlyselectedbyourlearningalgorithm.Althoughthevalidationdata
isn'tpartoftheoriginalMNISTspecification,manypeopleuseMNISTin
thisfashion,andtheuseofvalidationdataiscommoninneuralnetworks.
WhenIrefertothe"MNISTtrainingdata"fromnowon,I'llbereferring
*Asnotedearlier,theMNISTdatasetisbasedontwo
toour50,000imagedataset,nottheoriginal60,000imagedataset*. datasetscollectedbyNIST,theUnitedStates'
NationalInstituteofStandardsandTechnology.To
ApartfromtheMNISTdatawealsoneedaPythonlibrarycalledNumpy, constructMNISTtheNISTdatasetswerestripped
downandputintoamoreconvenientformatbyYann
fordoingfastlinearalgebra.Ifyoudon'talreadyhaveNumpyinstalled, LeCun,CorinnaCortes,andChristopherJ.C.Burges.
Seethislinkformoredetails.Thedatasetinmy
youcangetithere. repositoryisinaformthatmakesiteasytoloadand
manipulatetheMNISTdatainPython.Iobtainedthis

Letmeexplainthecorefeaturesoftheneuralnetworkscode,before particularformofthedatafromtheLISAmachine
learninglaboratoryattheUniversityofMontreal
givingafulllisting,below.ThecenterpieceisaNetworkclass,whichwe (link).

usetorepresentaneuralnetwork.Here'sthecodeweusetoinitializea
Networkobject:

class Network(object):

def __init__(self, sizes):


self.num_layers = len(sizes)
self.sizes = sizes
self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
self.weights = [np.random.randn(y, x)
for x, y in zip(sizes[:-1], sizes[1:])]

Inthiscode,thelistsizescontainsthenumberofneuronsinthe
respectivelayers.So,forexample,ifwewanttocreateaNetworkobject
with2neuronsinthefirstlayer,3neuronsinthesecondlayer,and1
neuroninthefinallayer,we'ddothiswiththecode:

net = Network([2, 3, 1])

ThebiasesandweightsintheNetworkobjectareallinitializedrandomly,
usingtheNumpynp.random.randnfunctiontogenerateGaussian
distributionswithmean0andstandarddeviation1.Thisrandom
initializationgivesourstochasticgradientdescentalgorithmaplaceto
startfrom.Inlaterchapterswe'llfindbetterwaysofinitializingthe
weightsandbiases,butthiswilldofornow.NotethattheNetwork
initializationcodeassumesthatthefirstlayerofneuronsisaninputlayer,
andomitstosetanybiasesforthoseneurons,sincebiasesareonlyever
usedincomputingtheoutputsfromlaterlayers.

NotealsothatthebiasesandweightsarestoredaslistsofNumpy
matrices.So,forexamplenet.weights[1]isaNumpymatrixstoringthe
weightsconnectingthesecondandthirdlayersofneurons.(It'snotthe
http://neuralnetworksanddeeplearning.com/chap1.html 32/47
1/10/2017 Neural networks and deep learning

firstandsecondlayers,sincePython'slistindexingstartsat0.)Since
net.weights[1]isratherverbose,let'sjustdenotethatmatrixw.It'sa

matrixsuchthatw istheweightfortheconnectionbetweenthek
jk
th

neuroninthesecondlayer,andthej neuroninthethirdlayer.This
th

orderingofthejandkindicesmayseemstrangesurelyit'dmakemore
sensetoswapthejandkindicesaround?Thebigadvantageofusingthis
orderingisthatitmeansthatthevectorofactivationsofthethirdlayerof
neuronsis:

a

= (wa + b). (22)

There'squiteabitgoingoninthisequation,solet'sunpackitpieceby
piece.aisthevectorofactivationsofthesecondlayerofneurons.To
obtaina wemultiplyabytheweightmatrixw,andaddthevectorbof

biases.Wethenapplythefunction elementwisetoeveryentryinthe
vectorwa + b.(Thisiscalledvectorizingthefunction .)It'seasytoverify
thatEquation(22)givesthesameresultasourearlierrule,Equation(4),
forcomputingtheoutputofasigmoidneuron.

Exercise
WriteoutEquation(22)incomponentform,andverifythatitgives
thesameresultastherule(4)forcomputingtheoutputofasigmoid
neuron.

Withallthisinmind,it'seasytowritecodecomputingtheoutputfroma
Networkinstance.Webeginbydefiningthesigmoidfunction:

def sigmoid(z):
return 1.0/(1.0+np.exp(-z))

NotethatwhentheinputzisavectororNumpyarray,Numpy
automaticallyappliesthefunctionsigmoidelementwise,thatis,in
vectorizedform.

WethenaddafeedforwardmethodtotheNetworkclass,which,givenan
*Itisassumedthattheinputaisan(n, 1)Numpy
inputaforthenetwork,returnsthecorrespondingoutput*.Allthemethod ndarray,nota(n,)vector.Here,nisthenumberof

doesisappliesEquation(22)foreachlayer: inputstothenetwork.Ifyoutrytousean(n,)
vectorasinputyou'llgetstrangeresults.Although
usingan(n,)vectorappearsthemorenatural
def feedforward(self, a):
choice,usingan(n, 1)ndarraymakesit
"""Return the output of the network if "a" is input."""
particularlyeasytomodifythecodetofeedforward
for b, w in zip(self.biases, self.weights):
multipleinputsatonce,andthatissometimes
a = sigmoid(np.dot(w, a)+b)
convenient.
return a

Ofcourse,themainthingwewantourNetworkobjectstodoistolearn.
Tothatendwe'llgivethemanSGDmethodwhichimplementsstochastic

http://neuralnetworksanddeeplearning.com/chap1.html 33/47
1/10/2017 Neural networks and deep learning

gradientdescent.Here'sthecode.It'salittlemysteriousinafewplaces,
butI'llbreakitdownbelow,afterthelisting.

def SGD(self, training_data, epochs, mini_batch_size, eta,


test_data=None):
"""Train the neural network using mini-batch stochastic
gradient descent. The "training_data" is a list of tuples
"(x, y)" representing the training inputs and the desired
outputs. The other non-optional parameters are
self-explanatory. If "test_data" is provided then the
network will be evaluated against the test data after each
epoch, and partial progress printed out. This is useful for
tracking progress, but slows things down substantially."""
if test_data: n_test = len(test_data)
n = len(training_data)
for j in xrange(epochs):
random.shuffle(training_data)
mini_batches = [
training_data[k:k+mini_batch_size]
for k in xrange(0, n, mini_batch_size)]
for mini_batch in mini_batches:
self.update_mini_batch(mini_batch, eta)
if test_data:
print "Epoch {0}: {1} / {2}".format(
j, self.evaluate(test_data), n_test)
else:
print "Epoch {0} complete".format(j)

Thetraining_dataisalistoftuples(x, y)representingthetraining
inputsandcorrespondingdesiredoutputs.Thevariablesepochsand
mini_batch_sizearewhatyou'dexpectthenumberofepochstotrain

for,andthesizeoftheminibatchestousewhensampling.etaisthe
learningrate,.Iftheoptionalargumenttest_dataissupplied,thenthe
programwillevaluatethenetworkaftereachepochoftraining,andprint
outpartialprogress.Thisisusefulfortrackingprogress,butslowsthings
downsubstantially.

Thecodeworksasfollows.Ineachepoch,itstartsbyrandomlyshuffling
thetrainingdata,andthenpartitionsitintominibatchesoftheappropriate
size.Thisisaneasywayofsamplingrandomlyfromthetrainingdata.
Thenforeachmini_batchweapplyasinglestepofgradientdescent.This
isdonebythecodeself.update_mini_batch(mini_batch, eta),which
updatesthenetworkweightsandbiasesaccordingtoasingleiterationof
gradientdescent,usingjustthetrainingdatainmini_batch.Here'sthe
codefortheupdate_mini_batchmethod:

def update_mini_batch(self, mini_batch, eta):


"""Update the network's weights and biases by applying
gradient descent using backpropagation to a single mini batch.
The "mini_batch" is a list of tuples "(x, y)", and "eta"
is the learning rate."""
nabla_b = [np.zeros(b.shape) for b in self.biases]
nabla_w = [np.zeros(w.shape) for w in self.weights]
for x, y in mini_batch:
delta_nabla_b, delta_nabla_w = self.backprop(x, y)

http://neuralnetworksanddeeplearning.com/chap1.html 34/47
1/10/2017 Neural networks and deep learning
nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
self.weights = [w-(eta/len(mini_batch))*nw
for w, nw in zip(self.weights, nabla_w)]
self.biases = [b-(eta/len(mini_batch))*nb
for b, nb in zip(self.biases, nabla_b)]

Mostoftheworkisdonebytheline
delta_nabla_b, delta_nabla_w = self.backprop(x, y)

Thisinvokessomethingcalledthebackpropagationalgorithm,whichisa
fastwayofcomputingthegradientofthecostfunction.So
update_mini_batchworkssimplybycomputingthesegradientsforevery

trainingexampleinthemini_batch,andthenupdatingself.weightsand
self.biasesappropriately.

I'mnotgoingtoshowthecodeforself.backproprightnow.We'llstudy
howbackpropagationworksinthenextchapter,includingthecodefor
self.backprop.Fornow,justassumethatitbehavesasclaimed,returning

theappropriategradientforthecostassociatedtothetrainingexamplex.

Let'slookatthefullprogram,includingthedocumentationstrings,which
Iomittedabove.Apartfromself.backproptheprogramisself
explanatoryalltheheavyliftingisdoneinself.SGDand
self.update_mini_batch,whichwe'vealreadydiscussed.The

self.backpropmethodmakesuseofafewextrafunctionstohelpin

computingthegradient,namelysigmoid_prime,whichcomputesthe
derivativeofthe function,andself.cost_derivative,whichIwon't
describehere.Youcangetthegistofthese(andperhapsthedetails)just
bylookingatthecodeanddocumentationstrings.We'lllookatthemin
detailinthenextchapter.Notethatwhiletheprogramappearslengthy,
muchofthecodeisdocumentationstringsintendedtomakethecodeeasy
tounderstand.Infact,theprogramcontainsjust74linesofnon
whitespace,noncommentcode.AllthecodemaybefoundonGitHub
here.

"""
network.py
~~~~~~~~~~

A module to implement the stochastic gradient descent learning


algorithm for a feedforward neural network. Gradients are calculated
using backpropagation. Note that I have focused on making the code
simple, easily readable, and easily modifiable. It is not optimized,
and omits many desirable features.
"""

#### Libraries
# Standard library
import random

http://neuralnetworksanddeeplearning.com/chap1.html 35/47
1/10/2017 Neural networks and deep learning
# Third-party libraries
import numpy as np

class Network(object):

def __init__(self, sizes):


"""The list ``sizes`` contains the number of neurons in the
respective layers of the network. For example, if the list
was [2, 3, 1] then it would be a three-layer network, with the
first layer containing 2 neurons, the second layer 3 neurons,
and the third layer 1 neuron. The biases and weights for the
network are initialized randomly, using a Gaussian
distribution with mean 0, and variance 1. Note that the first
layer is assumed to be an input layer, and by convention we
won't set any biases for those neurons, since biases are only
ever used in computing the outputs from later layers."""
self.num_layers = len(sizes)
self.sizes = sizes
self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
self.weights = [np.random.randn(y, x)
for x, y in zip(sizes[:-1], sizes[1:])]

def feedforward(self, a):


"""Return the output of the network if ``a`` is input."""
for b, w in zip(self.biases, self.weights):
a = sigmoid(np.dot(w, a)+b)
return a

def SGD(self, training_data, epochs, mini_batch_size, eta,


test_data=None):
"""Train the neural network using mini-batch stochastic
gradient descent. The ``training_data`` is a list of tuples
``(x, y)`` representing the training inputs and the desired
outputs. The other non-optional parameters are
self-explanatory. If ``test_data`` is provided then the
network will be evaluated against the test data after each
epoch, and partial progress printed out. This is useful for
tracking progress, but slows things down substantially."""
if test_data: n_test = len(test_data)
n = len(training_data)
for j in xrange(epochs):
random.shuffle(training_data)
mini_batches = [
training_data[k:k+mini_batch_size]
for k in xrange(0, n, mini_batch_size)]
for mini_batch in mini_batches:
self.update_mini_batch(mini_batch, eta)
if test_data:
print "Epoch {0}: {1} / {2}".format(
j, self.evaluate(test_data), n_test)
else:
print "Epoch {0} complete".format(j)

def update_mini_batch(self, mini_batch, eta):


"""Update the network's weights and biases by applying
gradient descent using backpropagation to a single mini batch.
The ``mini_batch`` is a list of tuples ``(x, y)``, and ``eta``
is the learning rate."""
nabla_b = [np.zeros(b.shape) for b in self.biases]
nabla_w = [np.zeros(w.shape) for w in self.weights]
for x, y in mini_batch:
delta_nabla_b, delta_nabla_w = self.backprop(x, y)
nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
self.weights = [w-(eta/len(mini_batch))*nw
for w, nw in zip(self.weights, nabla_w)]
self.biases = [b-(eta/len(mini_batch))*nb
for b, nb in zip(self.biases, nabla_b)]

http://neuralnetworksanddeeplearning.com/chap1.html 36/47
1/10/2017 Neural networks and deep learning

def backprop(self, x, y):


"""Return a tuple ``(nabla_b, nabla_w)`` representing the
gradient for the cost function C_x. ``nabla_b`` and
``nabla_w`` are layer-by-layer lists of numpy arrays, similar
to ``self.biases`` and ``self.weights``."""
nabla_b = [np.zeros(b.shape) for b in self.biases]
nabla_w = [np.zeros(w.shape) for w in self.weights]
# feedforward
activation = x
activations = [x] # list to store all the activations, layer by layer
zs = [] # list to store all the z vectors, layer by layer
for b, w in zip(self.biases, self.weights):
z = np.dot(w, activation)+b
zs.append(z)
activation = sigmoid(z)
activations.append(activation)
# backward pass
delta = self.cost_derivative(activations[-1], y) * \
sigmoid_prime(zs[-1])
nabla_b[-1] = delta
nabla_w[-1] = np.dot(delta, activations[-2].transpose())
# Note that the variable l in the loop below is used a little
# differently to the notation in Chapter 2 of the book. Here,
# l = 1 means the last layer of neurons, l = 2 is the
# second-last layer, and so on. It's a renumbering of the
# scheme in the book, used here to take advantage of the fact
# that Python can use negative indices in lists.
for l in xrange(2, self.num_layers):
z = zs[-l]
sp = sigmoid_prime(z)
delta = np.dot(self.weights[-l+1].transpose(), delta) * sp
nabla_b[-l] = delta
nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())
return (nabla_b, nabla_w)

def evaluate(self, test_data):


"""Return the number of test inputs for which the neural
network outputs the correct result. Note that the neural
network's output is assumed to be the index of whichever
neuron in the final layer has the highest activation."""
test_results = [(np.argmax(self.feedforward(x)), y)
for (x, y) in test_data]
return sum(int(x == y) for (x, y) in test_results)

def cost_derivative(self, output_activations, y):


"""Return the vector of partial derivatives \partial C_x /
\partial a for the output activations."""
return (output_activations-y)

#### Miscellaneous functions


def sigmoid(z):
"""The sigmoid function."""
return 1.0/(1.0+np.exp(-z))

def sigmoid_prime(z):
"""Derivative of the sigmoid function."""
return sigmoid(z)*(1-sigmoid(z))

Howwelldoestheprogramrecognizehandwrittendigits?Well,let'sstart
byloadingintheMNISTdata.I'lldothisusingalittlehelperprogram,
mnist_loader.py,tobedescribedbelow.Weexecutethefollowing

commandsinaPythonshell,

http://neuralnetworksanddeeplearning.com/chap1.html 37/47
1/10/2017 Neural networks and deep learning
>>> import mnist_loader
>>> training_data, validation_data, test_data = \
... mnist_loader.load_data_wrapper()

Ofcourse,thiscouldalsobedoneinaseparatePythonprogram,butif
you'refollowingalongit'sprobablyeasiesttodoinaPythonshell.

AfterloadingtheMNISTdata,we'llsetupaNetworkwith30 hidden
neurons.WedothisafterimportingthePythonprogramlistedabove,
whichisnamednetwork,

>>> import network


>>> net = network.Network([784, 30, 10])

Finally,we'llusestochasticgradientdescenttolearnfromtheMNIST
training_dataover30epochs,withaminibatchsizeof10,anda

learningrateof = 3.0,

>>> net.SGD(training_data, 30, 10, 3.0, test_data=test_data)

Notethatifyou'rerunningthecodeasyoureadalong,itwilltakesome
timetoexecuteforatypicalmachine(asof2015)itwilllikelytakeafew
minutestorun.Isuggestyousetthingsrunning,continuetoread,and
periodicallychecktheoutputfromthecode.Ifyou'reinarushyoucan
speedthingsupbydecreasingthenumberofepochs,bydecreasingthe
numberofhiddenneurons,orbyusingonlypartofthetrainingdata.Note
thatproductioncodewouldbemuch,muchfaster:thesePythonscriptsare
intendedtohelpyouunderstandhowneuralnetswork,nottobehigh
performancecode!And,ofcourse,oncewe'vetrainedanetworkitcanbe
runveryquicklyindeed,onalmostanycomputingplatform.Forexample,
oncewe'velearnedagoodsetofweightsandbiasesforanetwork,itcan
easilybeportedtoruninJavascriptinawebbrowser,orasanativeapp
onamobiledevice.Inanycase,hereisapartialtranscriptoftheoutputof
onetrainingrunoftheneuralnetwork.Thetranscriptshowsthenumberof
testimagescorrectlyrecognizedbytheneuralnetworkaftereachepochof
training.Asyoucansee,afterjustasingleepochthishasreached9,129
outof10,000,andthenumbercontinuestogrow,

Epoch 0: 9129 / 10000


Epoch 1: 9295 / 10000
Epoch 2: 9348 / 10000
...
Epoch 27: 9528 / 10000
Epoch 28: 9542 / 10000
Epoch 29: 9534 / 10000

Thatis,thetrainednetworkgivesusaclassificationrateofabout95
percent95.42percentatitspeak("Epoch28")!That'squiteencouraging

http://neuralnetworksanddeeplearning.com/chap1.html 38/47
1/10/2017 Neural networks and deep learning

asafirstattempt.Ishouldwarnyou,however,thatifyourunthecode
thenyourresultsarenotnecessarilygoingtobequitethesameasmine,
sincewe'llbeinitializingournetworkusing(different)randomweights
andbiases.TogenerateresultsinthischapterI'vetakenbestofthreeruns.

Let'sreruntheaboveexperiment,changingthenumberofhiddenneurons
to100.Aswasthecaseearlier,ifyou'rerunningthecodeasyouread
along,youshouldbewarnedthatittakesquiteawhiletoexecute(onmy
machinethisexperimenttakestensofsecondsforeachtrainingepoch),so
it'swisetocontinuereadinginparallelwhilethecodeexecutes.

>>> net = network.Network([784, 100, 10])


>>> net.SGD(training_data, 30, 10, 3.0, test_data=test_data)

Sureenough,thisimprovestheresultsto96.59percent.Atleastinthis
*Readerfeedbackindicatesquitesomevariationin
case,usingmorehiddenneuronshelpsusgetbetterresults*.
resultsforthisexperiment,andsometrainingruns
giveresultsquiteabitworse.Usingthetechniques
Ofcourse,toobtaintheseaccuraciesIhadtomakespecificchoicesforthe introducedinchapter3willgreatlyreducethe
variationinperformanceacrossdifferenttrainingruns
numberofepochsoftraining,theminibatchsize,andthelearningrate,. forournetworks.

AsImentionedabove,theseareknownashyperparametersforourneural
network,inordertodistinguishthemfromtheparameters(weightsand
biases)learntbyourlearningalgorithm.Ifwechooseourhyper
parameterspoorly,wecangetbadresults.Suppose,forexample,thatwe'd
chosenthelearningratetobe = 0.001 ,

>>> net = network.Network([784, 100, 10])


>>> net.SGD(training_data, 30, 10, 0.001, test_data=test_data)

Theresultsaremuchlessencouraging,

Epoch 0: 1139 / 10000


Epoch 1: 1136 / 10000
Epoch 2: 1135 / 10000
...
Epoch 27: 2101 / 10000
Epoch 28: 2123 / 10000
Epoch 29: 2142 / 10000

However,youcanseethattheperformanceofthenetworkisgetting
slowlybetterovertime.Thatsuggestsincreasingthelearningrate,sayto
= 0.01.Ifwedothat,wegetbetterresults,whichsuggestsincreasingthe
learningrateagain.(Ifmakingachangeimprovesthings,trydoingmore!)
Ifwedothatseveraltimesover,we'llendupwithalearningrateof
somethinglike = 1.0(andperhapsfinetuneto3.0),whichisclosetoour
earlierexperiments.Soeventhoughweinitiallymadeapoorchoiceof
hyperparameters,weatleastgotenoughinformationtohelpusimprove
ourchoiceofhyperparameters.

http://neuralnetworksanddeeplearning.com/chap1.html 39/47
1/10/2017 Neural networks and deep learning

Ingeneral,debugginganeuralnetworkcanbechallenging.Thisis
especiallytruewhentheinitialchoiceofhyperparametersproduces
resultsnobetterthanrandomnoise.Supposewetrythesuccessful30
hiddenneuronnetworkarchitecturefromearlier,butwiththelearningrate
changedto = 100.0 :

>>> net = network.Network([784, 30, 10])


>>> net.SGD(training_data, 30, 10, 100.0, test_data=test_data)

Atthispointwe'veactuallygonetoofar,andthelearningrateistoohigh:
Epoch 0: 1009 / 10000
Epoch 1: 1009 / 10000
Epoch 2: 1009 / 10000
Epoch 3: 1009 / 10000
...
Epoch 27: 982 / 10000
Epoch 28: 982 / 10000
Epoch 29: 982 / 10000

Nowimaginethatwewerecomingtothisproblemforthefirsttime.Of
course,weknowfromourearlierexperimentsthattherightthingtodois
todecreasethelearningrate.Butifwewerecomingtothisproblemfor
thefirsttimethentherewouldn'tbemuchintheoutputtoguideuson
whattodo.Wemightworrynotonlyaboutthelearningrate,butabout
everyotheraspectofourneuralnetwork.Wemightwonderifwe've
initializedtheweightsandbiasesinawaythatmakesithardforthe
networktolearn?Ormaybewedon'thaveenoughtrainingdatatoget
meaningfullearning?Perhapswehaven'trunforenoughepochs?Or
maybeit'simpossibleforaneuralnetworkwiththisarchitecturetolearn
torecognizehandwrittendigits?Maybethelearningrateistoolow?Or,
maybe,thelearningrateistoohigh?Whenyou'recomingtoaproblemfor
thefirsttime,you'renotalwayssure.

Thelessontotakeawayfromthisisthatdebugginganeuralnetworkis
nottrivial,and,justasforordinaryprogramming,thereisanarttoit.You
needtolearnthatartofdebugginginordertogetgoodresultsfromneural
networks.Moregenerally,weneedtodevelopheuristicsforchoosing
goodhyperparametersandagoodarchitecture.We'lldiscussalltheseat
lengththroughthebook,includinghowIchosethehyperparameters
above.

Exercise
Trycreatinganetworkwithjusttwolayersaninputandanoutput
layer,nohiddenlayerwith784and10neurons,respectively.Train

http://neuralnetworksanddeeplearning.com/chap1.html 40/47
1/10/2017 Neural networks and deep learning

thenetworkusingstochasticgradientdescent.Whatclassification
accuracycanyouachieve?

Earlier,IskippedoverthedetailsofhowtheMNISTdataisloaded.It's
prettystraightforward.Forcompleteness,here'sthecode.Thedata
structuresusedtostoretheMNISTdataaredescribedinthe
documentationstringsit'sstraightforwardstuff,tuplesandlistsof
Numpyndarrayobjects(thinkofthemasvectorsifyou'renotfamiliar
withndarrays):

"""
mnist_loader
~~~~~~~~~~~~

A library to load the MNIST image data. For details of the data
structures that are returned, see the doc strings for ``load_data``
and ``load_data_wrapper``. In practice, ``load_data_wrapper`` is the
function usually called by our neural network code.
"""

#### Libraries
# Standard library
import cPickle
import gzip

# Third-party libraries
import numpy as np

def load_data():
"""Return the MNIST data as a tuple containing the training data,
the validation data, and the test data.

The ``training_data`` is returned as a tuple with two entries.


The first entry contains the actual training images. This is a
numpy ndarray with 50,000 entries. Each entry is, in turn, a
numpy ndarray with 784 values, representing the 28 * 28 = 784
pixels in a single MNIST image.

The second entry in the ``training_data`` tuple is a numpy ndarray


containing 50,000 entries. Those entries are just the digit
values (0...9) for the corresponding images contained in the first
entry of the tuple.

The ``validation_data`` and ``test_data`` are similar, except


each contains only 10,000 images.

This is a nice data format, but for use in neural networks it's
helpful to modify the format of the ``training_data`` a little.
That's done in the wrapper function ``load_data_wrapper()``, see
below.
"""
f = gzip.open('../data/mnist.pkl.gz', 'rb')
training_data, validation_data, test_data = cPickle.load(f)
f.close()
return (training_data, validation_data, test_data)

def load_data_wrapper():
"""Return a tuple containing ``(training_data, validation_data,
test_data)``. Based on ``load_data``, but the format is more
convenient for use in our implementation of neural networks.

In particular, ``training_data`` is a list containing 50,000

http://neuralnetworksanddeeplearning.com/chap1.html 41/47
1/10/2017 Neural networks and deep learning
2-tuples ``(x, y)``. ``x`` is a 784-dimensional numpy.ndarray
containing the input image. ``y`` is a 10-dimensional
numpy.ndarray representing the unit vector corresponding to the
correct digit for ``x``.

``validation_data`` and ``test_data`` are lists containing 10,000


2-tuples ``(x, y)``. In each case, ``x`` is a 784-dimensional
numpy.ndarry containing the input image, and ``y`` is the
corresponding classification, i.e., the digit values (integers)
corresponding to ``x``.

Obviously, this means we're using slightly different formats for


the training data and the validation / test data. These formats
turn out to be the most convenient for use in our neural network
code."""
tr_d, va_d, te_d = load_data()
training_inputs = [np.reshape(x, (784, 1)) for x in tr_d[0]]
training_results = [vectorized_result(y) for y in tr_d[1]]
training_data = zip(training_inputs, training_results)
validation_inputs = [np.reshape(x, (784, 1)) for x in va_d[0]]
validation_data = zip(validation_inputs, va_d[1])
test_inputs = [np.reshape(x, (784, 1)) for x in te_d[0]]
test_data = zip(test_inputs, te_d[1])
return (training_data, validation_data, test_data)

def vectorized_result(j):
"""Return a 10-dimensional unit vector with a 1.0 in the jth
position and zeroes elsewhere. This is used to convert a digit
(0...9) into a corresponding desired output from the neural
network."""
e = np.zeros((10, 1))
e[j] = 1.0
return e

Isaidabovethatourprogramgetsprettygoodresults.Whatdoesthat
mean?Goodcomparedtowhat?It'sinformativetohavesomesimple
(nonneuralnetwork)baselineteststocompareagainst,tounderstand
whatitmeanstoperformwell.Thesimplestbaselineofall,ofcourse,isto
randomlyguessthedigit.That'llberightabouttenpercentofthetime.
We'redoingmuchbetterthanthat!

Whataboutalesstrivialbaseline?Let'stryanextremelysimpleidea:we'll
lookathowdarkanimageis.Forinstance,animageofa2willtypically
bequiteabitdarkerthananimageofa1,justbecausemorepixelsare
blackenedout,asthefollowingexamplesillustrate:

Thissuggestsusingthetrainingdatatocomputeaveragedarknessesfor
eachdigit,0, 1, 2, , 9.Whenpresentedwithanewimage,wecompute
howdarktheimageis,andthenguessthatit'swhicheverdigithasthe
closestaveragedarkness.Thisisasimpleprocedure,andiseasytocode

http://neuralnetworksanddeeplearning.com/chap1.html 42/47
1/10/2017 Neural networks and deep learning

up,soIwon'texplicitlywriteoutthecodeifyou'reinterestedit'sinthe
GitHubrepository.Butit'sabigimprovementoverrandomguessing,
getting2, 225ofthe10, 000testimagescorrect,i.e.,22.25percent
accuracy.

It'snotdifficulttofindotherideaswhichachieveaccuraciesinthe20 to50
percentrange.Ifyouworkabitharderyoucangetupover50 percent.But
togetmuchhigheraccuraciesithelpstouseestablishedmachinelearning
algorithms.Let'stryusingoneofthebestknownalgorithms,thesupport
vectormachineorSVM.Ifyou'renotfamiliarwithSVMs,nottoworry,
we'renotgoingtoneedtounderstandthedetailsofhowSVMswork.
Instead,we'lluseaPythonlibrarycalledscikitlearn,whichprovidesa
simplePythoninterfacetoafastCbasedlibraryforSVMsknownas
LIBSVM.

Ifwerunscikitlearn'sSVMclassifierusingthedefaultsettings,thenit
gets9,435of10,000testimagescorrect.(Thecodeisavailablehere.)
That'sabigimprovementoverournaiveapproachofclassifyinganimage
basedonhowdarkitis.Indeed,itmeansthattheSVMisperforming
roughlyaswellasourneuralnetworks,justalittleworse.Inlaterchapters
we'llintroducenewtechniquesthatenableustoimproveourneural
networkssothattheyperformmuchbetterthantheSVM.

That'snottheendofthestory,however.The9,435of10,000resultisfor
scikitlearn'sdefaultsettingsforSVMs.SVMshaveanumberoftunable
parameters,andit'spossibletosearchforparameterswhichimprovethis
outoftheboxperformance.Iwon'texplicitlydothissearch,butinstead
referyoutothisblogpostbyAndreasMuellerifyou'dliketoknowmore.
MuellershowsthatwithsomeworkoptimizingtheSVM'sparametersit's
possibletogettheperformanceupabove98.5percentaccuracy.Inother
words,awelltunedSVMonlymakesanerroronaboutonedigitin70.
That'sprettygood!Canneuralnetworksdobetter?

Infact,theycan.Atpresent,welldesignedneuralnetworksoutperform
everyothertechniqueforsolvingMNIST,includingSVMs.Thecurrent
(2013)recordisclassifying9,979of10,000imagescorrectly.Thiswas
donebyLiWan,MatthewZeiler,SixinZhang,YannLeCun,andRob
Fergus.We'llseemostofthetechniquestheyusedlaterinthebook.At
thatleveltheperformanceisclosetohumanequivalent,andisarguably
better,sincequiteafewoftheMNISTimagesaredifficultevenfor
humanstorecognizewithconfidence,forexample:

http://neuralnetworksanddeeplearning.com/chap1.html 43/47
1/10/2017 Neural networks and deep learning

Itrustyou'llagreethatthosearetoughtoclassify!Withimageslikethese
intheMNISTdatasetit'sremarkablethatneuralnetworkscanaccurately
classifyallbut21ofthe10,000testimages.Usually,whenprogramming
webelievethatsolvingacomplicatedproblemlikerecognizingthe
MNISTdigitsrequiresasophisticatedalgorithm.Buteventheneural
networksintheWanetalpaperjustmentionedinvolvequitesimple
algorithms,variationsonthealgorithmwe'veseeninthischapter.Allthe
complexityislearned,automatically,fromthetrainingdata.Insome
sense,themoralofbothourresultsandthoseinmoresophisticatedpapers,
isthatforsomeproblems:

sophisticatedalgorithmsimplelearningalgorithm+goodtrainingdata.

Towarddeeplearning
Whileourneuralnetworkgivesimpressiveperformance,thatperformance
issomewhatmysterious.Theweightsandbiasesinthenetworkwere
discoveredautomatically.Andthatmeanswedon'timmediatelyhavean
explanationofhowthenetworkdoeswhatitdoes.Canwefindsomeway
tounderstandtheprinciplesbywhichournetworkisclassifying
handwrittendigits?And,givensuchprinciples,canwedobetter?

Toputthesequestionsmorestarkly,supposethatafewdecadeshence
neuralnetworksleadtoartificialintelligence(AI).Willweunderstand
howsuchintelligentnetworkswork?Perhapsthenetworkswillbeopaque
tous,withweightsandbiaseswedon'tunderstand,becausethey'vebeen
learnedautomatically.IntheearlydaysofAIresearchpeoplehopedthat
theefforttobuildanAIwouldalsohelpusunderstandtheprinciples
behindintelligenceand,maybe,thefunctioningofthehumanbrain.But
perhapstheoutcomewillbethatweendupunderstandingneitherthe
brainnorhowartificialintelligenceworks!

Toaddressthesequestions,let'sthinkbacktotheinterpretationofartificial
neuronsthatIgaveatthestartofthechapter,asameansofweighing
evidence.Supposewewanttodeterminewhetheranimageshowsa
humanfaceornot:

Credits:1.EsterInbar.2.Unknown.3.NASA,ESA,

http://neuralnetworksanddeeplearning.com/chap1.html 44/47
1/10/2017 Neural networks and deep learning
G.Illingworth,D.Magee,andP.Oesch(University
ofCalifornia,SantaCruz),R.Bouwens(Leiden
University),andtheHUDF09Team.Clickonthe
imagesformoredetails.

Wecouldattackthisproblemthesamewayweattackedhandwriting
recognitionbyusingthepixelsintheimageasinputtoaneuralnetwork,
withtheoutputfromthenetworkasingleneuronindicatingeither"Yes,
it'saface"or"No,it'snotaface".

Let'ssupposewedothis,butthatwe'renotusingalearningalgorithm.
Instead,we'regoingtotrytodesignanetworkbyhand,choosing
appropriateweightsandbiases.Howmightwegoaboutit?Forgetting
neuralnetworksentirelyforthemoment,aheuristicwecoulduseisto
decomposetheproblemintosubproblems:doestheimagehaveaneyein
thetopleft?Doesithaveaneyeinthetopright?Doesithaveanoseinthe
middle?Doesithaveamouthinthebottommiddle?Istherehairontop?
Andsoon.

Iftheanswerstoseveralofthesequestionsare"yes",orevenjust
"probablyyes",thenwe'dconcludethattheimageislikelytobeaface.
Conversely,iftheanswerstomostofthequestionsare"no",thenthe
imageprobablyisn'taface.

Ofcourse,thisisjustaroughheuristic,anditsuffersfrommany
deficiencies.Maybethepersonisbald,sotheyhavenohair.Maybewe
canonlyseepartoftheface,orthefaceisatanangle,sosomeofthe
facialfeaturesareobscured.Still,theheuristicsuggeststhatifwecan
solvethesubproblemsusingneuralnetworks,thenperhapswecanbuilda
neuralnetworkforfacedetection,bycombiningthenetworksforthesub
problems.Here'sapossiblearchitecture,withrectanglesdenotingthesub
networks.Notethatthisisn'tintendedasarealisticapproachtosolvingthe
facedetectionproblemrather,it'stohelpusbuildintuitionabouthow
networksfunction.Here'sthearchitecture:

http://neuralnetworksanddeeplearning.com/chap1.html 45/47
1/10/2017 Neural networks and deep learning

It'salsoplausiblethatthesubnetworkscanbedecomposed.Suppose
we'reconsideringthequestion:"Isthereaneyeinthetopleft?"Thiscan
bedecomposedintoquestionssuchas:"Isthereaneyebrow?""Arethere
eyelashes?""Isthereaniris?"andsoon.Ofcourse,thesequestions
shouldreallyincludepositionalinformation,aswell"Istheeyebrowin
thetopleft,andabovetheiris?",thatkindofthingbutlet'skeepit
simple.Thenetworktoanswerthequestion"Isthereaneyeinthetop
left?"cannowbedecomposed:

Thosequestionstoocanbebrokendown,furtherandfurtherthrough
multiplelayers.Ultimately,we'llbeworkingwithsubnetworksthat
answerquestionssosimpletheycaneasilybeansweredatthelevelof
singlepixels.Thosequestionsmight,forexample,beaboutthepresence
orabsenceofverysimpleshapesatparticularpointsintheimage.Such
questionscanbeansweredbysingleneuronsconnectedtotherawpixels
intheimage.

Theendresultisanetworkwhichbreaksdownaverycomplicated
questiondoesthisimageshowafaceornotintoverysimplequestions
answerableatthelevelofsinglepixels.Itdoesthisthroughaseriesof

http://neuralnetworksanddeeplearning.com/chap1.html 46/47
1/10/2017 Neural networks and deep learning

manylayers,withearlylayersansweringverysimpleandspecific
questionsabouttheinputimage,andlaterlayersbuildingupahierarchyof
evermorecomplexandabstractconcepts.Networkswiththiskindof
manylayerstructuretwoormorehiddenlayersarecalleddeepneural
networks.

Ofcourse,Ihaven'tsaidhowtodothisrecursivedecompositionintosub
networks.Itcertainlyisn'tpracticaltohanddesigntheweightsandbiases
inthenetwork.Instead,we'dliketouselearningalgorithmssothatthe
networkcanautomaticallylearntheweightsandbiasesandthus,the
hierarchyofconceptsfromtrainingdata.Researchersinthe1980sand
1990striedusingstochasticgradientdescentandbackpropagationtotrain
deepnetworks.Unfortunately,exceptforafewspecialarchitectures,they
didn'thavemuchluck.Thenetworkswouldlearn,butveryslowly,andin
practiceoftentooslowlytobeuseful.

Since2006,asetoftechniqueshasbeendevelopedthatenablelearningin
deepneuralnets.Thesedeeplearningtechniquesarebasedonstochastic
gradientdescentandbackpropagation,butalsointroducenewideas.These
techniqueshaveenabledmuchdeeper(andlarger)networkstobetrained
peoplenowroutinelytrainnetworkswith5to10hiddenlayers.And,it
turnsoutthattheseperformfarbetteronmanyproblemsthanshallow
neuralnetworks,i.e.,networkswithjustasinglehiddenlayer.Thereason,
ofcourse,istheabilityofdeepnetstobuildupacomplexhierarchyof
concepts.It'sabitlikethewayconventionalprogramminglanguagesuse
modulardesignandideasaboutabstractiontoenablethecreationof
complexcomputerprograms.Comparingadeepnetworktoashallow
networkisabitlikecomparingaprogramminglanguagewiththeability
tomakefunctioncallstoastrippeddownlanguagewithnoabilitytomake
suchcalls.Abstractiontakesadifferentforminneuralnetworksthanit
doesinconventionalprogramming,butit'sjustasimportant.

Inacademicwork,pleasecitethisbookas:MichaelA.Nielsen,"NeuralNetworksandDeepLearning", Lastupdate:SunJan116:00:212017
DeterminationPress,2015

ThisworkislicensedunderaCreativeCommonsAttributionNonCommercial3.0UnportedLicense.Thismeans
you'refreetocopy,share,andbuildonthisbook,butnottosellit.Ifyou'reinterestedincommercialuse,please
contactme.

http://neuralnetworksanddeeplearning.com/chap1.html 47/47