Jul09 Hinton Deeplearn

UCL Tutorial on:
Deep Belief Nets

(An updated and extended version of my 2007 N!" tutorial#
$eoffrey %inton
Canadian nstitute for Advan&ed 'esear&(
)
Department of Computer "&ien&e
University of Toronto
"&(edule for t(e Tutorial
*
2+00 , -+-0 Tutorial part .
*
-+-0 , -+/0 1uestions
*
-+/0 2 /+.0 Tea Brea3
*
/+.0 , 0+/0 Tutorial part 2
*
0+/0 , 4+00 1uestions
"ome t(in5s you 6ill learn in t(is tutorial
*
%o6 to learn multi2layer 5enerative models of unla7elled
data 7y learnin5 one layer of features at a time+
,
%o6 to add 8ar3ov 'andom 9ields in ea&( (idden layer+
*
%o6 to use 5enerative models to ma3e dis&riminative
trainin5 met(ods 6or3 mu&( 7etter for &lassifi&ation and
re5ression+
,
%o6 to extend t(is approa&( to $aussian !ro&esses and
(o6 to learn &omplex: domain2spe&ifi& 3ernels for a
$aussian !ro&ess+
*
%o6 to perform non2linear dimensionality redu&tion on very
lar5e datasets
,
%o6 to learn 7inary: lo62dimensional &odes and (o6 to
use t(em for very fast do&ument retrieval+
*
%o6 to learn multilayer 5enerative models of (i5(2
dimensional se;uential data+
A spe&trum of ma&(ine learnin5 tas3s
*
Lo62dimensional data (e+5+
less t(an .00 dimensions#
*
Lots of noise in t(e data
*
T(ere is not mu&( stru&ture in
t(e data: and 6(at stru&ture
t(ere is: &an 7e represented 7y
a fairly simple model+
*
T(e main pro7lem is
distin5uis(in5 true stru&ture
from noise+
*
%i5(2dimensional data (e+5+
more t(an .00 dimensions#
*
T(e noise is not suffi&ient to
o7s&ure t(e stru&ture in t(e
data if 6e pro&ess it ri5(t+
*
T(ere is a (u5e amount of
stru&ture in t(e data: 7ut t(e
stru&ture is too &ompli&ated to
7e represented 7y a simple
model+
*
T(e main pro7lem is fi5urin5
out a 6ay to represent t(e
&ompli&ated stru&ture so t(at it
&an 7e learned+
Typi&al "tatisti&s222222222222Artifi&ial ntelli5en&e
%istori&al 7a&35round:
9irst 5eneration neural net6or3s
*
!er&eptrons (<.=40#
used a layer of (and2
&oded features and tried
to re&o5ni>e o7?e&ts 7y
learnin5 (o6 to 6ei5(t
t(ese features+
,
T(ere 6as a neat
learnin5 al5orit(m for
ad?ustin5 t(e 6ei5(ts+
,
But per&eptrons are
fundamentally limited
in 6(at t(ey &an learn
to do+
non2adaptive
(and2&oded
features
output units
e+5+ &lass la7els
input units
e+5+ pixels
"3et&( of a typi&al
per&eptron from t(e .=40@s
Bom7 Toy
"e&ond 5eneration neural net6or3s (<.=A0#
input ve&tor
(idden
layers
outputs
Ba&32propa5ate
error
si5nal to 5et
derivatives for
learnin5
Compare outputs 6it(
&orre&t ans6er to 5et
error si5nal
A temporary di5ression
*
Bapni3 and (is &o26or3ers developed a very &lever type
of per&eptron &alled a "upport Be&tor 8a&(ine+
,
nstead of (and2&odin5 t(e layer of non2adaptive
features: ea&( trainin5 example is used to &reate a
ne6 feature usin5 a fixed re&ipe+
*
T(e feature &omputes (o6 similar a test example is to t(at
trainin5 example+
,
T(en a &lever optimi>ation te&(ni;ue is used to sele&t
t(e 7est su7set of t(e features and to de&ide (o6 to
6ei5(t ea&( feature 6(en &lassifyin5 a test &ase+
*
But its ?ust a per&eptron and (as all t(e same limitations+
*
n t(e .==0@s: many resear&(ers a7andoned neural
net6or3s 6it( multiple adaptive (idden layers 7e&ause
"upport Be&tor 8a&(ines 6or3ed 7etter+
C(at is 6ron5 6it( 7a&32propa5ationD
*
t re;uires la7eled trainin5 data+
,
Almost all data is unla7eled+
*
T(e learnin5 time does not s&ale 6ell
,
t is very slo6 in net6or3s 6it( multiple
(idden layers+
*
t &an 5et stu&3 in poor lo&al optima+
,
T(ese are often ;uite 5ood: 7ut for deep
nets t(ey are far from optimal+
Ever&omin5 t(e limitations of 7a&32
propa5ation
*
Feep t(e effi&ien&y and simpli&ity of usin5 a
5radient met(od for ad?ustin5 t(e 6ei5(ts: 7ut use
it for modelin5 t(e stru&ture of t(e sensory input+
,
Ad?ust t(e 6ei5(ts to maximi>e t(e pro7a7ility
t(at a 5enerative model 6ould (ave produ&ed
t(e sensory input+
,
Learn p(ima5e# not p(la7el G ima5e#
*
f you 6ant to do &omputer vision: first learn
&omputer 5rap(i&s
*
C(at 3ind of 5enerative model s(ould 6e learnD
Belief Nets
*
A 7elief net is a dire&ted
a&y&li& 5rap( &omposed of
sto&(asti& varia7les+
*
Ce 5et to o7serve some of
t(e varia7les and 6e 6ould
li3e to solve t6o pro7lems:
*
T(e inferen&e pro7lem: nfer
t(e states of t(e uno7served
varia7les+
*
T(e learnin5 pro7lem: Ad?ust
t(e intera&tions 7et6een
varia7les to ma3e t(e
net6or3 more li3ely to
5enerate t(e o7served data+
sto&(asti&
(idden
&ause
visi7le
effe&t
Ce 6ill use nets &omposed of
layers of sto&(asti& 7inary varia7les
6it( 6ei5(ted &onne&tions+ Later:
6e 6ill 5enerali>e to ot(er types of
varia7le+
"to&(asti& 7inary units
(Bernoulli varia7les#
*
T(ese (ave a state of .
or 0+
*
T(e pro7a7ility of
turnin5 on is determined
7y t(e 6ei5(ted input
from ot(er units (plus a
7ias#
0
0
.
+
= =
j
ji j i
i
w s b
s p
) exp( 1
) (
1
1
+
j
ji j i
w s b
) ( 1 =
i
s p
Learnin5 Deep Belief Nets
*
t is easy to 5enerate an
un7iased example at t(e
leaf nodes: so 6e &an see
6(at 3inds of data t(e
net6or3 7elieves in+
*
t is (ard to infer t(e
posterior distri7ution over
all possi7le &onfi5urations
of (idden &auses+
*
t is (ard to even 5et a
sample from t(e posterior+
*
"o (o6 &an 6e learn deep
7elief nets t(at (ave
millions of parametersD
sto&(asti&
(idden
&ause
visi7le
effe&t
T(e learnin5 rule for si5moid 7elief nets
*
Learnin5 is easy if 6e &an
5et an un7iased sample
from t(e posterior
distri7ution over (idden
states 5iven t(e o7served
data+
*
9or ea&( unit: maximi>e
t(e lo5 pro7a7ility t(at its
7inary state in t(e sample
from t(e posterior 6ould 7e
5enerated 7y t(e sampled
7inary states of its parents+
+
= =
j
ji j
i i
w s
s p p
) exp( 1
) (
1
1
?
i
ji
w
) (
i i j ji
p s s w =
i
s
j
s
learnin5
rate
Hxplainin5 a6ay (Iudea !earl#
*
Hven if t6o (idden &auses are independent: t(ey &an
7e&ome dependent 6(en 6e o7serve an effe&t t(at t(ey &an
7ot( influen&e+
,
f 6e learn t(at t(ere 6as an eart(;ua3e it redu&es t(e
pro7a7ility t(at t(e (ouse ?umped 7e&ause of a tru&3+
tru&3 (its (ouse eart(;ua3e
(ouse ?umps
20 20
220
2.0 2.0

p(.:.#J+000.
p(.:0#J+/===
p(0:.#J+/===
p(0:0#J+000.
posterior
C(y it is usually very (ard to learn
si5moid 7elief nets one layer at a time
*
To learn C: 6e need t(e posterior
distri7ution in t(e first (idden layer+
*
!ro7lem .: T(e posterior is typi&ally
&ompli&ated 7e&ause of Kexplainin5
a6ayL+
*
!ro7lem 2: T(e posterior depends
on t(e prior as 6ell as t(e li3eli(ood+
,
"o to learn C: 6e need to 3no6
t(e 6ei5(ts in (i5(er layers: even
if 6e are only approximatin5 t(e
posterior+ All t(e 6ei5(ts intera&t+
*
!ro7lem -: Ce need to inte5rate
over all possi7le &onfi5urations of
t(e (i5(er varia7les to 5et t(e prior
for first (idden layer+ Mu3N
data
(idden varia7les
(idden varia7les
(idden varia7les
li3eli(ood C
prior
"ome met(ods of learnin5
deep 7elief nets
*
8onte Carlo met(ods &an 7e used to sample
from t(e posterior+
,
But its painfully slo6 for lar5e: deep models+
*
n t(e .==0@s people developed variational
met(ods for learnin5 deep 7elief nets
,
T(ese only 5et approximate samples from t(e
posterior+
,
Nevet(eless: t(e learnin5 is still 5uaranteed to
improve a variational 7ound on t(e lo5
pro7a7ility of 5eneratin5 t(e o7served data+
T(e 7rea3t(rou5( t(at ma3es deep
learnin5 effi&ient
*
To learn deep nets effi&iently: 6e need to learn one layer
of features at a time+ T(is does not 6or3 6ell if 6e
assume t(at t(e latent varia7les are independent in t(e
prior :
,
T(e latent varia7les are not independent in t(e
posterior so inferen&e is (ard for non2linear models+
,
T(e learnin5 tries to find independent &auses usin5
one (idden layer 6(i&( is not usually possi7le+
*
Ce need a 6ay of learnin5 one layer at a time t(at ta3es
into a&&ount t(e fa&t t(at 6e 6ill 7e learnin5 more
(idden layers later+
,
Ce solve t(is pro7lem 7y usin5 an undire&ted model+
T6o types of 5enerative neural net6or3
*
f 6e &onne&t 7inary sto&(asti& neurons in a
dire&ted a&y&li& 5rap( 6e 5et a "i5moid Belief
Net ('adford Neal .==2#+
*
f 6e &onne&t 7inary sto&(asti& neurons usin5
symmetri& &onne&tions 6e 5et a Bolt>mann
8a&(ine (%inton ) "e?no6s3i: .=A-#+
,
f 6e restri&t t(e &onne&tivity in a spe&ial 6ay:
it is easy to learn a Bolt>mann ma&(ine+
'estri&ted Bolt>mann 8a&(ines
("molens3y :.=A4: &alled t(em K(armoniumsL#
*
Ce restri&t t(e &onne&tivity to ma3e
learnin5 easier+
,
Enly one layer of (idden units+
*
Ce 6ill deal 6it( more layers later
,
No &onne&tions 7et6een (idden units+
*
n an 'B8: t(e (idden units are
&onditionally independent 5iven t(e
visi7le states+
,
"o 6e &an ;ui&3ly 5et an un7iased
sample from t(e posterior distri7ution
6(en 5iven a data2ve&tor+
,
T(is is a 7i5 advanta5e over dire&ted
7elief nets
(idden
i
?
visi7le
T(e Hner5y of a ?oint &onfi5uration
(i5norin5 terms to do 6it( 7iases#
=
j i
ij j i
w h v v,h E
,
) (
6ei5(t 7et6een
units i and ?
Hner5y 6it( &onfi5uration
v on t(e visi7le units and
( on t(e (idden units
7inary state of
visi7le unit i
7inary state of
(idden unit ?
j i
ij
h v
w
h v E
=
) , (
Cei5(ts Hner5ies !ro7a7ilities
*
Ha&( possi7le ?oint &onfi5uration of t(e visi7le
and (idden units (as an ener5y
,
T(e ener5y is determined 7y t(e 6ei5(ts and
7iases (as in a %opfield net#+
*
T(e ener5y of a ?oint &onfi5uration of t(e visi7le
and (idden units determines its pro7a7ility:
*
T(e pro7a7ility of a &onfi5uration over t(e visi7le
units is found 7y summin5 t(e pro7a7ilities of all
t(e ?oint &onfi5urations t(at &ontain it+
) , (
) , (
h v E
h v p e
Usin5 ener5ies to define pro7a7ilities

*
T(e pro7a7ility of a ?oint
&onfi5uration over 7ot( visi7le
and (idden units depends on
t(e ener5y of t(at ?oint
&onfi5uration &ompared 6it(
t(e ener5y of all ot(er ?oint
&onfi5urations+
*
T(e pro7a7ility of a
&onfi5uration of t(e visi7le
units is t(e sum of t(e
pro7a7ilities of all t(e ?oint
&onfi5urations t(at &ontain it+
=
g u
g u E
h v E
e
e
h v p
,
) , (
) , (
) , (
=
g u
g u E
h
h v E
e
e
v p
,
) , (
) , (
) (
partition
fun&tion
A pi&ture of t(e maximum li3eli(ood learnin5
al5orit(m for an 'B8
0
> <
j i
h v

> <
j i
h v
i
?
i
?
i
?
i
?
t J 0 t J . t J 2 t J infinity
> < > < =
j i j i
ij
h v h v
w
v p
0
) ( log
"tart 6it( a trainin5 ve&tor on t(e visi7le units+
T(en alternate 7et6een updatin5 all t(e (idden units in
parallel and updatin5 all t(e visi7le units in parallel+
a fantasy
A ;ui&3 6ay to learn an 'B8
0
> <
j i
h v
1
> <
j i
h v
i
?
i
?
t J 0 t J .
) (
1 0
> < > < =
j i j i ij
h v h v w
"tart 6it( a trainin5 ve&tor on t(e
visi7le units+
Update all t(e (idden units in
parallel
Update t(e all t(e visi7le units in
parallel to 5et a Kre&onstru&tionL+
Update t(e (idden units a5ain+
T(is is not follo6in5 t(e 5radient of t(e lo5 li3eli(ood+ But it
6or3s 6ell+ t is approximately follo6in5 t(e 5radient of anot(er
o7?e&tive fun&tion (Carreira2!erpinan ) %inton: 2000#+
re&onstru&tion
data
%o6 to learn a set of features t(at are 5ood for
re&onstru&tin5 ima5es of t(e di5it 2
00 7inary
feature
neurons
.4 x .4
pixel
ima5e
00 7inary
feature
neurons
.4 x .4
pixel
ima5e
n&rement 6ei5(ts
7et6een an a&tive
pixel and an a&tive
feature
De&rement 6ei5(ts
7et6een an a&tive
pixel and an a&tive
feature
data
(reality#
re&onstru&tion
(7etter t(an reality#
T(e final 00 x 204 6ei5(ts
Ha&( neuron 5ra7s a different feature+
'e&onstru&tion
from a&tivated
7inary features
Data
'e&onstru&tion
from a&tivated
7inary features Data
%o6 6ell &an 6e re&onstru&t t(e di5it ima5es
from t(e 7inary feature a&tivationsD
Ne6 test ima5es from
t(e di5it &lass t(at t(e
model 6as trained on
ma5es from an
unfamiliar di5it &lass
(t(e net6or3 tries to see
every ima5e as a 2#
T(ree 6ays to &om7ine pro7a7ility density
models (an underlyin5 t(eme of t(e tutorial#
*
Mixture: Ta3e a 6ei5(ted avera5e of t(e distri7utions+
,
t &an never 7e s(arper t(an t(e individual distri7utions+
t@s a very 6ea3 6ay to &om7ine models+
*
Product: 8ultiply t(e distri7utions at ea&( point and t(en
renormali>e (t(is is (o6 an 'B8 &om7ines t(e distri7utions defined
7y ea&( (idden unit#
,
Hxponentially more po6erful t(an a mixture+ T(e
normali>ation ma3es maximum li3eli(ood learnin5
diffi&ult: 7ut approximations allo6 us to learn any6ay+
*
Composition: Use t(e values of t(e latent varia7les of one
model as t(e data for t(e next model+
,
Cor3s 6ell for learnin5 multiple layers of representation:
7ut only if t(e individual models are undire&ted+
Trainin5 a deep net6or3
(t(e main reason 'B8@s are interestin5#
*
9irst train a layer of features t(at re&eive input dire&tly
from t(e pixels+
*
T(en treat t(e a&tivations of t(e trained features as if
t(ey 6ere pixels and learn features of features in a
se&ond (idden layer+
*
t &an 7e proved t(at ea&( time 6e add anot(er layer of
features 6e improve a variational lo6er 7ound on t(e lo5
pro7a7ility of t(e trainin5 data+
,
T(e proof is sli5(tly &ompli&ated+
,
But it is 7ased on a neat e;uivalen&e 7et6een an
'B8 and a deep dire&ted model (des&ri7ed later#
T(e 5enerative model after learnin5 - layers
*
To 5enerate data:
.+ $et an e;uili7rium sample
from t(e top2level 'B8 7y
performin5 alternatin5 $i77s
samplin5 for a lon5 time+
2+ !erform a top2do6n pass to
5et states for all t(e ot(er
layers+
"o t(e lo6er level 7ottom2up
&onne&tions are not part of
t(e 5enerative model+ T(ey
are ?ust used for inferen&e+
(2
data
(.
(-
2
W
3
W
1
W
C(y does 5reedy learnin5 6or3D
An aside: Avera5in5 fa&torial distri7utions

*
f you avera5e some fa&torial distri7utions: you
do NET 5et a fa&torial distri7ution+
,
n an 'B8: t(e posterior over t(e (idden units
is fa&torial for ea&( visi7le ve&tor+
,
But t(e a55re5ated posterior over all trainin5
&ases is not fa&torial (even if t(e data 6as
5enerated 7y t(e 'B8 itself#+
*
Ha&( 'B8 &onverts its data distri7ution
into an a55re5ated posterior distri7ution
over its (idden units+
*
T(is divides t(e tas3 of modelin5 its
data into t6o tas3s:
,
Tas3 .: Learn 5enerative 6ei5(ts
t(at &an &onvert t(e a55re5ated
posterior distri7ution over t(e (idden
units 7a&3 into t(e data distri7ution+
,
Tas3 2: Learn to model t(e
a55re5ated posterior distri7ution
over t(e (idden units+
,
T(e 'B8 does a 5ood ?o7 of tas3 .
and a moderately 5ood ?o7 of tas3 2+
*
Tas3 2 is easier (for t(e next 'B8# t(an
modelin5 t(e ori5inal data 7e&ause t(e
a55re5ated posterior distri7ution is
&loser to a distri7ution t(at an 'B8 &an
model perfe&tly+
data distri7ution
on visi7le units
a55re5ated
posterior distri7ution
on (idden units
) | ( W h p
) , | ( W h v p
Tas3 2
Tas3 .
=
h
h v p h p v p ) | ( ) ( ) (
T(e 6ei5(ts: C: in t(e 7ottom level 'B8 define
p(vG(# and t(ey also: indire&tly: define p((#+
"o 6e &an express t(e 'B8 model as
f 6e leave p(vG(# alone and improve p((#: 6e 6ill
improve p(v#+
To improve p((#: 6e need it to 7e a 7etter model of
t(e a55re5ated posterior distri7ution over (idden
ve&tors produ&ed 7y applyin5 C to t(e data+
C(i&( distri7utions are fa&torial in a
dire&ted 7elief netD
*
n a dire&ted 7elief net 6it( one (idden layer: t(e
posterior over t(e (idden units p((Gv# is non2
fa&torial (due to explainin5 a6ay#+
,
T(e a55re5ated posterior is fa&torial if t(e
data 6as 5enerated 7y t(e dire&ted model+
*
t@s t(e opposite 6ay round from an undire&ted
model 6(i&( (as fa&torial posteriors and a non2
fa&torial prior p((# over t(e (iddens+
*
T(e intuitions t(at people (ave from usin5 dire&ted
models are very misleadin5 for undire&ted models+
C(y does 5reedy learnin5 fail in a dire&ted moduleD
*
A dire&ted module also &onverts its data
distri7ution into an a55re5ated posterior
,
Tas3 . T(e learnin5 is no6 (arder
7e&ause t(e posterior for ea&( trainin5
&ase is non2fa&torial+
,
Tas3 2 is performed usin5 an
independent prior+ T(is is a very 7ad
approximation unless t(e a55re5ated
posterior is &lose to fa&torial+
*
A dire&ted module attempts to ma3e t(e
a55re5ated posterior fa&torial in one step+
,
T(is is too diffi&ult and leads to a 7ad
&ompromise+ T(ere is also no
5uarantee t(at t(e a55re5ated
posterior is easier to model t(an t(e
data distri7ution+
data distri7ution
on visi7le units
) | (
2
W h p
) , | (
1
W h v p
Tas3 2
Tas3 .
a55re5ated
posterior distri7ution
on (idden units
A model of di5it re&o5nition
2000 top2level neurons
000 neurons
000 neurons
2A x 2A
pixel
ima5e
.0 la7el
neurons
T(e model learns to 5enerate
&om7inations of la7els and ima5es+
To perform re&o5nition 6e start 6it( a
neutral state of t(e la7el units and do
an up2pass from t(e ima5e follo6ed
7y a fe6 iterations of t(e top2level
asso&iative memory+
T(e top t6o layers form an
asso&iative memory 6(ose
ener5y lands&ape models t(e lo6
dimensional manifolds of t(e
di5its+
T(e ener5y valleys (ave names
9ine2tunin5 6it( a &ontrastive version of t(e
K6a3e2sleepL al5orit(m
After learnin5 many layers of features: 6e &an fine2tune
t(e features to improve 5eneration+
.+ Do a sto&(asti& 7ottom2up pass
,
Ad?ust t(e top2do6n 6ei5(ts to 7e 5ood at
re&onstru&tin5 t(e feature a&tivities in t(e layer 7elo6+
-+ Do a fe6 iterations of samplin5 in t(e top level 'B8
22 Ad?ust t(e 6ei5(ts in t(e top2level 'B8+
/+ Do a sto&(asti& top2do6n pass
,
Ad?ust t(e 7ottom2up 6ei5(ts to 7e 5ood at
re&onstru&tin5 t(e feature a&tivities in t(e layer a7ove+
"(o6 t(e movie of t(e net6or3
5eneratin5 di5its
(availa7le at 666+&s+torontoO<(inton#
"amples 5enerated 7y lettin5 t(e asso&iative
memory run 6it( one la7el &lamped+ T(ere are
.000 iterations of alternatin5 $i77s samplin5
7et6een samples+
Hxamples of &orre&tly re&o5ni>ed (and6ritten di5its
t(at t(e neural net6or3 (ad never seen 7efore
ts very
5ood
%o6 6ell does it dis&riminate on 8N"T test set 6it(
no extra information a7out 5eometri& distortionsD
*
$enerative model 7ased on 'B8@s .+20P
*
"upport Be&tor 8a&(ine (De&oste et+ al+# .+/P
*
Ba&3prop 6it( .000 (iddens (!latt# <.+4P
*
Ba&3prop 6it( 000 22Q-00 (iddens <.+4P
*
F2Nearest Nei5(7or < -+-P
*
"ee Le Cun et+ al+ .==A for more results
*
ts 7etter t(an 7a&3prop and mu&( more neurally plausi7le
7e&ause t(e neurons only need to send one 3ind of si5nal:
and t(e tea&(er &an 7e anot(er sensory input+
Unsupervised Kpre2trainin5L also (elps for
models t(at (ave more data and 7etter priors
*
'an>ato et+ al+ (N!" 2004# used an additional
400:000 distorted di5its+
*
T(ey also used &onvolutional multilayer neural
net6or3s t(at (ave some 7uilt2in: lo&al
translational invarian&e+
Ba&32propa5ation alone: 0+/=P
Unsupervised layer27y2layer
pre2trainin5 follo6ed 7y 7a&3prop: 0+-=P (re&ord#
Anot(er vie6 of 6(y layer27y2layer
learnin5 6or3s (%inton: Esindero ) Te( 2004#
*
T(ere is an unexpe&ted e;uivalen&e 7et6een
'B8@s and dire&ted net6or3s 6it( many layers
t(at all use t(e same 6ei5(ts+
,
T(is e;uivalen&e also 5ives insi5(t into 6(y
&ontrastive diver5en&e learnin5 6or3s+
An infinite si5moid 7elief net
t(at is e;uivalent to an 'B8
*
T(e distri7ution 5enerated 7y t(is
infinite dire&ted net 6it( repli&ated
6ei5(ts is t(e e;uili7rium distri7ution
for a &ompati7le pair of &onditional
distri7utions: p(vG(# and p((Gv# t(at
are 7ot( defined 7y C
,
A top2do6n pass of t(e dire&ted
net is exa&tly e;uivalent to lettin5
a 'estri&ted Bolt>mann 8a&(ine
settle to e;uili7rium+
,
"o t(is infinite dire&ted net
defines t(e same distri7ution as
an 'B8+
W
v.
(.
v0
(0
v2
(2
T
W
T
W
T
W
W
W
et&+
*
T(e varia7les in (0 are &onditionally
independent 5iven v0+
,
nferen&e is trivial+ Ce ?ust
multiply v0 7y C transpose+
,
T(e model a7ove (0 implements
a &omplementary prior+
,
8ultiplyin5 v0 7y C transpose
5ives t(e produ&t of t(e li3eli(ood
term and t(e prior term+
*
nferen&e in t(e dire&ted net is
exa&tly e;uivalent to lettin5 a
'estri&ted Bolt>mann 8a&(ine settle
to e;uili7rium startin5 at t(e data+
nferen&e in a dire&ted net
6it( repli&ated 6ei5(ts
W
v.
(.
v0
(0
v2
(2
T
W
T
W
T
W
W
W
et&+
R
R
R
R
*
T(e learnin5 rule for a si5moid 7elief
net is:
*
Cit( repli&ated 6ei5(ts t(is 7e&omes:
W
v.
(.
v0
(0
v2
(2
T
W
T
W
T
W
W
W
et&+
0
i
s
0
j
s
1
j
s
2
j
s
1
i
s
2
i
s

+
+
+
i j
i i j
j j i
i i j
s s
s s s
s s s
s s s
... ) (
) (
) (
2 1 1
1 0 1
1 0 0
T
W
T
W
T
W
W
W
)
(
i i j ij
s s s w
*
9irst learn 6it( all t(e 6ei5(ts tied
,
T(is is exa&tly e;uivalent to
learnin5 an 'B8
,
Contrastive diver5en&e learnin5
is e;uivalent to i5norin5 t(e small
derivatives &ontri7uted 7y t(e tied
6ei5(ts 7et6een deeper layers+
Learnin5 a deep dire&ted
net6or3
W
W
v.
(.
v0
(0
v2
(2
T
W
T
W
T
W
W
et&+
v0
(0
W
*
T(en free>e t(e first layer of 6ei5(ts
in 7ot( dire&tions and learn t(e
remainin5 6ei5(ts (still tied
to5et(er#+
,
T(is is e;uivalent to learnin5
anot(er 'B8: usin5 t(e
a55re5ated posterior distri7ution
of (0 as t(e data+
W
v.
(.
v0
(0
v2
(2
T
W
T
W
T
W
W
et&+
frozen
W
v.
(0
W
T
frozen
W
%o6 many layers s(ould 6e use and (o6
6ide s(ould t(ey 7eD
*
T(ere is no simple ans6er+
,
Hxtensive experiments 7y Mos(ua Ben5io@s 5roup
(des&ri7ed later# su55est t(at several (idden layers is
7etter t(an one+
,
'esults are fairly ro7ust a5ainst &(an5es in t(e si>e of a
layer: 7ut t(e top layer s(ould 7e 7i5+
*
Deep 7elief nets 5ive t(eir &reator a lot of freedom+
,
T(e 7est 6ay to use t(at freedom depends on t(e tas3+
,
Cit( enou5( narro6 layers 6e &an model any distri7ution
over 7inary ve&tors ("uts3ever ) %inton: 2007#
C(at (appens 6(en t(e 6ei5(ts in (i5(er layers
7e&ome different from t(e 6ei5(ts in t(e first layerD
*
T(e (i5(er layers no lon5er implement a &omplementary
prior+
,
"o performin5 inferen&e usin5 t(e fro>en 6ei5(ts in
t(e first layer is no lon5er &orre&t+ But its still pretty
5ood+
,
Usin5 t(is in&orre&t inferen&e pro&edure 5ives a
variational lo6er 7ound on t(e lo5 pro7a7ility of t(e
data+
*
T(e (i5(er layers learn a prior t(at is &loser to t(e
a55re5ated posterior distri7ution of t(e first (idden layer+
,
T(is improves t(e net6or3@s model of t(e data+
*
%inton: Esindero and Te( (2004# prove t(at t(is
improvement is al6ays 7i55er t(an t(e loss in t(e variational
7ound &aused 7y usin5 less a&&urate inferen&e+
An improved version of Contrastive
Diver5en&e learnin5 (if time permits#
*
T(e main 6orry 6it( CD is t(at t(ere 6ill 7e deep
minima of t(e ener5y fun&tion far a6ay from t(e
data+
,
To find t(ese 6e need to run t(e 8ar3ov &(ain for
a lon5 time (may7e t(ousands of steps#+
,
But 6e &annot afford to run t(e &(ain for too lon5
for ea&( update of t(e 6ei5(ts+
*
8ay7e 6e &an run t(e same 8ar3ov &(ain over
many 6ei5(t updatesD (Neal: .==2#
,
f t(e learnin5 rate is very small: t(is s(ould 7e
e;uivalent to runnin5 t(e &(ain for many steps
and t(en doin5 a 7i55er 6ei5(t update+
!ersistent CD
(Ti?men Teileman: C8L 200A ) 200=#
*
Use mini7at&(es of .00 &ases to estimate t(e
first term in t(e 5radient+ Use a sin5le 7at&( of
.00 fantasies to estimate t(e se&ond term in t(e
5radient+

*
After ea&( 6ei5(t update: 5enerate t(e ne6
fantasies from t(e previous fantasies 7y usin5
one alternatin5 $i77s update+
,
"o t(e fantasies &an 5et far from t(e data+
Contrastive diver5en&e as an
adversarial 5ame
*
C(y does persisitent CD 6or3 so 6ell 6it( only
.00 ne5ative examples to &(ara&teri>e t(e
6(ole partition fun&tionD
,
9or all interestin5 pro7lems t(e partition
fun&tion is (i5(ly multi2modal+
,
%o6 does it mana5e to find all t(e modes
6it(out startin5 at t(e dataD
T(e learnin5 &auses very fast mixin5
*
T(e learnin5 intera&ts 6it( t(e 8ar3ov &(ain+
*
!ersisitent Contrastive Diver5en&e &annot 7e
analysed 7y vie6in5 t(e learnin5 as an outer loop+
,
C(erever t(e fantasies outnum7er t(e
positive data: t(e free2ener5y surfa&e is
raised+ T(is ma3es t(e fantasies rus( around
(ypera&tively+
%o6 persistent CD moves 7et6een t(e
modes of t(e model@s distri7ution
*
f a mode (as more fantasy
parti&les t(an data: t(e free2
ener5y surfa&e is raised until
t(e fantasy parti&les es&ape+
,
T(is &an over&ome free2
ener5y 7arriers t(at 6ould
7e too (i5( for t(e 8ar3ov
C(ain to ?ump+
*
T(e free2ener5y surfa&e is
7ein5 &(an5ed to (elp
mixin5 in addition to definin5
t(e model+
"ummary so far
*
'estri&ted Bolt>mann 8a&(ines provide a simple 6ay to
learn a layer of features 6it(out any supervision+
,
8aximum li3eli(ood learnin5 is &omputationally
expensive 7e&ause of t(e normali>ation term: 7ut
&ontrastive diver5en&e learnin5 is fast and usually
6or3s 6ell+
*
8any layers of representation &an 7e learned 7y treatin5
t(e (idden states of one 'B8 as t(e visi7le data for
trainin5 t(e next 'B8 (a &omposition of experts#+
*
T(is &reates 5ood 5enerative models t(at &an t(en 7e
fine2tuned+
,
Contrastive 6a3e2sleep &an fine2tune 5eneration+
B'HAF
Evervie6 of t(e rest of t(e tutorial
*
%o6 to fine2tune a 5reedily trained 5enerative
model to 7e 7etter at dis&rimination+
*
%o6 to learn a 3ernel for a $aussian pro&ess+
*
%o6 to use deep 7elief nets for non2linear
dimensionality redu&tion and do&ument retrieval+
*
%o6 to learn a 5enerative (ierar&(y of
&onditional random fields+
*
A more advan&ed learnin5 module for deep
7elief nets t(at &ontains multipli&ative
intera&tions+
*
%o6 to learn deep models of se;uential data+
9ine2tunin5 for dis&rimination
*
9irst learn one layer at a time 5reedily+
*
T(en treat t(is as Kpre2trainin5L t(at finds a 5ood
initial set of 6ei5(ts 6(i&( &an 7e fine2tuned 7y
a lo&al sear&( pro&edure+
,
Contrastive 6a3e2sleep is one 6ay of fine2
tunin5 t(e model to 7e 7etter at 5eneration+
*
Ba&3propa5ation &an 7e used to fine2tune t(e
model for 7etter dis&rimination+
,
T(is over&omes many of t(e limitations of
standard 7a&3propa5ation+
C(y 7a&3propa5ation 6or3s 7etter 6it(
5reedy pre2trainin5: T(e optimi>ation vie6
*
$reedily learnin5 one layer at a time s&ales 6ell
to really 7i5 net6or3s: espe&ially if 6e (ave
lo&ality in ea&( layer+
*
Ce do not start 7a&3propa5ation until 6e already
(ave sensi7le feature dete&tors t(at s(ould
already 7e very (elpful for t(e dis&rimination tas3+
,
"o t(e initial 5radients are sensi7le and
7a&3prop only needs to perform a lo&al sear&(
from a sensi7le startin5 point+
C(y 7a&3propa5ation 6or3s 7etter 6it(
5reedy pre2trainin5: T(e overfittin5 vie6
*
8ost of t(e information in t(e final 6ei5(ts &omes from
modelin5 t(e distri7ution of input ve&tors+
,
T(e input ve&tors 5enerally &ontain a lot more
information t(an t(e la7els+
,
T(e pre&ious information in t(e la7els is only used for
t(e final fine2tunin5+
,
T(e fine2tunin5 only modifies t(e features sli5(tly to 5et
t(e &ate5ory 7oundaries ri5(t+ t does not need to
dis&over features+
*
T(is type of 7a&3propa5ation 6or3s 6ell even if most of
t(e trainin5 data is unla7eled+
,
T(e unla7eled data is still very useful for dis&overin5
5ood features+
9irst: model t(e distri7ution of di5it ima5es
2000 units
000 units
000 units
2A x 2A
pixel
ima5e
T(e net6or3 learns a density model for
unla7eled di5it ima5es+ C(en 6e 5enerate
from t(e model 6e 5et t(in5s t(at loo3 li3e
real di5its of all &lasses+
But do t(e (idden features really (elp 6it(
di5it dis&riminationD
Add .0 softmaxed units to t(e top and do
7a&3propa5ation+
T(e top t6o layers form a restri&ted
Bolt>mann ma&(ine 6(ose free ener5y
lands&ape s(ould model t(e lo6
dimensional manifolds of t(e di5its+
'esults on permutation2invariant 8N"T tas3
*
Bery &arefully trained 7a&3prop net 6it( .+4P
one or t6o (idden layers (!lattS %inton#
*
"B8 (De&oste ) "&(oel3opf: 2002# .+/P
*
$enerative model of ?oint density of .+20P
ima5es and la7els (R 5enerative fine2tunin5#
*
$enerative model of unla7elled di5its .+.0P
follo6ed 7y 5entle 7a&3propa5ation
(%inton ) "ala3(utdinov: "&ien&e 2004#
Learnin5 Dynami&s of Deep Nets
t(e next / slides des&ri7e 6or3 7y Mos(ua Ben5io@s 5roup
Before fine-tuning After fine-tuning
Hffe&t of Unsupervised !re2trainin5
40
Erhan et. al. AISTATS2009
Hffe&t of Dept(
44
w/o pre-training
with pre-training without pre-training
Learnin5 Tra?e&tories in 9un&tion "pa&e
(a 22D visuali>ation produ&ed 6it( t2"NH#
*
Ha&( point is a
model in fun&tion
spa&e
*
Color J epo&(
*
Top: tra?e&tories
6it(out pre2trainin5+
Ha&( tra?e&tory
&onver5es to a
different lo&al min+
*
Bottom: Tra?e&tories
6it( pre2trainin5+
*
No overlapN
Erhan et. al. AISTATS2009
C(y unsupervised pre2trainin5 ma3es sense
stuff
ima5e
la7el
stuff
ima5e
la7el
f ima5e2la7el pairs 6ere
5enerated t(is 6ay: it
6ould ma3e sense to try
to 5o strai5(t from
ima5es to la7els+
9or example: do t(e
pixels (ave even parityD
f ima5e2la7el pairs are
5enerated t(is 6ay: it
ma3es sense to first learn
to re&over t(e stuff t(at
&aused t(e ima5e 7y
invertin5 t(e (i5(
7and6idt( pat(6ay+
(i5(
7and6idt(
lo6
7and6idt(
8odelin5 real2valued data
*
9or ima5es of di5its it is possi7le to represent
intermediate intensities as if t(ey 6ere pro7a7ilities 7y
usin5 Kmean2fieldL lo5isti& units+
,
Ce &an treat intermediate values as t(e pro7a7ility
t(at t(e pixel is in3ed+
*
T(is 6ill not 6or3 for real ima5es+
,
n a real ima5e: t(e intensity of a pixel is almost
al6ays almost exa&tly t(e avera5e of t(e nei5(7orin5
pixels+
,
8ean2field lo5isti& units &annot represent pre&ise
intermediate values+
'epla&in5 7inary varia7les 7y
inte5er2valued varia7les
(Te( and %inton: 200.#
*
Ene 6ay to model an inte5er2valued varia7le is
to ma3e N identi&al &opies of a 7inary unit+
*
All &opies (ave t(e same pro7a7ility:
of 7ein5 KonL : p J lo5isti&(x#
,
T(e total num7er of KonL &opies is li3e t(e
firin5 rate of a neuron+
,
t (as a 7inomial distri7ution 6it( mean N p
and varian&e N p(.2p#
A 7etter 6ay to implement inte5er values
*
8a3e many &opies of a 7inary unit+
*
All &opies (ave t(e same 6ei5(ts and t(e same
adaptive 7ias: 7: 7ut t(ey (ave different fixed offsets to
t(e 7ias:
.... , 5 . 3 , 5 . 2 , 5 . 1 , 5 . 0 b b b b
x
A fast approximation
*
Contrastive diver5en&e learnin5 6or3s 6ell for t(e sum of
7inary units 6it( offset 7iases+
*
t also 6or3s for re&tified linear units+ T(ese are mu&( faster
to &ompute t(an t(e sum of many lo5isti& units+
output J max(0: x R randnTs;rt(lo5isti&(x## #
) 1 log( ) 5 . 0 ( logistic
1
x
n
n
e n x + +
=
=
%o6 to train a 7ipartite net6or3 of re&tified
linear units
*
Iust use &ontrastive diver5en&e to lo6er t(e ener5y of
data and raise t(e ener5y of near7y &onfi5urations t(at
t(e model prefers to t(e data+
data
> <
j i
h v
recon
> <
j i
h v
i
?
i
?
) (
recon data
> < > < =
j i j i ij
h v h v w
"tart 6it( a trainin5 ve&tor on t(e
visi7le units+
Update all (idden units in parallel
6it( samplin5 noise
Update t(e visi7le units in parallel
to 5et a Kre&onstru&tionL+
Update t(e (idden units a5ain
re&onstru&tion
data
3D Object Recognition: The NORB dataset
Stereopairs o! gra"sca#e images o! to" objects$
% #ighting conditions& '%( )ie*points
2
+i)e object instances per c#ass in the training set
2
, different set o! !i)e instances per c#ass in the test set
(-&3.. training cases& (-&3.. test cases
,nima#s
/umans
P#anes
Truc0s
Cars
Norma#i1ed
uni!orm
)ersion o!
NORB
"implifyin5 t(e data
*
Ha&( trainin5 &ase is a stereo2pair of =4x=4 ima5es+
,
T(e o7?e&t is &entered+
,
T(e ed5es of t(e ima5e are mainly 7lan3+
,
T(e 7a&35round is uniform and 7ri5(t+
*
To ma3e learnin5 faster used simplified t(e data:
,
T(ro6 a6ay one ima5e+
,
Enly use t(e middle 4/x4/ pixels of t(e ot(er
ima5e+
,
Do6nsample to -2x-2 7y avera5in5 / pixels+
"implifyin5 t(e data even more so t(at it &an
7e modeled 7y re&tified linear units
*
T(e intensity (isto5ram for ea&( -2x-2 ima5e (as a
s(arp pea3 for t(e 7ri5(t 7a&35round+
*
9ind t(is pea3 and &all it >ero+
*
Call all intensities 7ri5(ter t(an t(e 7a&35round >ero+
*
8easure intensities do6n6ards from t(e 7a&35round
intensity+
0
Test set error rates on NE'B after 5reedy
learnin5 of one or t6o (idden layers usin5
re&tified linear units
9ull NE'B (2 ima5es of =4x=4#
*
Lo5isti& re5ression on t(e ra6 pixels 20+0P
*
$aussian "B8 (trained 7y Leon Bottou# ..+4P
*
Convolutional neural net (Le Cun@s 5roup# 4+0P
(&onvolutional nets (ave 3no6led5e of translations 7uilt in#

'edu&ed NE'B (. ima5e -2x-2#
*
Lo5isti& re5ression on t(e ra6 pixels
-0+2P
*
Lo5isti& re5ression on first (idden layer ./+=P
*
Lo5isti& re5ression on se&ond (idden layer .0+2P
T(e
re&eptive
fields of
some
re&tified
linear
(idden
units+
A standard type of real2valued visi7le unit
*
Ce &an model pixels as
$aussian varia7les+
Alternatin5 $i77s
samplin5 is still easy:
t(ou5( learnin5 needs to
7e mu&( slo6er+
ij j
j i
i
i
v
hid j
j j
vis i i
i i
w h h b
b v
, E

=
,
2
2
2
) (
) (

h v
H

ener5y25radient
produ&ed 7y t(e total
input to a visi7le unit
para7oli&
&ontainment
fun&tion
i i
v b
Cellin5 et+ al+ (2000# s(o6 (o6 to extend 'B8@s to t(e
exponential family+ "ee also Ben5io et+ al+ (2007#
A random sample of .0:000 7inary filters learned
7y Alex Fri>(evs3y on a million -2x-2 &olor ima5es+
Com7inin5 deep 7elief nets 6it( $aussian pro&esses
*
Deep 7elief nets &an 7enefit a lot from unla7eled data
6(en la7eled data is s&ar&e+
,
T(ey ?ust use t(e la7eled data for fine2tunin5+
*
Fernel met(ods: li3e $aussian pro&esses: 6or3 6ell on
small la7eled trainin5 sets 7ut are slo6 for lar5e trainin5
sets+
*
"o 6(en t(ere is a lot of unla7eled data and only a little
la7eled data: &om7ine t(e t6o approa&(es:
,
9irst learn a deep 7elief net 6it(out usin5 t(e la7els+
,
T(en apply a $aussian pro&ess model to t(e deepest
layer of features+ T(is 6or3s 7etter t(an usin5 t(e ra6
data+
,
T(en use $!@s to 5et t(e derivatives t(at are 7a&32
propa5ated t(rou5( t(e deep 7elief net+ T(is is a
furt(er 6in+ t allo6s $!@s to fine2tune &ompli&ated
domain2spe&ifi& 3ernels+
Learnin5 to extra&t t(e orientation of a fa&e pat&(
("ala3(utdinov ) %inton: N!" 2007#
T(e trainin5 and test sets for predi&tin5
fa&e orientation
..:000 unla7eled &ases .00: 000: or .000 la7eled &ases
fa&e pat&(es from ne6 people
T(e root mean s;uared error in t(e orientation
6(en &om7inin5 $!@s 6it( deep 7elief nets
22+2 .7+= .0+2
.7+2 .2+7 7+2
.4+- ..+2 4+/
$! on
t(e
pixels
$! on
top2level
features
$! on top2level
features 6it(
fine2tunin5
.00 la7els
000 la7els
.000 la7els
Con&lusion: T(e deep features are mu&( 7etter
t(an t(e pixels+ 9ine2tunin5 (elps a lot+
Deep Autoen&oders
(%inton ) "ala3(utdinov: 2004#
*
T(ey al6ays loo3ed li3e a really
ni&e 6ay to do non2linear
dimensionality redu&tion:
,
But it is very diffi&ult to
optimi>e deep autoen&oders
usin5 7a&3propa5ation+
*
Ce no6 (ave a mu&( 7etter 6ay
to optimi>e t(em:
,
9irst train a sta&3 of / 'B8@s
,
T(en KunrollL t(em+
,
T(en fine2tune 6it( 7a&3prop+
.000 neurons
000 neurons
000 neurons
200 neurons
200 neurons
-0
.000 neurons
2Ax2A
2Ax2A
1
2
3
4
4
3
2
1
W
W
W
W
W
W
W
W
T
T
T
T
linear
units
A &omparison of met(ods for &ompressin5
di5it ima5es to -0 real num7ers+
real
data
-02D
deep auto
-02D lo5isti&
!CA
-02D
!CA
'etrievin5 do&uments t(at are similar
to a ;uery do&ument
*
Ce &an use an autoen&oder to find lo62
dimensional &odes for do&uments t(at allo6
fast and a&&urate retrieval of similar
do&uments from a lar5e set+
*
Ce start 7y &onvertin5 ea&( do&ument into a
K7a5 of 6ordsL+ T(is a 2000 dimensional
ve&tor t(at &ontains t(e &ounts for ea&( of t(e
2000 &ommonest 6ords+
%o6 to &ompress t(e &ount ve&tor
*
Ce train t(e neural
net6or3 to reprodu&e its
input ve&tor as its output
*
T(is for&es it to
&ompress as mu&(
information as possi7le
into t(e .0 num7ers in
t(e &entral 7ottlene&3+
*
T(ese .0 num7ers are
t(en a 5ood 6ay to
&ompare do&uments+
2000 re&onstru&ted &ounts
000 neurons
2000 6ord &ounts
000 neurons
200 neurons
200 neurons
.0
input
ve&tor
output
ve&tor
!erforman&e of t(e autoen&oder at
do&ument retrieval
*
Train on 7a5s of 2000 6ords for /00:000 trainin5 &ases
of 7usiness do&uments+
,
9irst train a sta&3 of 'B8@s+ T(en fine2tune 6it(
7a&3prop+
*
Test on a separate /00:000 do&uments+
,
!i&3 one test do&ument as a ;uery+ 'an3 order all t(e
ot(er test do&uments 7y usin5 t(e &osine of t(e an5le
7et6een &odes+
,
'epeat t(is usin5 ea&( of t(e /00:000 test do&uments
as t(e ;uery (re;uires 0+.4 trillion &omparisons#+
*
!lot t(e num7er of retrieved do&uments a5ainst t(e
proportion t(at are in t(e same (and2la7eled &lass as t(e
;uery do&ument+
!roportion of retrieved do&uments in same &lass as ;uery
Num7er of do&uments retrieved
9irst &ompress all do&uments to 2 num7ers usin5 a type of !CA
T(en use different &olors for different
do&ument &ate5ories
9irst &ompress all do&uments to 2 num7ers+
T(en use different &olors for different do&ument &ate5ories
9indin5 7inary &odes for do&uments
*
Train an auto2en&oder usin5 -0
lo5isti& units for t(e &ode layer+
*
Durin5 t(e fine2tunin5 sta5e:
add noise to t(e inputs to t(e
&ode units+
,
T(e KnoiseL ve&tor for ea&(
trainin5 &ase is fixed+ "o 6e
still 5et a deterministi&
5radient+
,
T(e noise for&es t(eir
a&tivities to 7e&ome 7imodal
in order to resist t(e effe&ts
of t(e noise+
,
T(en 6e simply round t(e
a&tivities of t(e -0 &ode units
to . or 0+
2000 re&onstru&ted &ounts
000 neurons
2000 6ord &ounts
000 neurons
200 neurons
200 neurons
-0

noise
"emanti& (as(in5: Usin5 a deep autoen&oder as a
(as(2fun&tion for findin5 approximate mat&(es
("ala3(utdinov ) %inton: 2007#
(as(
fun&tion
Ksupermar3et sear&(L
%o6 5ood is a s(ortlist found t(is 6ayD
*
Ce (ave only implemented it for a million
do&uments 6it( 2027it &odes 222 7ut 6(at &ould
possi7ly 5o 6ron5D
,
A 202D (yper&u7e allo6s us to &apture enou5(
of t(e similarity stru&ture of our do&ument set+
*
T(e s(ortlist found usin5 7inary &odes a&tually
improves t(e pre&ision2re&all &urves of T92D9+
,
Lo&ality sensitive (as(in5 (t(e fastest ot(er
met(od# is 00 times slo6er and (as 6orse
pre&ision2re&all &urves+
$eneratin5 t(e parts of an o7?e&t
*
Ene 6ay to maintain t(e
&onstraints 7et6een t(e parts is
to 5enerate ea&( part very
a&&urately
,
But t(is 6ould re;uire a lot of
&ommuni&ation 7and6idt(+
*
"loppy top2do6n spe&ifi&ation of
t(e parts is less demandin5
,
7ut it messes up relations(ips
7et6een features
,
so use redundant features
and use lateral intera&tions to
&lean up t(e mess+
*
Ha&( transformed feature (elps
to lo&ate t(e ot(ers
,
T(is allo6s a noisy &(annel
sloppy top2do6n
a&tivation of parts
&lean2up usin5
3no6n intera&tions
pose parameters
features 6it(
top2do6n
support
Ks;uareL
R
ts li3e soldiers on
a parade 5round
"emi2restri&ted Bolt>mann 8a&(ines
*
Ce restri&t t(e &onne&tivity to ma3e
learnin5 easier+
*
Contrastive diver5en&e learnin5 re;uires
t(e (idden units to 7e in &onditional
e;uili7rium 6it( t(e visi7les+
,
But it does not re;uire t(e visi7le units
to 7e in &onditional e;uili7rium 6it( t(e
(iddens+
,
All 6e re;uire is t(at t(e visi7le units
are &loser to e;uili7rium in t(e
re&onstru&tions t(an in t(e data+
*
"o 6e &an allo6 &onne&tions 7et6een
t(e visi7les+
(idden
i
?
visi7le
Learnin5 a semi2restri&ted Bolt>mann 8a&(ine
0
> <
j i
h v
1
> <
j i
h v
i
?
i
?
t J 0 t J .
) (
1 0
> < > < =
j i j i ij
h v h v w
.+ "tart 6it( a
trainin5 ve&tor on t(e
visi7le units+
2+ Update all of t(e
(idden units in
parallel
-+ 'epeatedly update
all of t(e visi7le units
in parallel usin5
mean2field updates
(6it( t(e (iddens
fixed# to 5et a
Kre&onstru&tionL+
/+ Update all of t(e
(idden units a5ain+
re&onstru&tion
data
) (
1 0
> < > < =
k i k i ik
v v v v l
3 i i 3 3 3
update for a
lateral 6ei5(t
Learnin5 in "emi2restri&ted Bolt>mann
8a&(ines
*
8et(od .: To form a re&onstru&tion: &y&le
t(rou5( t(e visi7le units updatin5 ea&( in turn
usin5 t(e top2do6n input from t(e (iddens plus
t(e lateral input from t(e ot(er visi7les+
*
8et(od 2: Use Kmean fieldL visi7le units t(at
(ave real values+ Update t(em all in parallel+
,
Use dampin5 to prevent os&illations
) ( ) (1
1
i
t
i
t
i
x p p + =
+
total input to i
dampin5
'esults on modelin5 natural ima5e pat&(es
usin5 a sta&3 of 'B8@s (Esindero and %inton#
*
"ta&3 of 'B8@s learned one at a time+
*
/00 $aussian visi7le units t(at see
6(itened ima5e pat&(es
,
Derived from .00:000 Ban %ateren
ima5e pat&(es: ea&( 20x20
*
T(e (idden units are all 7inary+
,
T(e lateral &onne&tions are
learned 6(en t(ey are t(e visi7le
units of t(eir 'B8+
*
'e&onstru&tion involves lettin5 t(e
visi7le units of ea&( 'B8 settle usin5
mean2field dynami&s+
,
T(e already de&ided states in t(e
level a7ove determine t(e effe&tive
7iases durin5 mean2field settlin5+
Dire&ted Conne&tions
Dire&ted Conne&tions
Undire&ted Conne&tions
/00
$aussian
units
%idden
8'9 6it(
2000 units
%idden
8'9 6it(
000 units
.000 top2
level units+
No 8'9+
Cit(out lateral &onne&tions
real data samples from model
Cit( lateral &onne&tions
real data samples from model
A funny 6ay to use an 8'9
*
T(e lateral &onne&tions form an 8'9+
*
T(e 8'9 is used durin5 learnin5 and 5eneration+
*
T(e 8'9 is not used for inferen&e+
,
T(is is a novel idea so vision resear&(ers don@t li3e it+
*
T(e 8'9 enfor&es &onstraints+ Durin5 inferen&e:
&onstraints do not need to 7e enfor&ed 7e&ause t(e data
o7eys t(em+
,
T(e &onstraints only need to 7e enfor&ed durin5
5eneration+
*
Uno7served (idden units &annot enfor&e &onstraints+
,
To enfor&e &onstraints re;uires lateral &onne&tions or
o7served des&endants+
C(y do 6e 6(iten dataD
*
ma5es typi&ally (ave stron5 pair26ise &orrelations+
*
Learnin5 (i5(er order statisti&s is diffi&ult 6(en t(ere are
stron5 pair26ise &orrelations+
,
"mall &(an5es in parameter values t(at improve t(e
modelin5 of (i5(er2order statisti&s may 7e re?e&ted
7e&ause t(ey form a sli5(tly 6orse model of t(e mu&(
stron5er pair26ise statisti&s+
*
"o 6e often remove t(e se&ond2order statisti&s 7efore
tryin5 to learn t(e (i5(er2order statisti&s+
C(itenin5 t(e learnin5 si5nal instead
of t(e data
*
Contrastive diver5en&e learnin5 &an remove t(e effe&ts
of t(e se&ond2order statisti&s on t(e learnin5 6it(out
a&tually &(an5in5 t(e data+
,
T(e lateral &onne&tions model t(e se&ond order
statisti&s
,
f a pixel &an 7e re&onstru&ted &orre&tly usin5 se&ond
order statisti&s: its 6ill 7e t(e same in t(e
re&onstru&tion as in t(e data+
,
T(e (idden units &an t(en fo&us on modelin5 (i5(2
order stru&ture t(at &annot 7e predi&ted 7y t(e lateral
&onne&tions+
*
9or example: a pixel &lose to an ed5e: 6(ere interpolation
from near7y pixels &auses in&orre&t smoot(in5+
To6ards a more po6erful: multi2linear
sta&3a7le learnin5 module
*
"o far: t(e states of t(e units in one layer (ave only 7een
used to determine t(e effe&tive 7iases of t(e units in t(e
layer 7elo6+
*
t 6ould 7e mu&( more po6erful to modulate t(e pair26ise
intera&tions in t(e layer 7elo6+
,
A 5ood 6ay to desi5n a (ierar&(i&al system is to allo6
ea&( level to determine t(e o7?e&tive fun&tion of t(e level
7elo6+
*
To modulate pair26ise intera&tions 6e need (i5(er2order
Bolt>mann ma&(ines+
%i5(er order Bolt>mann ma&(ines
("e?no6s3i: <.=A4#
*
T(e usual ener5y fun&tion is ;uadrati& in t(e states:
*
But 6e &ould use (i5(er order intera&tions:
ij j
j i
i
w s s terms bias E

<
=
ijk k j
k j i
i
w s s s terms bias E

< <
=
*
Unit 3 a&ts as a s6it&(+ C(en unit 3 is on: it s6it&(es
in t(e pair6ise intera&tion 7et6een unit i and unit ?+
,
Units i and ? &an also 7e vie6ed as s6it&(es t(at
&ontrol t(e pair6ise intera&tions 7et6een ? and 3
or 7et6een i and 3+
Usin5 (i5(er2order Bolt>mann ma&(ines to
model ima5e transformations
(t(e unfa&tored version#
*
A 5lo7al transformation spe&ifies 6(i&( pixel
5oes to 6(i&( ot(er pixel+
*
Conversely: ea&( pair of similar intensity pixels:
one in ea&( ima5e: votes for a parti&ular 5lo7al
transformation+
ima5e(t# ima5e(tR.#
ima5e transformation
9a&torin5 t(ree26ay
multipli&ative intera&tions

=
=
f
hf jf if h j
h j i
i
ijh h j
h j i
i
w w w s s s E
w s s s E
, ,
, ,
fa&tored
6it( linearly
many parameters
per fa&tor+
unfa&tored
6it( &u7i&ally
many parameters
A pi&ture of t(e lo62ran3 tensor
&ontri7uted 7y fa&tor f
if
w
jf
w
hf
w
Ha&( layer is a s&aled version
of t(e same matrix+
T(e 7asis matrix is spe&ified
as an outer produ&t 6it(
typi&al term
"o ea&( a&tive (idden unit
&ontri7utes a s&alar:
times t(e matrix spe&ified 7y
fa&tor f +
jf if
w w
hf
w
nferen&e 6it( fa&tored t(ree26ay
[ ]
=
=

= =
j
jf j if
i
i hf h f h f
hf jf if h j
h j i
i f
w s w s w s E s E
w w w s s s E
) ( ) ( 1 0
, ,
%o6 &(an5in5 t(e 7inary state
of unit ( &(an5es t(e ener5y
&ontri7uted 7y fa&tor f+
C(at unit ( needs
to 3no6 in order to
do $i77s samplin5
T(e ener5y
&ontri7uted 7y
fa&tor f+
Belief propa5ation
if
w
jf
w
hf
w
f
i
j
h
T(e out5oin5 messa5e
at ea&( vertex of t(e
fa&tor is t(e produ&t of
t(e 6ei5(ted sums at
t(e ot(er t6o verti&es+
Learnin5 6it( fa&tored t(ree26ay
del mo data
model data
h
f h
h
f h
hf
f
hf
f
hf
j
jf j if
i
i
h
f
m s m s
w
E
w
E
w
w s w s m
=
=

messa5e
from fa&tor f
to unit (
'oland data
8odelin5 t(e &orrelational stru&ture of a stati& ima5e
7y usin5 t6o &opies of t(e ima5e
if
w jf
w
hf
w
f
i
j
h
Ha&( fa&tor sends t(e
s;uared output of a linear
filter to t(e (idden units+
t is exa&tly t(e standard
model of simple and
&omplex &ells+ t allo6s
&omplex &ells to extra&t
oriented ener5y+
T(e standard model drops
out of doin5 7elief
propa5ation for a fa&tored
t(ird2order ener5y fun&tion+ Copy . Copy 2
An advanta5e of modelin5 &orrelations
7et6een pixels rat(er t(an pixels
*
Durin5 5eneration: a Kverti&al ed5eL unit &an turn off
t(e (ori>ontal interpolation in a re5ion 6it(out
6orryin5 a7out exa&tly 6(ere t(e intensity
dis&ontinuity 6ill 7e+
,
T(is 5ives some translational invarian&e
,
t also 5ives a lot of invarian&e to 7ri5(tness and
&ontrast+
,
"o t(e Kverti&al ed5eL unit is li3e a &omplex &ell+
*
By modulatin5 t(e &orrelations 7et6een pixels rat(er
t(an t(e pixel intensities: t(e 5enerative model &an
still allo6 interpolation parallel to t(e ed5e+
A prin&iple of (ierar&(i&al systems
*
Ha&( level in t(e (ierar&(y s(ould not try to
mi&ro2mana5e t(e level 7elo6+
*
nstead: it s(ould &reate an o7?e&tive fun&tion for
t(e level 7elo6 and leave t(e level 7elo6 to
optimi>e it+
,
T(is allo6s t(e fine details of t(e solution to
7e de&ided lo&ally 6(ere t(e detailed
information is availa7le+
*
E7?e&tive fun&tions are a 5ood 6ay to do
a7stra&tion+
Time series models
*
nferen&e is diffi&ult in dire&ted models of time
series if 6e use non2linear distri7uted
representations in t(e (idden units+
,
t is (ard to fit Dynami& Bayes Nets to (i5(2
dimensional se;uen&es (e+5 motion &apture
data#+
*
"o people tend to avoid distri7uted
representations and use mu&( 6ea3er met(ods
(e+5+ %88@s#+
Time series models
*
f 6e really need distri7uted representations (6(i&( 6e
nearly al6ays do#: 6e &an ma3e inferen&e mu&( simpler
7y usin5 t(ree tri&3s:
,
Use an 'B8 for t(e intera&tions 7et6een (idden and
visi7le varia7les+ T(is ensures t(at t(e main sour&e of
information 6ants t(e posterior to 7e fa&torial+
,
8odel s(ort2ran5e temporal information 7y allo6in5
several previous frames to provide input to t(e (idden
units and to t(e visi7le units+
*
T(is leads to a temporal module t(at &an 7e sta&3ed
,
"o 6e &an use 5reedy learnin5 to learn deep models
of temporal stru&ture+
An appli&ation to modelin5
motion &apture data
(Taylor: 'o6eis ) %inton: 2007#
*
%uman motion &an 7e &aptured 7y pla&in5
refle&tive mar3ers on t(e ?oints and t(en usin5
lots of infrared &ameras to tra&3 t(e -2D
positions of t(e mar3ers+
*
$iven a s3eletal model: t(e -2D positions of t(e
mar3ers &an 7e &onverted into t(e ?oint an5les
plus 4 parameters t(at des&ri7e t(e -2D position
and t(e roll: pit&( and ya6 of t(e pelvis+
,
Ce only represent &(an5es in ya6 7e&ause p(ysi&s
doesn@t &are a7out its value and 6e 6ant to avoid
&ir&ular varia7les+
T(e &onditional 'B8 model
(a partially o7served C'9#
*
"tart 6it( a 5eneri& 'B8+
*
Add t6o types of &onditionin5
&onne&tions+
*
$iven t(e data: t(e (idden units
at time t are &onditionally
independent+
*
T(e autore5ressive 6ei5(ts &an
model most s(ort2term temporal
stru&ture very 6ell: leavin5 t(e
(idden units to model nonlinear
irre5ularities (su&( as 6(en t(e
foot (its t(e 5round#+
t22 t2. t
i
j
(
v
Causal 5eneration from a learned model
*
Feep t(e previous visi7le states fixed+
,
T(ey provide a time2dependent
7ias for t(e (idden units+
*
!erform alternatin5 $i77s samplin5
for a fe6 iterations 7et6een t(e
(idden units and t(e most re&ent
visi7le units+
,
T(is pi&3s ne6 (idden and visi7le
states t(at are &ompati7le 6it(
ea&( ot(er and 6it( t(e re&ent
(istory+
i
j
%i5(er level models
*
En&e 6e (ave trained t(e model: 6e &an
add layers li3e in a Deep Belief Net6or3+
*
T(e previous layer C'B8 is 3ept: and its
output: 6(ile driven 7y t(e data is treated
as a ne6 3ind of Kfully o7servedL data+
*
T(e next level C'B8 (as t(e same
ar&(ite&ture as t(e first (t(ou5( 6e &an
alter t(e num7er of units it uses# and is
trained t(e same 6ay+
*
Upper levels of t(e net6or3 model more
Ka7stra&tL &on&epts+
*
T(is 5reedy learnin5 pro&edure &an 7e
?ustified usin5 a variational 7ound+
i
j
k
t22 t2. t
Learnin5 6it( KstyleL la7els
*
As in t(e 5enerative model of
(and6ritten di5its (%inton et al+
2004#: style la7els &an 7e
provided as part of t(e input to
t(e top layer+
*
T(e la7els are represented 7y
turnin5 on one unit in a 5roup of
units: 7ut t(ey &an also 7e
7lended+
i
j
t22 t2. t
k

l
"(o6 demo@s of multiple styles of
6al3in5
These can be foun at
www.cs.toronto.eu/!gwta"lor/
'eadin5s on deep 7elief nets
A readin5 list (t(at is still 7ein5 updated# &an 7e
found at
666+&s+toronto+eduO<(intonOdeeprefs+(tml

Jul09 Hinton Deeplearn

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Jul09 Hinton Deeplearn

Uploaded by

Copyright:

Available Formats

UCL Tutorial on:

Deep Belief Nets

Usin5 ener5ies to define pro7a7ilities

> < > < =

You might also like