You are on page 1of 126

UCL Tutorial on:

Deep Belief Nets


(An updated and extended version of my 2007 N!" tutorial#
$eoffrey %inton
Canadian nstitute for Advan&ed 'esear&(
)
Department of Computer "&ien&e
University of Toronto
"&(edule for t(e Tutorial
*
2+00 , -+-0 Tutorial part .
*
-+-0 , -+/0 1uestions
*
-+/0 2 /+.0 Tea Brea3
*
/+.0 , 0+/0 Tutorial part 2
*
0+/0 , 4+00 1uestions
"ome t(in5s you 6ill learn in t(is tutorial
*
%o6 to learn multi2layer 5enerative models of unla7elled
data 7y learnin5 one layer of features at a time+
,
%o6 to add 8ar3ov 'andom 9ields in ea&( (idden layer+
*
%o6 to use 5enerative models to ma3e dis&riminative
trainin5 met(ods 6or3 mu&( 7etter for &lassifi&ation and
re5ression+
,
%o6 to extend t(is approa&( to $aussian !ro&esses and
(o6 to learn &omplex: domain2spe&ifi& 3ernels for a
$aussian !ro&ess+
*
%o6 to perform non2linear dimensionality redu&tion on very
lar5e datasets
,
%o6 to learn 7inary: lo62dimensional &odes and (o6 to
use t(em for very fast do&ument retrieval+
*
%o6 to learn multilayer 5enerative models of (i5(2
dimensional se;uential data+
A spe&trum of ma&(ine learnin5 tas3s
*
Lo62dimensional data (e+5+
less t(an .00 dimensions#
*
Lots of noise in t(e data
*
T(ere is not mu&( stru&ture in
t(e data: and 6(at stru&ture
t(ere is: &an 7e represented 7y
a fairly simple model+
*
T(e main pro7lem is
distin5uis(in5 true stru&ture
from noise+
*
%i5(2dimensional data (e+5+
more t(an .00 dimensions#
*
T(e noise is not suffi&ient to
o7s&ure t(e stru&ture in t(e
data if 6e pro&ess it ri5(t+
*
T(ere is a (u5e amount of
stru&ture in t(e data: 7ut t(e
stru&ture is too &ompli&ated to
7e represented 7y a simple
model+
*
T(e main pro7lem is fi5urin5
out a 6ay to represent t(e
&ompli&ated stru&ture so t(at it
&an 7e learned+
Typi&al "tatisti&s222222222222Artifi&ial ntelli5en&e
%istori&al 7a&35round:
9irst 5eneration neural net6or3s
*
!er&eptrons (<.=40#
used a layer of (and2
&oded features and tried
to re&o5ni>e o7?e&ts 7y
learnin5 (o6 to 6ei5(t
t(ese features+
,
T(ere 6as a neat
learnin5 al5orit(m for
ad?ustin5 t(e 6ei5(ts+
,
But per&eptrons are
fundamentally limited
in 6(at t(ey &an learn
to do+
non2adaptive
(and2&oded
features
output units
e+5+ &lass la7els
input units
e+5+ pixels
"3et&( of a typi&al
per&eptron from t(e .=40@s
Bom7 Toy
"e&ond 5eneration neural net6or3s (<.=A0#
input ve&tor
(idden
layers
outputs
Ba&32propa5ate
error
si5nal to 5et
derivatives for
learnin5
Compare outputs 6it(
&orre&t ans6er to 5et
error si5nal
A temporary di5ression
*
Bapni3 and (is &o26or3ers developed a very &lever type
of per&eptron &alled a "upport Be&tor 8a&(ine+
,
nstead of (and2&odin5 t(e layer of non2adaptive
features: ea&( trainin5 example is used to &reate a
ne6 feature usin5 a fixed re&ipe+
*
T(e feature &omputes (o6 similar a test example is to t(at
trainin5 example+
,
T(en a &lever optimi>ation te&(ni;ue is used to sele&t
t(e 7est su7set of t(e features and to de&ide (o6 to
6ei5(t ea&( feature 6(en &lassifyin5 a test &ase+
*
But its ?ust a per&eptron and (as all t(e same limitations+
*
n t(e .==0@s: many resear&(ers a7andoned neural
net6or3s 6it( multiple adaptive (idden layers 7e&ause
"upport Be&tor 8a&(ines 6or3ed 7etter+
C(at is 6ron5 6it( 7a&32propa5ationD
*
t re;uires la7eled trainin5 data+
,
Almost all data is unla7eled+
*
T(e learnin5 time does not s&ale 6ell
,
t is very slo6 in net6or3s 6it( multiple
(idden layers+
*
t &an 5et stu&3 in poor lo&al optima+
,
T(ese are often ;uite 5ood: 7ut for deep
nets t(ey are far from optimal+
Ever&omin5 t(e limitations of 7a&32
propa5ation
*
Feep t(e effi&ien&y and simpli&ity of usin5 a
5radient met(od for ad?ustin5 t(e 6ei5(ts: 7ut use
it for modelin5 t(e stru&ture of t(e sensory input+
,
Ad?ust t(e 6ei5(ts to maximi>e t(e pro7a7ility
t(at a 5enerative model 6ould (ave produ&ed
t(e sensory input+
,
Learn p(ima5e# not p(la7el G ima5e#
*
f you 6ant to do &omputer vision: first learn
&omputer 5rap(i&s
*
C(at 3ind of 5enerative model s(ould 6e learnD
Belief Nets
*
A 7elief net is a dire&ted
a&y&li& 5rap( &omposed of
sto&(asti& varia7les+
*
Ce 5et to o7serve some of
t(e varia7les and 6e 6ould
li3e to solve t6o pro7lems:
*
T(e inferen&e pro7lem: nfer
t(e states of t(e uno7served
varia7les+
*
T(e learnin5 pro7lem: Ad?ust
t(e intera&tions 7et6een
varia7les to ma3e t(e
net6or3 more li3ely to
5enerate t(e o7served data+
sto&(asti&
(idden
&ause
visi7le
effe&t
Ce 6ill use nets &omposed of
layers of sto&(asti& 7inary varia7les
6it( 6ei5(ted &onne&tions+ Later:
6e 6ill 5enerali>e to ot(er types of
varia7le+
"to&(asti& 7inary units
(Bernoulli varia7les#
*
T(ese (ave a state of .
or 0+
*
T(e pro7a7ility of
turnin5 on is determined
7y t(e 6ei5(ted input
from ot(er units (plus a
7ias#
0
0
.

+
= =
j
ji j i
i
w s b
s p
) exp( 1
) (
1
1

+
j
ji j i
w s b
) ( 1 =
i
s p
Learnin5 Deep Belief Nets
*
t is easy to 5enerate an
un7iased example at t(e
leaf nodes: so 6e &an see
6(at 3inds of data t(e
net6or3 7elieves in+
*
t is (ard to infer t(e
posterior distri7ution over
all possi7le &onfi5urations
of (idden &auses+
*
t is (ard to even 5et a
sample from t(e posterior+
*
"o (o6 &an 6e learn deep
7elief nets t(at (ave
millions of parametersD
sto&(asti&
(idden
&ause
visi7le
effe&t
T(e learnin5 rule for si5moid 7elief nets
*
Learnin5 is easy if 6e &an
5et an un7iased sample
from t(e posterior
distri7ution over (idden
states 5iven t(e o7served
data+
*
9or ea&( unit: maximi>e
t(e lo5 pro7a7ility t(at its
7inary state in t(e sample
from t(e posterior 6ould 7e
5enerated 7y t(e sampled
7inary states of its parents+

+
= =
j
ji j
i i
w s
s p p
) exp( 1
) (
1
1
?
i
ji
w
) (
i i j ji
p s s w =
i
s
j
s
learnin5
rate
Hxplainin5 a6ay (Iudea !earl#
*
Hven if t6o (idden &auses are independent: t(ey &an
7e&ome dependent 6(en 6e o7serve an effe&t t(at t(ey &an
7ot( influen&e+
,
f 6e learn t(at t(ere 6as an eart(;ua3e it redu&es t(e
pro7a7ility t(at t(e (ouse ?umped 7e&ause of a tru&3+
tru&3 (its (ouse eart(;ua3e
(ouse ?umps
20 20
220
2.0 2.0

p(.:.#J+000.
p(.:0#J+/===
p(0:.#J+/===
p(0:0#J+000.
posterior
C(y it is usually very (ard to learn
si5moid 7elief nets one layer at a time
*
To learn C: 6e need t(e posterior
distri7ution in t(e first (idden layer+
*
!ro7lem .: T(e posterior is typi&ally
&ompli&ated 7e&ause of Kexplainin5
a6ayL+
*
!ro7lem 2: T(e posterior depends
on t(e prior as 6ell as t(e li3eli(ood+
,
"o to learn C: 6e need to 3no6
t(e 6ei5(ts in (i5(er layers: even
if 6e are only approximatin5 t(e
posterior+ All t(e 6ei5(ts intera&t+
*
!ro7lem -: Ce need to inte5rate
over all possi7le &onfi5urations of
t(e (i5(er varia7les to 5et t(e prior
for first (idden layer+ Mu3N
data
(idden varia7les
(idden varia7les
(idden varia7les
li3eli(ood C
prior
"ome met(ods of learnin5
deep 7elief nets
*
8onte Carlo met(ods &an 7e used to sample
from t(e posterior+
,
But its painfully slo6 for lar5e: deep models+
*
n t(e .==0@s people developed variational
met(ods for learnin5 deep 7elief nets
,
T(ese only 5et approximate samples from t(e
posterior+
,
Nevet(eless: t(e learnin5 is still 5uaranteed to
improve a variational 7ound on t(e lo5
pro7a7ility of 5eneratin5 t(e o7served data+
T(e 7rea3t(rou5( t(at ma3es deep
learnin5 effi&ient
*
To learn deep nets effi&iently: 6e need to learn one layer
of features at a time+ T(is does not 6or3 6ell if 6e
assume t(at t(e latent varia7les are independent in t(e
prior :
,
T(e latent varia7les are not independent in t(e
posterior so inferen&e is (ard for non2linear models+
,
T(e learnin5 tries to find independent &auses usin5
one (idden layer 6(i&( is not usually possi7le+
*
Ce need a 6ay of learnin5 one layer at a time t(at ta3es
into a&&ount t(e fa&t t(at 6e 6ill 7e learnin5 more
(idden layers later+
,
Ce solve t(is pro7lem 7y usin5 an undire&ted model+
T6o types of 5enerative neural net6or3
*
f 6e &onne&t 7inary sto&(asti& neurons in a
dire&ted a&y&li& 5rap( 6e 5et a "i5moid Belief
Net ('adford Neal .==2#+
*
f 6e &onne&t 7inary sto&(asti& neurons usin5
symmetri& &onne&tions 6e 5et a Bolt>mann
8a&(ine (%inton ) "e?no6s3i: .=A-#+
,
f 6e restri&t t(e &onne&tivity in a spe&ial 6ay:
it is easy to learn a Bolt>mann ma&(ine+
'estri&ted Bolt>mann 8a&(ines
("molens3y :.=A4: &alled t(em K(armoniumsL#
*
Ce restri&t t(e &onne&tivity to ma3e
learnin5 easier+
,
Enly one layer of (idden units+
*
Ce 6ill deal 6it( more layers later
,
No &onne&tions 7et6een (idden units+
*
n an 'B8: t(e (idden units are
&onditionally independent 5iven t(e
visi7le states+
,
"o 6e &an ;ui&3ly 5et an un7iased
sample from t(e posterior distri7ution
6(en 5iven a data2ve&tor+
,
T(is is a 7i5 advanta5e over dire&ted
7elief nets
(idden
i
?
visi7le
T(e Hner5y of a ?oint &onfi5uration
(i5norin5 terms to do 6it( 7iases#

=
j i
ij j i
w h v v,h E
,
) (
6ei5(t 7et6een
units i and ?
Hner5y 6it( &onfi5uration
v on t(e visi7le units and
( on t(e (idden units
7inary state of
visi7le unit i
7inary state of
(idden unit ?
j i
ij
h v
w
h v E
=

) , (
Cei5(ts Hner5ies !ro7a7ilities
*
Ha&( possi7le ?oint &onfi5uration of t(e visi7le
and (idden units (as an ener5y
,
T(e ener5y is determined 7y t(e 6ei5(ts and
7iases (as in a %opfield net#+
*
T(e ener5y of a ?oint &onfi5uration of t(e visi7le
and (idden units determines its pro7a7ility:
*
T(e pro7a7ility of a &onfi5uration over t(e visi7le
units is found 7y summin5 t(e pro7a7ilities of all
t(e ?oint &onfi5urations t(at &ontain it+
) , (
) , (
h v E
h v p e

Usin5 ener5ies to define pro7a7ilities


*
T(e pro7a7ility of a ?oint
&onfi5uration over 7ot( visi7le
and (idden units depends on
t(e ener5y of t(at ?oint
&onfi5uration &ompared 6it(
t(e ener5y of all ot(er ?oint
&onfi5urations+
*
T(e pro7a7ility of a
&onfi5uration of t(e visi7le
units is t(e sum of t(e
pro7a7ilities of all t(e ?oint
&onfi5urations t(at &ontain it+

=
g u
g u E
h v E
e
e
h v p
,
) , (
) , (
) , (

=
g u
g u E
h
h v E
e
e
v p
,
) , (
) , (
) (
partition
fun&tion
A pi&ture of t(e maximum li3eli(ood learnin5
al5orit(m for an 'B8
0
> <
j i
h v

> <
j i
h v
i
?
i
?
i
?
i
?
t J 0 t J . t J 2 t J infinity

> < > < =

j i j i
ij
h v h v
w
v p
0
) ( log
"tart 6it( a trainin5 ve&tor on t(e visi7le units+
T(en alternate 7et6een updatin5 all t(e (idden units in
parallel and updatin5 all t(e visi7le units in parallel+
a fantasy
A ;ui&3 6ay to learn an 'B8
0
> <
j i
h v
1
> <
j i
h v
i
?
i
?
t J 0 t J .
) (
1 0
> < > < =
j i j i ij
h v h v w
"tart 6it( a trainin5 ve&tor on t(e
visi7le units+
Update all t(e (idden units in
parallel
Update t(e all t(e visi7le units in
parallel to 5et a Kre&onstru&tionL+
Update t(e (idden units a5ain+
T(is is not follo6in5 t(e 5radient of t(e lo5 li3eli(ood+ But it
6or3s 6ell+ t is approximately follo6in5 t(e 5radient of anot(er
o7?e&tive fun&tion (Carreira2!erpinan ) %inton: 2000#+
re&onstru&tion
data
%o6 to learn a set of features t(at are 5ood for
re&onstru&tin5 ima5es of t(e di5it 2
00 7inary
feature
neurons
.4 x .4
pixel
ima5e
00 7inary
feature
neurons
.4 x .4
pixel
ima5e
n&rement 6ei5(ts
7et6een an a&tive
pixel and an a&tive
feature
De&rement 6ei5(ts
7et6een an a&tive
pixel and an a&tive
feature
data
(reality#
re&onstru&tion
(7etter t(an reality#
T(e final 00 x 204 6ei5(ts
Ha&( neuron 5ra7s a different feature+
'e&onstru&tion
from a&tivated
7inary features
Data
'e&onstru&tion
from a&tivated
7inary features Data
%o6 6ell &an 6e re&onstru&t t(e di5it ima5es
from t(e 7inary feature a&tivationsD
Ne6 test ima5es from
t(e di5it &lass t(at t(e
model 6as trained on
ma5es from an
unfamiliar di5it &lass
(t(e net6or3 tries to see
every ima5e as a 2#
T(ree 6ays to &om7ine pro7a7ility density
models (an underlyin5 t(eme of t(e tutorial#
*
Mixture: Ta3e a 6ei5(ted avera5e of t(e distri7utions+
,
t &an never 7e s(arper t(an t(e individual distri7utions+
t@s a very 6ea3 6ay to &om7ine models+
*
Product: 8ultiply t(e distri7utions at ea&( point and t(en
renormali>e (t(is is (o6 an 'B8 &om7ines t(e distri7utions defined
7y ea&( (idden unit#
,
Hxponentially more po6erful t(an a mixture+ T(e
normali>ation ma3es maximum li3eli(ood learnin5
diffi&ult: 7ut approximations allo6 us to learn any6ay+
*
Composition: Use t(e values of t(e latent varia7les of one
model as t(e data for t(e next model+
,
Cor3s 6ell for learnin5 multiple layers of representation:
7ut only if t(e individual models are undire&ted+
Trainin5 a deep net6or3
(t(e main reason 'B8@s are interestin5#
*
9irst train a layer of features t(at re&eive input dire&tly
from t(e pixels+
*
T(en treat t(e a&tivations of t(e trained features as if
t(ey 6ere pixels and learn features of features in a
se&ond (idden layer+
*
t &an 7e proved t(at ea&( time 6e add anot(er layer of
features 6e improve a variational lo6er 7ound on t(e lo5
pro7a7ility of t(e trainin5 data+
,
T(e proof is sli5(tly &ompli&ated+
,
But it is 7ased on a neat e;uivalen&e 7et6een an
'B8 and a deep dire&ted model (des&ri7ed later#
T(e 5enerative model after learnin5 - layers
*
To 5enerate data:
.+ $et an e;uili7rium sample
from t(e top2level 'B8 7y
performin5 alternatin5 $i77s
samplin5 for a lon5 time+
2+ !erform a top2do6n pass to
5et states for all t(e ot(er
layers+
"o t(e lo6er level 7ottom2up
&onne&tions are not part of
t(e 5enerative model+ T(ey
are ?ust used for inferen&e+
(2
data
(.
(-
2
W
3
W
1
W
C(y does 5reedy learnin5 6or3D
An aside: Avera5in5 fa&torial distri7utions

*
f you avera5e some fa&torial distri7utions: you
do NET 5et a fa&torial distri7ution+
,
n an 'B8: t(e posterior over t(e (idden units
is fa&torial for ea&( visi7le ve&tor+
,
But t(e a55re5ated posterior over all trainin5
&ases is not fa&torial (even if t(e data 6as
5enerated 7y t(e 'B8 itself#+
C(y does 5reedy learnin5 6or3D
*
Ha&( 'B8 &onverts its data distri7ution
into an a55re5ated posterior distri7ution
over its (idden units+
*
T(is divides t(e tas3 of modelin5 its
data into t6o tas3s:
,
Tas3 .: Learn 5enerative 6ei5(ts
t(at &an &onvert t(e a55re5ated
posterior distri7ution over t(e (idden
units 7a&3 into t(e data distri7ution+
,
Tas3 2: Learn to model t(e
a55re5ated posterior distri7ution
over t(e (idden units+
,
T(e 'B8 does a 5ood ?o7 of tas3 .
and a moderately 5ood ?o7 of tas3 2+
*
Tas3 2 is easier (for t(e next 'B8# t(an
modelin5 t(e ori5inal data 7e&ause t(e
a55re5ated posterior distri7ution is
&loser to a distri7ution t(at an 'B8 &an
model perfe&tly+
data distri7ution
on visi7le units
a55re5ated
posterior distri7ution
on (idden units
) | ( W h p
) , | ( W h v p
Tas3 2
Tas3 .
C(y does 5reedy learnin5 6or3D

=
h
h v p h p v p ) | ( ) ( ) (
T(e 6ei5(ts: C: in t(e 7ottom level 'B8 define
p(vG(# and t(ey also: indire&tly: define p((#+
"o 6e &an express t(e 'B8 model as
f 6e leave p(vG(# alone and improve p((#: 6e 6ill
improve p(v#+
To improve p((#: 6e need it to 7e a 7etter model of
t(e a55re5ated posterior distri7ution over (idden
ve&tors produ&ed 7y applyin5 C to t(e data+
C(i&( distri7utions are fa&torial in a
dire&ted 7elief netD
*
n a dire&ted 7elief net 6it( one (idden layer: t(e
posterior over t(e (idden units p((Gv# is non2
fa&torial (due to explainin5 a6ay#+
,
T(e a55re5ated posterior is fa&torial if t(e
data 6as 5enerated 7y t(e dire&ted model+
*
t@s t(e opposite 6ay round from an undire&ted
model 6(i&( (as fa&torial posteriors and a non2
fa&torial prior p((# over t(e (iddens+
*
T(e intuitions t(at people (ave from usin5 dire&ted
models are very misleadin5 for undire&ted models+
C(y does 5reedy learnin5 fail in a dire&ted moduleD
*
A dire&ted module also &onverts its data
distri7ution into an a55re5ated posterior
,
Tas3 . T(e learnin5 is no6 (arder
7e&ause t(e posterior for ea&( trainin5
&ase is non2fa&torial+
,
Tas3 2 is performed usin5 an
independent prior+ T(is is a very 7ad
approximation unless t(e a55re5ated
posterior is &lose to fa&torial+
*
A dire&ted module attempts to ma3e t(e
a55re5ated posterior fa&torial in one step+
,
T(is is too diffi&ult and leads to a 7ad
&ompromise+ T(ere is also no
5uarantee t(at t(e a55re5ated
posterior is easier to model t(an t(e
data distri7ution+
data distri7ution
on visi7le units
) | (
2
W h p
) , | (
1
W h v p
Tas3 2
Tas3 .
a55re5ated
posterior distri7ution
on (idden units
A model of di5it re&o5nition
2000 top2level neurons
000 neurons
000 neurons
2A x 2A
pixel
ima5e
.0 la7el
neurons
T(e model learns to 5enerate
&om7inations of la7els and ima5es+
To perform re&o5nition 6e start 6it( a
neutral state of t(e la7el units and do
an up2pass from t(e ima5e follo6ed
7y a fe6 iterations of t(e top2level
asso&iative memory+
T(e top t6o layers form an
asso&iative memory 6(ose
ener5y lands&ape models t(e lo6
dimensional manifolds of t(e
di5its+
T(e ener5y valleys (ave names
9ine2tunin5 6it( a &ontrastive version of t(e
K6a3e2sleepL al5orit(m
After learnin5 many layers of features: 6e &an fine2tune
t(e features to improve 5eneration+
.+ Do a sto&(asti& 7ottom2up pass
,
Ad?ust t(e top2do6n 6ei5(ts to 7e 5ood at
re&onstru&tin5 t(e feature a&tivities in t(e layer 7elo6+
-+ Do a fe6 iterations of samplin5 in t(e top level 'B8
22 Ad?ust t(e 6ei5(ts in t(e top2level 'B8+
/+ Do a sto&(asti& top2do6n pass
,
Ad?ust t(e 7ottom2up 6ei5(ts to 7e 5ood at
re&onstru&tin5 t(e feature a&tivities in t(e layer a7ove+
"(o6 t(e movie of t(e net6or3
5eneratin5 di5its
(availa7le at 666+&s+torontoO<(inton#
"amples 5enerated 7y lettin5 t(e asso&iative
memory run 6it( one la7el &lamped+ T(ere are
.000 iterations of alternatin5 $i77s samplin5
7et6een samples+
Hxamples of &orre&tly re&o5ni>ed (and6ritten di5its
t(at t(e neural net6or3 (ad never seen 7efore
ts very
5ood
%o6 6ell does it dis&riminate on 8N"T test set 6it(
no extra information a7out 5eometri& distortionsD
*
$enerative model 7ased on 'B8@s .+20P
*
"upport Be&tor 8a&(ine (De&oste et+ al+# .+/P
*
Ba&3prop 6it( .000 (iddens (!latt# <.+4P
*
Ba&3prop 6it( 000 22Q-00 (iddens <.+4P
*
F2Nearest Nei5(7or < -+-P
*
"ee Le Cun et+ al+ .==A for more results
*
ts 7etter t(an 7a&3prop and mu&( more neurally plausi7le
7e&ause t(e neurons only need to send one 3ind of si5nal:
and t(e tea&(er &an 7e anot(er sensory input+
Unsupervised Kpre2trainin5L also (elps for
models t(at (ave more data and 7etter priors
*
'an>ato et+ al+ (N!" 2004# used an additional
400:000 distorted di5its+
*
T(ey also used &onvolutional multilayer neural
net6or3s t(at (ave some 7uilt2in: lo&al
translational invarian&e+
Ba&32propa5ation alone: 0+/=P
Unsupervised layer27y2layer
pre2trainin5 follo6ed 7y 7a&3prop: 0+-=P (re&ord#
Anot(er vie6 of 6(y layer27y2layer
learnin5 6or3s (%inton: Esindero ) Te( 2004#
*
T(ere is an unexpe&ted e;uivalen&e 7et6een
'B8@s and dire&ted net6or3s 6it( many layers
t(at all use t(e same 6ei5(ts+
,
T(is e;uivalen&e also 5ives insi5(t into 6(y
&ontrastive diver5en&e learnin5 6or3s+
An infinite si5moid 7elief net
t(at is e;uivalent to an 'B8
*
T(e distri7ution 5enerated 7y t(is
infinite dire&ted net 6it( repli&ated
6ei5(ts is t(e e;uili7rium distri7ution
for a &ompati7le pair of &onditional
distri7utions: p(vG(# and p((Gv# t(at
are 7ot( defined 7y C
,
A top2do6n pass of t(e dire&ted
net is exa&tly e;uivalent to lettin5
a 'estri&ted Bolt>mann 8a&(ine
settle to e;uili7rium+
,
"o t(is infinite dire&ted net
defines t(e same distri7ution as
an 'B8+
W
v.
(.
v0
(0
v2
(2
T
W
T
W
T
W
W
W
et&+
*
T(e varia7les in (0 are &onditionally
independent 5iven v0+
,
nferen&e is trivial+ Ce ?ust
multiply v0 7y C transpose+
,
T(e model a7ove (0 implements
a &omplementary prior+
,
8ultiplyin5 v0 7y C transpose
5ives t(e produ&t of t(e li3eli(ood
term and t(e prior term+
*
nferen&e in t(e dire&ted net is
exa&tly e;uivalent to lettin5 a
'estri&ted Bolt>mann 8a&(ine settle
to e;uili7rium startin5 at t(e data+
nferen&e in a dire&ted net
6it( repli&ated 6ei5(ts
W
v.
(.
v0
(0
v2
(2
T
W
T
W
T
W
W
W
et&+
R
R
R
R
*
T(e learnin5 rule for a si5moid 7elief
net is:
*
Cit( repli&ated 6ei5(ts t(is 7e&omes:
W
v.
(.
v0
(0
v2
(2
T
W
T
W
T
W
W
W
et&+
0
i
s
0
j
s
1
j
s
2
j
s
1
i
s
2
i
s

+
+
+
i j
i i j
j j i
i i j
s s
s s s
s s s
s s s
... ) (
) (
) (
2 1 1
1 0 1
1 0 0
T
W
T
W
T
W
W
W
)

(
i i j ij
s s s w
*
9irst learn 6it( all t(e 6ei5(ts tied
,
T(is is exa&tly e;uivalent to
learnin5 an 'B8
,
Contrastive diver5en&e learnin5
is e;uivalent to i5norin5 t(e small
derivatives &ontri7uted 7y t(e tied
6ei5(ts 7et6een deeper layers+
Learnin5 a deep dire&ted
net6or3
W
W
v.
(.
v0
(0
v2
(2
T
W
T
W
T
W
W
et&+
v0
(0
W
*
T(en free>e t(e first layer of 6ei5(ts
in 7ot( dire&tions and learn t(e
remainin5 6ei5(ts (still tied
to5et(er#+
,
T(is is e;uivalent to learnin5
anot(er 'B8: usin5 t(e
a55re5ated posterior distri7ution
of (0 as t(e data+
W
v.
(.
v0
(0
v2
(2
T
W
T
W
T
W
W
et&+
frozen
W
v.
(0
W
T
frozen
W
%o6 many layers s(ould 6e use and (o6
6ide s(ould t(ey 7eD
*
T(ere is no simple ans6er+
,
Hxtensive experiments 7y Mos(ua Ben5io@s 5roup
(des&ri7ed later# su55est t(at several (idden layers is
7etter t(an one+
,
'esults are fairly ro7ust a5ainst &(an5es in t(e si>e of a
layer: 7ut t(e top layer s(ould 7e 7i5+
*
Deep 7elief nets 5ive t(eir &reator a lot of freedom+
,
T(e 7est 6ay to use t(at freedom depends on t(e tas3+
,
Cit( enou5( narro6 layers 6e &an model any distri7ution
over 7inary ve&tors ("uts3ever ) %inton: 2007#
C(at (appens 6(en t(e 6ei5(ts in (i5(er layers
7e&ome different from t(e 6ei5(ts in t(e first layerD
*
T(e (i5(er layers no lon5er implement a &omplementary
prior+
,
"o performin5 inferen&e usin5 t(e fro>en 6ei5(ts in
t(e first layer is no lon5er &orre&t+ But its still pretty
5ood+
,
Usin5 t(is in&orre&t inferen&e pro&edure 5ives a
variational lo6er 7ound on t(e lo5 pro7a7ility of t(e
data+
*
T(e (i5(er layers learn a prior t(at is &loser to t(e
a55re5ated posterior distri7ution of t(e first (idden layer+
,
T(is improves t(e net6or3@s model of t(e data+
*
%inton: Esindero and Te( (2004# prove t(at t(is
improvement is al6ays 7i55er t(an t(e loss in t(e variational
7ound &aused 7y usin5 less a&&urate inferen&e+
An improved version of Contrastive
Diver5en&e learnin5 (if time permits#
*
T(e main 6orry 6it( CD is t(at t(ere 6ill 7e deep
minima of t(e ener5y fun&tion far a6ay from t(e
data+
,
To find t(ese 6e need to run t(e 8ar3ov &(ain for
a lon5 time (may7e t(ousands of steps#+
,
But 6e &annot afford to run t(e &(ain for too lon5
for ea&( update of t(e 6ei5(ts+
*
8ay7e 6e &an run t(e same 8ar3ov &(ain over
many 6ei5(t updatesD (Neal: .==2#
,
f t(e learnin5 rate is very small: t(is s(ould 7e
e;uivalent to runnin5 t(e &(ain for many steps
and t(en doin5 a 7i55er 6ei5(t update+
!ersistent CD
(Ti?men Teileman: C8L 200A ) 200=#
*
Use mini7at&(es of .00 &ases to estimate t(e
first term in t(e 5radient+ Use a sin5le 7at&( of
.00 fantasies to estimate t(e se&ond term in t(e
5radient+

*
After ea&( 6ei5(t update: 5enerate t(e ne6
fantasies from t(e previous fantasies 7y usin5
one alternatin5 $i77s update+
,
"o t(e fantasies &an 5et far from t(e data+
Contrastive diver5en&e as an
adversarial 5ame
*
C(y does persisitent CD 6or3 so 6ell 6it( only
.00 ne5ative examples to &(ara&teri>e t(e
6(ole partition fun&tionD
,
9or all interestin5 pro7lems t(e partition
fun&tion is (i5(ly multi2modal+
,
%o6 does it mana5e to find all t(e modes
6it(out startin5 at t(e dataD
T(e learnin5 &auses very fast mixin5
*
T(e learnin5 intera&ts 6it( t(e 8ar3ov &(ain+
*
!ersisitent Contrastive Diver5en&e &annot 7e
analysed 7y vie6in5 t(e learnin5 as an outer loop+
,
C(erever t(e fantasies outnum7er t(e
positive data: t(e free2ener5y surfa&e is
raised+ T(is ma3es t(e fantasies rus( around
(ypera&tively+
%o6 persistent CD moves 7et6een t(e
modes of t(e model@s distri7ution
*
f a mode (as more fantasy
parti&les t(an data: t(e free2
ener5y surfa&e is raised until
t(e fantasy parti&les es&ape+
,
T(is &an over&ome free2
ener5y 7arriers t(at 6ould
7e too (i5( for t(e 8ar3ov
C(ain to ?ump+
*
T(e free2ener5y surfa&e is
7ein5 &(an5ed to (elp
mixin5 in addition to definin5
t(e model+
"ummary so far
*
'estri&ted Bolt>mann 8a&(ines provide a simple 6ay to
learn a layer of features 6it(out any supervision+
,
8aximum li3eli(ood learnin5 is &omputationally
expensive 7e&ause of t(e normali>ation term: 7ut
&ontrastive diver5en&e learnin5 is fast and usually
6or3s 6ell+
*
8any layers of representation &an 7e learned 7y treatin5
t(e (idden states of one 'B8 as t(e visi7le data for
trainin5 t(e next 'B8 (a &omposition of experts#+
*
T(is &reates 5ood 5enerative models t(at &an t(en 7e
fine2tuned+
,
Contrastive 6a3e2sleep &an fine2tune 5eneration+
B'HAF
Evervie6 of t(e rest of t(e tutorial
*
%o6 to fine2tune a 5reedily trained 5enerative
model to 7e 7etter at dis&rimination+
*
%o6 to learn a 3ernel for a $aussian pro&ess+
*
%o6 to use deep 7elief nets for non2linear
dimensionality redu&tion and do&ument retrieval+
*
%o6 to learn a 5enerative (ierar&(y of
&onditional random fields+
*
A more advan&ed learnin5 module for deep
7elief nets t(at &ontains multipli&ative
intera&tions+
*
%o6 to learn deep models of se;uential data+
9ine2tunin5 for dis&rimination
*
9irst learn one layer at a time 5reedily+
*
T(en treat t(is as Kpre2trainin5L t(at finds a 5ood
initial set of 6ei5(ts 6(i&( &an 7e fine2tuned 7y
a lo&al sear&( pro&edure+
,
Contrastive 6a3e2sleep is one 6ay of fine2
tunin5 t(e model to 7e 7etter at 5eneration+
*
Ba&3propa5ation &an 7e used to fine2tune t(e
model for 7etter dis&rimination+
,
T(is over&omes many of t(e limitations of
standard 7a&3propa5ation+
C(y 7a&3propa5ation 6or3s 7etter 6it(
5reedy pre2trainin5: T(e optimi>ation vie6
*
$reedily learnin5 one layer at a time s&ales 6ell
to really 7i5 net6or3s: espe&ially if 6e (ave
lo&ality in ea&( layer+
*
Ce do not start 7a&3propa5ation until 6e already
(ave sensi7le feature dete&tors t(at s(ould
already 7e very (elpful for t(e dis&rimination tas3+
,
"o t(e initial 5radients are sensi7le and
7a&3prop only needs to perform a lo&al sear&(
from a sensi7le startin5 point+
C(y 7a&3propa5ation 6or3s 7etter 6it(
5reedy pre2trainin5: T(e overfittin5 vie6
*
8ost of t(e information in t(e final 6ei5(ts &omes from
modelin5 t(e distri7ution of input ve&tors+
,
T(e input ve&tors 5enerally &ontain a lot more
information t(an t(e la7els+
,
T(e pre&ious information in t(e la7els is only used for
t(e final fine2tunin5+
,
T(e fine2tunin5 only modifies t(e features sli5(tly to 5et
t(e &ate5ory 7oundaries ri5(t+ t does not need to
dis&over features+
*
T(is type of 7a&3propa5ation 6or3s 6ell even if most of
t(e trainin5 data is unla7eled+
,
T(e unla7eled data is still very useful for dis&overin5
5ood features+
9irst: model t(e distri7ution of di5it ima5es
2000 units
000 units
000 units
2A x 2A
pixel
ima5e
T(e net6or3 learns a density model for
unla7eled di5it ima5es+ C(en 6e 5enerate
from t(e model 6e 5et t(in5s t(at loo3 li3e
real di5its of all &lasses+
But do t(e (idden features really (elp 6it(
di5it dis&riminationD
Add .0 softmaxed units to t(e top and do
7a&3propa5ation+
T(e top t6o layers form a restri&ted
Bolt>mann ma&(ine 6(ose free ener5y
lands&ape s(ould model t(e lo6
dimensional manifolds of t(e di5its+
'esults on permutation2invariant 8N"T tas3
*
Bery &arefully trained 7a&3prop net 6it( .+4P
one or t6o (idden layers (!lattS %inton#
*
"B8 (De&oste ) "&(oel3opf: 2002# .+/P
*
$enerative model of ?oint density of .+20P
ima5es and la7els (R 5enerative fine2tunin5#
*
$enerative model of unla7elled di5its .+.0P
follo6ed 7y 5entle 7a&3propa5ation
(%inton ) "ala3(utdinov: "&ien&e 2004#
Learnin5 Dynami&s of Deep Nets
t(e next / slides des&ri7e 6or3 7y Mos(ua Ben5io@s 5roup
Before fine-tuning After fine-tuning
Hffe&t of Unsupervised !re2trainin5
40
Erhan et. al. AISTATS2009
Hffe&t of Dept(
44
w/o pre-training
with pre-training without pre-training
Learnin5 Tra?e&tories in 9un&tion "pa&e
(a 22D visuali>ation produ&ed 6it( t2"NH#
*
Ha&( point is a
model in fun&tion
spa&e
*
Color J epo&(
*
Top: tra?e&tories
6it(out pre2trainin5+
Ha&( tra?e&tory
&onver5es to a
different lo&al min+
*
Bottom: Tra?e&tories
6it( pre2trainin5+
*
No overlapN
Erhan et. al. AISTATS2009
C(y unsupervised pre2trainin5 ma3es sense
stuff
ima5e
la7el
stuff
ima5e
la7el
f ima5e2la7el pairs 6ere
5enerated t(is 6ay: it
6ould ma3e sense to try
to 5o strai5(t from
ima5es to la7els+
9or example: do t(e
pixels (ave even parityD
f ima5e2la7el pairs are
5enerated t(is 6ay: it
ma3es sense to first learn
to re&over t(e stuff t(at
&aused t(e ima5e 7y
invertin5 t(e (i5(
7and6idt( pat(6ay+
(i5(
7and6idt(
lo6
7and6idt(
8odelin5 real2valued data
*
9or ima5es of di5its it is possi7le to represent
intermediate intensities as if t(ey 6ere pro7a7ilities 7y
usin5 Kmean2fieldL lo5isti& units+
,
Ce &an treat intermediate values as t(e pro7a7ility
t(at t(e pixel is in3ed+
*
T(is 6ill not 6or3 for real ima5es+
,
n a real ima5e: t(e intensity of a pixel is almost
al6ays almost exa&tly t(e avera5e of t(e nei5(7orin5
pixels+
,
8ean2field lo5isti& units &annot represent pre&ise
intermediate values+
'epla&in5 7inary varia7les 7y
inte5er2valued varia7les
(Te( and %inton: 200.#
*
Ene 6ay to model an inte5er2valued varia7le is
to ma3e N identi&al &opies of a 7inary unit+
*
All &opies (ave t(e same pro7a7ility:
of 7ein5 KonL : p J lo5isti&(x#
,
T(e total num7er of KonL &opies is li3e t(e
firin5 rate of a neuron+
,
t (as a 7inomial distri7ution 6it( mean N p
and varian&e N p(.2p#
A 7etter 6ay to implement inte5er values
*
8a3e many &opies of a 7inary unit+
*
All &opies (ave t(e same 6ei5(ts and t(e same
adaptive 7ias: 7: 7ut t(ey (ave different fixed offsets to
t(e 7ias:
.... , 5 . 3 , 5 . 2 , 5 . 1 , 5 . 0 b b b b
x
A fast approximation
*
Contrastive diver5en&e learnin5 6or3s 6ell for t(e sum of
7inary units 6it( offset 7iases+
*
t also 6or3s for re&tified linear units+ T(ese are mu&( faster
to &ompute t(an t(e sum of many lo5isti& units+
output J max(0: x R randnTs;rt(lo5isti&(x## #
) 1 log( ) 5 . 0 ( logistic
1
x
n
n
e n x + +

=
=
%o6 to train a 7ipartite net6or3 of re&tified
linear units
*
Iust use &ontrastive diver5en&e to lo6er t(e ener5y of
data and raise t(e ener5y of near7y &onfi5urations t(at
t(e model prefers to t(e data+
data
> <
j i
h v
recon
> <
j i
h v
i
?
i
?
) (
recon data
> < > < =
j i j i ij
h v h v w
"tart 6it( a trainin5 ve&tor on t(e
visi7le units+
Update all (idden units in parallel
6it( samplin5 noise
Update t(e visi7le units in parallel
to 5et a Kre&onstru&tionL+
Update t(e (idden units a5ain
re&onstru&tion
data
3D Object Recognition: The NORB dataset
Stereopairs o! gra"sca#e images o! to" objects$
% #ighting conditions& '%( )ie*points
2
+i)e object instances per c#ass in the training set
2
, different set o! !i)e instances per c#ass in the test set
(-&3.. training cases& (-&3.. test cases
,nima#s
/umans
P#anes
Truc0s
Cars
Norma#i1ed
uni!orm
)ersion o!
NORB
"implifyin5 t(e data
*
Ha&( trainin5 &ase is a stereo2pair of =4x=4 ima5es+
,
T(e o7?e&t is &entered+
,
T(e ed5es of t(e ima5e are mainly 7lan3+
,
T(e 7a&35round is uniform and 7ri5(t+
*
To ma3e learnin5 faster used simplified t(e data:
,
T(ro6 a6ay one ima5e+
,
Enly use t(e middle 4/x4/ pixels of t(e ot(er
ima5e+
,
Do6nsample to -2x-2 7y avera5in5 / pixels+
"implifyin5 t(e data even more so t(at it &an
7e modeled 7y re&tified linear units
*
T(e intensity (isto5ram for ea&( -2x-2 ima5e (as a
s(arp pea3 for t(e 7ri5(t 7a&35round+
*
9ind t(is pea3 and &all it >ero+
*
Call all intensities 7ri5(ter t(an t(e 7a&35round >ero+
*
8easure intensities do6n6ards from t(e 7a&35round
intensity+
0
Test set error rates on NE'B after 5reedy
learnin5 of one or t6o (idden layers usin5
re&tified linear units
9ull NE'B (2 ima5es of =4x=4#
*
Lo5isti& re5ression on t(e ra6 pixels 20+0P
*
$aussian "B8 (trained 7y Leon Bottou# ..+4P
*
Convolutional neural net (Le Cun@s 5roup# 4+0P
(&onvolutional nets (ave 3no6led5e of translations 7uilt in#

'edu&ed NE'B (. ima5e -2x-2#
*
Lo5isti& re5ression on t(e ra6 pixels
-0+2P
*
Lo5isti& re5ression on first (idden layer ./+=P
*
Lo5isti& re5ression on se&ond (idden layer .0+2P
T(e
re&eptive
fields of
some
re&tified
linear
(idden
units+
A standard type of real2valued visi7le unit
*
Ce &an model pixels as
$aussian varia7les+
Alternatin5 $i77s
samplin5 is still easy:
t(ou5( learnin5 needs to
7e mu&( slo6er+
ij j
j i
i
i
v
hid j
j j
vis i i
i i
w h h b
b v
, E

=
,
2
2
2
) (
) (



h v
H

ener5y25radient
produ&ed 7y t(e total
input to a visi7le unit
para7oli&
&ontainment
fun&tion

i i
v b
Cellin5 et+ al+ (2000# s(o6 (o6 to extend 'B8@s to t(e
exponential family+ "ee also Ben5io et+ al+ (2007#
A random sample of .0:000 7inary filters learned
7y Alex Fri>(evs3y on a million -2x-2 &olor ima5es+
Com7inin5 deep 7elief nets 6it( $aussian pro&esses
*
Deep 7elief nets &an 7enefit a lot from unla7eled data
6(en la7eled data is s&ar&e+
,
T(ey ?ust use t(e la7eled data for fine2tunin5+
*
Fernel met(ods: li3e $aussian pro&esses: 6or3 6ell on
small la7eled trainin5 sets 7ut are slo6 for lar5e trainin5
sets+
*
"o 6(en t(ere is a lot of unla7eled data and only a little
la7eled data: &om7ine t(e t6o approa&(es:
,
9irst learn a deep 7elief net 6it(out usin5 t(e la7els+
,
T(en apply a $aussian pro&ess model to t(e deepest
layer of features+ T(is 6or3s 7etter t(an usin5 t(e ra6
data+
,
T(en use $!@s to 5et t(e derivatives t(at are 7a&32
propa5ated t(rou5( t(e deep 7elief net+ T(is is a
furt(er 6in+ t allo6s $!@s to fine2tune &ompli&ated
domain2spe&ifi& 3ernels+
Learnin5 to extra&t t(e orientation of a fa&e pat&(
("ala3(utdinov ) %inton: N!" 2007#
T(e trainin5 and test sets for predi&tin5
fa&e orientation
..:000 unla7eled &ases .00: 000: or .000 la7eled &ases
fa&e pat&(es from ne6 people
T(e root mean s;uared error in t(e orientation
6(en &om7inin5 $!@s 6it( deep 7elief nets
22+2 .7+= .0+2
.7+2 .2+7 7+2
.4+- ..+2 4+/
$! on
t(e
pixels
$! on
top2level
features
$! on top2level
features 6it(
fine2tunin5
.00 la7els
000 la7els
.000 la7els
Con&lusion: T(e deep features are mu&( 7etter
t(an t(e pixels+ 9ine2tunin5 (elps a lot+
Deep Autoen&oders
(%inton ) "ala3(utdinov: 2004#
*
T(ey al6ays loo3ed li3e a really
ni&e 6ay to do non2linear
dimensionality redu&tion:
,
But it is very diffi&ult to
optimi>e deep autoen&oders
usin5 7a&3propa5ation+
*
Ce no6 (ave a mu&( 7etter 6ay
to optimi>e t(em:
,
9irst train a sta&3 of / 'B8@s
,
T(en KunrollL t(em+
,
T(en fine2tune 6it( 7a&3prop+
.000 neurons
000 neurons
000 neurons
200 neurons
200 neurons
-0
.000 neurons
2Ax2A
2Ax2A
1
2
3
4
4
3
2
1
W
W
W
W
W
W
W
W
T
T
T
T
linear
units
A &omparison of met(ods for &ompressin5
di5it ima5es to -0 real num7ers+
real
data
-02D
deep auto
-02D lo5isti&
!CA
-02D
!CA
'etrievin5 do&uments t(at are similar
to a ;uery do&ument
*
Ce &an use an autoen&oder to find lo62
dimensional &odes for do&uments t(at allo6
fast and a&&urate retrieval of similar
do&uments from a lar5e set+
*
Ce start 7y &onvertin5 ea&( do&ument into a
K7a5 of 6ordsL+ T(is a 2000 dimensional
ve&tor t(at &ontains t(e &ounts for ea&( of t(e
2000 &ommonest 6ords+
%o6 to &ompress t(e &ount ve&tor
*
Ce train t(e neural
net6or3 to reprodu&e its
input ve&tor as its output
*
T(is for&es it to
&ompress as mu&(
information as possi7le
into t(e .0 num7ers in
t(e &entral 7ottlene&3+
*
T(ese .0 num7ers are
t(en a 5ood 6ay to
&ompare do&uments+
2000 re&onstru&ted &ounts
000 neurons
2000 6ord &ounts
000 neurons
200 neurons
200 neurons
.0
input
ve&tor
output
ve&tor
!erforman&e of t(e autoen&oder at
do&ument retrieval
*
Train on 7a5s of 2000 6ords for /00:000 trainin5 &ases
of 7usiness do&uments+
,
9irst train a sta&3 of 'B8@s+ T(en fine2tune 6it(
7a&3prop+
*
Test on a separate /00:000 do&uments+
,
!i&3 one test do&ument as a ;uery+ 'an3 order all t(e
ot(er test do&uments 7y usin5 t(e &osine of t(e an5le
7et6een &odes+
,
'epeat t(is usin5 ea&( of t(e /00:000 test do&uments
as t(e ;uery (re;uires 0+.4 trillion &omparisons#+
*
!lot t(e num7er of retrieved do&uments a5ainst t(e
proportion t(at are in t(e same (and2la7eled &lass as t(e
;uery do&ument+
!roportion of retrieved do&uments in same &lass as ;uery
Num7er of do&uments retrieved
9irst &ompress all do&uments to 2 num7ers usin5 a type of !CA
T(en use different &olors for different
do&ument &ate5ories
9irst &ompress all do&uments to 2 num7ers+
T(en use different &olors for different do&ument &ate5ories
9indin5 7inary &odes for do&uments
*
Train an auto2en&oder usin5 -0
lo5isti& units for t(e &ode layer+
*
Durin5 t(e fine2tunin5 sta5e:
add noise to t(e inputs to t(e
&ode units+
,
T(e KnoiseL ve&tor for ea&(
trainin5 &ase is fixed+ "o 6e
still 5et a deterministi&
5radient+
,
T(e noise for&es t(eir
a&tivities to 7e&ome 7imodal
in order to resist t(e effe&ts
of t(e noise+
,
T(en 6e simply round t(e
a&tivities of t(e -0 &ode units
to . or 0+
2000 re&onstru&ted &ounts
000 neurons
2000 6ord &ounts
000 neurons
200 neurons
200 neurons
-0

noise
"emanti& (as(in5: Usin5 a deep autoen&oder as a
(as(2fun&tion for findin5 approximate mat&(es
("ala3(utdinov ) %inton: 2007#
(as(
fun&tion
Ksupermar3et sear&(L
%o6 5ood is a s(ortlist found t(is 6ayD
*
Ce (ave only implemented it for a million
do&uments 6it( 2027it &odes 222 7ut 6(at &ould
possi7ly 5o 6ron5D
,
A 202D (yper&u7e allo6s us to &apture enou5(
of t(e similarity stru&ture of our do&ument set+
*
T(e s(ortlist found usin5 7inary &odes a&tually
improves t(e pre&ision2re&all &urves of T92D9+
,
Lo&ality sensitive (as(in5 (t(e fastest ot(er
met(od# is 00 times slo6er and (as 6orse
pre&ision2re&all &urves+
$eneratin5 t(e parts of an o7?e&t
*
Ene 6ay to maintain t(e
&onstraints 7et6een t(e parts is
to 5enerate ea&( part very
a&&urately
,
But t(is 6ould re;uire a lot of
&ommuni&ation 7and6idt(+
*
"loppy top2do6n spe&ifi&ation of
t(e parts is less demandin5
,
7ut it messes up relations(ips
7et6een features
,
so use redundant features
and use lateral intera&tions to
&lean up t(e mess+
*
Ha&( transformed feature (elps
to lo&ate t(e ot(ers
,
T(is allo6s a noisy &(annel
sloppy top2do6n
a&tivation of parts
&lean2up usin5
3no6n intera&tions
pose parameters
features 6it(
top2do6n
support
Ks;uareL
R
ts li3e soldiers on
a parade 5round
"emi2restri&ted Bolt>mann 8a&(ines
*
Ce restri&t t(e &onne&tivity to ma3e
learnin5 easier+
*
Contrastive diver5en&e learnin5 re;uires
t(e (idden units to 7e in &onditional
e;uili7rium 6it( t(e visi7les+
,
But it does not re;uire t(e visi7le units
to 7e in &onditional e;uili7rium 6it( t(e
(iddens+
,
All 6e re;uire is t(at t(e visi7le units
are &loser to e;uili7rium in t(e
re&onstru&tions t(an in t(e data+
*
"o 6e &an allo6 &onne&tions 7et6een
t(e visi7les+
(idden
i
?
visi7le
Learnin5 a semi2restri&ted Bolt>mann 8a&(ine
0
> <
j i
h v
1
> <
j i
h v
i
?
i
?
t J 0 t J .
) (
1 0
> < > < =
j i j i ij
h v h v w
.+ "tart 6it( a
trainin5 ve&tor on t(e
visi7le units+
2+ Update all of t(e
(idden units in
parallel
-+ 'epeatedly update
all of t(e visi7le units
in parallel usin5
mean2field updates
(6it( t(e (iddens
fixed# to 5et a
Kre&onstru&tionL+
/+ Update all of t(e
(idden units a5ain+
re&onstru&tion
data
) (
1 0
> < > < =
k i k i ik
v v v v l
3 i i 3 3 3
update for a
lateral 6ei5(t
Learnin5 in "emi2restri&ted Bolt>mann
8a&(ines
*
8et(od .: To form a re&onstru&tion: &y&le
t(rou5( t(e visi7le units updatin5 ea&( in turn
usin5 t(e top2do6n input from t(e (iddens plus
t(e lateral input from t(e ot(er visi7les+
*
8et(od 2: Use Kmean fieldL visi7le units t(at
(ave real values+ Update t(em all in parallel+
,
Use dampin5 to prevent os&illations
) ( ) (1
1
i
t
i
t
i
x p p + =
+
total input to i
dampin5
'esults on modelin5 natural ima5e pat&(es
usin5 a sta&3 of 'B8@s (Esindero and %inton#
*
"ta&3 of 'B8@s learned one at a time+
*
/00 $aussian visi7le units t(at see
6(itened ima5e pat&(es
,
Derived from .00:000 Ban %ateren
ima5e pat&(es: ea&( 20x20
*
T(e (idden units are all 7inary+
,
T(e lateral &onne&tions are
learned 6(en t(ey are t(e visi7le
units of t(eir 'B8+
*
'e&onstru&tion involves lettin5 t(e
visi7le units of ea&( 'B8 settle usin5
mean2field dynami&s+
,
T(e already de&ided states in t(e
level a7ove determine t(e effe&tive
7iases durin5 mean2field settlin5+
Dire&ted Conne&tions
Dire&ted Conne&tions
Undire&ted Conne&tions
/00
$aussian
units
%idden
8'9 6it(
2000 units
%idden
8'9 6it(
000 units
.000 top2
level units+
No 8'9+
Cit(out lateral &onne&tions
real data samples from model
Cit( lateral &onne&tions
real data samples from model
A funny 6ay to use an 8'9
*
T(e lateral &onne&tions form an 8'9+
*
T(e 8'9 is used durin5 learnin5 and 5eneration+
*
T(e 8'9 is not used for inferen&e+
,
T(is is a novel idea so vision resear&(ers don@t li3e it+
*
T(e 8'9 enfor&es &onstraints+ Durin5 inferen&e:
&onstraints do not need to 7e enfor&ed 7e&ause t(e data
o7eys t(em+
,
T(e &onstraints only need to 7e enfor&ed durin5
5eneration+
*
Uno7served (idden units &annot enfor&e &onstraints+
,
To enfor&e &onstraints re;uires lateral &onne&tions or
o7served des&endants+
C(y do 6e 6(iten dataD
*
ma5es typi&ally (ave stron5 pair26ise &orrelations+
*
Learnin5 (i5(er order statisti&s is diffi&ult 6(en t(ere are
stron5 pair26ise &orrelations+
,
"mall &(an5es in parameter values t(at improve t(e
modelin5 of (i5(er2order statisti&s may 7e re?e&ted
7e&ause t(ey form a sli5(tly 6orse model of t(e mu&(
stron5er pair26ise statisti&s+
*
"o 6e often remove t(e se&ond2order statisti&s 7efore
tryin5 to learn t(e (i5(er2order statisti&s+
C(itenin5 t(e learnin5 si5nal instead
of t(e data
*
Contrastive diver5en&e learnin5 &an remove t(e effe&ts
of t(e se&ond2order statisti&s on t(e learnin5 6it(out
a&tually &(an5in5 t(e data+
,
T(e lateral &onne&tions model t(e se&ond order
statisti&s
,
f a pixel &an 7e re&onstru&ted &orre&tly usin5 se&ond
order statisti&s: its 6ill 7e t(e same in t(e
re&onstru&tion as in t(e data+
,
T(e (idden units &an t(en fo&us on modelin5 (i5(2
order stru&ture t(at &annot 7e predi&ted 7y t(e lateral
&onne&tions+
*
9or example: a pixel &lose to an ed5e: 6(ere interpolation
from near7y pixels &auses in&orre&t smoot(in5+
To6ards a more po6erful: multi2linear
sta&3a7le learnin5 module
*
"o far: t(e states of t(e units in one layer (ave only 7een
used to determine t(e effe&tive 7iases of t(e units in t(e
layer 7elo6+
*
t 6ould 7e mu&( more po6erful to modulate t(e pair26ise
intera&tions in t(e layer 7elo6+
,
A 5ood 6ay to desi5n a (ierar&(i&al system is to allo6
ea&( level to determine t(e o7?e&tive fun&tion of t(e level
7elo6+
*
To modulate pair26ise intera&tions 6e need (i5(er2order
Bolt>mann ma&(ines+
%i5(er order Bolt>mann ma&(ines
("e?no6s3i: <.=A4#
*
T(e usual ener5y fun&tion is ;uadrati& in t(e states:
*
But 6e &ould use (i5(er order intera&tions:
ij j
j i
i
w s s terms bias E

<
=
ijk k j
k j i
i
w s s s terms bias E

< <
=
*
Unit 3 a&ts as a s6it&(+ C(en unit 3 is on: it s6it&(es
in t(e pair6ise intera&tion 7et6een unit i and unit ?+
,
Units i and ? &an also 7e vie6ed as s6it&(es t(at
&ontrol t(e pair6ise intera&tions 7et6een ? and 3
or 7et6een i and 3+
Usin5 (i5(er2order Bolt>mann ma&(ines to
model ima5e transformations
(t(e unfa&tored version#
*
A 5lo7al transformation spe&ifies 6(i&( pixel
5oes to 6(i&( ot(er pixel+
*
Conversely: ea&( pair of similar intensity pixels:
one in ea&( ima5e: votes for a parti&ular 5lo7al
transformation+
ima5e(t# ima5e(tR.#
ima5e transformation
9a&torin5 t(ree26ay
multipli&ative intera&tions

=
=
f
hf jf if h j
h j i
i
ijh h j
h j i
i
w w w s s s E
w s s s E
, ,
, ,
fa&tored
6it( linearly
many parameters
per fa&tor+
unfa&tored
6it( &u7i&ally
many parameters
A pi&ture of t(e lo62ran3 tensor
&ontri7uted 7y fa&tor f
if
w
jf
w
hf
w
Ha&( layer is a s&aled version
of t(e same matrix+
T(e 7asis matrix is spe&ified
as an outer produ&t 6it(
typi&al term
"o ea&( a&tive (idden unit
&ontri7utes a s&alar:
times t(e matrix spe&ified 7y
fa&tor f +
jf if
w w
hf
w
nferen&e 6it( fa&tored t(ree26ay
multipli&ative intera&tions
[ ]

=
=

= =
j
jf j if
i
i hf h f h f
hf jf if h j
h j i
i f
w s w s w s E s E
w w w s s s E
) ( ) ( 1 0
, ,
%o6 &(an5in5 t(e 7inary state
of unit ( &(an5es t(e ener5y
&ontri7uted 7y fa&tor f+
C(at unit ( needs
to 3no6 in order to
do $i77s samplin5
T(e ener5y
&ontri7uted 7y
fa&tor f+
Belief propa5ation
if
w
jf
w
hf
w
f
i
j
h
T(e out5oin5 messa5e
at ea&( vertex of t(e
fa&tor is t(e produ&t of
t(e 6ei5(ted sums at
t(e ot(er t6o verti&es+
Learnin5 6it( fa&tored t(ree26ay
multipli&ative intera&tions
del mo data
model data
h
f h
h
f h
hf
f
hf
f
hf
j
jf j if
i
i
h
f
m s m s
w
E
w
E
w
w s w s m
=

=

messa5e
from fa&tor f
to unit (
'oland data
8odelin5 t(e &orrelational stru&ture of a stati& ima5e
7y usin5 t6o &opies of t(e ima5e
if
w jf
w
hf
w
f
i
j
h
Ha&( fa&tor sends t(e
s;uared output of a linear
filter to t(e (idden units+
t is exa&tly t(e standard
model of simple and
&omplex &ells+ t allo6s
&omplex &ells to extra&t
oriented ener5y+
T(e standard model drops
out of doin5 7elief
propa5ation for a fa&tored
t(ird2order ener5y fun&tion+ Copy . Copy 2
An advanta5e of modelin5 &orrelations
7et6een pixels rat(er t(an pixels
*
Durin5 5eneration: a Kverti&al ed5eL unit &an turn off
t(e (ori>ontal interpolation in a re5ion 6it(out
6orryin5 a7out exa&tly 6(ere t(e intensity
dis&ontinuity 6ill 7e+
,
T(is 5ives some translational invarian&e
,
t also 5ives a lot of invarian&e to 7ri5(tness and
&ontrast+
,
"o t(e Kverti&al ed5eL unit is li3e a &omplex &ell+
*
By modulatin5 t(e &orrelations 7et6een pixels rat(er
t(an t(e pixel intensities: t(e 5enerative model &an
still allo6 interpolation parallel to t(e ed5e+
A prin&iple of (ierar&(i&al systems
*
Ha&( level in t(e (ierar&(y s(ould not try to
mi&ro2mana5e t(e level 7elo6+
*
nstead: it s(ould &reate an o7?e&tive fun&tion for
t(e level 7elo6 and leave t(e level 7elo6 to
optimi>e it+
,
T(is allo6s t(e fine details of t(e solution to
7e de&ided lo&ally 6(ere t(e detailed
information is availa7le+
*
E7?e&tive fun&tions are a 5ood 6ay to do
a7stra&tion+
Time series models
*
nferen&e is diffi&ult in dire&ted models of time
series if 6e use non2linear distri7uted
representations in t(e (idden units+
,
t is (ard to fit Dynami& Bayes Nets to (i5(2
dimensional se;uen&es (e+5 motion &apture
data#+
*
"o people tend to avoid distri7uted
representations and use mu&( 6ea3er met(ods
(e+5+ %88@s#+
Time series models
*
f 6e really need distri7uted representations (6(i&( 6e
nearly al6ays do#: 6e &an ma3e inferen&e mu&( simpler
7y usin5 t(ree tri&3s:
,
Use an 'B8 for t(e intera&tions 7et6een (idden and
visi7le varia7les+ T(is ensures t(at t(e main sour&e of
information 6ants t(e posterior to 7e fa&torial+
,
8odel s(ort2ran5e temporal information 7y allo6in5
several previous frames to provide input to t(e (idden
units and to t(e visi7le units+
*
T(is leads to a temporal module t(at &an 7e sta&3ed
,
"o 6e &an use 5reedy learnin5 to learn deep models
of temporal stru&ture+
An appli&ation to modelin5
motion &apture data
(Taylor: 'o6eis ) %inton: 2007#
*
%uman motion &an 7e &aptured 7y pla&in5
refle&tive mar3ers on t(e ?oints and t(en usin5
lots of infrared &ameras to tra&3 t(e -2D
positions of t(e mar3ers+
*
$iven a s3eletal model: t(e -2D positions of t(e
mar3ers &an 7e &onverted into t(e ?oint an5les
plus 4 parameters t(at des&ri7e t(e -2D position
and t(e roll: pit&( and ya6 of t(e pelvis+
,
Ce only represent &(an5es in ya6 7e&ause p(ysi&s
doesn@t &are a7out its value and 6e 6ant to avoid
&ir&ular varia7les+
T(e &onditional 'B8 model
(a partially o7served C'9#
*
"tart 6it( a 5eneri& 'B8+
*
Add t6o types of &onditionin5
&onne&tions+
*
$iven t(e data: t(e (idden units
at time t are &onditionally
independent+
*
T(e autore5ressive 6ei5(ts &an
model most s(ort2term temporal
stru&ture very 6ell: leavin5 t(e
(idden units to model nonlinear
irre5ularities (su&( as 6(en t(e
foot (its t(e 5round#+
t22 t2. t
i
j
(
v
Causal 5eneration from a learned model
*
Feep t(e previous visi7le states fixed+
,
T(ey provide a time2dependent
7ias for t(e (idden units+
*
!erform alternatin5 $i77s samplin5
for a fe6 iterations 7et6een t(e
(idden units and t(e most re&ent
visi7le units+
,
T(is pi&3s ne6 (idden and visi7le
states t(at are &ompati7le 6it(
ea&( ot(er and 6it( t(e re&ent
(istory+
i
j
%i5(er level models
*
En&e 6e (ave trained t(e model: 6e &an
add layers li3e in a Deep Belief Net6or3+
*
T(e previous layer C'B8 is 3ept: and its
output: 6(ile driven 7y t(e data is treated
as a ne6 3ind of Kfully o7servedL data+
*
T(e next level C'B8 (as t(e same
ar&(ite&ture as t(e first (t(ou5( 6e &an
alter t(e num7er of units it uses# and is
trained t(e same 6ay+
*
Upper levels of t(e net6or3 model more
Ka7stra&tL &on&epts+
*
T(is 5reedy learnin5 pro&edure &an 7e
?ustified usin5 a variational 7ound+
i
j
k
t22 t2. t
Learnin5 6it( KstyleL la7els
*
As in t(e 5enerative model of
(and6ritten di5its (%inton et al+
2004#: style la7els &an 7e
provided as part of t(e input to
t(e top layer+
*
T(e la7els are represented 7y
turnin5 on one unit in a 5roup of
units: 7ut t(ey &an also 7e
7lended+
i
j
t22 t2. t
k

l
"(o6 demo@s of multiple styles of
6al3in5
These can be foun at
www.cs.toronto.eu/!gwta"lor/
'eadin5s on deep 7elief nets
A readin5 list (t(at is still 7ein5 updated# &an 7e
found at
666+&s+toronto+eduO<(intonOdeeprefs+(tml

You might also like