You are on page 1of 6

Bayesian Ying-Yang Theory for Empirical Learning, Regularization

and Model Selection: General Formulation


Lei Xu
Computer Science and Engineering Department
The Chinese University of Hong Kong, Shatin, NT, Hong Kong, P.R.China
Email: 1xucse.cuhk.edu.hk

Abstract p ( y l z ) ;while p ~ +models the inverse mapping y + z,


The Bayesian Ying-Yang ( B W ) learning system and a Ying transform from y to x, through the distribution
theory developed in the recent years by the present p(z1y). Moreover, M,, Mylr form the Yang model M I
author since Xu(1995&96) is further elaborated in a as described by p~~ (t,y ) , and M , , Mylz form the Ying
geneml formulation, focusing on systematically intro- model M2 as described by p ~ , ( z , y ) We. call such a
ducing the key points of theory f o r empirical learning, pair of Ying-Yang models Bayesian Ying- Yang (BYY)
data smoothing based regularization, structuml regular- perception system. The word ‘ perception’ is usually
ization, and model selection. Moreover, discussions omitted for simplicity.
have been made on the relationship and diference be-
tween this theory and the existing approaches, especially Based on a training set 0, = { t i } & , the task of
Helmholtz machine, information theory as well as infor- specifying all the aspects ofpM,~+,pM,,PM,I,, P M , is
mation geometry theory. called learning in a broad sense.
Typically, p ~ is, estimated from 0,by
1. Bayesian Ying-Yang learning system
It is widely believed that the perception tasks in- PO(”) = & X:I~(”- ”i), Ph, =&x51Kh,(z- zi), (2)
volves both the bottom-up and top-down connections
between the observable space X and the invisible space where p o ( z ) is called the empirical estimate and ph, is
Y. The B W learning system takes this viewpoint for called Parzen window estimate. ph, (z) can be regarded
modeling various tasks of perception and unsupervised as a smoothed modification of p o ( z ) by using a kernel
learning (Xu, 1995&96;1997a&b). We use the joint dis- K h ( 2 ) with a smoothing parameter h, > 0 to blur out
tribution p ( z , y ) to describe such connections between each impulse at zi. A typical kernel is the gaussian
X and Y. Two different models M I ,M2 are built for G(t, z;, h,ld) of mean z; and covariance h,I, I is an
describing p ( z , y ) because it has two complementary identity Matrix. When h, = 0, ph, (z) returns to p o ( z ) .
Bayesian representations: Also, p ~ can , be is estimated from a parametric
P M ~ ( ~ ,=
Y )P M , ~ = P M , , P M , ( ~ , Y )= PM,~,,PM,. (1) model, e.g., a mixture of gaussian.
The specifications of p ~ , , +p, ~ , ( ,and p ~ consists ,
Interestingly, this compliments to a famous Chinese of the following three issues:
ancient Ying-Yang philosophy p ~ models , a distri- (1) Specification of the space Y . For tasks of deci-
bution p ( z ) on the visible space X and thus is a Yang, sion, classification, recognition, ..., etc, each y E Y
while p ~ models
, a distribution p ( y ) on the invisible is a discrete number that takes only k values, e.g.,
space Y by p ~ and , thus is a Ying. p ~ , ~ models
, y.= 1,.. ,k. For tasks of encoding, perception, recogni-
a Yang transform from an observation into an invis- tion, . . ., etc, each y E Y is either a k-bit binary code
ible code or representation by through a distribution or a k-dimensional real vector. The integer k is called
representation scale since it acts as a scale indicator of
This project was supported by the HK RGC Earmarked CUHK the complexity of the representation form for y .
Grant 3391963.
IIt should be T i n ” in the Mainland Chinese spelling. How-
( 2 ) Model design. For each a E {zly, y l z , y } , we
ever, “Ying” is used here for keeping the sprit of symmetry. The use Pa to denote the family of all the distributions in
key point of the philosophy is that every normal entity in the uni- the form p ( ~ )i.e.,, we have p ( + ) E P , ~ , , P ( Y ) E P, and
verse involves a harmony i n t e y t i o n between a “Ying” part and a p(y1z) E p,~,. Similarly, we use pMa c P, to denote the
“Yang” part. Here, the word harmony” consists of two aspects,
namely a best matching between the Ymg and the Yang and the family of all the distributions of p ~( U ,) . If PMa = Pa,
compactness or the degree of consolidation of the matched the we say that p ~ . ( a is ) model-free and thus simply de-
Ymg-Yang pair. Usually, a visible or physical component is called
“Yang”,and an invisible or spiritual component is called “Yimg”, noted by p ( ~ ) .If PM. c pa, we say that ~ M . ( u ) is
which represents a static pair of Ying-Yang spaces. Moreover, designed in a model M , with a given structure, which
there is another dynamic pair of “Ymg-Yang” that describes the specifies a distribution function form p ~(ale,) . or sim-
interactions between this pair of static spaces. E.g., a male ani- ply p(alea) with a set 0, of unknown parameters, i.e.,
mal is such a ‘Yang’that functionsas a transform from a physical
body into an invisible code, and a female animal is a ‘Ying’ that PM. = (p(al6a) : for all Bo E 0, under a g i v e n Ma},where
transfers an invisible code into a next generation physical body. 0 , is the set of all values that 0, takes. Similar to the

01999 IEEE
0-7803-5529-6/99/$10.00 552
case of M,, for each a E {zly,ylz, y} we may also have (1997) and rewrite it into a general form
a set of models {M?")} that share a same structure
but in different scales such that c 'P (no+l) with
MO
H ( M ) = Q ( P M ~ , P MQ~( p), q~) = Jp(u)lnq(u)du. (3)
PMina)= { p ( a l e p ) ): for all OF) E Ob".) under Mi"")}. To get better insights, we further decompose H ( M )
The designs on p ~, p ~ , , p,~~ , , ~=are considered c e into the following equivalent form:
ordinately according to implementation facility, struc-
tural constraints and priori knowledge, subject to two +
- H ( M ) = Df(M) D I ( M ) , D I ( M ) = - Q ( P M ~ , P M ~ ) (4)
,
basic requirements. The first is that the designs should D ~ W=) D ~ ( P M ~ , P MD~f ()P,, q )= J ~ ( u ) ~ n $ $ d u2 0.
be consistent to the representation form of y, which is
prespecified according to the nature of the perception For brevity the notations Q ( p ,q ) , D f ( p ,q ) will be fre-
tasks. The next requirement is that a combination of quently used through out the whole paper. In fact, the
designed structures of p ~ , ~p , ~, , , should not lead above D f ( p ,q ) is the well known Kullback divergence.
to a Ying-Yang system of structuml indeterminacy, i.e., Minimizing - H ( M ) consists of minimizing both
the subsequent learning problem becomes either trivial D f ( M ) and D I ( M ) . D f ( M ) describes the mismatch
or indefinite. We will encounter structuml indetermi- between the Ying and the Yang models and gets its
nacy when any p a r among My~t,Mrly,~y are free of
model. Thus, at most, one of M ~ ~ ~ , M , can ~ ~ ,beMfree
, minimum when p~~ = p ~ , Minimizing . D f ( M ) alone
of model in a Ying-Yang system. will indifferently take the minimum zero for different
( 3 ) Parameter learning. We use M = {M,, : a E
specifications M*(') = {M$r),M.$r)},r = I, , r M that
all satisfy P ~ ~ M ; ( ~ ) (=Z ,P,.(~)(z,Y).
Y) This problem is
{z, zly, ylz, y}} to denote a designed model struc- a
ture, with an implication that all the scales k and handled by minimizing D I ( M ) which here is the en-
n,, a E {z, zly, ylz, y} are prefixed in a certain way. tropy of the Yang model only, i.e., the system's diversi-
Then, we use 8 = {e, : a E {.z, sly, y(z!y}} to denote fication is observed from the perspective of Yang model.
all the unknown parameters in M . Particularly, we use For a distribution, the large is its entropy, the more
parameter learning to refer to the task of determin- the distribution is diversified. As will be introduced in
ing e at a specific value e*, and correspondingly use M* Sec.4, we can also use the entropy of the Ying model as
to denote a specification of M by 8*. D I ( M ) . Particularly, when the mismatch disappears
(4) Model selection . It refers to the task of selecting at p ~ =, p ~the~two, choices become an identical ex-
a particular j * among a set of given models { ~ ' ( j ) } . pression of system's diversification on X , Y.
The set {M(j)} can be obtained on a design of the same We observe from eq.(3) that H ( M ) has convexity
structure but with each j corresponding to a partic- with respect to only p ~ but , not to p ~ Thus,~ it. is not
ular setting of scales k and n , , a E {z,zIy, ylz, y}. well defined to maximize H ( M ) with respect to p~~ di-
We use representation scale selection to refer to the rectly. For this reason, we maximize H ( M ) stepwisely
specific case that j is obtained by enumerating k only by either Learning procedure I or Learning procedure II,
with dl na,a E (2,zJy, ylz, y} prefixed in a certain described as follows.
way. Similarly, we use structuml scale selection to refer Learning procedure I
to the specific case that j is obtained by enumerating It consists of pammeter learning and model selection.
n , , a E {z,zIy, y l z , y} under a given k. Sometimes, Parameter learning. It is made by
we even can get the set {MI} from a same structure
of fixed scales by enumerating only a number of specific D f ( M ' ) = min D f ( M ) , or 8' = argmin D f ( M ) . (5)
values that 0 takes.
It can be solved by an iterative algorithm that updates
2. BYY learning theory: fundamentals each ea,a E {z, z l y , ylz, y} to minimize D f ( M ) ,with
other parameters fixed alternatively.
We use 'Ying-Yang harmony" as a fundamental prin- When p ~ =, ph, by eq.(2) is used, we have
ciple for building a learning theory in the B W learn-
in4 system. Specifically, this theory contains two key min D f M , 8- = {ey,6'y1r,Or1y},
points: 0-A
(1) Model Design is made under a geneml frame- Df(M) = Df-(M) i-Q(PhotPh,)t (6)
work that consists of a Ying model p ~ =, p ~ , , , p ~and,
a Yang model p~~ = p ~ , ~ , p ~with , , four components where Q(ph,,ph,) is irrelevant to e-. At a fixed h,,
we equivalently have
p ~ + , , p ~p , ~, , and
, ~p ~ designed
, as in Sec.1.
(2) Parameter learning and model selection are min Df-(M), Df-(M)= J P h , ( z ) R ( z ) h ,
made to let the Ying model p ~ and , the Yang model p~~ e-
PMUI, dy.
to be best harmony in a sense that both the mismatch R ( z )= SPM,l= In - (7)
between the two models and the diversification of the PMZIWPM,
resulted Ying- Yang system are minimized.
Mathematically, we need a functional H ( M ) to mea- Particularly, when p o ( z ) is used as p ~, we
, have
sure the degree of harmony between p~~ and p ~. We , min D f r ( M ) , DfF(M) = k C iN= l R ( z i ) .
start at considering the functional -Jz given in Xu e-
(8)

553
In this case, eq.(7) consents the conventional empirical 3. Three typical architectures
learning that is directly made on D , = { x i } E 1 . Bi-directional architecture
We call eq.(6) smoothed learning, because a best h; is We have parametric models
searched such that ph: by eq.(2) can smooth out those
disturbances in P O . Thus, the smoothed learning can be PM,l, = P(Yk,eyl+tnylt), PM,lv = P(zIY,erlytnzly)* (13)
regarded as a regularization on empirical learning.
This point can be better understood by considering where the structural scales nylr,n,ly are usually pre-
Kh,(I) = G(z,zi,hzId). In this Case, We have ph;(z) = fixed and thus omitted.
S ~ ( d ) G ( z - d , o h:I)dz’
, is the distribution of x = X I + & We can implement either the smoothed learning or
of two independent variables x’ from po and E from empirical learning by eq.(6) or eq.(8). For a more in-
G(&,O,h;I). In other words, eq.(7) is equivalent to sight we rewrite Df-(M),Df;(M) in eqs.(7)&(8) into
eq.(8) on a noise blurred data set of a large enough
size N’ by adding noise samples into 0,. It is well un-
derstood in the literature that adding noise into data is
a type of regularization.
~ f o - ( ~=)
~ f ( z=) J
*
Df-(M) = SPh,(z)Dfb)dz
z:IDf(zi)
PMUlI
In, -dy.
P M ~ ~
-
- SPh, (2)lnPM;&,
+ 1npMzi,

(14)
The regularization role of the smoothed learning can
PMUlZ
also be understood by analyzing D f - ( M ) . From the
2nd order Taylor expansion (Xu, 1999a),we get In Sec.5, we will discuss the relation of this D f ; ( M )

RH = *
D f - ( M ) = Df;(M)
z:,Ts[H~(zi)],
+0.5hoR~, (9)
H R ( I ) is the Hessian of R(z).
That is, the smoothed learning consists of the empiri-
to Helmholtz machine (Hinton, et al, 1995; Dayan, et
al, 1995). In eq.( 14),p ~ ; ~p , ~, are
: contributed by M2
as the effective counterparts of P M ~ ~p, , ~ of, M I , given
as follows:
cal learning plus a regularization term weighted by h,
for minimizing the trace of the Hessian, a ‘ volume’ or
complexity of R(+).
A typical implementation of the smoothed learning
eq.(5) is to iterate the updating on h, and 8- alter- After parameter learning, we can use eq.(lO) for
natively. That is, with h, fixed, update 8- to reduce model selection. However, the learning eq.(12) is usu-
D f - ( M ) , with 8- fixed, update h, to reduce D f ( M ) . ally difficult to implement since it is not easy to get
The detailed methods are referred to Xu(1999a). F ( e ) = 0.
0 Model selection Given a set of models { M * ( j ) }j, = Backward architecture
1,. . ,NM with their parameters specified after param- We let p~~~~ to be parametric as in eq.(13), while
eter learning, we select the best one by p ~ is ~either ~ free, of model with no structural con-
j* = asgmin ~ ( j ) ,~ ( j=) - H ( M * ( ~ ) ) , straint, i.e., PM+ = ~ ( y l z ) ,p(y1z) E P+, or a paramet-
f
H ( M ) is given by eq.(3). (10)
ric model of a large freedom with PMC U I=
= {pM;ir(y12) :
all P M , ~E~ P M , ~ ~ , P
EMP~M ~ }is a subset of P M + .
When p ~ =, p o ( x ) , we have
On this architecture, the minimization of D f ( M )
WW) = k CL,JPM~~,~
~ ~ P M , ~ ~ , P M ~ ~ Y (11)
. with respect to p ~ , results
~ , in

Learning procedure I1
First, we divide 0 into two subset 8’ and 8 - {e’}
and minimize D f ( M ) with respect to O’, which results which can be implemented either directly by gradient
in a constraint ~w = 0 , shortly denoted by F ( 8 ) =
0. Second, we do parameter learning by minimizing
descent approach or by iterating two steps:

- H ( M ) with respect to 8 under this constraint. Third, Step 1 : with 0,i,, 0, fixed, update h, to reduce,
we do model selection by eq.(lO). D~(PM,,PM:), get PM;~,, by eq.(15), (17)
In this case, we have Step 2 : with 0, fixed and let p ~ -
~ - PM;,, ~ fixed,
,
0 Harmony oriented parameter learning i.e., then update O,lr, 0, t o reduce -
H(M).

{e, B .min -H(M). (12)


t.q e ) = o ) When p ~ =, PA,, from eq.(7) and eq.(8) we have
for best harmony, which introduces structural regular-
ization to merely minimizing the mismatch by eq.(5). min{ellU,eU)D ~ - ( M ) , Dj-(W = -JphI I ~ P M ; ~(18)
,
Eq.(12) is well defined as long as the constraint F(8) = min{erlr,eu)Df;(M) DfF(M)=
N
-& Ci=l
1nPM:bi)-
0 is strong enough, which depends on the selection of
8’ and highly relates to system structure. In Sec.3, we Thus, we see that the empirical learning becomes ex-
will further discuss eq.(12) on two specific structures. actly the maximum likelihood (ML) learning on the
For models { M * ( j ) } , j = ~ , . . . , N M resulted from marginal distribution p ~ of : the Ying model. More-
eq.(12), we can also use eq.(lO) for model selection. over, eq.(17) with h, = 0 fixed becomes the generalized

554
Expectation Maximization (GEM) algorithm for ML case of gaussian kernel from eq.(9) with RH derived as
learning (Dempster, et al, 1977). shown in (Xu, 1999a).
The smoothed learning eq.(6) with 8- = {ey, Ozly} Putting p ~ , = ~ p, ~ : into
~ , eq.(3), we can get
and Df-( M )given by eq.( 18) is a regularized extension
of the ML learning, which can be explicitly seen in the
case of gaussian kernel from eq.(9) with HR derived in
H ( M )= - D ~ ( P M ; , P M , ) + ~PM,Q(PMy61tt1)MyCl,)d~,
Xu(1999a). or H ( M ) = JPM;Q(PM:~,,PM:~,)~Y
Also, putting p ~ = p, ~ ;~into
~ ,~eq.(3) results in +Q(PM;, P M ; ) - of ( P M ; , P M , ) , (24)

It consists of the minimization of the mismatch be-


H ( M ) = JPM,Q(PM;~~,PM;~~)&+ tween M i , My and a regularization that minimizes the
Q( P M , I PM, - of ( P M , 9 PM; 1. (19) uncertainty of the mapping x + y.
Based on H ( M ) by eq.(24), we can implement model
Based on which we can perform model selection by selection by eq.( 10) and the harmony oriented parame-
eq.(lO) as well as the harmony oriented parameter ter learning eq.(12) with F ( 8 ) = 0 being P M , ~ , = P M : ~ ,,
learning eq.(12) with F(O) = 0 being P M , ~ = = PM;I= *
which introduces a structural regularization into the
When, po(x) is used as p ~ +eq.(ll)
, becomes empirical learning.
N N
4. The theory in a general formulation
H{(M) = z1 z l n P M ; ( z i ) + z1 C J P M ; , , , lnPM;llidY. (20)
Generalized definition for H ( M )
i=l i=l
We consider a general measure called f-Divergence,
It consists of the ML learning on p ~ and : a regular- which was first introduced by Csiszar in 1967 and a nice
ization by the second term which minimizes the uncer- introduction can be found in (Devroye, et all 1996):
tainty of the mapping x + y .
Similarly, the smoothed H ( M ) by eq.(19) includes
also another regularization that can be understood in
the case of gaussian kernel from H - ( M ) = H ; ( M ) +
0 . 5 h , ( H ~ + H,lz), where Hylz relates to the Hessian of
- s p ~ ; ~1npMc, dy with respective to x .
,I=
Forward architecture
M y 1. is a parametric model as given in eq.(13), while
Mzly is flexible such that p~~~~ is either free of model,
i.e., PM,,,,= p(+ly), P ( ~ Y )E Pzlv or a parametric i.e., Kullback divergence in eq.(4) is its special case.
model of a large freedom with P M ~ C( P~ M ~ ~~where
~), , We can use f-Divergence to measure the mismatch
between the Ying and Yang models.
Moreover, to get a general definition of the system's
diversification, we propose to use a relative concept to
P M : ~ ,= { P M : ~ , : all P M , ~ , E P M , ~ + , P ME, P M , } , describe the diversification of a given distribution p ( x ).
First, we choose a distribution uz as a standard that
where P M C ,p ~ are ; contributed by MI as the coun- stands for the most diversified distribution, e.g.,
=I,
terparts of ?'M,I, 1 P M , of M2-
In this case, the minimizationof D f ( M ) with respect
to p ~ results~ in~ ,
-=( nf=l uniform on X, for a finite support X ,
G(z,m, Id),
Xie-xifi,
for the support Rd,
for the support ( R + ) d ,
R = (-..,+CO), R+ = [O,+m), (27)
P M , ~ , = PM:~,' min{eull,er,h,) D~(PM;,PM,).
(22)
where G ( x ,m, I d ) denotes a gaussian with mean m =
which, again, can be solved either directly by gradi- J x p ( x ) d x and identity covariance matrix Id.
ent descent approach or alternatively updating one of Second, we regard that p ( x ) is more diversified if it is
p ~ ;p , ~ with
, the temporally other fixed. more closer to U ( + ) ,or equivalently, it is less diversified
When p~~ = po(x), p ~ in; eq.(21) becomes if it is more away from U(.). Thus, we can define
Definition-1 or shortly Def-1,
D'(p) = { I DDf(p U)
f ( u : p ) : Definition-2 or shortly Def-2. (28)

When p ~ =, Ph, by eq.(2) is used in PM; by eq.(21), Obviously, D I ( p ) 5 0 and reaches the largest value
the smoothed learning eq.(6) with 6'- = {OY,.O+} is 0 when p ( x ) = U(.). The less is the D I ( p ) , the less
a regularized extension of the empirical learning with diversified the p ( c ) is. Two definitions become the same
p ~ by; eq.(23). Again, it can be clearly seen in the for a symmetrical D f ( p , q ) = D f ( q , p ) .

555
When f(z) = z l n z and X is a finite support, with D f ( p , q ) # D f ( p , q ) . Up to now, we have discussed only
us given by the first choice in eq.(27), we have the case that p~~ in the 1st position, i.e., it acts as the
measure that dominates the integrals. So, we call all
D1(P)
Q(P,P) -+InV(X),
= { I Q (u , , p ) lnV(X),
for Def-1,
for Def-2, (29)
these cases BYY Yang-dominated learning. Similarly,
we have BYY Ying-dominated learning by switching
the positions of p ~ , , p ~in,H ( M ) , D f ( M ) . by eq.(3),
where V ( X )is a constant volume of X and usually can eq.(4), and eq.(31), and replacing p~~ with p~~ in
be ignored. Thus, this Dl(p) by deiniftion-1 returns to D I ( M ) by in eq.(4), eq.(31). Specifically, for back-
the entropy, which is consistent to eq.(4); while Dl(p) ward architecture, we switch p ~ = , p ~in:D f ( M ) by
by deiniftion-2 is equivalent to -&(us, p). eq.(16) and eq.(32), and use Df(pMg,pM,) to replace
On the joint support Z = X x Y ,the most diversified both D f ( p ~ , , p ~ : ) and - H ( M ) in eq.(17). Moreover, in
standard us,ycan be one of the two choices:
eq.(19), we get H ( M ) =
(a) Let J = [z,y],
U(%) is given by eq.(27),
(b) u ~= U~PM,(Z),
, ~ uY is given by eq.(27), (30) 1
JPM: Q ( P M ; ~ , P M ; ~ , dz + Q (PM: PM: 1 - of (PM: PM, 1.
9 (34)

In a summary, we extend the definition of harmony For forward architecture, we switch p ~ , p, ~ in; D f ( M )
degree by eq.(4) into the following general one: by eq.(22) and eq.(33). Moreover, we get H ( M ) =
JPM~ MY + Q ( P M ~PM,1 - D ~ ( P MPM;
Q ( P M : ~ ,9 P M ; ~ , 8 , 1. (35)

Empirical learning versus smoothed learning


We can use this D f ( M ) in eq.(5) for parameter learn- The smoothed learning applies to all the cases for
ing, and this H ( M ) in eq.(lO) for model selection and in both Yang-dominated and Ying-dominated learning.
eq.( 12) for harmony oriented parameter learning. How- However, the empirical learning applies only to a Yang-
ever, here the use of po).( as p ~ usually
, does not apply dominated learning with f = ulnu, but usually not to
for those cases of f (U)# ulnu. a Yang-dominated learning with f # ulnu and all the
On backward and forward architectures cases of Ying-dominated learning.
On backward architecture, we still have eq.( 16) but For the empirical learning, we can get rid of the com-
putational difficulty of the integrals over z by summa-
with Df(p,q) given by eq.(25), and eq.(19) becomes tions. While for the smoothed learning, we have to deal
with these integrals by Monte-Carlo technique.
- H ( M ) = Df(PM,rPM:) +DI(M), (32)
5. Relations to the literature
DqM)=
{ -JPM,D/(PM:,lr"M,(Ylz))dz,Def-1,
-JPM,Df("M= I&,
(Y14,PM;I= Def-2,
In addition to providing a number of new learning
models, special cases of B W learning system and t h e
ory also lead us to several existing learning approaches
where U M , ( ~ ~ Z=) U,,~/PM=can be regarded as the in the literature, with different insights and some new
effective standards of most diversifi-ed distributions in results. A detailed list with discussions is given in
the form p(yl+), from the perspective of p ~ ., Thus, Xu (1999b) for unsupervised learning, in Xu( 1999c) for
this D I ( M ) is consistent to our original definition of supervised learning and in Xu(1999d&e) for temporal
diversification. Particularly, when us,yis given by the modeling.
second choice in eq.(30), we have UM,(U]+) = u y. Relations to information theory and informa-
On the forward architecture, we still have eq.(22) but tion geometry
with Df(p,q ) given by eq.(25). and eq.(24) becomes Generally speaking, the idea of matching two distri-
-H(M) = Df(PM;vPM,) +DI(M), (33)
butions by Kullback divergence KL(p,q) = D f ( p , q ) de-
fined in eq.(4) can be traced back for decades with
DZ(M) =
{ - J P M p f ( P M f l u l "MI(4Y))dY, Def-1,
--JPMU'Df(UM, (4Y) I PMZI )dY 9 Def-2,
in information theory. Different specific problems of
matching two specific distributions were studied in var-
ious areas of mathematics, physics, and engineering,
and also the alternative minimization minpEp KL(p,q )
where UM, (zly)= U,,~/PM:can be regarded as the ef- and rnin,EQ KL(p,q ) was also often used in various spe-
fective standards of most diversified distributions in the cific formulations. A general study has been made by
form p(~ly)from the perspective of p ~ ; respectively.
, Csiszar & Tusnady (1984) for two general p,q, in which
of djversific&ion.
-
This DIIMI is also consistexit to our original definition the geometry structure of K L ( ~q ), has been discovered
and the convergence property of the alternative mini-
. , even when p ~ =
Since p- ~U iszwell defined by eq.(23) , mization has been studied. Another general study has
PO(.), the empirical learning is applicable here, which been made by Amari (1985, 1995), in which the dif-
is a feature that only the forward architecture has. ferential geometry structure of KL(p,9) has been found
Yang dominated versus Yang dominated and the relation of the alternative minimization to the
e-geodisc and m-geodisc has been set up.
H ( M ) , D ~ ( M ) are not symmetry for the positions In the BYY learning, the matching of the Ying and
of PM, and PM, since usually Q ( P , q ) it Q ( p , n ) and Yang distributions is also one important ingredient and

556
thus naturally relates to these previous studies. Being three aspects:
different from these studies, there are a number of new (i) Only a special design in the above discussed case
features in the. B W framework. (c) relates to Helmholtz machine, while the above cases
(a) In addition to the matching of two distributions, (a)&(b) are not considered in Helmholtz machine.
B W learning also consists of another important ingre- (ii) Pinning on the case (c), B W learning is a general
dient, namely, minimizing the complexity or diversifi- framework that considers the complementary structure
cation of the system formed by the pair for structural
regularization and model selection. This issue has not eq.( 1) by systematically studying various designs, re-
been considered by the above previous studies. lationship and specification of the four components for
(b) In addition to using Kullback divergence, the gen- different learning purposes, while in Helmholtz machine
eral f-divergence D j ( p ,q ) is also considered as a devel- the matching of the two joint distributions in the com-
opment of the two-distribution-matching idea, which plementary structure eq.(1) has not been explicitly and
has also not been studied in the above previous studies. systematically considered. Instead, Helmholtz machine
Actually, for a general D j ( p , q ) , whether there are the learning principle considers Maximum learning on the
above mentioned nice geometry structures for D f ( p ,q ) marginal distribution of x by the top down passage t e
may be an interesting open topic. gether with matching the two distributions on x + y,
with one directly implemented by the bottom up pas-
(c) Even when pinning on matching the Ying and sage and the other indirectly induced from the top down
Yang distributions by Kullback divergence, the B W passage, respectively. The principle is only equivalent
learning theory differs itself from the above previous to the empirical learning eq.(8) on bi-directional archi-
studies by focusing on a particular complementary de-
composition structure given in eq.(l) and studying sys- tecture with D j ; ( M ) given by eq.(14), but equivalent to
tematically the relationship among the specific designed neither the smoothed learning on bi-directional archi-
four components as well as their specificationsfor vari- tecture nor various-other cases on forward architecture
and backward architecture.
ous statistical learning purposes. In some special case, (iii) Even when further pinning on the above men-
e.g., the backward architecture with p ~ = , po, from tioned special case that the learning principles of
eq.(18) we have maximum likelihood learning and EM Helmholtz machine and B W learning coincide, B W
algorithm, which is a same result obtained from the in- learning consider several designs on p ~ , , , ,PM,,
, JIM,,,,
complete data theory (Dempster, et al, 1977), alterna- for different learning purposes, while Helmholtz ma-
tive minimization (Csiszar & Tusnady, 1984) and the chine (Hinton, et d, 1995; Dayan, et al, 1995) considers
projections along the e-geodisc and m-geodisc Amari mainly one special design of layered structures with bi-
(1985, 1995). However, the case p ~ =, p h , in eq.(18) nary representation in each layer, focusing on develop
and also many other cases of three architectures in Sec.3 ing efficient learning algorithms such as the wake-sleep
have not been systematically explored by the above pre- algorithm and the mean field algorithm.
vious studies, to our best knowledge.
In other words, though all the three involve the ba- REFERENCES
sic information theory idea of matching two distribu- Amari, S, (1985), Differential geometry methods in statis-
tions by Kullback divergence, the studies are made from tics, Lecture Notes in Statistics 28, Springer.
different perspectives. Csiszar & Tusnady (1984) and Am&, S, (1995), “ Information geometry of the EM and em
algorithms for neural networks”, Neural Networks 8, No.9,
Amari (1985, 1995) both consider the general proper- 1379-1408.
ties of K L ( p , q ) for general p , q , one focusing on gen- Csiszar, I.,& G. Tusnady (1984), “ Information geometry
eral geometry structure and one on differential geome- and alternating minimization procedures”, Statistics and
Decisions, Supplementary Issue, No.1, 205-237.
try structure. However, the matching of the Ying and Dempster, A.P., et al (1977), “Maximum- likelihood from
Yang distributions by the B W learning considers the incomplete data via the EM algorithm”, J. o j Royal Statis-
statistical learning aimed structure eq.(l) and its inner tical Society, B99, 1-38.
relationship among components. It may be interest- Devroye, L. et al (1996), A Probability Theory of Pattern
ing to explore how to use the elegant general tools of Recognition, Springer, 1996.
Xu, L.(1999a) “Further results on B W data smoothing
Csiszar & Tusnady (1984) and Amari (1985, 1995) to learning”, in the same Proc. IJCNN99.
study the specific structure of the BYY learning. Xu, L.(1999b) “ Data mining, unsupervised learning and
Relations to Helmholtz machine learning Bayesian Ying-Yang theory ” , in the same Proc. IJChW99.
Xu, L.(1999c) “ Temporal Bayesian Ymg-Yang dependence
The sprit of considering both the bottom-up and t o p reduction, blind source separation, and principal indepen-
down connections in learning can also be traced back ”,
dent components in the same Proc. IJCNN99.
for decades, and has been used in many learning mod- Xu, L.(1999d) “ Temporal Bayesian ring-Yang Learning
els. Among them, Helmholtz machine (Hinton, et al, and Its ADDlkatiOnSto Extended Kalman filterinn. Hidden
“l

Markov i d d e l , and Sensor-Motor Integration”,in the same


1995; Dayan, et al, 1995) is one that is made from sta- Proc. IJCNN99.
tistical viewpoint also. As shown in Xu(1998b), for a Xu, L.( 1998b), “Bayesian Kullback Ymg-Yang dependence
specific design on bi-directional architecture, the em- ”.
reduction theory Neurocomputina22, No.1-3, 81-112.
Xu, L.( 1997a),- “Bayesian Ying-YAg machine, clustering
pirical learning eq.(8) with D j ; ( M ) , D j ; ( M ) given by and number of clusters”, Pattern Recognition Letters 18,
eq.(14) will reduce Helmholtz machine learning of one NO.11-13, 1167-1178.
hidden layer structure. Xu, L.(1995&96), “A unified learning scheme: Bayesian-
Though both Helmholtz machine and B W learnin Kullback YING-YANG machine”, Advances i n Neural In-
formation Proceasing Systems 8, eds., D. S. Touretzky, et
explore the above bi-directional sprit from statistic3 al, MIT Press,1996,444-450. A part of its preliminary ver-
viewpoint, there are a number of different features in sion on Proc. ICONIP95, Peking, Oct 30 - Nov. 3, 1995,
B W learning, which are summarized into the following 977-988.

557

You might also like