Professional Documents
Culture Documents
01999 IEEE
0-7803-5529-6/99/$10.00 552
case of M,, for each a E {zly,ylz, y} we may also have (1997) and rewrite it into a general form
a set of models {M?")} that share a same structure
but in different scales such that c 'P (no+l) with
MO
H ( M ) = Q ( P M ~ , P MQ~( p), q~) = Jp(u)lnq(u)du. (3)
PMina)= { p ( a l e p ) ): for all OF) E Ob".) under Mi"")}. To get better insights, we further decompose H ( M )
The designs on p ~, p ~ , , p,~~ , , ~=are considered c e into the following equivalent form:
ordinately according to implementation facility, struc-
tural constraints and priori knowledge, subject to two +
- H ( M ) = Df(M) D I ( M ) , D I ( M ) = - Q ( P M ~ , P M ~ ) (4)
,
basic requirements. The first is that the designs should D ~ W=) D ~ ( P M ~ , P MD~f ()P,, q )= J ~ ( u ) ~ n $ $ d u2 0.
be consistent to the representation form of y, which is
prespecified according to the nature of the perception For brevity the notations Q ( p ,q ) , D f ( p ,q ) will be fre-
tasks. The next requirement is that a combination of quently used through out the whole paper. In fact, the
designed structures of p ~ , ~p , ~, , , should not lead above D f ( p ,q ) is the well known Kullback divergence.
to a Ying-Yang system of structuml indeterminacy, i.e., Minimizing - H ( M ) consists of minimizing both
the subsequent learning problem becomes either trivial D f ( M ) and D I ( M ) . D f ( M ) describes the mismatch
or indefinite. We will encounter structuml indetermi- between the Ying and the Yang models and gets its
nacy when any p a r among My~t,Mrly,~y are free of
model. Thus, at most, one of M ~ ~ ~ , M , can ~ ~ ,beMfree
, minimum when p~~ = p ~ , Minimizing . D f ( M ) alone
of model in a Ying-Yang system. will indifferently take the minimum zero for different
( 3 ) Parameter learning. We use M = {M,, : a E
specifications M*(') = {M$r),M.$r)},r = I, , r M that
all satisfy P ~ ~ M ; ( ~ ) (=Z ,P,.(~)(z,Y).
Y) This problem is
{z, zly, ylz, y}} to denote a designed model struc- a
ture, with an implication that all the scales k and handled by minimizing D I ( M ) which here is the en-
n,, a E {z, zly, ylz, y} are prefixed in a certain way. tropy of the Yang model only, i.e., the system's diversi-
Then, we use 8 = {e, : a E {.z, sly, y(z!y}} to denote fication is observed from the perspective of Yang model.
all the unknown parameters in M . Particularly, we use For a distribution, the large is its entropy, the more
parameter learning to refer to the task of determin- the distribution is diversified. As will be introduced in
ing e at a specific value e*, and correspondingly use M* Sec.4, we can also use the entropy of the Ying model as
to denote a specification of M by 8*. D I ( M ) . Particularly, when the mismatch disappears
(4) Model selection . It refers to the task of selecting at p ~ =, p ~the~two, choices become an identical ex-
a particular j * among a set of given models { ~ ' ( j ) } . pression of system's diversification on X , Y.
The set {M(j)} can be obtained on a design of the same We observe from eq.(3) that H ( M ) has convexity
structure but with each j corresponding to a partic- with respect to only p ~ but , not to p ~ Thus,~ it. is not
ular setting of scales k and n , , a E {z,zIy, ylz, y}. well defined to maximize H ( M ) with respect to p~~ di-
We use representation scale selection to refer to the rectly. For this reason, we maximize H ( M ) stepwisely
specific case that j is obtained by enumerating k only by either Learning procedure I or Learning procedure II,
with dl na,a E (2,zJy, ylz, y} prefixed in a certain described as follows.
way. Similarly, we use structuml scale selection to refer Learning procedure I
to the specific case that j is obtained by enumerating It consists of pammeter learning and model selection.
n , , a E {z,zIy, y l z , y} under a given k. Sometimes, Parameter learning. It is made by
we even can get the set {MI} from a same structure
of fixed scales by enumerating only a number of specific D f ( M ' ) = min D f ( M ) , or 8' = argmin D f ( M ) . (5)
values that 0 takes.
It can be solved by an iterative algorithm that updates
2. BYY learning theory: fundamentals each ea,a E {z, z l y , ylz, y} to minimize D f ( M ) ,with
other parameters fixed alternatively.
We use 'Ying-Yang harmony" as a fundamental prin- When p ~ =, ph, by eq.(2) is used, we have
ciple for building a learning theory in the B W learn-
in4 system. Specifically, this theory contains two key min D f M , 8- = {ey,6'y1r,Or1y},
points: 0-A
(1) Model Design is made under a geneml frame- Df(M) = Df-(M) i-Q(PhotPh,)t (6)
work that consists of a Ying model p ~ =, p ~ , , , p ~and,
a Yang model p~~ = p ~ , ~ , p ~with , , four components where Q(ph,,ph,) is irrelevant to e-. At a fixed h,,
we equivalently have
p ~ + , , p ~p , ~, , and
, ~p ~ designed
, as in Sec.1.
(2) Parameter learning and model selection are min Df-(M), Df-(M)= J P h , ( z ) R ( z ) h ,
made to let the Ying model p ~ and , the Yang model p~~ e-
PMUI, dy.
to be best harmony in a sense that both the mismatch R ( z )= SPM,l= In - (7)
between the two models and the diversification of the PMZIWPM,
resulted Ying- Yang system are minimized.
Mathematically, we need a functional H ( M ) to mea- Particularly, when p o ( z ) is used as p ~, we
, have
sure the degree of harmony between p~~ and p ~. We , min D f r ( M ) , DfF(M) = k C iN= l R ( z i ) .
start at considering the functional -Jz given in Xu e-
(8)
553
In this case, eq.(7) consents the conventional empirical 3. Three typical architectures
learning that is directly made on D , = { x i } E 1 . Bi-directional architecture
We call eq.(6) smoothed learning, because a best h; is We have parametric models
searched such that ph: by eq.(2) can smooth out those
disturbances in P O . Thus, the smoothed learning can be PM,l, = P(Yk,eyl+tnylt), PM,lv = P(zIY,erlytnzly)* (13)
regarded as a regularization on empirical learning.
This point can be better understood by considering where the structural scales nylr,n,ly are usually pre-
Kh,(I) = G(z,zi,hzId). In this Case, We have ph;(z) = fixed and thus omitted.
S ~ ( d ) G ( z - d , o h:I)dz’
, is the distribution of x = X I + & We can implement either the smoothed learning or
of two independent variables x’ from po and E from empirical learning by eq.(6) or eq.(8). For a more in-
G(&,O,h;I). In other words, eq.(7) is equivalent to sight we rewrite Df-(M),Df;(M) in eqs.(7)&(8) into
eq.(8) on a noise blurred data set of a large enough
size N’ by adding noise samples into 0,. It is well un-
derstood in the literature that adding noise into data is
a type of regularization.
~ f o - ( ~=)
~ f ( z=) J
*
Df-(M) = SPh,(z)Dfb)dz
z:IDf(zi)
PMUlI
In, -dy.
P M ~ ~
-
- SPh, (2)lnPM;&,
+ 1npMzi,
(14)
The regularization role of the smoothed learning can
PMUlZ
also be understood by analyzing D f - ( M ) . From the
2nd order Taylor expansion (Xu, 1999a),we get In Sec.5, we will discuss the relation of this D f ; ( M )
RH = *
D f - ( M ) = Df;(M)
z:,Ts[H~(zi)],
+0.5hoR~, (9)
H R ( I ) is the Hessian of R(z).
That is, the smoothed learning consists of the empiri-
to Helmholtz machine (Hinton, et al, 1995; Dayan, et
al, 1995). In eq.( 14),p ~ ; ~p , ~, are
: contributed by M2
as the effective counterparts of P M ~ ~p, , ~ of, M I , given
as follows:
cal learning plus a regularization term weighted by h,
for minimizing the trace of the Hessian, a ‘ volume’ or
complexity of R(+).
A typical implementation of the smoothed learning
eq.(5) is to iterate the updating on h, and 8- alter- After parameter learning, we can use eq.(lO) for
natively. That is, with h, fixed, update 8- to reduce model selection. However, the learning eq.(12) is usu-
D f - ( M ) , with 8- fixed, update h, to reduce D f ( M ) . ally difficult to implement since it is not easy to get
The detailed methods are referred to Xu(1999a). F ( e ) = 0.
0 Model selection Given a set of models { M * ( j ) }j, = Backward architecture
1,. . ,NM with their parameters specified after param- We let p~~~~ to be parametric as in eq.(13), while
eter learning, we select the best one by p ~ is ~either ~ free, of model with no structural con-
j* = asgmin ~ ( j ) ,~ ( j=) - H ( M * ( ~ ) ) , straint, i.e., PM+ = ~ ( y l z ) ,p(y1z) E P+, or a paramet-
f
H ( M ) is given by eq.(3). (10)
ric model of a large freedom with PMC U I=
= {pM;ir(y12) :
all P M , ~E~ P M , ~ ~ , P
EMP~M ~ }is a subset of P M + .
When p ~ =, p o ( x ) , we have
On this architecture, the minimization of D f ( M )
WW) = k CL,JPM~~,~
~ ~ P M , ~ ~ , P M ~ ~ Y (11)
. with respect to p ~ , results
~ , in
Learning procedure I1
First, we divide 0 into two subset 8’ and 8 - {e’}
and minimize D f ( M ) with respect to O’, which results which can be implemented either directly by gradient
in a constraint ~w = 0 , shortly denoted by F ( 8 ) =
0. Second, we do parameter learning by minimizing
descent approach or by iterating two steps:
- H ( M ) with respect to 8 under this constraint. Third, Step 1 : with 0,i,, 0, fixed, update h, to reduce,
we do model selection by eq.(lO). D~(PM,,PM:), get PM;~,, by eq.(15), (17)
In this case, we have Step 2 : with 0, fixed and let p ~ -
~ - PM;,, ~ fixed,
,
0 Harmony oriented parameter learning i.e., then update O,lr, 0, t o reduce -
H(M).
554
Expectation Maximization (GEM) algorithm for ML case of gaussian kernel from eq.(9) with RH derived as
learning (Dempster, et al, 1977). shown in (Xu, 1999a).
The smoothed learning eq.(6) with 8- = {ey, Ozly} Putting p ~ , = ~ p, ~ : into
~ , eq.(3), we can get
and Df-( M )given by eq.( 18) is a regularized extension
of the ML learning, which can be explicitly seen in the
case of gaussian kernel from eq.(9) with HR derived in
H ( M )= - D ~ ( P M ; , P M , ) + ~PM,Q(PMy61tt1)MyCl,)d~,
Xu(1999a). or H ( M ) = JPM;Q(PM:~,,PM:~,)~Y
Also, putting p ~ = p, ~ ;~into
~ ,~eq.(3) results in +Q(PM;, P M ; ) - of ( P M ; , P M , ) , (24)
When p ~ =, Ph, by eq.(2) is used in PM; by eq.(21), Obviously, D I ( p ) 5 0 and reaches the largest value
the smoothed learning eq.(6) with 6'- = {OY,.O+} is 0 when p ( x ) = U(.). The less is the D I ( p ) , the less
a regularized extension of the empirical learning with diversified the p ( c ) is. Two definitions become the same
p ~ by; eq.(23). Again, it can be clearly seen in the for a symmetrical D f ( p , q ) = D f ( q , p ) .
555
When f(z) = z l n z and X is a finite support, with D f ( p , q ) # D f ( p , q ) . Up to now, we have discussed only
us given by the first choice in eq.(27), we have the case that p~~ in the 1st position, i.e., it acts as the
measure that dominates the integrals. So, we call all
D1(P)
Q(P,P) -+InV(X),
= { I Q (u , , p ) lnV(X),
for Def-1,
for Def-2, (29)
these cases BYY Yang-dominated learning. Similarly,
we have BYY Ying-dominated learning by switching
the positions of p ~ , , p ~in,H ( M ) , D f ( M ) . by eq.(3),
where V ( X )is a constant volume of X and usually can eq.(4), and eq.(31), and replacing p~~ with p~~ in
be ignored. Thus, this Dl(p) by deiniftion-1 returns to D I ( M ) by in eq.(4), eq.(31). Specifically, for back-
the entropy, which is consistent to eq.(4); while Dl(p) ward architecture, we switch p ~ = , p ~in:D f ( M ) by
by deiniftion-2 is equivalent to -&(us, p). eq.(16) and eq.(32), and use Df(pMg,pM,) to replace
On the joint support Z = X x Y ,the most diversified both D f ( p ~ , , p ~ : ) and - H ( M ) in eq.(17). Moreover, in
standard us,ycan be one of the two choices:
eq.(19), we get H ( M ) =
(a) Let J = [z,y],
U(%) is given by eq.(27),
(b) u ~= U~PM,(Z),
, ~ uY is given by eq.(27), (30) 1
JPM: Q ( P M ; ~ , P M ; ~ , dz + Q (PM: PM: 1 - of (PM: PM, 1.
9 (34)
In a summary, we extend the definition of harmony For forward architecture, we switch p ~ , p, ~ in; D f ( M )
degree by eq.(4) into the following general one: by eq.(22) and eq.(33). Moreover, we get H ( M ) =
JPM~ MY + Q ( P M ~PM,1 - D ~ ( P MPM;
Q ( P M : ~ ,9 P M ; ~ , 8 , 1. (35)
556
thus naturally relates to these previous studies. Being three aspects:
different from these studies, there are a number of new (i) Only a special design in the above discussed case
features in the. B W framework. (c) relates to Helmholtz machine, while the above cases
(a) In addition to the matching of two distributions, (a)&(b) are not considered in Helmholtz machine.
B W learning also consists of another important ingre- (ii) Pinning on the case (c), B W learning is a general
dient, namely, minimizing the complexity or diversifi- framework that considers the complementary structure
cation of the system formed by the pair for structural
regularization and model selection. This issue has not eq.( 1) by systematically studying various designs, re-
been considered by the above previous studies. lationship and specification of the four components for
(b) In addition to using Kullback divergence, the gen- different learning purposes, while in Helmholtz machine
eral f-divergence D j ( p ,q ) is also considered as a devel- the matching of the two joint distributions in the com-
opment of the two-distribution-matching idea, which plementary structure eq.(1) has not been explicitly and
has also not been studied in the above previous studies. systematically considered. Instead, Helmholtz machine
Actually, for a general D j ( p , q ) , whether there are the learning principle considers Maximum learning on the
above mentioned nice geometry structures for D f ( p ,q ) marginal distribution of x by the top down passage t e
may be an interesting open topic. gether with matching the two distributions on x + y,
with one directly implemented by the bottom up pas-
(c) Even when pinning on matching the Ying and sage and the other indirectly induced from the top down
Yang distributions by Kullback divergence, the B W passage, respectively. The principle is only equivalent
learning theory differs itself from the above previous to the empirical learning eq.(8) on bi-directional archi-
studies by focusing on a particular complementary de-
composition structure given in eq.(l) and studying sys- tecture with D j ; ( M ) given by eq.(14), but equivalent to
tematically the relationship among the specific designed neither the smoothed learning on bi-directional archi-
four components as well as their specificationsfor vari- tecture nor various-other cases on forward architecture
and backward architecture.
ous statistical learning purposes. In some special case, (iii) Even when further pinning on the above men-
e.g., the backward architecture with p ~ = , po, from tioned special case that the learning principles of
eq.(18) we have maximum likelihood learning and EM Helmholtz machine and B W learning coincide, B W
algorithm, which is a same result obtained from the in- learning consider several designs on p ~ , , , ,PM,,
, JIM,,,,
complete data theory (Dempster, et al, 1977), alterna- for different learning purposes, while Helmholtz ma-
tive minimization (Csiszar & Tusnady, 1984) and the chine (Hinton, et d, 1995; Dayan, et al, 1995) considers
projections along the e-geodisc and m-geodisc Amari mainly one special design of layered structures with bi-
(1985, 1995). However, the case p ~ =, p h , in eq.(18) nary representation in each layer, focusing on develop
and also many other cases of three architectures in Sec.3 ing efficient learning algorithms such as the wake-sleep
have not been systematically explored by the above pre- algorithm and the mean field algorithm.
vious studies, to our best knowledge.
In other words, though all the three involve the ba- REFERENCES
sic information theory idea of matching two distribu- Amari, S, (1985), Differential geometry methods in statis-
tions by Kullback divergence, the studies are made from tics, Lecture Notes in Statistics 28, Springer.
different perspectives. Csiszar & Tusnady (1984) and Am&, S, (1995), “ Information geometry of the EM and em
algorithms for neural networks”, Neural Networks 8, No.9,
Amari (1985, 1995) both consider the general proper- 1379-1408.
ties of K L ( p , q ) for general p , q , one focusing on gen- Csiszar, I.,& G. Tusnady (1984), “ Information geometry
eral geometry structure and one on differential geome- and alternating minimization procedures”, Statistics and
Decisions, Supplementary Issue, No.1, 205-237.
try structure. However, the matching of the Ying and Dempster, A.P., et al (1977), “Maximum- likelihood from
Yang distributions by the B W learning considers the incomplete data via the EM algorithm”, J. o j Royal Statis-
statistical learning aimed structure eq.(l) and its inner tical Society, B99, 1-38.
relationship among components. It may be interest- Devroye, L. et al (1996), A Probability Theory of Pattern
ing to explore how to use the elegant general tools of Recognition, Springer, 1996.
Xu, L.(1999a) “Further results on B W data smoothing
Csiszar & Tusnady (1984) and Amari (1985, 1995) to learning”, in the same Proc. IJCNN99.
study the specific structure of the BYY learning. Xu, L.(1999b) “ Data mining, unsupervised learning and
Relations to Helmholtz machine learning Bayesian Ying-Yang theory ” , in the same Proc. IJChW99.
Xu, L.(1999c) “ Temporal Bayesian Ymg-Yang dependence
The sprit of considering both the bottom-up and t o p reduction, blind source separation, and principal indepen-
down connections in learning can also be traced back ”,
dent components in the same Proc. IJCNN99.
for decades, and has been used in many learning mod- Xu, L.(1999d) “ Temporal Bayesian ring-Yang Learning
els. Among them, Helmholtz machine (Hinton, et al, and Its ADDlkatiOnSto Extended Kalman filterinn. Hidden
“l
557