You are on page 1of 9

Learning the Structure of Dynamic Probabilistic Networks

Nir Friedman Kevin Murphy Stuart Russell


Computer Science Division, U. of California, Berkeley, CA 94720

nir,murphyk,russell @cs.berkeley.edu

Abstract on parametric EM operating within an outer-loop structural


Dynamic probabilistic networks are a compact repre- search. SEM can be shown to find local optima defined by
sentation of complex stochastic processes. In this pa- a scoring function that combines the likelihood of the data
per we examine how to learn the structure of a DPN with a structural penalty that discourages overly complex
from data. We extend structure scoring rules for stan- networks. This property holds for both the Bayesian Infor-
dard probabilistic networks to the dynamic case, and mation Criterion (BIC) score [28], a variant of Minimum
show how to search for structure when some of the vari- Description Length (MDL) scoring, and the BDe score [15],
ables are hidden. Finally, we examine two applications a Bayesian metric that uses an explicit prior over networks.
where such a technology might be useful: predicting In this paper, we extend the BIC and BDe scores to han-
and classifying dynamic behaviors, and learning causal dle the problem of learning DPN structure from complete
orderings in biological processes. We provide empir-
data. More importantly, we extend the SEM algorithm to
ical results that demonstrate the applicability of our
methods in both domains. learn DPNs from incomplete data with both scores. Given
partial observations of a set of random variables over time,
the algorithm constructs a DPN (possibly including addi-
1 INTRODUCTION tional hidden variables) that fits the observations as well
Probabilistic networks (PNs), also known as Bayesian net- as possible. The addition of hidden variables in DPNs
works or belief networks, are already well-established as is particularly important as many processes—human de-
representations of domains involving uncertain relations cision making, speech generation, disease processes, for
among several random variables. Somewhat less well- example—are only partially observable.
established, but perhaps of equal importance, are dynamic We begin with formal definitions of PNs and DPNs. Sec-
probabilistic networks (DPNs), which model the stochastic tion 3 discusses the complete-data case, developing scores
evolution of a set of random variables over time [5]. DPNs for DPNs so that existing algorithms for learning PN struc-
have significant advantages over competing representations tures can be applied directly. In Section 4, we handle the
such as Kalman filters, which handle only unimodal pos- case of incomplete data and show how to extend SEM to
terior distributions and linear models, and hidden Markov DPNs. Computation of the necessary sufficient statistics is
models (HMMs), whose parameterization grows exponen- highlighted as a principal bottleneck. Section 5 describes
tially with the number of state variables. For example, [31] two applications: learning simple models of human driving
show that DPNs can outperform HMMs on standard speech behavior from videotapes, and learning models of biologi-
recognition tasks. cal processes from very sparse observations.
PNs and DPNs are defined by a graphical structure and
a set of parameters, which together specify a joint distribu- 2 PRELIMINARIES
tion over the random variables. Algorithms for learning the
parameters of PNs [1, 21] and DPNs [1, 14] are becoming We start with a short review of probabilistic networks and
widely used. These algorithms typically use either gradi- dynamic probabilistic networks.
ent methods or EM, and can handle hidden variables and We will be concerned with distributions over sets of dis-
missing values. crete random variables, where each variable 
may take
Algorithms for learning the graphical structure, on the on values from a finite set, denoted by Val  
. We denote
other hand, have until recently been restricted to networks the size of Val 
by



. We use capital letters, such as


with complete data, i.e., where the values of all variables are , for variable names and lowercase letters   
to
specified in each training case [4, 15]. Friedman [10, 11] has denote specific values taken by those variables. Sets of vari-
developed the Structural EM (SEM) algorithm for learning  
ables are denoted by boldface capital letters X Y Z, with
PN structure from data with hidden variables and missing sets of values denoted by boldface lowercase letters x y z.  
values. SEM combines structural and parametric modifi- For a given network, we will use X   
1 to de-
cation within a single EM process, and appears to be sub- note the variables of the network in topological order, and
stantially more effective than previous approaches based to denote a joint probability distribution over the variables
? A
X 1 that is taken to specify the transition probability
E ? @NF AJ
? @5A% @
X 1[0] X 1[0] X 1[1] X 1[0] X 1[1] X 1[2]
X 1 X for all .
Figure 1(a) shows a simple example.1 In the transition
X 2[0] X 2[0] X 2[1] X 2[0] X 2[1] X 2[2]
network (but not the prior network), the variables in X 0 ? A
have no parents. The transition probability implied by such
a network is:
6<O  x ? 1AC
x ? 0A%< 9  ;
X 3[0] X 3[0] X 3[1] X 3[0] X 3[1] X 3[2]

(a) (b) 6<O ) P? 1A


pa,? 1A% ,
1

A DPN defined by a pair %! 0 !7LE corresponds to a


Figure 1: (a) A prior network and transition network defin- semi-infinite network over the variables X ? 0A  X ?RQSA .
ing a DPN for the attributes 1 , 2 , 3 . (b) The corre-    In practice, we reason only about a finite interval 0  T .
sponding “unrolled” network. To do this, we can notionally “unroll” the DPN structure
into a PN over X ? 0A   X ? TUA . In slice 0, the parents of
  ? 0A are those specified in the prior network ! 0 ; in slice
in X. @JF 1, the parents of   ? @CF 1A are those nodes in slices @
A probabilistic network (PN) is an annotated directed and @VF 1 corresponding to the parents of ,? 1A in !7L . We
acyclic graph that encodes a joint probability distribution
over X. Formally, a PN for X is a pair Θ. !"#%$&  copy the conditional distributions for these variables in a
$
The first component, , is a directed acyclic graph whose
similar manner. Figure 1(b) shows the result of unrolling
vertices correspond to the random variables 1   the network in Figure 1(a) for 3 time slices. Given a DPN
model, the joint distribution over X 0 X is ? AW ? TUA
that encodes the following set of conditional independence
assumptions: each variable is independent of its non-   6  x ? 0A   x ? TUA%XY 6  x ? 0A% 9[ZX] ; \ 01 6<O  x ? @^F 1A
x ? @5A%
descendants given its parents Pa in . The second    $ 0
(1)
component, Θ, represents the set of parameters that quan- where 6 O  x ? @<F 1AJ
x ? @5A% is obtained in the obvious way
tifies the network. In the simplest case, it contains a pa-
rameter ' )( *,+( -.+    0/ 
  21  
Pr Pa for each pos-
from the transition model.
sible value of /3 
and each possible set of values of 1,
Pa 45
. Each conditional probability distribution can be
3 LEARNING from Complete Data
represented as a table, called a CPT (conditional probabil- In this section we develop the theory for learning DPNs from
ity table). Representations which require fewer parameters, complete data. We begin with a brief review of how one
such as noisy-ORs [27] or trees [12], are also possible — learns standard PNs. Then we derive the details of the BIC
indeed, we use them in the experimental results section — and BDe score for DPNs and finally we discuss how these
but, for simplicity of notation, we shall stick to the CPT are optimized in the search for good structures. The upshot
case. of this section is that learning DPNs from complete data
$
Given and Θ, a PN defines a unique joint probability ! uses the same techniques as learning PNs from complete
distribution over X given by: data. Learning DPNs is not quite the same as applying
67% 1    8:9 ; 1 <6=% 
pa    PN methods to the unrolled network, however, due to the
constraint of repeated structure and repeated parameters.
A PN describes a probability distribution over a fixed
set of variables. DPNs extend this representation to model 3.1 LEARNING PNS
temporal processes. For simplicity, we assume that changes The problem of learning a probabilistic network is stated
occur between discrete time points that are indexed by the as follows. Given a training set of instances of X, find _
non-negative integers. We assume that X     > a network !`a%$E 
Θ that best matches . The notion _
 ? @5A
1
is a set of attributes that the process changes. is the of “best match” is defined using a scoring function. Sev-
random variable that denotes the value of the attribute   eral different scoring functions have been proposed in the
@ ? @5A
at time , and X is the set of random variables .   ? @5A literature. The most frequently used are the Bayesian Infor-
To represent beliefs about the possible trajectories of mation Criterion [28] and the BDe score [15]. Both of these
the process, we need a probability distribution over the
random variables X 0 X1 X2 ? A<B
. Of course, ? ACB ? ADB2 1
Some authors define DPNs using just the transition network,
assuming that all “slices,” including slice 0, have the same struc-
such a distribution can be extremely complex. In this
paper, we assume the process is Markovian in X, i.e., ture. In general, however, the prior distribution on X 0 can have b c
E ? @8F AG
? A  ? @5A%EHE ? @8F AG
? @5A%I
X 1 X0 X X 1 X We a quite different independence structure from other slices, since
it may represent either the way in which the process is initialized
also assume that the process is stationary, i.e., that the tran-
& ? @<F JA
? @5A% @ or the dependencies induced among the variables in X 0 by the b c
sition probability X 1 X is independent of . unobserved portion of the process before d<e
0. For example, if
Given these assumptions, a DPN that represents the joint d<e 0 represents an arbitrary starting point taken from an infinite
distribution over all possible trajectories of a process con- process, then it is reasonable to use a dependency structure for
sists of two parts: b c
X 0 that reflects the stationary distribution of the Markov chain
K !
a prior network 0 that specifies a distribution over defined by the transition network. One can combine the prior
initial states X 0 ; and ? A and transition networks into a single two-slice network, but, be-
K a transition network over the variables X 0 !ML ? A<B
cause the learning processes for the prior and transition models
are distinct, we have chosen to represent them separately.
scores combine the likelihood of the data according to the This approach has the desirable property that the scores of
network, fg)!
: _E=
log Pr %_h
!i
, with some penalty two networks that are equivalent (i.e., describe the same
relating to the complexity of the network. When learning set of independence assumptions) are the same. Finally,
the structure of PNs, a complexity penalty is essential since to compute the full BDe score, a simple description-length
the maximum-likelihood network is usually the completely penalty corresponding to Pr is added in. %$y
connected network.
The BIC and BDe scores are derived from the posterior 3.2 BIC SCORE FOR DPNS
probability of the network structure. Let the random vari- We now describe the BIC score for DPNs. Unsurprisingly,
$
able range over the possible network structures that might the results here mirror the results for PNs.
in fact obtain in the real world. Then, using Bayes’ rule, Throughout this section, we assume that we are given a
the posterior distribution over is $ _
training set consisting of seq complete observation se- x
Pr %$j
._E8k Pr )_l
.$i Pr % $i
z
quences. The th such sequence has length
{ ? A   { R? xM{ A xM{
and specifies
values for the variables x 0 x . Such a dataset
where the likelihood of the data given a network structure x
give us seq instances of initial slices, from which we can
can be computed by conditioning on the associated network !
train 0 , and x#`| { x {
instances of transitions, from
parameters: which we can train . !7L
We start by introducing some notation. Let us define
Pr %_H
.$iXnm Pr %_`
o$& Θ Pr Θ
.$i p Θ  (2)
'3)} 0( *~ + ( -€+    ? 0AVn/‚
Pa   ? 0A%ƒ1q  
Pr
Obviously, specifying the parameter priors and evaluating
and similarly
' %L ( * + ( - +  Pr 4 ] ? @5AN:/3,
Pa ,? @5A%ƒ1,5
this integral can be difficult.
One approach to avoiding full computation of the inte-
gral in (2) is to examine the asymptotic behavior of this for @„ 1  T . We will also need some notation for the
term. Given a large number of data points, the posterior sufficient statistics for each family,
x %} ( 0*~ + ( -q+  | {V… 4,? 0A:/†  Pa  ? 0A4X‡1€ ; x{ 
probability is insensitive to the choice of prior (assuming
that the prior does not give probability zero to any event).
Schwarz [28] derives the following asymptotic estimate for
and
x )L ( *,+( -.+ Y| { | ] … ,? @5AY/3  Pa ? @5A%X‡1ˆ ; x{ 
well-behaved priors:
)_`
q$iX log Pr )_l
.$E ΘˆrXs log2t #$uFwvE 1,
log Pr {
where …  ‰ ; x  is an indicator function which takes on value
1 if the event ‰ occurs in sequence x{ , and 0 otherwise.
(3)
where Θ̂ r are the parameter settings for $ that maximize
the likelihood of the data, x is the number of training
Using (1), and rearranging terms, we find that the likeli-
instances, # $ is the dimension of $ (which in the case
hood function decomposes according to the structure of the
DPN, just as with PNs:
x )} ( 0*~ + ( -€+
of complete data is just the number of parameters), and
vE 1 is a constant term which is independent of x and $ . ‹
The BIC score uses Equation (3) to rank candidate network Pr )_
$E ΘrXŠ 9  9 * + 9 -q+ ' % } 0( *~ + ( -q+ Œ
structures. Notice that it obviates the need for parameter ‹ x )L ( * + ( - +
priors, and the prior on structures is reduced to counting
parameters.2
9  9 *I+ 9 -q+ ' %L ( * + ( - + Π(4)
An alternative approach is to evaluate (2) in closed form and hence the log-likelihood is given by
given a restricted family of priors [3, 4, 15]. Roughly
fg%! : _EŠ x )} ( 0*~ + ( -€+ log '3)} 0( *~ + ( -€+ F
|  | *+ | q- + Ž
|  | * + | - + x )L ( *I+( -q+ log ' )L ( *,+4( -q+
speaking, the prior on the parameters for a given structure
$ is assumed to factor into independent priors over the pa- (5)
rameters for each conditional distribution pa E4 
4   
.
This decomposition facilitates the computation of the BIC
When the data is complete, this implies that the posterior
and BDe scores in several ways. First, note that the like-
factors in the same way. Thus, each conditional distribu-
lihood is expressed as a sum of terms, where each term
tion can be updated and scored separately. The score can
depends only on the conditional probability of a variable
be computed in closed form if we assume in addition that
given a particular assignment to its parents. Thus, if we
the prior for each conditional distribution is from a conju-
want to find the maximum likelihood parameters, we can
gate family, such as the Dirichlet distribution. The details
maximize within each family independently. Second, the
appear below. decomposition implies that we can learn 0 independently !
Since we might examine a large number of possible net-
work structures, we would like to avoid having to assign a
of !7L
. Finally, we can learn in exactly the same !ML
manner as learning a PN for a set of samples of transitions.
prior distribution over parameters for each possible struc- We now make these arguments more precise. Using the
ture. [15] provide a set of assumptions that allow the pa-
standard maximum likelihood estimate for multinomial dis-
rameter prior for all structures to be specified using a single
tributions, we immediately get the following expression for
network with Dirichlet priors, together with a single “vir-
r
Θ̂ .
x )} ( 0*~ + ( -q+
tual data count” that describes the confidence in that prior.
' ˆ)} 0( *~ + ( -q+ 
|  + x )} ( 0*~ + ( -q+
2
A similar formula arises from the Minimum Description
Length (MDL) principle [20]. Y
]
and similarly for the transition case. where Γ %>7jmXµ @ ® \ ¶ \ pq@
1
is the Gamma function that
 ˜ %F ˜n % >
0
In the case of CPTs, the number of parameters in the satisfies the properties Γ 1 1 and Γ 1 Γ .

x¡ %} ( *~ + ( -€+


network is given by Returning to the BDe score, let us assume that for each
$n #$ 0 F #i$ L $ 0
L
# structure , we have chosen the hyperparameters
where and x¨ )( * + ( - +
. Then we can rewrite Pr as a product )_H
3$i
 |  |w3‘ Pa +” 0•“~

 

ƒ
$ 0Y  –

  ? 0A

>s 1— of two terms,
}“’
#
™ |n¦ + Ÿ £ ¥   ¤ + ¥ ¦ + ž
Γ
0
and similarly for the transition case.
9 Ÿ 9   + §
Finally, substituting into (3), we find that the BIC score Γ ™ |:¦ + Ÿ £ ¥   ¤ + «E¥ ¦ +V· ¸ Ÿ © ¥   + ¥ ¦ + ž
0

Ÿ £« ¥   ¤ + · ¥ ¦ +V¸ Ÿ¬£ ¥ «   ¤ + ¥ · ¦ + ž
is given by
)$ : E
_ ˜ F L Γ™
0 0
BIC BIC0 BIC (6)
9 ¦ + « · Γ ™ £ ¤ « ž
where
0
Ÿ ¥   + ¥ ¦ +
BIC0  |  | *+ | -q+ x )} ( 0*~ + ( -q+ log ' ˆ)} 0( *~ + ( -€+ s log t seq
# $ 0
« ·
2 and similarly for the transition case.
and This still requires us to supply the Dirichlet hyperparame-
BIC L :|  | *,+ | -q+ x )L ( * + ( - + log ' ˆ%L ( * + ( - + s log
2
t #$ L ters for each candidate DPN structure. Since the number of
possible DPN structures is large, these prior estimates might
Note that the penalty for families in the original time be hard to asses in practice. Following [15], we can assign
slice is a function of the number of sequences seq , since x all of these given a prior DPN and two !=8¹)!=} ~  !=L 
x } ~ x L
0
we only observe that many examples for this part of the equivalent sample sizes ˆ 0 and ˆ . Given these compo-
model; whereas the penalty for the transition model is a
x
function of , the total number of transitions observed.
nents, we assign the Dirichlet weights for 0 $º»%$ $iL
as follows:
3.3 BDe SCORE FOR DPNS
x ˆ } 0 ~  6  4  ? 0AY/ 
Pa   ? 0A%<‡1   
x¡ )} 0( *~ + ( -q+ 
x¡ )L ( *,+( -.+ 
x ˆ Lj 6 O   ? 1AY/ 
Pa 4  ? 1A%Xw1  
0
Recall that to compute the BDe score, we need to evaluate
the following integral.
Pr )_l
.$iX m Pr %_`
q$& Θr  Pr  Θr
q$ip Θr (Note that the choice of parents here is based on $ , and
The first term inside the integral decomposes as in Equa- might differ from the structure of != .) Intuitively, we can
consider the belief in != as equivalent to having previously
tion 4. If we assume that the prior over each conditional
probability is independent of the rest, then the prior term
L
experienced x ˆ } 0 ~ sequences with x ˆ transitions.
also decomposes as It is easy to verify that choosing priors in this manner
¤ 9 Ÿq9¨  + Pr ™¢oŸ © ¥   + ¥ ¦ + ž
Pr ™ Θ šœ›qgže 9ŽŸq9¡  + Pr ™¢oŸ £ ¥   + ¥ ¦ + ž8§ Ž
0 preserves the main property we require: two DPN structures
that imply identical independence assumptions will receive
the same score. Thus, we claim that our definition is the
Inserting this into the preceding equation, we see that the natural extension of the BDe score to the case of dynamic
entire expression consists of an integral over the product of
independent terms. Hence Pr can be written as a %_ª
>$i probabilistic networks.
product of two integrals,
Ÿ £ ¥   ¤ ¥ ¦
3.4 LEARNING IN PRACTICE
‹
9 Ÿ 9   + m 9 ¦ + ¢oŸ £ ¥   ¤ + ¥ ¦ + Œ<« + + § ™¢3Ÿ¬£ ¥   ¤ + ¥ ¦ + žX§­q¢3Ÿ £ ¥   ¤ + ¥ ¦ +
0
0 0 0 Both scores we considered so far have two important prop-
Pr
erties. First, the score of a DPN can be written as a sum
of terms, (for the BDe case we look at the logarithm of
and similarly for the transition case. Pr %$¼
U_E
), where each term determines the score of a
In order to obtain a closed-form solution, we assume particular choice of parents for a particular variable. Thus,
Dirichlet priors [6]. A Dirichlet prior for a multinomial a local change to one family (e.g., arc addition or removal)
distribution of a variable  is specified by a set of hyper- affects only one of these terms. Second, the term that eval-
x¡ ® [¯     ? @5A
x L  ‰R
parameters : Val as follows: uates given its parents is a function of the appropriate
Pr  Θ C Dirichlet  x¡°® : [
 ¯ q8kY± ' ® x¡ ² \ 1  counts (either 0
or x } ~  ‰R
) for that family. Thus,
’ Val
® by caching these counts we can efficiently evaluate many
families.
Intuitively, the hyperparameters can be thought of as These two properties can be exploited by hill-climbing
“pseudo counts”, since they play a similar role to the actual search procedures [3, 15] that gradually improve a candidate
counts we derive from the data. Under a Dirichlet prior, structure by applying the best arc addition, deletion, or
the probability of observing a sequence of values of with  reversal. In the case of DPNs, as opposed to static PNs,
counts x³%>
is we have the additional constraint that the network structure
m 9 ® ' ®t } ® ~ Pr  Θ’
q$y p Θ’  must repeat over time. This reduces the number of options
at each point in the search. Moreover, we can search for
Γ | ² x¡ ² ~ x “²q´>t } ® ~4~ 
 9 ® Γ} ¡ !
}
Γ | ² x¡°²€´>t ® ~~~ Γ ¡
the best structure for 0 independently of the search for the
!7L
} } } } x “²q~ best structure for .
4 LEARNING FROM INCOMPLETE i.e., the probability assigned to this completion based on
DATA the old DPN. Using Theorem 3.1 of [10], we can prove the
following:
We now examine how to learn DPNs from incomplete data.
Theorem 4.1:
Incomplete data is crucial since in most real life applications
since we do not have complete observability of the process BIC %! 0  ! L  : E
_ Xs BIC  )! 0  !MLE : _EUÀ
we want to model. This means that even if the process is ½E? BIC  %! 0  ! L  : _ ´  : _¡)! 0  ! L )A>s
a stationary Markov process, the partial observations we
have are not Markovian. As an example, suppose we are ½E? BIC  %! 0  ! L  : _ ´  : _¡)! 0  ! L )A
tracking a car moving down the highway and we are ob- That is, if we choose a new DPN that has a higher expected
serving the car’s speed, lane, relative distance to other cars, score than the expected score of the old DPN, then the true
and so on. This example is clearly not Markovian, given score of new DPN will also be higher than the true score
the observations. On the other hand, a Markovian model of the old DPN. Moreover, the difference in the expected
might be reasonable provided we include as part of the state scores is a lower bound on the improvement in terms of the
information the driver’s goals, e.g., get to the left lane, exit actual score we are trying to optimize.
at the next off ramp, etc. By learning hidden variables, we The crucial property of the expected BIC score is that it,
can capture state information about the process. This allows too, decomposes into a sum of local terms, as follows. By
the model to “remember” additional information about the the linearity of expectation, we can “push” the opera- ½
past and to make better predictions of the future. tor through the summation signs implicit in (6) to get an
The main difficulty with learning from partial observa- equation which involves terms of the following form:
tions is that we no longer have the score decomposition
½E?Rx )L ( * + ( - + Á
| { | ] ½E? …   ? @5ANY/   Pa 4  ? @5A%Xw1  ; x{ WA
A 
| { | ] Pr ? @5An/3  Pa ,? @5A4ƒ1,,
x{ 
properties of (4). This means that the optimal parameter
choice in one part of the network depends on the parameter 
0~
choices in other parts of the network. This problem can
be understood better if we notice that once we have partial and similarly for ½&? x %} ( *+ ( -€+ A . These are called the expected
observability we can no longer talk about counts from the sufficient statistics. Hence the key requirement to apply the
training data. For most events of interest, the counts are SEM algorithm in the incomplete data case is the ability
not defined, since we do not know the exact value of the  ? @5AC
/     ? @5A%g01 
{ 
to compute the probabilities of all the families Pr
variables in questions. Pa x of all the networks which we wish
The most commonly used method to alleviate this to evaluate, i.e., the current network and the “neighboring”
problem is the Expectation-Maximization (EM) algorithm networks in the search space.
[7, 21]. The E-step of EM uses the currently estimated To efficiently compute the probabilities of the families,
parameters to complete the data by computing the ex- we can convert the DPN to a join tree and use a two-pass dy-
pected counts. The M-step then re-estimates the maximum- namic programming algorithm [19], similar to the forwards-
likelihood parameter values as if the expected counts were backwards algorithm used in HMMs [29]. To efficiently
true observed counts. The central theorem underlying EM’s compute the probability of a set of nodes that is not con-
behavior is that each EM cycle is guaranteed to improve the tained in any of the join tree nodes, we need to use more
likelihood of the data given the model until it reaches a local sophisticated techniques [30]. A simpler approach, which
maximum. is the one we currently use, is to connect together all the
EM has been traditionally viewed as a method for ad- variables in a single timeslice into a single node (i.e., to
justing the parameters of a fixed model structure. How- convert the DPN to a Markov chain), compute the expected
ever, the underlying theorem can be generalized to apply to number of transitions between every pair of consecutive
structural as well as parametric modifications. Friedman’s states in this chain, and then marginalize these counts to get
Structural EM (SEM) algorithm [10] has the same E-step the counts for each family. We also maintain a cache of
as EM, completing the data by computing expected counts the expected counts computed so far based on the current
based on the current structure and parameters. In addition completion model, and use this cache to avoid recomputing
to re-estimating parameters, the M-step of SEM can use the expected counts. Whatever technique we use, computing
expected counts according to the current structure to evalu- exact expected sufficient statistics for all models of interest
ate any other candidate structure—essentially performing a is likely to remain computationally challenging. In the dis-
complete-data structural search in the inner loop. Friedman cussion section we mention some approximation techniques
shows that for a large family of scorings rules, including that may allow us to learn larger models.
the BIC score and BDe score [11], the resulting network In summary, the SEM procedure is as follows.
must have a higher score than the original. This is true procedure DPN-SEM:
even though the expected counts used in evaluating the new ™% à  © ž
Choose 00 0 (possibly randomly).
structure are computed using the old structure. Loop for 0 1 Äe à ÃÅ ÅÅ
until convergence
™%ÂMÆ Ã Â7©Æ ž
 %!¾ !=L  _ ´ 
We can extend Friedman’s results to the DPN case
½E?
Improve the parameters of 0 using EM
in the following way. Let BIC 0 : : Search over DPN structures
_¡%! ! L ¿A ™%Â Æ Ã Â Æ ž )
_ ´ ™% ÆÇ Ã Â ©ÆÇ ž ©
be the expected BIC score of a DPN (using expected counts computed with 0
%!¾  !=L 
0
based on all possible completions of the data, Let the best scoring DPN seen be 0 1 1

™% ÆÇ Ã Â ©Æ Ç žeƒ™%Â Æ Ã Â ©Æ ž


0

%_ ´
q_¡%! } ~  !MLE 
i.e., an assignment of values to the hidden variables. The ex- if 0 1 1

™%ÂMÆÇ Ã ÂM©ÆÇ ž
0
pectation is taken with respect to Pr 0 , return 0 1 1
Data Size BIC BDe
x seq x L 0 1 2 0 1 2
250 11387 -1.4323 -1.4196 -1.0812 -1.5286 -1.5027 -1.3347
500 28043 -1.0737 -1.0248 -1.0168 -1.2503 -1.0160 -1.0339
1000 61864 -0.9979 -0.9961 -0.9783 -0.9638 -0.9647 -0.9676
1500 102906 -0.9820 -0.9773 -0.9787 -0.9625 -0.9511 -0.9888

Figure 2: Logloss (bits) per slice on test data for the networks learned in the driving domain using the BIC and BDe scores.
Columns are labeled by the number of additional binary variables introduced.

leftblocked[0] leftblocked[1]

xdot[0] xdot[1]

reldist[0] reldist[1] dist-from-camera[0]


relspeed[1]
leftblocked[0] leftblocked[1]
relspeed[0] dist-from-camera[1]
ydot[0]

xdot[0] xdot[1]
dist-from-camera[0] dist-from-camera[1] rightblocked[0] rightblocked[1]
lane-posn[0]

lane-posn[1] ydot[1]
leftblocked[0] H0[0] H0[1]
rightblocked[0] rightblocked[1]
leftblocked[1]
lane-posn[0] lane-posn[1] dist-from-camera[0] lane-posn[1]
lane-posn[0]
dist-from-camera[1]
_0[0] H0[1]
xdot[0] rightblocked[1] H1[1] reldist[1]
H1[0]
reldist[0] reldist[1]

rightblocked[0] xdot[1] relspeed[1] ydot[1] relspeed[1]


reldist[0]
relspeed[0]

ydot[0] ydot[1] ydot[0] relspeed[0]

(a) (b) (c)

Figure 3: Transition models learned from the driving domain. Shaded nodes correspond to hidden variables.

The above discussion was for the application of SEM for camera’s range. The report for each car has the following
the BIC score. In [11], Friedman shows how to extend the attributes: position and velocity (relative to the camera’s
SEM procedure to learn with the BDe score. The details are reference frame), relative speed and distance to the car in
more involved, since the BDe score is not linear. Nonethe- front, and whether there is a car immediately to the left or
less, Friedman shows that it is a reasonable approximation right. From this data, we want to learn models of typical
to use the BDe score on the expected counts inside the SEM classes of driving behavior. Such models can be useful for
loop. several tasks. A prime example arises in the BATmobile
autonomous car project [8]. The BATmobile’s autonomous
5 APPLICATIONS controller attempts to predict the behavior of neighboring
car. For example, it would be useful to know that someone
In this section we describe two preliminary investigations who has just driven across two lanes might be attempting
that attempt to evaluate the usefulness of the DPN technol- to leave the freeway, and consequently is likely to cut in
ogy we develop here for real-life applications. front of you. Since tracking information from real cam-
eras is readily available [24], it is reasonable to hope that
5.1 INFERRING DRIVER BEHAVIOR realistic models of human drivers can be obtained. In ad-
In many tracking domains, we would like to learn a predic- dition to their use in autonomous vehicles, such models are
tive model of the behavior of the object being tracked. The of paramount importance in so-called “microscopic” traffic
more accurate the model, the more robust the tracking sys- models used in freeway design and construction planning
tem and the more useful its predictions will be in decision and also in safety studies.
making. For objects with hidden state, it is our hope that In our experiments, we generate a variety of simulated
DPNs can be learned that correctly reflect the unobserved traffic patterns consisting of populations of vehicles with a
process governing the behavior. The learned model may mixture of different driving tendencies (e.g., trucks, sports
also provide insight into how the behavior is generated. cars, Sunday drivers, etc.). We generated 3500 cars and
In this section, we describe some experiments we car- tracked their behavior over sequencesof roughly 40–70 time
ried out using a simulated driving domain [9]. The data steps. The observed data were discretized using fixed-sized
is an idealization of what cameras mounted on the side of bins. We then trained networks using a dataset consisting
the road can collect. In particular, at each time step of of the first 250, 500, 1000, and 1500 sequences, and tested
the simulation, we get a report on cars that are within the these networks on the last 2000 sequences.
A
A A system.
0.9 0.8
We believe that DPNs provide a good tool for model-
B B
B C ing such noisy, causal systems, and furthermore that SEM
0.7 0.6 C C
provides a good way to learn these models automatically
D
from noisy, incomplete data. Here we describe some initial
D D experiments using DPNs to learn small artificial examples
0.5
typical of the causal processes involved in genetic regula-
E E
E tion. We generate data from models of known structure,
learn DPN models from the data in a variety of settings, and
compare these with the original models. The main purpose
(a) (b)
of these experiments is to understand how well DPNs can
represent such processes, how many observations are re-
Figure 4: (a) A simple pathway model with five vertices. quired, and what sorts of observations are most useful. We
Each vertex represents a site in the genome, and each arc refrain from describing any particular biological process,
is a possible triggering pathway. (b) A DPN model that is since we do not yet have sufficient real data on the processes
equivalent to the pathway model shown in (a). we are studying to learn a scientifically useful model.
Simple genetic systems are commonly described by a
pathway model—a graph in which vertices represent genes
We learned networks with 0, 1, and 2 hidden variables (or larger chromosomal regions) and arcs represent causal
using decision tree CPTs [12] with the BIC and BDe score. pathways (Figure 4(a)). A vertex can either be “off/normal”
The initial network structure in each case had each hidden (state 0) or “on/abnormal” (state 1). The system starts in a
variable (if any) dependent on its value in the previous time state which is all 0s, and vertices can “spontaneously” turn
slice, and each observable variable dependent only on the on (due to unmodelled external causes) with some probabil-
hidden variables in the same time slice. (Thus, we did not ity per unit time. Once a vertex is turned on, it stays on, but
put in any persistence arcs in the initial model.) This struc- may trigger other neighboring vertices to turn on as well—
ture connects each hidden variable to all the other variables, again, with a certain probability per unit time. The arcs on
allowing that variable to carry forward information about the graph are usually annotated with the “half-life” param-
previous time slices. eter of the triggering process. Note that pathway models,
Table 2 summarizes the logloss, per time slice, on the unlike PNs, can contain directed cycles. For many impor-
test data for the different networks we learned. We can tant biological processes, the structure and parameters of
roughly interpret the negative of the logloss as the number this graph are completely unknown; their discovery would
of bits needed to encode a time slice using each network. constitute a major advance in scientific understanding.
These numbers indicate that the addition of hidden vari- Pathway models have a very natural representation as
ables improves the predictive ability of the models. They DPNs: each vertex becomes a state variable, and the trig-
also indicate that we might benefit from using many more gering arcs are represented as links in the transition network
training sequences; this is because many of the important of the DPN. The tendency of a vertex to stay “on” once
events such as lane changes are quite rare in typical freeway triggered is represented by persistence links in the DPN.
traffic patterns. Some of the learned networks are shown Figure 4(b) shows a DPN representation of the five-vertex
in Figures 3. The networks seem to include at least some pathway model in Figure 4(a). The nature of the problem
of the essential relationships that we expect to see. For suggests that noisy-ORs (or noisy-ANDs) should provide
example, in Figure 3(b), the single hidden node has parent a parsimonious representation of the conditional density
“relative distance” and children “lane-position” and “longi- function at each node. To specify a noisy-OR for a node
tudinal velocity” (but not “lateral velocity”). This suggests /
with parents, we use parameters 1 È   ÈI-
, where is ÈI
that it encodes “need to take avoiding action.” the probability the child node will be in state 0 if the th par- É
5.2 LEARNING CAUSAL PATHWAYS IN - /
ent is in state 1. (Thus we need only parameters instead
of 2 for a full CPT.) In the five-vertex DPN model that we
BIOLOGICAL PROCESSES
used in the experiments reported below, all the parame- È
DPNs are a powerful representation language for describ- ters (except for the persistence arcs) have value 0 2. For a 
ing causal models of stochastic processes. Such models are strict persistence model (vertices stay on once triggered), È
useful in many areas of science, including molecular biol- parameters for persistence arcs are fixed at 0. To learn such
ogy, where one is often interested in inferring the structure noisy-OR distributions, we follow the technique suggested
of regulatory pathways. McAdams and Shapiro [26] model in [27]; this entails introducing a new hidden node for each
part of the genetic circuit of the lambda bacteriophage in arc in the network, which is a noisy version of its parent, and
terms of a sequential logic circuit; and attempts have even replacing each noisy-OR gate with a deterministic-OR gate.
be made to automatically infer the form of such Boolean We also tried using gradient descent, following [1], but en-
circuits from data [22]. However, it is well known that the countered difficulties with convergence in cases where the
abstraction of binary-valued signals together with deter- optimal parameter values were close to the boundaries (0
ministic switches often breaks down, and one must model or 1).
the continuous nature and inherent noise in the underlying In all our experiments, we enforced the presence of the
system to get accurate predictions [25]. Also, one must be persistence arcs in the network structure. We used two
able to deal with noisy and incomplete observations of the
6 2

5 0% 0%
20% 20%
Hamming distance

1.5
4

Relative logloss
40% 40%
3 60% 60%
1
2

1 0.5

0
0
−1
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Number of training sequences Number of training sequences

6 2

5 0% 0%
20% 1.5 20%
Hamming distance

Relative logloss
40% 40%
3 60% 60%
1
2

1 0.5

0
0
−1
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Number of training sequences Number of training sequences

Figure 5: Hamming distance (left) of the learned models compared to the generating model for the five-vertex process,
and relative log-loss (right) compared with the generating model on independent sample of 100 sequences. Tests are for
regimes in which 0%, 20%, 40%, and 60% of the slices are hidden at random. Top: tabulated CPTs; bottom: noisy-OR
network.

alternative initial topologies: one that has only persistence of the system is frozen—that is, the conditional probabil-
arcs (so the system must learn to add arcs) and one that ity distribution X 1 X E ? @8F AG
? @5AWËi 
1 is fixed so that
is fully interconnected (so the system must learn to delete X 1 ? @VF AN ? @5A
X with probability 1. The persistence param-
arcs). Performance in the two cases was very similar. Ë
eter for determines a probability distribution over obs ; @
We experimented with three observation regimes that cor- by fixing this parameter such that (a priori) obs with @ Ï T
respond to realistic settings: high probability, we effectively fix a scale for time, which
K would otherwise be arbitrary. The learned network will,
The complete state of the system is observed at every
time step.
nonetheless, imply a more constrained distribution for obs @
K Entire time slices are hidden uniformly at random with
given a pair of observations.
Ê
We consider two measures of performance. One is the
probability , corresponding to intermittent observa- number of different edges in the learned network compared
tion.
K Only two observations are made, one before the pro-
to the generating network, i.e., the Hamming distance be-
tween their adjacency matrices. The second is the difference
cess begins and another at some unknown time obs @ in the logloss of the learned model compared to the gen-
after the process is initiated by some external or spon- erating model, measured on an independent test set. Our
taneous event. This might be the case with some dis- results for the five-vertex model of Figure 4(a) are shown
ease processes, where the DNA of a diseased cell can in Figure 5. We see that noisy-ORs perform much better
be observed but the elapsed time since the disease pro- than tabular CPTs when the amount of missing data is high.
cess began is not known. (The “initial” observation is Even with 40% missing slices, the exact structure is learned
of the DNA of some other, healthy cell from the same from only 30 examples by the noisy-OR network. However,
individual.) when all-but-two slices are hidden, the system cannot learn
effectively with a reasonable amount of data (results not
This last case, which obtains in many realistic situations,
shown). The case in which we do not even know the time at
raises a new challenge for machine learning. We resolve
which the second observation is made (which we modeled
it as follows: we supply the network with the “diseased”
observation at time slice , where T
is with high prob- T with the switching variable) is even harder to learn (results
ability larger than the actual elapsed time obs since the @ not shown). Obviously, the application of prior knowledge
will be very important in reducing the data requirements.
process began.3 We also augment the DPN model with
a hidden “switch” variable that is initially off, but can Ë
come on spontaneously. When the switch is off, the 6 DISCUSSION
system evolves according to its normal transition model
E ? @<F A„
? @5A  Ëy 
X 1 X 0 , which is to be determined from In this paper we addressed the question of learning the
the data. Once the switch turns on, however, the state structure and parameters of DPNs from complete and in-
complete data. To the best of our knowledge, we are the
3
Ì
With the parameters set to 0.2 in the true network, the actual first to examine this problem. As our experiments show,
system state is all-1s with high probability after about 20, so ÍÎe we can learn non-trivial structures from synthetic data and
this is the length used in training and testing. “realistic” data from a nontrivial simulator.
The main bottleneck in the application of our procedures [9] J. Forbes, N. Oza, R. Parr, and S. Russell. Feasibility study
is inference, which is necessary to compute the expected of fully automated traffic using decision-theoretic control.
sufficient statistics. Unlike the case of PNs, a sparse DPN Cal. PATH Research Report UCB-ITS-PRR-97-18, Inst. of
structure does not necessarily ensure fast inference—the Transportation Studies, U. C. Berkeley, 1997.
minimum size of the posterior distribution for a slice is [10] N. Friedman. Learning belief networks in the presence of
missing values and hidden variables. In ICML-97, 1997.
generally exponential in the number of variables that have
[11] N. Friedman. The Bayesian Structural EM Algorithm. In
parents in the previous slice. There are various approxi- UAI-98, 1998.
mations that we could use to speed up inference: (1) The [12] N. Friedman and M. Goldszmidt. Learning Bayesian net-
method proposed by Boyen and Koller [2], which approxi- works with local structure. In M. I. Jordan, ed., Learning in
mates posterior probabilities in the DPN in a factored form; Graphical Models, 1998. A preliminary version appeared in
this should be particularly appropriate for the biological UAI ’96..
models we are investigating. (2) Stochastic simulation—for [13] D. Geiger and D. Heckerman. Learning Gaussian Networks.
example, the ER/SOF algorithm of Kanazawa et al. [18]. (3) In UAI, 1994.
Variational approximations, e.g., Jaakkola and Jorden [17] [14] Z. Ghahramani and M. I. Jordan. Factorial hidden Markov
and Ghahramani and Jordan [14]. (4) Methods based on models. Mach. Learning, 29:245–274, 1997.
[15] D. Heckerman, D. Geiger, and D. M. Chickering. Learn-
multilevel abstraction hierarchy to detect which variables
ing Bayesian networks: The combination of knowledge and
are related to each other. statistical data. Mach. Learning, 20:197–243, 1995.
We are currently extending SEM to learn the structure [16] Z. Ghahramani and G. Hinton. Switching State-Space Mod-
of linear Gaussian DPNs; we hope this will prove com- els. Submitted for publication, 1998.
petitive with traditional techniques of system identification [17] T.S. Jaakkola and M.I. Jordan. Recursive algorithms for
[23]. One advantage of the Gaussian case over the discrete approximating probabilities in graphical models. In NIPS 9,
case is that marginalizing the posterior over two slices is pp. 487–93, 1997.
an efficient operation. Ultimately we wish to tackle the [18] K. Kanazawa, D. Koller, and S. Russell. Stochastic sim-
case of hybrid DPNs, with both discrete and continuous ulation algorithms for dynamic probabilistic networks. In
variables. The advantage of hybrid DPNs over switching UAI-95, pp. 346–351, 1995.
[19] U. Kjaerulff. A computational scheme for reasoning in dy-
state space models [16] is that the state variables can be
namic probabilistic networks. In UAI-92, 1992.
represented in factored form. For example, in the driving [20] W. Lam and F. Bacchus. Learning Bayesian belief networks:
domain, we can have separate variables for the continuous An approach based on the MDL principle. Comp. Intel.,
observations (such as speed and position) and for the dis- 10:269–293, 1994.
crete hidden states (such as “want to change lane” or “want [21] S. L. Lauritzen. The EM algorithm for graphical association
to overtake”). The question of how to know when to add models with missing data. Comp. Stat. and Data Anal.,
hidden variables is a very interesting one which we are also 19:191–201, 1995.
currently investigating. [22] S. Liang, S. Fuhrman, and R. Somogyi. Reveal, a general
reverse engineering algorithm for inference of genetic net-
work architectures. In Pacific Symp. on Biocomputing, vol. 3,
Acknowledgments
pp. 18–29, 1998.
We thank Jeff Forbes and Nikunj Oza for their help in [23] L. Ljung. System Identification: Theory for the User. Pren-
getting the training data for the driving domain. This work tice Hall, 1987.
was supported in part by ARO under the MURI program [24] J. Malik and S. Russell. Traffic surveillance and detec-
“Integrated Approach to Intelligent Systems”, grant number tion technology development: New sensor technology final
DAAH04-96-1-0341. report. Research Report UCB-ITS-PRR-97-6, Cal. PATH
Program, 1997.
[25] H. H. McAdams and A. Arkin. Stochastic mechanisms in
References gene expression. Proc. of the Nat. Acad. of Sci., 94:814–819,
1997.
[1] J. Binder, D. Koller, S. Russell, and K. Kanazawa. Adaptive [26] H. H. McAdams and L. Shapiro. Circuit simulation of ge-
probabilistic networks with hidden variables. Mach. Learn- netic networks. Science, 269:650–656, 1995.
ing, 29:213–244, 1997. [27] C. Meek and D. Heckerman. Structure and parameter learn-
[2] X. Boyen and D. Koller. Tractable Inference for Complex ing for causal independence and causal interaction models.
Stochastic Processes. In UAI, 1998. In UAI-97, pp. 366–375, 1997.
[3] W. Buntine. Theory refinement on Bayesian networks. In [28] G. Schwarz. Estimating the dimension of a model. Ann.
UAI-91, pp. 52–60. 1991. Stat., 6:461–464, 1978.
[4] G. Cooper and E. Herskovits. A Bayesian method for the [29] P. Smyth, D. Heckerman, and M. Jordan. Probabilistic inde-
induction of probabilistic networks from data. Mach. Learn- pendence networks for hidden Markov probability models.
ing, 9:309–347, 1992. Neural Computation, 9(2):227–269, 1997.
[5] T. Dean and K. Kanazawa. Probabilistic temporal reasoning. [30] H. Xu. Computing marginals for arbitrary subsets from
In AAAI-88, pp. 524–528, 1988. marginal representations in Markov trees. Art. Intel.,
[6] M. DeGroot. Optimal Statistical Decisions. 1970. 74:177–189, 1995.
[7] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maxi- [31] G. Zweig and S. Russell. Speech recognition with dynamic
mum likelihood from incomplete data via the EM algorithm. Bayesian networks. In AAAI-98, 1998.
J. Royal Stat. Soc., B 39:1–39, 1977.
[8] J. Forbes, T. Huang, K. Kanazawa, and S. Russell. The
BATmobile: Towards a Bayesian automated taxi. In IJCAI-
95, 1995.

You might also like