You are on page 1of 14

Applied Intelligence 11, 31–44 (1999)

°
c 1999 Kluwer Academic Publishers. Manufactured in The Netherlands.

Massively Parallel Probabilistic Reasoning with Boltzmann Machines

PETRI MYLLYMÄKI
Complex Systems Computation Group (CoSCo), Department of Computer Science, P.O. Box 26, FIN-00014,
University of Helsinki, Finland
Petri.Myllymaki@cs.Helsinki.FI, http://www.cs.Helsinki.FI/∼myllymak/

Abstract. We present a method for mapping a given Bayesian network to a Boltzmann machine architecture,
in the sense that the the updating process of the resulting Boltzmann machine model probably converges to a
state which can be mapped back to a maximum a posteriori (MAP) probability state in the probability distribution
represented by the Bayesian network. The Boltzmann machine model can be implemented efficiently on massively
parallel hardware, since the resulting structure can be divided into two separate clusters where all the nodes in one
cluster can be updated simultaneously. This means that the proposed mapping can be used for providing Bayesian
network models with a massively parallel probabilistic reasoning module, capable of finding the MAP states in a
computationally efficient manner. From the neural network point of view, the mapping from a Bayesian network
to a Boltzmann machine can be seen as a method for automatically determining the structure and the connection
weights of a Boltzmann machine by incorporating high-level, probabilistic information directly into the neural
network architecture, without recourse to a time-consuming and unreliable learning process.

Keywords: Boltzmann machines, probabilistic reasoning, Bayesian networks, simulated annealing

1. Introduction the objective function corresponding to a given net-


work structure, provided that suitable massively para-
Neural networks are massively parallel computational llel hardware is available. In order to apply these mod-
models consisting of a large number of very simple pro- els for solving combinatorial optimization problems,
cessing units (for a survey of neural models, see e.g., we need also an efficient, theoretically justifiable
the classic collections [1–4]). These models can per- method for constructing a neural network structure
form certain computational tasks extremely fast when where the maximum point of the objective function
run on customized parallel hardware, and hence they corresponds to the solution to the optimization prob-
have been suggested as a computationally efficient tool lem to be solved. However, Boltzmann machines, and
for solving NP-hard optimization problems approxi- neural network models in general, are typically con-
matively [5, 6]. Especially suitable for this type of structed by inefficient, partly quite adhoc methods.
tasks are stochastic neural network architectures, as In particular, the neural network architecture is usu-
these models are based on a stochastic updating pro- ally selected in a more or less arbitrary manner by
cess which converges to a state maximizing an objec- fixing the number of nodes and the connections be-
tive function determined by the variable states of the tween the nodes manually. The weights are then de-
binary nodes of the network, and by the (constant) pa- termined by a learning algorithm which first assigns
rameters (“weights”) attached to the connecting arcs. A (randomly chosen) initial weights to the connections,
well-known example of a stochastic neural network ar- and uses then some gradient-based greedy algorithm
chitecture is the Boltzmann machine (BM) model [7, 8]. for changing the weights until the behavior of the net-
Boltzmann machines can be used as a computa- work seems to be consistent with a sample of training
tionally efficient tool for finding maximum points of data [8–10].
32 Myllymäki

There are, however, several serious pitfalls with this resulting in relatively small conditional probability ta-
approach, which relate to the “black box” nature of the bles. In these cases, a Bayesian network representation
functioning of the resulting models: normally there is of the problem domain probability distribution can be
no way of finding out what kind of knowledge a neu- constructed efficiently and reliably, assuming that ap-
ral network contains after the learning, nor is it possi- propriate high-level expert domain knowledge is avail-
ble to explain the behavior of the model. In particular, able. There exists also several interesting approaches
as the learning algorithms start with a randomly cho- for constructing Bayesian networks from sample data,
sen initial state, “tabula rasa”, they are unable to use and moreover, theoretically solid techniques for com-
any prior knowledge of the problem environment, al- bining domain expert knowledge with the machine
though in many cases this kind of information would learning approach (see, e.g., [14]).
be readily available. This results in a very slow and The Bayesian network theory offers a framework for
unreliable learning process. Besides, as most of the constructing algorithms for different probabilistic rea-
learning algorithms are “steepest-descent” type greedy soning tasks (for a survey of probabilistic reasoning
algorithms, they are very likely to get stuck in local algorithms, see [15]). In this paper, we focus on the
maximum points. problem of finding the MAP state in the domain mod-
It follows that although Boltzmann machines pro- eled by a given Bayesian network. Unfortunately, for
vide us with an efficient computational mechanism for a general network structure, the MAP problem can be
finding maximum points of the objective function, find- shown to be NP-hard [16, 17], which means that very
ing a Boltzmann machine structure with a suitable ob- probably it is not possible for any algorithm to solve
jective function can be very difficult in practice. In this this task (in the worst case) in polynomial time with
paper we present one possible solution to this prob- respect to the size of the network. Consequently, re-
lem, focusing on a combinatorial optimization problem cently there has been a growing interest in developing
of great practical importance, the problem of finding stochastic algorithms for solving the MAP problem,
the maximum a priori (MAP) probability value assign- where instead of aiming at an accurate, deterministic
ment of a probability distribution on a set of discrete solution, the goal is to find a good approximation for a
variables. More precisely, we show how a Bayesian given problem with high probability.
(belief) network model [11–13], a graphical high-level A prominent example of such stochastic methods is
representation of a probability distribution over a set the simulated annealing algorithm [18–21]. This global
of discrete variables, can be mapped to a Boltzmann optimization method can be used as the computational
machine architecture so that the global maximum of the procedure for performing probabilistic reasoning on a
Boltzmann machine objective function (the consensus Bayesian network, but the method can be excruciat-
function) has the same maximum points as the proba- ingly slow in practice. The efficiency of simulated an-
bility distribution represented by the Bayesian network nealing depends crucially on a set of parameters which
model. control the behavior of the algorithm. Elaborate tech-
Intuitively speaking, a Bayesian network model of a niques for optimizing these parameters can be found
probability distribution is constructed by explicitly de- in [22–29]. An alternative approach for speeding up
termining all the direct (causal) dependencies between the simulated annealing method is to develop parallel
the random variables of the problem domain: each node variants of the algorithm. Techniques for implement-
in a Bayesian network represents one of the random ing parallel forms of simulated annealing on conven-
variables of the problem domain, and the arcs between tional hardware can be found in [21, 30, 31]. On the
the nodes represent the direct dependencies between other hand, as suggested in [21], since the Boltzmann
the corresponding variables. In addition, each node machine updating process produces a stochastic pro-
has to be provided with a table of conditional probabil- cess similar to simulated annealing, this neural network
ities, where the variable in question is conditioned by model offers an interesting possibility for an implemen-
its predecessors in the network. tation on a massively parallel architecture.
The importance of Bayesian network representations In this paper we argue that we can achieve “the best
lies in the way such a structure can be used as a com- of both worlds” by building a hybrid Bayesian-neural
pact representation for many naturally occurring dis- system, where the model construction module uses
tributions, where the dependencies between variables Bayesian network techniques, while the probabilis-
arise from a relatively sparse network of connections, tic reasoning module is implemented as a massively
Massively Parallel Probabilistic Reasoning 33

parallel Boltzmann machine. For constructing such of accuracy in sampling the probability distribution,
a hybrid Bayesian-neural system, we need a way to the BM process is only an approximation of a simu-
map a given Bayesian network to a Boltzmann ma- lated annealing process on the Bayesian network. It is
chine architecture, so that the consensus function of therefore an interesting question whether the speedup
the resulting Boltzmann machine has the same maxi- gained from parallelization compensates for the loss of
mum points as the probability measure corresponding accuracy in the stochastic process. In the empirical part
to the original Bayesian network. Although in some re- of the paper, we study this question by using artificially
stricted domains this kind of a transformation is fairly generated test cases.
straightforward to construct [7, 32–35], the methods The paper is organized as follows. First in Section 2
presented do not apply to general Bayesian network we give the formal definition of Bayesian network mo-
structures (see the discussion in [36]). In [37, 38] we dels. The probabilistic reasoning task studied in this pa-
presented a mapping from a binary-variable Bayesian per, the MAP problem, is discussed in Section 3. In this
network to a harmony network [39], which can be re- section we also describe how simulated annealing can
garded as a special case of the Boltzmann machine be used for solving this problem. The basic concepts
architecture. Following the work presented in [36, 40, of Boltzmann machine models are reviewed briefly in
41], in this paper we extend this framework to cases Section 4, and in Section 5 we describe the suggested
where the Bayesian network variables can have an arbi- mapping from a Bayesian network to a Boltzmann ma-
trary number of values, and the neural network module chine. In Section 6 we show results of simulation
uses the standard Boltzmann machine structure. runs which demonstrate that the massively parallel SA
Compared to other neural-symbolic hybrid systems scheme provided by the Boltzmann machine model
(see e.g. [42–45]), the Bayesian-neural hybrid system clearly outperforms the corresponding traditional se-
suggested here has two clear advantages. First of all, quential SA scheme, provided that suitable massively
the mathematical model behind the system is the theo- parallel hardware is available.
retically sound framework of Bayesian reasoning, com-
pared to the more or less heuristic models of most other 2. Bayesian Networks
hybrid systems (for our earlier, heuristic attempts to-
wards a hybrid system, see [46–48]). Secondly, al- Let U denote a set of N discrete random variables,
though some hybrid models provide theoretical justifi- U = {U1 , . . . , U N }. We call this set our variable base.
cations for the computations (see e.g., Shastri’s system In the sequel, we use capital letters for denoting the
for evidential reasoning [49]), they may require fairly actual variables, and small letters u 1 , . . . , u N for de-
complicated and heterogeneous computing elements noting their values. The values of all the variables in
and control regimes, whereas the neural network model the variable base form a configuration vector or a state
behind our Bayesian-neural system is structurally very vector uE = (u 1 , . . . , u N ), and all the M possible con-
simple and uniform, and confirms to an already exist- figuration vectors (E u 1 , . . . , uE M ) form our configuration
ing family of neural architectures, the Boltzmann ma- space Ω. Hence our variable base U can also be re-
chines. In addition, the mapping presented here creates garded as a random variable UE , the values of which
a two-way bridge between the symbolic and neural are the configuration vectors. Generally, if X ⊆ U is
representations, which can be used to create a “real” a set of variables, X = {X 1 , . . . , X n }, by XE = xE we
modular hybrid system where two (or more) separate mean that xE is a vector (x1 , . . . , xn ), and X i = xi , for
(neural or symbolic) inference modules work together. all i = 1, . . . , n.
The mapping described in this paper produces a Let F denote the set of all the possible subsets of
Boltzmann machine updating process which corre- Ω the set of events, and let P be a probability mea-
sponds in a sense to a simulated annealing process sure defined on Ω. The triple (Ω, F, P) now defines
where all the Bayesian network variables are updated a joint probability distribution on our variable base U.
simultaneously. Consequently, with suitable massively Having fixed the configuration space Ω (and the set
parallel hardware, processing is quite efficient and be- of events F ), any probability distribution can be fully
comes independent of the size or the structure of the determined by giving its probability measure P, and
Bayesian network. On the other hand, the BM updating hence we will in the sequel refer to a probability distri-
process works in a state space much larger than the state bution by simply saying “the probability distribution is
space of the original Bayesian network, and in terms P”.
34 Myllymäki

Bayesian networks are a formalism for storing and conditional probabilities:


retrieving the configuration probabilities P{E u } in a
( ¯ )
compact and efficient manner. The theoretical foun- Y
N ¯ ^
¯
dations of such models were set up in [11, 12, 13, 50], P{E
u} = P Ui = u i ¯ Uj = u j , (1)
i=1
¯ U ∈F
where it was shown how probability distributions can j Ui

be defined efficiently by explicitly identifying and ex-


ploiting conditional independencies between the vari- where FUi denotes the predecessors of variable V Ui ,
ables of U: and the conditional probabilities P{Ui = u i | U j ∈FU
i
U j = u j } can be found in the set PG . Consequently,
having defined a set of conditional independencies in
Definition 1. Let X, Y and Z be sets of variables.
a graphical form as a Bayesian network structure G,
Then X is conditionally independent of Y, given Z, if
we can use the conditional probabilities PG to fully de-
termine the underlying joint probability distribution.
P{ XE = xE | YE = yE, ZE = Ez } = P{ XE = xE | ZE = Ez } The number of parameters needed, m = |PG |, de-
pends Pon the density
Q of the Bayesian network structure:
m = i (|Ui | U j ∈FU |U j |). In many natural situa-
holds for all vectors xE, yE, Ez such that P{YE = yE, ZE i

= Ez } > 0. tions, this number can be several magnitudes smaller


than the size of the full configuration space. An exam-
ple of a simple Bayesian network is given in Fig. 1.
Consequently, the variables in Z intercept any de- Let Ci denote a set consisting of a variable Ui and
pendencies between the variables in X and the vari- all its predecessors in a Bayesian network G, Ci =
ables in Y: knowing the values of Z renders informa- {Ui } + FUi . We call these N sets the cliques of G. For
tion about the values of Y irrelevant to determining the each clique Ci , we define a potential function Vi which
distribution of X. Using the concept of conditional in- maps a given state vector to a real number,
dependence we can now give the definition of Bayesian
( ¯ )
network models: ¯ ^
¯
Vi (E
u ) = ln P Ui = u i ¯ Uj = u j . (2)
¯ U ∈F
j Ui
Definition 2. A Bayesian (belief) network (BN) rep-
resentation for a probability distribution P on a set of
discrete variables U = {U1 , . . . , U N } is a pair {G, PG }, The value of a potential function Vi depends only on
where G is a directed acyclic graph whose nodes corre- the values of the variables in set Ci . As has been noted
spond to the variables U = {U1 , . . . , U N }, and whose in several occasions [33, 37, 38, 54–56], these cliques
topology satisfies the following: each variable X ∈ U can be used to construct an undirected graphical
is conditionally independent of all its non-descendants Markov random field model of the probability dis-
in G, given its set of parents FX , and no proper subset tribution P, which means that the joint probability
of FX satisfies this condition. The second component distribution (1) for a Bayesian network representation
PG is a set consisting of the corresponding conditional can also be expressed as a Gibbs distribution
probabilities P{X | FX }.
1 − Pi −Vi (Eu ) 1 P
P{E
u} = e = e i Vi (Eu ) , (3)
Z Z
As the parents of a node X can often be interpreted as
direct causes of X , Bayesian networks are also some- where the clique potential functions Vi are given in (2),
times referred to as causal networks, or as the purpose and Z = 1.
is Bayesian reasoning, they are also called inference
networks. In the field of decision theory, a model sim-
ilar to Bayesian networks is known as influence dia- 3. The MAP Problem and Simulated Annealing
grams [51]. Rigorous introductions to Bayesian net-
work modeling can be found in [11, 13, 14, 52, 53]. Let X ⊆ U be a set of variables, X = {X 1 , . . . , X n }. By
The importance of Bayesian network structures lies an event { XE = xE} we mean a subset of F which inclu-
in the way such networks facilitate computing the joint des all the configurations uE ∈ Ω that are consistent with
configuration probabilities P{Eu } as a product of simple the assignment hX 1 = x1 , . . . , X n = xn i. Now assume
Massively Parallel Probabilistic Reasoning 35

Figure 1. A simple Bayesian network structure with 7 binary variables U1 , . . . , U7 , and all the conditional probabilities required for determining
the corresponding probability distribution exactly. By “u i ” we mean here the value assignment hUi = 1i, and by “ū i ” the value assignment
hUi = 0i.

we are given a partial value assignment h EE = eEi on a and we set uE(t + 1) = uE(t). For generating a new can-
set E ⊂ U of variables as an input, and let Pmax denote didate state uE j , most versions of simulated annealing
the maximal marginal probability in the set { EE = eE} use a simple scheme called Gibbs sampling where only
consisting of all the configurations consistent with the one randomly chosen variable is allowed to change its
given value assignment, Pmax = maxuE∈{ E=E E e} P{Eu }. A value to a new randomly chosen value.
state uEi ∈ { EE = eE} with the property P{Eu i } = Pmax is In the Metropolis-Hastings version [18, 57] of sim-
now called a maximum a posteriori probability (MAP) ulated annealing, the acceptance probabilities are de-
state. In this study, we are interested in the task of fined by
finding a MAP state of the configuration space Ω, given  µ ¶
some evidence h EE = eEi, and we call this problem the 
 V (E
u j ) − V (E
ui )

 1, if exp ≥ 1,
MAP problem. T
Ai j (T ) = µ ¶
Let us consider a Bayesian network with a joint prob- 
 u j ) − V (E
V (E ui )
ability distribution of the form (3), and let us define the 
 exp , otherwise.
T
potential of a state uE as the sum of clique potentials,
X where T (t), the computational temperature, is a mono-
V (E
u) = Vi (E
u ). (4) tonely decreasing function converging to zero as t ap-
i
proaches infinity. Barker [58] has introduced an al-
We can now clearly find a MAP state by maximizing ternative method with acceptance probabilities of the
the potential function V . form
In simulated annealing (SA) [18–21] this task is
1
solved by producing a stochastic process {E u (t) | t = Ai j (t) = ¡ V (Eu i )−V (Eu j ) ¢ . (5)
0, 1, . . .}, which converges to a MAP solution with 1 + exp T (t)
high probability. The stochastic process used in SA is
a Markov chain, which means that, at time t, the next In both cases it can be shown that the resulting stochas-
state uE(t + 1) depends only on the current state uE(t). tic process will converge to a MAP solution with prob-
Assuming that the current state uE(t) is uEi , the new state ability p, where p approaches one as the number of
uE(t + 1) is produced by first generating a candidate uE j iterations approaches infinity [21]. Moreover, it can
for the next state, and the candidate is then accepted by be shown that if the temperature T (t) decreases to-
an acceptance probability Ai j , otherwise it is rejected, wards zero slowly enough, the convergence is almost
36 Myllymäki

certain even with a finite time process [20]. Unfortu- input in (7):
nately, for a theoretically guaranteed convergence, a X
computationally infeasible exponential number of iter- r ) − C(Es ) =
C(E wi j s j = Ii .
ations is needed. Although in practice good results are j
sometimes obtained with a relatively small number of
iterations, the method can be excruciatingly slow. In It follows that the BM updating process can be regarded
the following, we show how to create a massively par- as a Gibbs sampling process with acceptance probabil-
allel implementation of simulated annealing by using ities identical to those given in (5), with respect to a
the Boltzmann machine architecture. Gibbs distribution

P{Es } = exp(C(Es )/T (t)).


4. Boltzmann Machines
Consequently, in principle the BM model can be used
A Boltzmann machine (BM) [8] is a neural network
as a massively parallel tool for finding the maximum
consisting of a set of binary nodes {S1 , . . . , Sn }, where
of the consensus function. However, the acceptance
the state si of a node Si is either 1 (“on”), or 0 (“off”).
probability (7) sets here implicitly some additional re-
Each arc from a unit Si to unit S j is provided with a
quirements to the generation probabilities, since the
real-valued parameter w ji , the weight of the arc. The
difference in consensus is calculated by keeping all the
arcs are symmetric, so wi j = w ji . Each unit Si is also
nodes except one constant. For this reason, if two or
provided with a real-valued parameter θi , the bias of
more adjacent nodes of the BM network are to be up-
the unit.
dated at the same time, the corresponding transition
Let sE = (s1 , . . . , sn ) ∈ {0, 1}n denote a global state
probability matrix of the resulting process is no more
vector of the nodes in the network. We define the con-
stochastic, and hence convergence of the algorithm can
sensus of a state sE as
not be guaranteed. On the other hand, if we allow only
X
n X
n one node to be updated at a time, convergence can be
C(Es ) = wi j si s j , (6) guaranteed, but then the parallel nature of the algorithm
j=1 i= j is lost.
To solve this dilemma, it has been suggested [59]
where wii denotes the bias θi of node Si , and the weight
that the nodes of a BM network should be divided into
wi j is defined to be 0 if Si and S j are not connected
clusters, where no two nodes inside of a cluster are con-
(disregarding the sign reversal, the consensus function
nected to each other. Using this kind of clustered BM
is equal to the the energy of a BM model used in most
models we can maintain some parallelism and update
definitions).
all the nodes in one cluster at the same time, while a
The nodes of a BM network are updated stochasti-
convergence theorem similar to the convergence theo-
cally according to the following probabilistic rule:
rem of simulated annealing can be proved [21, p. 139].
1 Obviously, the degree of parallelism depends on the
P(si = 1) = ¡ −Ii ¢ , (7) number of clusters in the network. Unfortunately, the
1 + exp T (t) problem of finding a minimal set of clusters in a given
P network is NP-complete [21, p. 141]. In the next section
where Ii = j wi j s j is the net input to node Si , we deal with a special class of BM architectures which
and T (t) is a real-valued parameter called tempera- has by definition only two clusters, being in this sense
ture, which decreases with increasing time t towards an optimal BM architecture.
zero. Consequently, each node is able to update its state
locally, using the information arriving from the con-
necting neighbors as the input for a sigmoid function, 5. Solving the MAP Problem
thus offering a possibility for a massively parallel im- by Boltzmann Machines
plementation of this algorithm.
Let us consider a state sE, and let rE be a new state Let us consider a Bayesian network {G, PG } for a
which is produced by changing the state of node Si . It probability distribution P of the form (3), and let
is now relatively easy too see [36] that the difference m denote the number of conditional probabilities in
in consensus between the two states is exactly the net PG , PG = { p1 , . . . , pm }. Furthermore, let λ1 , . . . , λm
Massively Parallel Probabilistic Reasoning 37

P
denote the natural logarithms of the conditional proba- where I j = i w ji ai is the net input to pattern node
bilities pi , λi = ln pi . We can now equally well forget Bj.
about the cliques altogether, and express the potential corresponding
Let us now consider a pattern node B j V
function (4) as to a parameter λ j , and let P{Ui = u i | Uk ∈FU Uk =
i
u k } be the conditional probability used for computing
X
N X
m
λj,
V (E
u) = Vi (E
u) = χ j (E
u )λ j , (8)
i=1 j=1 ( ¯ )
¯ ^
¯
where the function χ j (E u ) has value one, if uE is consis- λ j = ln P Ui = u i ¯ Uk = u k + λ∗ . (10)
¯ U ∈F
k Ui
tent with the value assignment corresponding to prob-
ability p j , otherwise it is zero. It is clear that if we
find the maximum of the potential function V in (8), We now set the weights of the arcs between pattern node
we have found the maximum of the probability distri- B j and feature nodes A1 , . . . , An in the following way:
bution P.
For our purposes, we need the function to be maxi- 1. The weights of all the arcs connecting node B j to
mized to be unbounded above, so we scale the strictly feature nodes representing values of variables not
negative potential function (8) by adding a constant in {Ui , FUi } are set to zero.
λ∗ to each of the parameters λ j . If we set λ∗ ≥ 2. The weight from pattern node B j to a feature node
− min j ln p j , the new potential function becomes non- corresponding to the value u i is set to λ j , and the
negative. In our empirical tests described in Section 6, weights of the arcs to feature nodes corresponding
we set λ∗ to be equal to − min j ln p j . to other values of the variable Ui are set to −λ j .
As the number of non-zero parameters χ j is always 3. The weight from node B j to feature nodes corre-
exactly N , the number of cliques, the resulting new sponding to the values u j appearing on the right
potential function is equal to the original function plus hand side in (10) are set to λ j , and arcs to feature
a constant N λ∗ , and hence it has the same maximum nodes representing other values of the predecessors
points as the original function. In the following, we of the variable Ui are set to −λ j .
consider maximizing a potential function of the form 4. The bias θ j of pattern node B j is set to (n +j − 1)λ j ,
(8), with the parameters λ j scaled to positive numbers where n +j is the number of positive arcs leaving from
as described above. If we manage to map this function node B j . However, if n +j = 0, we set θ j = 0.
to a consensus function of a Boltzmann machine, we
have accomplished our goal: massively parallel solu- Following this construction for each of the pattern
tion to the MAP problem. nodes B j , we get a structure which we call a two-
Let us now consider a two-layer Boltzmann
P machine, layer Boltzmann machine (BM2 ). An example of a BM2
where the first layer consists of n = i |Ui | feature structure in a case of a simple Bayesian network is
nodes A1 , . . . , An , one for each value for each of the shown in Fig. 2. As arcs with a zero-valued weight are
variables of the problem domain. The second layer has irrelevant for the computations, they are excluded from
altogether m pattern nodes B1 , . . . , Bm , one node for the network.
each of the parameters λ j in the sum in (8). Initially, Let us now consider a Bayesian network GBN , and
let us assume that each pattern node is connected to all the corresponding BM2 network GBM . In the sequel,
the feature nodes in the first layer, but no two nodes we use the term feature vector to denote the vec-
in the same layer can be connected to each other. The tor aE = (a1 , . . . , an ), consisting of the states of the
consensus function of such a network can be expressed feature nodes of GBM . Correspondingly, the vector
as bE = (b1 , . . . , bm ), consisting of the states of the pattern
nodes is called a pattern vector.
C(Es ) = C(a1 , . . . , an , b1 , . . . , bm )
Xm X n
= bj w ji ai Definition 3. The feature vector of a two-layer
j=1 i=1 Boltzmann machine GBM is consistent (with respect to
X
m the corresponding Bayesian network GBN ), if there is
= bj Ij, (9) exactly one feature node active for each of variables in
j=1 GBN , representing one possible value of that variable.
38 Myllymäki

Figure 2. A simple Bayesian network (on the left), and the corresponding BM2 structure with the parameters λ j . Solid lines represent arcs
with positive weight, negative arcs are printed with dashed lines.

From Definition 3 it follows directly that a consistent It is now easy to proof the following simple lemma:
feature vector can be mapped to an instantiation of GBN .
As before, let B j be a pattern node corresponding
V to a Lemma 1. A BM2 network converges to a consistent
parameter λ j , and let P{Ui = u i | Uk ∈FU Uk = u k } state.
i
be the conditional probability used for computing λ j .
We now say that the pattern node B j is consistent with Proof: Let sE = (E E be the final state of a BM2 up-
a , b)
V aE if aE is consistent with the assignment
a feature node dating process. We know thatPsE is a state which maxi-
hUi = u i , Uk ∈FU Uk = u k i. Moreover, a pattern acti- mizes the consensus C(Es ) = mj=1 b j I j . It is now easy
vation vector bE is said to be consistent, if all the con- too see that if aE is a consistent vector, bE must also
i

sistent pattern nodes are on, and all the inconsistent be consistent, since if there were inconsistent pattern
pattern nodes are off. Finally, a state sE = (E E of GBM
a , b) nodes active on the second layer in this case, switching
E
is called consistent if both aE and b are consistent. them off would increase the consensus of the network,
Massively Parallel Probabilistic Reasoning 39

and correspondingly, switching any inactive consistent Bayesian networks, and attached to each network a
pattern node on would increase the consensus. On the random MAP instantiation problem by clamping half
other hand, aE can not be inconsistent, since in this case of the variables to randomly chosen values. This same
all the pattern nodes connected to inconsistent feature test was performed with three different algorithms. As
nodes would be inactive, which means that removing a benchmark, we used a brute force search algorithm
inconsistencies would increase the number of active (denoted by BF), which is guaranteed to find the MAP
pattern nodes, thus increasing the consensus. It now solution in |Ä E | ∗ m time steps, where |Ä E | is the num-
follows that sE = (E E must be a consistent state. 2
a , b) ber of possible solutions, and m is the number of condi-
tional probabilities attached to the Bayesian network in
Using this lemma, we can show that the BM2 struc- question. Secondly, we used a sequential simulated an-
ture can be used for solving the MAP problem: nealing algorithm on the Bayesian network (denoted by
SSA), and the corresponding massively parallel BM2
Proposition 1. The updating process of the BM2 net- updating process (denoted here by BMSA). As we did
work converges to a state which maximizes the poten- not have access to real neural hardware, the BMSA
tial function V of the corresponding Bayesian network model was tested by running simulations on a conven-
GBN , provided that all the parameters λ j are nonnega- tional Unix workstation. The time requirement for one
tive. iteration of the SSA was assumed to be O(m), whereas
the time requirement for one iteration for the BMSA
Proof: According to Lemma 1, the network con- algorithm was assumed to be O(1) (all the nodes on
verges to a maximum consensus state where all the one layer were updated at the same time, so updating
active pattern nodes are consistent with the consistent the whole network took 2 time steps).
feature vector. This means that each final feature vec- When considering the performance of the SSA or the
tor aE corresponds to an instantiation uE. It is easy to see BMSA algorithm, it is clear that the most critical issue
that the net input I j to a consistent pattern node B j is is finding a suitable cooling scheme. Unfortunately,
λ j , and hence it follows that the theoretically correct logarithmic cooling scheme
X
m described in [20] can not be used in practice, since the
max C(Es ) = max max bj Ij number of iterations required grows too high even with
sE aE bE j=1 relatively low starting temperatures: for instance, start-
X
m ing with the initial temperature of 2, annealing down
= max max (b j I j ) to 0.1 would require more than 485 million iteration
aE b j ∈{0,1}
j=1 steps. The problem with heuristic annealing schemes
X
m is that if the annealing is done too cautiously, an unnec-
= max χ j (E
u )λ j essarily large amount of computing time may be spent.
uE
j=1
On the other hand, if the annealing is done too quickly,
= max V (E
u ). 2 the theoretical convergence results do not apply, and
uE
the results are unreliable. Recent theoretical modifica-
Please note that if the parameters λ j were not scaled tions applied in fast annealing (FA) [24] and adaptive
to nonnegative numbers as suggested earlier, the poten- simulated annealing (ASA) [22, 23] offer interesting
tial function would be strictly negative, and the BM2 alternatives to the exponential Metropolis form of sim-
construction would be useless, since the network would ulated annealing used in this study, but it is currently
then always have a trivial maximum point at zero, cor- not clear whether the framework presented here can
responding to a state where all the nodes are off. be extended to these forms of simulated annealing as
well. This poses an interesting research problem that
6. Empirical Results deserves further study.
In the experiments performed, our primary objec-
6.1. The Setup tive was not to study the efficiency of the sequen-
tial and massively parallel algorithms per se, but to
To empirically test the effectiveness of the BM2 con- see whether the speedup gained from parallelization
struction, we generated several MAP instantiation would be enough to compensate for the loss of accuracy
problems, with N , the number of variables, ranging in sampling—in other words, whether the BM2 con-
from 2 to 24. For each N , we generated 10 random struction would be computationally practical to use in
40 Myllymäki

principle, if suitable hardware was available. It should


be noted that as SSA and BMSA use the same parame-
ters for determining the cooling schedule of the anneal-
ing algorithm, it seems reasonable to assume that use of
the more sophisticated variants of simulated annealing
[22–29] would demonstrate roughly equal speedup for
both SSA and BMSA.
In our empirical tests, the temperature was low-
ered according to a simple cooling schedule, where
the temperature was multiplied after each iteration by
a constant annealing factor F, F < 1.0. A more detailed
study of the behavior of SSA and BMSA with different
annealing factors can be found in [36]. Based on this Figure 3. The behavior of the BMSA and SSA algorithms as a
study, in the tests reported here, the annealing factor function of the density of Bayesian network.
was set to 0.66. Although simple, this type of cooling
schedule is very common, and has proven successful
in many applications [21]. It is also empirically ob- significant effect on the results with either SSA or
served that more sophisticated annealing methods do BMSA. It would be an interesting research problem to
not necessarily produce any better results than this sim- study (analytically or empirically) how the Bayesian
ple method [60]. network probability distribution P changes with the
The main goal of the experiments was to study the parameter ², but this question was not addressed here.
behavior of the SSA and BMSA algorithms with in- Increasing the density of Bayesian networks not only
creasingly complex MAP problems. The complexity changes the shape of the probability distribution on the
of these problems can be increased in two ways: by configuration space, but it also imposes a computa-
changing the properties of the probability distribution tional problem as it increases m, the number of the
on the configuration space in question, or by chang- conditional probabilities that need to determined for
ing the size of the configuration space. We experi- the model. Since the BMSA algorithm (or actually its
mented with three methods for changing the configu- simulated massively parallel implementation) is inde-
ration space probability distribution: by restricting the pendent of m, increasing the density does not affect
conditional probabilities of the Bayesian networks to the BMSA solution time very much, whereas the so-
small regions near zero or one, by changing the density lution time of SSA increases significantly (see Fig. 3).
of the Bayesian network structure, and by changing the Nevertheless, it should be noted that we have here ex-
size of the evidence set E, i.e., by changing the num- tended our experiments to very dense, and even fully
ber of clamped variables. The size of the configuration connected networks. Naturally, this does not make any
space was increased in two ways: by allowing more sense in practice, since the whole concept of Bayesian
variables in the Bayesian networks, and by allowing networks relies on the networks being relatively sparse.
the variables to have more values. For this reason, in the sequel we use in our experiments
relatively sparse networks only (which does not, how-
ever, mean that the networks were singly-connected or
6.2. The Results otherwise structurally simple).
The test set corresponding to Fig. 3 consisted of 100
It has been noted [61] that solving certain types of prob- MAP problems on 10 Bayesian networks with only 8
abilistic reasoning tasks can become very difficult if binary nodes. When considering the results, it should
the Bayesian network contains a lot of extreme prob- be kept in mind that although the BF algorithm seems to
abilities (probabilities with values near zero or one). work relatively well with these small networks, it does
However, as already noted in [38], in our MAP prob- not scale up with increasing size of the networks (as we
lem framework this does not seem to be true. We ex- shall see in Fig. 6). For the same reason, the exhaus-
perimented by restricting the randomly generated con- tive BF algorithm performs well with a small number
ditional probabilities of the Bayesian network model of unclamped variables (in which case the search space
in the regions [0.0, δ], [1.0 − δ, 1.0], and varied the is small), but from Fig. 4 we see that as the number of
value of δ between 0.5 and 0.0001, but observed no unclamped variables increases, the time required for
Massively Parallel Probabilistic Reasoning 41

Figure 4. The behavior of the algorithms as a function of the num-


Figure 6. The behavior of the algorithms as a function of the num-
ber of the unclamped variables.
ber of the variables.

on 10 Bayesian networks with binary nodes, and as


before, half of the variables were clamped in advance.

7. Conclusion and Future Work

We presented a method for solving probabilistic rea-


soning tasks, formulated as finding the MAP state of a
given Bayesian network, by a massively parallel neu-
ral network computer. Our experimental simulations
strongly suggest that the massively parallel BMSA al-
gorithm outperforms the sequential SSA algorithm,
Figure 5. The behavior of the algorithms as a function of the num- provided that suitable hardware is available. As can be
ber of the values of the variables. expected, the proportional speedup gained from par-
allelization seems to increase with increasing problem
complexity.
running BF grows rapidly. Both SSA and BMSA ap- It should be noted that our SSA realization of the
pear to be quite insensitive to the number of instantiated Gibbs sampling/annealing method uses a heuristic
variables. In these tests, we used 100 MAP problems cooling schedule which does not fulfill the theoreti-
on 10 Bayesian networks with 16 binary variables. cal requirements of the convergence theorem of SA.
In Fig. 5, we plot the behavior of the algorithms as What is more, even as an approximation of the theo-
a function of the increasing configuration space, when retically correct algorithm, the SSA method is not a
the maximum number of variable values is increased. very sophisticated alternative, and there exist several
The test set consisted of 100 MAP problems on 10 elaborate techniques for making the algorithm much
10-node Bayesian networks with half of the variables more efficient than in the experiments here. Neverthe-
clamped in advance. With networks of this size, the less, it can also be argued that the same techniques
SSA algorithm seems to perform only comparably to would probably yield a similar speedup for the BMSA
the BF algorithm. However, when the size of the net- algorithm as well. Consequently, we believe that these
works is increased, the general tendency is clear: the results show that if suitable neural hardware is avail-
exhaustive BF algorithm starts to suffer from combina- able, the BMSA algorithm offers a promising basis for
torial explosion, and fails to provide a computationally building an extremely efficient tool for solving MAP
feasible solution to the MAP problem (see Fig 6). The problems.
SSA and BMSA algorithms, on the other hand, seem From the Bayesian network point of view, the map-
to scale up very well. In Fig. 6, each data point corre- ping presented here provides an efficient implementa-
sponds to a test set consisting of 100 MAP problems tional platform for performing probabilistic reasoning
42 Myllymäki

with the stochastic simulated annealing algorithm. 6. E. Baum, “Towards practical ‘neural’ computation for combi-
From the neural network point of view, the mapping natorial optimization problems,” in: Proceedings of the AIP
offers an interesting opportunity for constructing neu- Conference 151: Neural Networks for Computing, edited by
J. Denker, Snowbird, UT, pp. 53–58, 1986.
ral models from expert knowledge, instead of learn- 7. G. Hinton and T. Sejnowski, “Optimal perceptual inference,” in
ing them from raw data. The resulting neural network Proceedings of the IEEE Computer Society Conference on Com-
could also be used as a (cleverly chosen) initial starting puter Vision and Pattern Recognition, Washington, DC, 1983,
point to some of the existing learning algorithms [8–10, pp. 448–453.
62] for Boltzmann machines, in which case the learn- 8. G. Hinton and T. Sejnowski, Learning and Relearning in
Boltzmann Machines, in [3], pp. 282–317, 1986.
ing problem should become much easier than with a 9. Y. Freund and D. Haussler, “Unsupervised learning of distribu-
randomly chosen initial state. Moreover, the resulting tions on binary vectors using two layer networks,” in Neural In-
“fine-tuned” neural network could also be mapped back formation Processing Systems 4, edited by J. Moody, S. Hanson,
to a Bayesian network representation after the learn- and R. Lippmann, Morgan Kaufmann Publishers: San Mateo,
CA, pp. 912–919, 1992.
ing, which means that the mapping can also be seen as
10. C. Galland, “Learning in deterministic Boltzmann machine net-
a tool for extracting high-level knowledge from neural works,” Ph.D. Thesis, Department of Physics, University of
networks. From the Bayesian network point of view, Toronto, 1992.
this kind of a “fine-tuning” learning process could also 11. J. Pearl, Probabilistic Reasoning in Intelligent Systems: Net-
be useful in detecting mutually inconsistent probabili- works of Plausible Inference, Morgan Kaufmann Publishers:
San Mateo, CA, 1988.
ties, or other inconsistencies in the underlying Bayesian
12. R. Shachter, “Probabilistic inference and influence diagrams,”
network representation. Operations Research, vol. 36, no. 4, pp. 589–604, 1988.
It should also be noted that the results presented here 13. R. Neapolitan, Probabilistic Reasoning in Expert Systems, John
can in principle be used as a computationally efficient, Wiley & Sons: New York, NY, 1990.
massively parallel tool for solving optimization prob- 14. D. Heckerman, D. Geiger, and D. Chickering, “Learning
Bayesian networks: The combination of knowledge and sta-
lems in general, and not only for solving MAP prob-
tistical data,” Machine Learning, vol. 20, no. 3, pp. 197–243,
lems as formulated here. Naturally, the efficiency of 1995.
such an approach would largely depend on the degree 15. M. Henrion, “An introduction to algorithms for inference in be-
to which the function to be maximized can be decom- lief nets,” in Uncertainty in Artificial Intelligence 5, edited by
posed as a linear sum of functions, each depending M. Henrion, R. Shachter, L. Kanal, and J. Lemmer, Elsevier Sci-
ence Publishers B.V.: North-Holland, Amsterdam, pp. 129–138,
only on a small subset of variables (corresponding to
1990.
the cliques in the Bayesian network formalism). Fur- 16. G. Cooper, “The computational complexity of probabilistic in-
ther studies on this subject are left as a goal for future ference using Bayesian belief networks,” Artificial Intelligence,
research. vol. 42, no. 2/3, pp. 393–405, 1990.
17. S. Shimony, “Finding MAPs for belief networks is NP-hard,”
Artificial Intelligence, vol. 68, pp. 399–410, 1994.
Acknowledgments 18. N. Metropolis, A. Rosenbluth, M. Rosenbluth, M. Teller, and
E. Teller, “Equations of state calculations by fast computing
This research has been supported by the Academy of machines,” Journal of Chem. Phys., vol. 21, pp. 1087–1092,
1953.
Finland, Jenny and Antti Wihuri Foundation, Leo and 19. S. Kirkpatrick, D. Gelatt, and M. Vecchi, “Optimization by sim-
Regina Wainstein Foundation, and the Technology De- ulated annealing,” Science, vol. 220, no. 4598, pp. 671–680,
velopment Center (TEKES). 1983.
20. S. Geman and D. Geman, “Stochastic relaxation, Gibbs distri-
butions, and the Bayesian restoration of images,” IEEE Trans-
References actions on Pattern Analysis and Machine Intelligence, vol. 6,
pp. 721–741, 1984.
1. J. Anderson and E. Rosenfeld (Eds.), Neurocomputing: Foun- 21. E. Aarts and J. Korst, Simulated Annealing and Boltzmann Ma-
dations of Research, MIT Press: Cambridge, MA, 1988. chines: A Stochastic Approach to Combinatorial Optimization
2. J. Anderson and E. Rosenfeld (Eds.), Neurocomputing 2: Direc- and Neural Computing, John Wiley & Sons: Chichester, 1989
tions for Research, MIT Press: Cambridge, MA, 1991. 22. L. Ingber, “Very fast simulated re-annealing,” Mathematical
3. D. Rumelhart and J. McClelland (Eds.), Parallel Distributed Computer Modelling, vol. 8, no. 12, pp. 967–973, 1989.
Processing, vol. 1, MIT Press: Cambridge, MA, 1986. 23. L. Ingber, “Adaptive simulated annealing (ASA): Lessons
4. J. McClelland and D. Rumelhart (Eds.), Parallel Distributed learned,” Control and Cybernetics, vol. 25, no. 1, pp. 33–54,
Processing, MIT Press: Cambridge, MA, vol. 2, 1986. 1996.
5. J. Hopfield and D. Tank, “Neural computation of decisions in op- 24. H. Szu and R. Hartley, “Nonconvex Optimization by Fast Sim-
timization problems,” Biological Cybernetics, vol. 52, pp. 141– ulated Annealing,” Proceedings of the IEEE, vol. 75, no. 11,
152, 1985. pp. 1538–1540, 1987.
Massively Parallel Probabilistic Reasoning 43

25. G. Bilbro, R. Mann, and T. Miller, “Optimization by mean field 44. S. Goonatilake and S. Khebbal (Eds.), Intelligent Hybrid Sys-
annealing,” in Advances in Neural Information Processing Sys- tems, John Wiley & Sons: Chichester, 1995.
tems I, edited by D. Touretzky, Morgan Kaufmann Publishers: 45. R. Sun, Integrating Rules and Connectionism for Robust Com-
San Mateo, CA, pp. 91–98, 1989. monsense Reasoning, John Wiley & Sons: Chichester, 1994.
26. C. Peterson and J. Anderson, “A mean field theory learning al- 46. P. Floréen, P. Myllymäki, P. Orponen, and H. Tirri, “Neural
gorithm for neural networks,” Complex Systems, vol. 1, pp. 995– representation of concepts for robust inference,” in Proceedings
1019, 1987. of the International Symposium Computational Intelligence II,
27. J. Alspector, T. Zeppenfeld, and S. Luna, “A volatility measure edited by F. Gardin and G. Mauri, Milano, Italy, pp. 89–98,
for annealing in feedback neural networks,” Neural Computation 1989.
vol. 4, pp. 191–195, 1992. 47. P. Myllymäki, H. Tirri, P. Floréen, and P. Orponen, “Compiling
28. N.R.S. Ansari and G. Wang, “An efficient annealing algorithm high-level specifications into neural networks,” in Proceedings
for global optimization in Boltzmann machines,” Journal of Ap- of the International Joint Conference on Neural Networks, vol. 2,
plied Intelligence, vol. 3, no. 3, pp. 177–192, 1993. Washington, D.C., 1990, pp. 475–478.
29. S. Rajasekaran and J.H. Reif, “Nested annealing: A provable 48. P. Floréen, P. Myllymäki, P. Orponen, and H. Tirri, “Compiling
improvement to simulated annealing,” Theoretical Computer object declarations into connectionist networks,” AI Communi-
Science, vol. 99, pp. 157–176, 1992. cations, vol. 3, no. 4, pp. 172–183, 1990.
30. D. Greening, “Parallel simulated annealing techniques,” Physica 49. L. Shastri, Semantic Networks: An Evidential Formalization and
D, vol. 42, pp. 293–306, 1990. Its Connectionist Realization. Pitman: London, 1988.
31. L. Ingber, “Simulated annealing: Practice versus theory,” Math- 50. J. Pearl, “Fusion, propagation and structuring in belief net-
ematical Computer Modelling, vol. 18, no. 11, pp. 29–57, works,” Artificial Intelligence, vol. 29, pp. 241–288, 1986.
1993. 51. R. Howard and J. Matheson, “Influence diagrams,” in Readings
32. H. Geffner and J. Pearl, “On the probabilistic semantics of con- in Decision Analysis, edited by R.A. Howard and J.E. Matheson,
nectionist networks, Technical Report R-84, UCLA Computer Strategic Decisions Group: Menlo Park, CA, pp. 763–771, 1984.
Science Department, Los Angeles, CA, 1987. 52. F. Jensen, An Introduction to Bayesian Networks. UCL Press:
33. K. Laskey, “Adapting connectionist learning to Bayesian net- London, 1996.
works,” International Journal of Approximate Reasoning, vol. 4, 53. E. Castillo, J. Gutiérrez, and A. Hadi, “Expert systems and prob-
pp. 261–282, 1990. abilistic network models,” Monographs in Computer Science,
34. T. Hrycej, “Common features of neural-network models of high Springer-Verlag: New York, NY, 1997.
and low level human information processing,” in Proceedings 54. T. Hrycej, “Gibbs sampling in Bayesian networks,” Artificial
of the International Conference on Artificial Neural Networks Intelligence, vol. 46, pp. 351–363, 1990.
(ICANN-91), edited by T. Kohonen, K. Mäkisara, O. Simula, 55. S. Lauritzen and D. Spiegelhalter, “Local computations with
and J. Kangas, Espoo, Finland, 1991, pp. 861–866. probabilities on graphical structures and their application to ex-
35. R. Neal, “Connectionist learning of belief networks,” Artificial pert systems,” J. Royal Stat. Soc., Ser. B, vol. 50, no. 2, pp. 157–
Intelligence, vol. 56, pp. 71–113, 1992. 224, 1988. Reprinted as pp. 415–448 in [63].
36. P. Myllymäki, “Mapping Bayesian networks to stochastic neural 56. D. Spiegelhalter, “Probabilistic reasoning in predictive expert
networks: A foundation for hybrid Bayesian-neural systems,” systems,” in Uncertainty in Artificial Intelligence 1, edited by
Ph.D. Thesis, Report A-1995-1, Department of Computer Sci- L. Kanal and J. Lemmer, Elsevier Science Publishers B.V.:
ence, University of Helsinki, 1995. North-Holland, Amsterdam, pp. 47–67, 1986.
37. P. Myllymäki and P. Orponen, “Programming the harmonium,” 57. W. Hastings, “Monte Carlo sampling methods using Markov
in Proceedings of the International Joint Conference on Neural chains and their applications,” Biometrika, vol. 57, pp. 97–109,
Networks, vol. 1, Singapore, 1991, pp. 671–677. 1970.
38. P. Myllymäki, “Bayesian reasoning by stochastic neural net- 58. A. Barker, “Monte Carlo calculations of the radial distribution
works,” Ph.Lic. Thesis, Tech. Rep. C-1993-67, Department of functions for a proton-electron plasma,” Aust. J. Phys., vol. 18,
Computer Science, University of Helsinki, 1993. pp. 119–133, 1965.
39. P. Smolensky, Information Processing in Dynamical Sys- 59. A. de Gloria, P. Faraboschi, and M. Olivieri, “Clustered
tems: Foundations of Harmony Theory, in [3], pp. 194–281, Boltzmann Machines: Massively parallel architectures for con-
1986. strained optimization problems,” Parallel Computing, pp. 163–
40. P. Myllymäki, “Using Bayesian networks for incorporating 175, 1993.
probabilistic a priori knowledge into Boltzmann machines,” 60. D. Johnson, C. Aragon, L. McGeoch, and C. Schevon, “Opti-
in Proceedings of SOUTHCON’94, Orlando, pp. 97–102, mization by simulated annealing: An experimental evaluation;
1994. Part I, graph partitioning,” Operations Research, vol. 37, no. 6,
41. P. Myllymäki, “Mapping Bayesian networks to Boltzmann ma- pp. 865–892, 1989.
chines,” in Proceedings of Applied Decision Technologies 1995, 61. H. Chin and G. Cooper, “Bayesian belief network inference
edited by A. Gammerman, NeuroCOLT Technical Report NC- using simulation,” in Uncertainty in Artificial Intelligence 3,
TR-95-034, London, 1995, pp. 269–280. edited by L. Kanal and J. Lemmer, Elsevier Science Publishers
42. J. Barnden and J. Pollack (Eds.), Advances in Connectionist and B.V.: North-Holland, Amsterdam, pp. 129–147, 1989.
Neural Computation Theory, Vol. I: High Level Connectionist 62. G. Hinton, “Connectionist learning procedures,” Artificial Intel-
Models, Ablex Publishing Company: Norwood, NJ, 1991. ligence, vol. 40, no. 1–3, 1989.
43. G. Hinton, “Special issue on connectionist symbol processing,” 63. G. Shafer and J. Pearl (Eds.), Readings in Uncertain Reasoning,
Artificial Intelligence, vol. 46, no. 1–2, 1990. Morgan Kaufmann Publishers: San Mateo, CA, 1990.
44 Myllymäki

Petri Myllymäki is one of the founders of the Complex Sys-


tems Computation (CoSCo) research group at the Department of
Computer Science of University of Helsinki. For the past ten years,
the CoSCo group has been studying complex systems focusing on
issues related to uncertain reasoning, machine learning and data
mining. Research areas include Bayesian networks and other proba-
bilistic models, information theory, neural networks, case-based rea-
soning, and stochastic optimization methods. Dr. Myllymäki is cur-
rently working at the University of Helsinki as a research scientist
for the Academy of Finland.