You are on page 1of 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/313959314

On learning mixed Bayesian networks

Conference Paper · July 2016


DOI: 10.1109/ICMLC.2016.7860933

CITATIONS READS

8 953

2 authors, including:

Aakanksha Bapna
International Institute of Information Technology Bangalore
6 PUBLICATIONS 33 CITATIONS

SEE PROFILE

All content following this page was uploaded by Aakanksha Bapna on 15 June 2018.

The user has requested enhancement of the downloaded file.


ON LEARNING MIXED BAYESIAN NETWORKS
AAKANKSHA BAPNA1 , G. SRINIVASARAGHAVAN2
1,2
Department of Computer Science, IIIT Bangalore, Electronics City, Bangalore, India
E-MAIL: aakanksha.bapna@iiitb.org, gsr@iiitb.ac.in

Abstract: A B C
We propose a novel method for Bayesian learning of the param-
eters of a mixed belief network. Given the structure of a network,
the parameters of conditional distribution of a node based on its
type (discrete or continuous) and the types of its parents are learnt F
D E
from the data. This node-wise updating scheme puts no restric-
tion on the number and type of parents any node can have. We
also extended the traditional algorithm for learning pure Gaus-
sian networks to (i) deal with conditional Gaussian nodes, (ii) al- I
G H
low continuous nodes to be multivariate Gaussian and (iii) be able
to converge to actual mean, covariance and weights of the network
FIGURE 1. An example of a mixed Bayesian network. Circles represent
with which we generated the data. continuous nodes and squares represent discrete nodes,

Keywords:
Mixed Bayesian Networks, Bayesian parameter learning been ideal in this case. Existing algorithms for Parameter learn-
ing in mixed Bayesian networks rely on Maximum Likelihood
1. Introduction (MLE).
We propose an algorithm for Bayesian learning of the param-
Bayesian Networks are ubiquitous in machine learning ap- eters of a mixed belief network. The algorithm
plications as these capture the relationships between different 1. works for any topology of discrete and continuous nodes.
variables very well. However Bayesian network algorithms 2. handles multivariate continuous nodes.
have traditionally been designed for handling categorical vari- The rest of the paper is organised as follows. Some previous
ables and Bayesian networks have not typically been suitable attempts similar to our work have been discussed in Section 2.
for dealing with continuous valued variables. There has been Details of our learning procedure are given in Section 3. Our
reasonable success in handling networks where all the variables results and conclusions are shared in Section 4.
are continuous valued, in particular Gaussian. Most real data,
however, in many domains like health care, image processing 2. Related Works
and e-commerce consists of a mix of categorical (discrete val-
ued) and continuous (Gaussian) variables. Because of the lack There have been numerous attempts to learn the parameters
of good algorithms to learn Bayesian networks with mixed vari- of a Bayesian network. Table 1 summarizes some open source
ables their usability has remained limited to modeling discrete libraries for Bayesian networks compared on the basis of some
variables. Works that have attempted to deal with mixed vari- basic functionalities. (C and D used anywhere in the paper
able types have tried to keep the two kinds of variables sep- stand for continuous and discrete nodes respectively). These
arate. For instance [1] models occurrences of objects as dis- were the only open source libraries which dealt with mixed
crete variables and and learns mixtures of Gaussians using ex- Bayesian networks. However most of them have one or the
pectation maximization for the locations of objects (continu- other drawback. Bayes Net Toolbox (BNT) given by Kevin
ous variables). A single Bayesian network that encompasses Murphy [2] proved to be the most powerful one. It deals with
all the variables and is able to model the inter-dependencies pure discrete, pure Gaussian, multivariate Gaussian and even
between the continuous and categorical variables would have hybrid cases but carries out only MLE for parameter learn-
TABLE 1. Features supported by few open source Bayesian network libraries.

Library→ libpgm BayesPy CGBayesNets Bayes Net Toolbox


deal (R) bnlearn (R)
Allows ↓ (python) (python) (matlab) (matlab)
yes (except C yes (except C to yes (except C to
Mixed network yes yes yes
to D) D) D)
Directed
Specifying network
JSON file GUI DAG factor no DAG
structure
graphs
Multivariate
no no no no no yes
continuous nodes
MLE for mixed, obscure (trained MLE for mixed,
Parameter learning
MLE yes Bayesian only for no from conditional Bayesian only for
for mixed network
D probabilities) D
Inferencing from Only for
no yes yes yes yes
mixed network D
Structure learning yes no yes no yes yes

ing. Bayesian parameter learning is supported only for discrete Some other works in this field, the implementations of which
nodes. Leveraging the fact that BNT already supports inferenc- have not been open sourced yet are described below.
ing using numerous inference engines and structure learning In an attempt proposed in [9], the continuous nodes are dis-
using all prevalent algorithms we choose to extend BNT to in- cretized using quantized intervals of values for attributes to
clude generic Bayesian parameter learning for mixed networks. yield generalized techniques for learning mixed Bayesian net-
Hence our algorithm is just an add-on over BNT. works. However accuracy of the inference using networks
A similar attempt in this area is the algorithm by S.G. learnt this way is directly affected by the widths of the quanti-
Bottcher [3], implemented as the R library deal [4]. Bottcher zation intervals. Acceptable accuracy invariably requires quan-
proposed a new master prior procedure for parameter learning tization at very fine intervals and that in turn makes the learning
for a Bayesian network. Apart from the little ambiguity im- and inference rather slow.
plicit in this work regarding the distinction between conditional Davis and Moore [10], propose a different interpretation of
Gaussian and plain Gaussian nodes (since Bottcher’s algorithm Bayesian network. They model a Bayesian network as low-
needs the parameter learning for learning the structure of the dimensional mixtures of Gaussians. These mixtures of Gaus-
network), Bottcher’s algorithm in fact does not handle the fol- sians over different subsets of the domain variables are com-
lowing cases: (i) networks with full continuous nodes, (ii) mul- bined into a coherent joint probability model over the entire do-
tivariate continuous nodes and (iii) discrete nodes with contin- main. But it stores a lot of redundant information per node and
uous parents. Another one in R is BNlearn [5] which provides also puts a restriction on the dimensionality of data. It doesn’t
Bayesian parameter estimation only for discrete data. allow the discrete variables to take many distinct values. In a re-
cent paper [11] Krauthausen and Hanebeck proposed an MLE
Very recently a matlab package, CGBayesNets [6] was re-
algorithm for learning hybrid Bayesian Networks with Gaus-
leased, which deals with conditional Gaussian networks. They
sian mixture and Dirac mixture conditional densities, from data
claimed to have included everything from parameter learning,
given the network structure.
structure learning and inference but no clear idea about the pa-
rameter learning algorithm they used can be found in their pa-
per. 3 Detailed Learning Procedure
Two popular Bayesian network libraries in python are:
libpgm [7] (developed by students under Daphne Koller) and Our algorithm sequentially updates the parameters of the
BayesPy [8] (provides tools for Bayesian inference). However nodes in the network, one node at a time. The update step how-
both of them lack the most essential feature of Bayesian param- ever depends on the configuration of the parents of the node
eter estimation. whose parameters are being computed. The conditional dis-
TABLE 2. The probability distributions of a node based on its type (discrete or continuous) and the different types of parents it can have.

Parents→ Only Only


None Mixed
Nodes ↓ discrete continuous
Discrete Dirichlet Dirichlet Softmax Conditional Softmax
Conditional Linear Conditional Linear
Continuous Gaussian
Gaussian Gaussian Gaussian

tribution of any node based on any possible parent configura- is given as-
tion is shown in Table 2. Our algorithm successfully learns
the hyperparameters for each of these distributions. Markov exp(w(:, k)0 ∗ x + b(k))
Pr(Xi = k|cpa(i) = x) = P 0
(1)
property of the Bayesian network ensures that the parameter j exp(w(:, j) ∗ x + b(j))
learning for a node depends only on itself and its parents. Let
The parameters of a softmax node, w(:, k) and b(k), k = 1..si ,
Fi = {i ∪ cpa(i) ∪ dpa(i)} where cpa(i) and dpa(i) are con-
have the following interpretation: w(:, k)−w(:, j) is the normal
tinuous and discrete parents of node i respectively.
vector to the decision boundary between classes k and j, and
b(k) − b(j) is its offset (bias).
3.1 Discrete Case If a softmax node also has discrete parents (e.g. parent nodes
E,F and child node I in Figure 1), we need to find a different
There are two methods for learning the parameters for a dis- set of w/b parameters for every configuration of these discrete
crete node i based on whether Fi has continuous nodes or not: parents.

3.2 Continuous Case


3.1.1 Pure Discrete case
First we will discuss the conventional procedure for learn-
This method is used when discrete nodes have discrete or no ing Gaussian Bayesian networks and how we extended it for
parents. Hence P a(i) = dpa(i). According to theory given multivariate case.
by Heckerman in [12], for any discrete node i, if j is one of
the qi configurations the parents of i can be in, it’s conditional
3.2.1 Pure Gaussian Case
distribution given the structure, S is
The algorithm for a learning Gaussian Bayesian network by
j αijk + Nijk converting it to a multivariate Gaussian distribution was first
Pr(Xi = xk |pa(i) , θi , S) = si P = θijk
k=1 αijk + Nijk given by Geiger and Heckerman [13]. The detailed procedure
for this is also described in [14], section 7.2.3. We extended
where αijk is the initial count (represents our prior belief) and this approach to handle multivariate Gaussian nodes as well. It
Nijk is the number of cases in data, D when Xi = xk and was easy to extend because a network with multivariate Gaus-
P a(i) = pa(i)j . Updating θijk for all parent configurations sian nodes will also lead to a multivariate Gaussian distribution.
j = 1..qi and values this node can take k = 1..si yields an up- Few changes were made to the way joint mean, join precision
dated θi . This θi vector directly corresponds to the conditional and weights were calculated, since every node in the network
probability table stored for a discrete node in BNT. now is of size si (i.e. the dimensionality of each node). In
our procedure, we learn the parameters for one node at a time.
Hence the subset of nodes of the network considered for below
3.1.2 Mixed case mentioned algorithm is Fi . Suppose the number of nodes Pnin this
family is n, the total size of this family will be SFi = j=1 sj .
When a discrete node has continuous parents we need to find The joint mean vector µ of this family will be
a softmax function in order to derive the conditional probabili-  T
ties. Such nodes are called softmax nodes in BNT. A softmax µ = µ11 . . . µ1s1 µ21 . . . µ2s2 . . . µn1 . . . µnsn
function for a node i given its continuous parent takes value x (2)
In the theory given for univariate nodes in [14], the variance After this we just need to factorise this precision matrix into
will now be replaced by covariance matrices Σj (⇒ precision global weight and covariance matrices using equation 3, from
matrix tj = Σ−1j ), the weight matrices bj will be scaled by sj which we can easily get weights and covariance for every node
for every node j in Fi . in the network. But the factorization wasn’t very simple. Hence
To form the joint precision matrix T , we used Kevin Mur- we used the recursive method shown in [14] for the same.
phy’s way as given in [15] as it goes well with multivariate
nodes and the BNT package. Hence the joint precision matrix 3.2.2 Mixed case
is calculated as,
The mixed case for continuous nodes is the conditional
T = (I − B)D−1 (I − B)T (3) Gaussian and conditional linear Gaussian case. For such a


where I is a SFi × SFi identity matrix , B is the global weight node, given the value of its continuous parents is X , the value
matrix and D is the combined covariance matrix, block diago- it takes is calculated as
nal with the individual covariance matrices on the diagonal. →
Y = N (µi,k + bi,k X, Σi,k ) (9)
After having found the initial parameters of this joint distri-
bution, next step is to update these parameters from the data. Where the discrete parents are in their k th configuration. The
As mentioned in Theorem 7.27 of [14] a multivariate normal learning process remains the same as above, only change is that
distribution with unknown mean and variance has the prior multiple means, covariance and weights have to be learnt now.
joint distribution of µ and T as multivariate Normal-Wishart- The number of partitions is equal to the total number of con-
distribution. And the distribution of a sample drawn from such figurations of the discrete parents (d). We split the data for the
a distribution is multivariate t-distribution. The variables for continuous nodes in this family into d partitions, each partition
the prior t-distribution are initialised as- corresponding to one discrete parent configuration and run our
mean of the hypothetical sample upon which we base our algorithm for learning GBN on every part independently. But

µ = prior belief. This is the joint mean of the prior network multiple parameters are stored only for the child node not for
found in eq. 2 parent nodes.
ν = size of the hypothetical sample.
3.3 Ambiguities in continuous case
α = usually ν − 1.
joint covariance of the hypothetical sample (it is not the We encountered few ambiguities in the procedure mentioned
β=
inverse of precision for t-distribution) in last section. Hence we devised some changes in certain spec-
ν(α − n + 1) −1 ifications or sub-procedures in order to resolve these ambigui-
= T
ν+1 ties. These are discussed here:
The data D is sampled from the Bayesian network Breal which
is to be learnt. DFi ⊆ D contains the data for the the family 3.3.1 Conditional Gaussian parents
of node i. Next, we find the mean(x̄) and covariance(s) of the
When a continuous parent has a conditional Gaussian distri-
DFi . Then, parameters for posterior t-distribution are found as:
bution it will have multiple means but in our procedure we just

→∗ ν µ + M x̄ use one mean for a parent. Like in Fig. 1, node G has only
µ = (4) continuous parents D and E. For node D we would have learnt
ν+M
ν∗ = ν + M (5) a single mean but for node E we would have learnt 2 means.
Therefore, for a node y with a conditional Gaussian parent i
α∗ = α + M (6)
whose discrete parents can be in j possible configurations, the
νM → →
β∗ = β + s + (x̄ − µ)(x̄ − µ)T (7) weighted mean of i which will be used to update the node y is
ν+M given as-
→∗
X
The µ will readily give us the individual means using equality µi = pj µj (10)
2. For other parameters we find the joint precision matrix from j

the updated joint covariance matrix of the t-distribution as Where pj is the probability of parents of i being in their j th
ν ∗ (α∗ − n + 1) ∗ −1 configuration, calculated from data and µj is the mean learnt
T∗ = (β ) (8) for that node for that case.
ν∗ + 1
3.3.2 Base mean for Inference
While inferencing from a network the value of a continuous
node with linear Gaussian distribution, given the value of its

continuous parents is X, is calculated as

Y = N (µbase
i + bi X, Σi ) (11)

Hence this µbase


i should be stored for a node i, but from
the above algorithm we get full mean µfi ull . Therefore, not
only the mean and covariance but also the weights should
be learnt as accurately as possible in order to get accurate

µbase
i (= µfi ull − bi X). Leaning accurate weights had not been FIGURE 2. Network specified for ksl test.
paid much attentions in any of the previous works. (We explain

how we did it in the next section) This X while inferencing is
the value known for that parent but while learning we can use
network. Therefore each node can be learnt on a separate ma-
the mean of the parent (µfcpa(i)
ull
). Hence we store both base and
chine. It fits naturally to map reduce framework, hence com-
full means for continuous nodes.
plex networks can be learnt very fast.
We experimented with the ksl dataset (a study measuring
3.3.3 Accurate mean, covariance and weights health and social characteristics of representative samples of
It was seen that the actual means, weights and covariances 1083 Danish 70-year old people.) available in the R package
of the network Breal with which the data was generated can deal. As mentioned in [4] the manual for learning parameters
not be found in one run of the above algorithm. Some addi- using deal, the parameters are learnt first and then the structure
tions to prior information took the posterior bit close to the real for ksl dataset. However we first assumed the structure they
values. But that is not always possible. Only in the prior igno- had derived (shown in Figure 2) and then learn the parameters.
rance case the mean and covariance were found in to be good The dataset was shuffled and split into training (700) and test
in one run. However the weights were not very accurate. We (300) sets in order to check the accuracy of the probabilities
ran the posterior finding sub-routine, which executes equations we learnt for every node. For pure discrete nodes probabilities
4 to 7, iteratively with the updated µ, ν, α and β. In our ex- found from the test set were almost same as the ones learnt from
periments it was seen that the accuracy of the parameters learnt the training set. The accuracy was calculated as
(as measured by the accuracy of the predictions made using the #(|CP Ttest − CP Ttrain | < 0.05)
parameters learnt) improve significantly and the parameters do accuracy =
size(CP T )
converge to stable values on running the posterior finding sub-
routine iteratively with parameters learnt in one iteration are For softmax nodes the class for every data point was predicted
used as priors for the subsequent iteration. In the experiments by using the learnt w, b and the predicted class was verified
described in Section 4 below, we stopped the iterations when against the test data label. For continuous nodes 5 ranges (from
the absolute difference µ∗ − µ between posterior joint mean µ − 2.5σ to µ + 2.5σ) were taken around the learnt mean for
and prior joint mean goes below a threshold (0.009 for our ex- a node. The empirical test sample probability for each range
periments). R was computed as (#data-points having value in R) / (total
no. of data-points). This was compared against the predicted
4 Results and Conclusion cumulative density for the range
Z
Our system is very fast in learning parameters for small net- N (µ, σ)
R
works. Also, it efficiently learns parameters for all nodes even
in highly complex networks with approximately 30 nodes and Our experimental results are shown in Table 3, in which d, sf ,
100 edges. One major advantage of our procedure is that learn- cg represent pure discrete, softmax and conditional Gaussian
ing every node depends only on its parents and not on the entire nodes respectively. More details about the attributes of the data
TABLE 3. ksl test results. rangements,” ACM Transactions on Graphics (TOG),
vol. 31, no. 6, p. 135, 2012.
Node Type Accuracy [2] K. P. Murphy, Bayes Net Toolbox for Matlab, 1997-2002.
fev cg 100
kol cg 100 [3] S. Bottcher, “Learning bayesian networks with mixed
hyp sf 64 variables,” Proceedings of the Eighth International Work-
logBMI cg 100 shop in Artificial Intelligence and Statistics, 2001.
smok d 87.5
alc d 37.5 [4] S. G. Bottcher and C. Dethlefsen, Learning Bayesian Net-
work d 75 works with R, 2003.
sex d 100
year d 100 [5] M. Scutari, “Learning bayesian networks with the bnlearn
r package,” 2010-2015.
[6] W. S. McGeachie MJ, Chang H-H, “Cgbayesnets: Condi-
can be found in [4]. It can be seen in the table that except the tional gaussian bayesian network learning and inference
softmax type rest all were efficiently learnt. Even where the with mixed discrete and continuous data,” PLoS Compu-
accuracy for alc has dropped to 37.5% the difference between tational Biology, 2014.
test and train probabilities was not more than 0.15. Our tests
[7] C. Cabot, libpgm, 2012.
on other complex mixed networks with multivariate Gaussian
nodes also showed close to 80% accuracy. [8] J. Luttinen, “BayesPy: variational Bayesian inference in
We also compared our results with the output produced by Python,” JMLR, 2015.
deal on the same dataset and specified the same initial network
structure. For discrete nodes, counts learnt by our system were [9] S. Monti and G. F. Cooper, “Learning hybrid bayesian
exactly same as those learnt by deal. For continuous nodes, the networks from data,” in Learning in Graphical Models
mus and covariances were almost same as those learnt by deal (M. Jordan, ed.), vol. 89 of NATO ASI Series, pp. 521–
but weights for the network were difficult to decipher from the 540, Springer Netherlands, 1998.
parameters output by deal. It would be tough to use the param-
[10] S. Davies and A. Moore, “Mix-nets: Factored mixtures
eters output as is for inferencing. For softmax nodes (discrete
of gaussians in bayesian networks with mixed continuous
node with continuous parents) deal returns means and other pa-
and discrete variables,” 2000.
rameters like continuous nodes. There seems to be no reason
for it (however in the manual it was mentioned that it does not [11] P. Krauthausen and U. Hanebeck, “Parameter learning for
allow edges from continuous to discrete nodes). If the parame- hybrid bayesian networks with gaussian mixture and dirac
ter ‘cholesterol’ was split into HDL and LDL cholesterol then mixture conditional densities,” in American Control Con-
is requires creating a new node in deal and copying all the con- ference (ACC), 2010, pp. 480–485, June 2010.
nections of node cholesterol for it whereas in our system it just
required to increase the size of node ‘cholesterol’ to 2. We did [12] D. Heckerman, “A tutorial on learning with bayesian net-
not compare our results with any other library because those works,” tech. rep., Microsoft Research, March 1995.
did not have Bayesian parameter learning algorithm for mixed [13] D. Geiger and D. Heckerman, “Learning gaussian net-
networks. works,” in Proceedings of the Tenth International Con-
We plan to extend our work to deal with more complex de- ference on Uncertainty in Artificial Intelligence, UAI’94,
pendencies between continuous variables. Our existing system 1994.
is flexible enough to allow for such changes, which will make
it more useful in current day problems. [14] R. E. Neapolitan, Learning Bayesian Networks. Prentice
Hall, 2004.
References [15] K. P. Murphy, “Inference and learning in hybrid bayesian
networks,” Tech. Rep. 990, University of California
[1] M. Fisher, D. Ritchie, M. Savva, T. Funkhouser, and Berkeley, Dept. of Comp. Sci., 1998.
P. Hanrahan, “Example-based synthesis of 3d object ar-

View publication stats

You might also like