Professional Documents
Culture Documents
Introduction To Bayesian Networks & Bayesialab: Stefan - Conrady@Bayesia - Us
Introduction To Bayesian Networks & Bayesialab: Stefan - Conrady@Bayesia - Us
September 3, 2013
Introduction to Bayesian Networks & BayesiaLab
Table of Contents
Introduction 4
References 29
Contact Information 30
Bayesia USA 30
Bayesia Singapore Pte. Ltd. 30
Bayesia S.A.S. 30
Copyright 30
Introduction
With Professor Judea Pearl receiving the prestigious 2011 A.M. Turing Award, Bayesian networks have pre-
sumably received more public recognition than ever before. Judea Pearl’s achievement of establishing Bayes-
ian networks as a new paradigm is fittingly summarized by Stuart Russell:
“[Judea Pearl] is credited with the invention of Bayesian networks, a mathematical formalism for
defining complex probability models, as well as the principal algorithms used for inference in these
models. This work not only revolutionized the field of artificial intelligence but also became an im-
portant tool for many other branches of engineering and the natural sciences. He later created a
mathematical framework for causal inference that has had significant impact in the social sciences.”
While their theoretical properties made Bayesian networks immediately attractive for academic research,
especially with regard to the study of causality, the arrival of practically feasible machine learning algo-
rithms has allowed Bayesian networks to grow beyond its origin in the field of computer science. Since the
first release of the BayesiaLab software package in 2001, Bayesian networks have finally become accessible
to a wide range of scientists and analysts for use in many other disciplines.
In this introductory paper, we present Bayesian networks (the paradigm) and BayesiaLab (the software
tool), from the perspective of the applied researcher.
In Chapter 1 we begin with the role of Bayesian networks in today’s world of analytics, juxtaposing them
with traditional statistics and more recent innovations in data mining.
Once we establish how Bayesian networks fit into the proverbial big picture, we present in Chapter 2 the
mathematical formalism that underpins this paradigm. While employing Bayesian networks for research has
become remarkably easy with BayesiaLab, we need to emphasize importance of their theory. Only a deep
understanding of this theory will allow researchers to fully appreciate the wide-ranging benefits of Bayesian
networks.
Finally, in Chapter 3, we provide an overview of the BayesiaLab software platform, which leverages the
Bayesian networks paradigm to far greater extent than any other tool that has ever been available in this
field. We show how the theoretical properties of Bayesian networks translate into an extremely powerful
and universal research tool for many fields of study, ranging from bioinformatics to marketing science and
beyond.
Such context is particularly important given the attention that Big Data and related technologies receive
these days. Their dominance in terms of publicity does perhaps drown out other many other important
methods of scientific inquiry.
Equally important is positioning Bayesian networks vis-à-vis traditional parametric statistical methods,
which have supported a myriad of scientific advances in the 20th century and that continue to serve as valid
and valuable tools for researchers today.
The x-axis reflects the Modeling Purpose, ranging from Association/Correlation to Causation. Tags on the
x-axis furthermore indicate a conceptual progression that includes description, prediction, explanation,
simulation, and optimization.
The y-axis shows the Model Source, or, more precisely, the source of the model specification: On the one
end, we have Theory as the source, one the other end, we have Data as the source. Theory is furthermore
tagged with Parametric as the prevalent modeling approach, and Human Intelligence, indicating the origin
of Theory.
On the opposite end of the y-axis, Data is associated with Machine Learning and Artificial Intelligence. It is
also tagged with Algorithmic, to highlight the contrast with the mostly parametric modeling generated from
theory.
Predictive Modeling
Scoring
Machine
Learning Data Forecasting
Artificial Classification
Algorithmic
Intelligence
Model Source
Q2 Q3
Q1 Q4
Explanatory Modeling Operations Research
"Reasoning"
Modeling Purpose
Association Causation
Correlation
Needless to say, this is a highly simplified view of the world, and readers can rightfully point out the limita-
tions of this presentation.1 Despite this caveat, we will now use the proposed map and its coordinate system
to position different modeling approaches.
Y = f (X) .
of interest
Neural networks are a typical example of implementing machine learning techniques in this con-
text. Such models are often devoid of any theory, however, they can be excellent “statistical de-
vices” for producing predictions.
1 For instance, one could easily expand this overview by adding a third dimension, perhaps including the type of pa-
rameter estimation. With such an additional axis, one could differentiate “frequentist” and “Bayesian” estimation
methods.
Y= f (X) .
of interest
Traditional statistical techniques, which have an explanatory purpose and that are used in epidemi-
ology and the social sciences, would mostly belong in Quadrant 4. Regressions are the best-known
models in this context. Extending further into the causal direction, we would progress into the field
of operations research, including simulation and optimization.
Despite the diverging objectives of Predictive Modeling and Explanatory Modeling, i.e. predicting Y versus
learning f, the respective methods are not necessarily incompatible. In Figure 1, this is suggested by the blue
boxes that gradually fade out as they cross the boundaries and extend beyond their “home” quadrant.
However, the best-performing modeling approaches do rarely serve predictive and explanatory purposes
equally well. In many situations, the optimal fit-for-purpose models remain very distinct from each other. In
fact, Schmueli (2010) has shown that structurally “less true” models can yield better predictive performance
than the “true” explanatory model.
We should also point out that recent advances in machine learning and data mining have mostly occurred in
Quadrant 2, and thus disproportionately benefitted predictive modeling. Unfortunately, most machine-
learned models are also remarkably difficult to interpret in terms of their structural meaning, so new theo-
ries are rarely generated this way. For instance, the well-known Netflix Prize competition generated
phenomenally-performing predictive models, but they yielded little explanatory insight into consumers’
choices.
Conversely, in Quadrant 4, purposefully machine learning explanatory models remains rather difficult. As
opposed to Quadrant 2, the availability of ever-increasing amounts of data is not even a major advantage in
the discovery of theory through machine learning.
Also, due to their graphical structure, machine-learned Bayesian networks are intuitively interpretable, thus
facilitating human learning and theory building. As emphasized by the bidirectional arc in Figure 2, Bayes-
ian networks allow human learning and machine learning to interact efficiently. This way, Bayesian net-
works can be developed from a combination of human and artificial intelligence.
Machine
Learning Data
Artificial
Algorithmic
Intelligence
Model Source
Bayesian
Networks
Human Parametric
Learning
Human Theory
Intelligence
Machine
Learning Data
Artificial
Algorithmic
Intelligence
Model Source
Bayesian Q2 Q3
Networks Q1 Q4
Human Parametric
Learning
Causal Assumptions
Human Theory
Intelligence
"Reasoning"
Model Purpose
Association Causation
Correlation
As a result, Bayesian networks are a highly versatile modeling framework, making them suitable for many
problem domains. The mathematical formalism underpinning the Bayesian network paradigm will be pre-
sented in the next chapter, Bayesian Network Theory.
Rev. Bayes addressed both the case of discrete probability distributions of data and the more complicated
case of continuous probability distributions. In the discrete case, Bayes’ theorem relates the conditional and
marginal probabilities of events A and B, provided that the probability of B does not equal zero:
P(B∣A)P(A)
P(A∣B) =
P(B)
P(A) is the prior probability (or “unconditional” or “marginal” probability) of A. It is “prior” in the sense
that it does not take into account any information about B; however, the event B need not occur after
event A. In the nineteenth century, the unconditional probability P(A) in Bayes’s rule was called the “ante-
cedent” probability; in deductive logic, the antecedent set of propositions and the inference rule imply con-
sequences. The unconditional probability P(A) was called “a priori” by Ronald A. Fisher.
• P(A|B) is the conditional probability of A, given B. It is also called the posterior probability because it is
derived from or depends upon the specified value of B.
Bayes theorem in this form gives a mathematical representation of how the conditional probability of event
A given B is related to the converse conditional probability of B given A.
The initial development of Bayesian networks in the late 1970s was motivated by the need to model the top-
down (semantic) and bottom-up (perceptual) combination of evidence in reading. The capability for bidirec-
tional inferences, combined with a rigorous probabilistic foundation, led to the rapid emergence of Bayesian
2 For the technical portion of this introduction, we defer to the words of Judea Pearl, who originally coined the term
“Bayesian network.” We are grateful to him for allowing us to use and adapt large sections from one of his technical
reports for our purposes (Pearl and Russell, 2000).
networks as the method of choice for uncertain reasoning in AI and expert systems replacing earlier, ad hoc
rule-based schemes.
The nodes in a Bayesian network represent variables of interest (e.g. the temperature of a device, the gender
of a patient, a feature of an object, the occurrence of an event) and the links represent statistical (informa-
tional)3 or causal dependencies among the variables. The dependencies are quantified by conditional prob-
abilities for each node given its parents in the network. The network supports the computation of the poste-
rior probabilities of any subset of variables given evidence about any other subset.
Figure 4: A Bayesian network representing the statistical relationship between to two variables.
Figure 5 illustrates another simple yet typical Bayesian network. In contrast to the statistical relationships in
Figure 4, the diagram in Figure 5 describes the causal relationships among the season of the year (X1),
whether it is raining (X2), whether the sprinkler is on (X3), whether the pavement is wet (X4), and whether
the pavement is slippery (X5). Here the absence of a direct link between X1 and X5, for example, captures
our understanding that there is no direct influence of season on slipperiness — the influence is mediated by
the wetness of the pavement (if freezing is a possibility then a direct link could be added).
3 “informational” and “statistical” are treated here as equivalent concepts and can be used interchangeably.
Perhaps the most important aspect of a Bayesian networks is that they are direct representations of the
world, not of reasoning processes. The arrows in the diagram represent real causal connections and not the
flow of information during reasoning (as in rule-based systems and neural networks). Reasoning processes
can operate on Bayesian networks by propagating information in any direction. For example, if the sprin-
kler is on, then the pavement is probably wet (deduction, prediction, simulation); if someone slips on the
pavement, that also provides evidence that it is wet (abduction, reasoning to a probable cause or diagnosis).
On the other hand, if we see that the pavement is wet, that makes it more likely that the sprinkler is on or
that it is raining (abduction); but if we then observe that the sprinkler is on, that reduces the likelihood that
it is raining (explaining away). It is this last form of reasoning, explaining away, that is especially difficult to
model in rule-based systems and neural networks in any natural way, because it seems to require the propa-
gation of information in two directions.
strictly more expressive than other temporal probability models such as hidden Markov models and Kalman
filters.
time step t
Probabilistic Semantics
Any complete probabilistic model of a domain must, either explicitly or implicitly, represent the joint prob-
ability distribution — the probability of every possible event as defined by the combination of the values of
all the variables. There are exponentially many such events, yet Bayesian networks achieve compactness by
factoring the joint distribution into local, conditional distributions for each variable given its parents. If xi
denotes some value of the variable Xi and pai denotes some set of values for the parents of Xi, then P(xi|pai)
denotes this conditional distribution. For example, P(x4|x2,x3) is the probability of wetness given the values
of sprinkler and rain. The global semantics of Bayesian networks specifies that the full joint distribution is
given by the product
There is also an entirely equivalent local semantics, which asserts that each variable is independent of its
nondescendants in the network given its parents. For example, the parents of X4 in Figure 7 are X2 and X3
and they render X4 independent of the remaining nondescendant, X1. That is,
P(x4∣x 1 , x2 , x3 ) = P(x4∣x2 , x3 ) .
Non-Descendants
Parents
Descendant
Figure 7: Variable X4 is independent of its non-descendants, in this case X1, given its parents, X3 and X2
The collection of independence assertions formed in this way suffices to derive the global assertion in Equa-
tion 1, and vice versa. The local semantics is most useful in constructing Bayesian networks, because select-
ing as parents all the direct causes (or direct relationships) of a given variable invariably satisfies the local
conditional independence conditions. The global semantics leads directly to a variety of algorithms for rea-
soning.
Evidential Reasoning
From the product specification in Equation 1 one can express the probability of any desired proposition in
terms of the conditional probabilities specified in the network. For example the probability that the sprin-
kler is on given that the pavement is slippery is
=
∑ x1 , x2 , x4 P(x1 , x2 , X 3 = on, x4 , X5 = true)
∑ x1 , x2 , x3 , x4 P(x1 , x2 , x3 , x4 , X5 = true)
==
∑ x1 , x2 , x4 P(x1 )P(x2∣x1 )P(X 3 = on∣x1 )P(x4∣x2 , X 3 = on)P(X5 = true∣x4 )
∑ x1 , x2 , x3 , x4 P(x1 )P(x2∣x1 )P(x3∣x1 )P(x4∣x2 , x3 )P(X5 = true∣x4 )
The first algorithms proposed for probabilistic calculations in Bayesian networks used a local distributed
message-passing architecture, typical of many cognitive activities. Initially this approach was limited to tree-
structured networks, but was later extended to general networks in Lauritzen and Spiegelhalter’s (1988)
method of junction tree propagation. A number of other exact methods have been developed and can be
found in recent textbooks.
It is easy to show that reasoning in Bayesian networks subsumes the satisfiability problem in propositional
logic and hence is NP-hard. Monte Carlo simulation methods can be used for approximate inference (Pearl,
1988) giving gradually improving estimates as sampling proceeds. These methods use local message propa-
gation on the original network structure unlike junction tree methods. Alternatively, variational methods
provide bounds on the true probability.
Causal Reasoning
Most probabilistic models including, general Bayesian networks, describe a distribution over possible ob-
served events — as in Equation 1 — but say nothing about what will happen if a certain intervention oc-
curs. For example, what if I turn the sprinkler on instead of just observing that it is turned? What effect
does that have on the season, or on the connection between wetness and slipperiness? A causal network,
intuitively speaking, is a Bayesian network with the added property that the parents of each node are its
direct causes — as in Figure 2. In such a network, the result of an intervention is obvious: the sprinkler node
is set to X3 = on and the causal link between the season X1 and the sprinkler X3 is removed (see Figure 8).
All other causal links and conditional probabilities remain intact so the new model is
Notice that this differs from observing that X3=on, which would result in a new model that included the
term P(X3=on|x1). This mirrors the difference between seeing and doing: after observing that the sprinkler is
on, we wish to infer that the season is dry, that it probably did not rain, and so on; an arbitrary decision to
turn the sprinkler on should not result in any such beliefs.
Causal networks are more properly defined, then, as Bayesian networks in which the correct probability
model after intervening to fix any node’s value is given simply by deleting links from the node’s parents. For
example, Fire → Smoke is a causal network whereas Smoke → Fire is not, even though both networks are
equally capable of representing any joint distribution on the two variables. Causal networks model the envi-
ronment as a collection of stable component mechanisms. These mechanisms may be reconfigured locally by
interventions, with correspondingly local changes in the model. This, in turn, allows causal networks to be
used very naturally for prediction by an agent that is considering various courses of action.
In machine learning approaches, the conditional probabilities P(xi|pai) are typically estimated with the
maximum likelihood approach (observed frequencies in the dataset). In pure Bayesian approaches, models
are designed by expertise and include hyperparameter nodes. Data (usually scarce) is used as pieces of evi-
dence set in the networks for incrementally updating the distributions of the hyperparameters (Bayesian up-
dating).
It is also possible to machine learn the structure of a Bayesian network, and two families of methods are
available for that purpose. The first one, the constraint-based algorithms, is based on the probabilistic se-
mantic of Bayesian networks. Links are added or deleted according to the results of statistical tests, which
identify marginal and conditional independencies. The second approach, the score-based algorithms, is
based on a metric measuring the quality of candidate networks with respect to the observed data. This met-
ric trades off network complexity against degree of fit to the data, typically expressed as the likelihood of
the data given the network.
As a substrate for learning, Bayesian networks have the advantage that it is relatively easy to encode prior
knowledge in network form, e.g. by fixing portions of the structure or defining forbidden arcs.
Causal Discovery
One of the most exciting prospects in recent years has been the possibility of using Bayesian networks to
discover causal structures in raw statistical data — a task previously considered impossible without con-
trolled experiments. Consider, for example, the following intransitive pattern of dependencies among three
events: A and B are dependent. B and C are dependent, yet A and C are independent. If you ask a person to
supply an example of three such events, the example would invariably portray A and C as two independent
causes and B as their common effect, namely, A → B ← C. (For instance A and C could be the outcomes of
two fair coins and B represents a bell that rings whenever either coin comes up heads.)
Figure 9: Causal model for variables A, C and B, representing two fair coins and a bell respectively
Fitting this dependence pattern with a scenario in which B is the cause and A and C are the effects is
mathematically feasible but very unnatural, because it must entail fine tuning of the probabilities involved;
the desired dependence pattern will be destroyed as soon as the probabilities undergo a slight change.
Such thought experiments tell us that certain patterns of dependency, which are totally void of temporal
information, are conceptually characteristic of certain causal directionalities and not others. When put to-
gether systematically, such patterns can be used to infer causal structures from raw data and to guarantee
that any alternative structure compatible with the data must be less stable than the one(s) inferred; namely
slight fluctuations in parameters will render that structure incompatible with the data.
BayesiaLab 5.2
BayesiaLab is a powerful desktop application (Windows/Mac/Unix) with a highly-sophisticated graphical
user interface, which provides scientists a comprehensive “lab” environment for machine learning, knowl-
edge modeling, diagnosis, analysis, simulation, and optimization. With BayesiaLab, Bayesian networks have
become a powerful and practical tool to gain deep understanding of high-dimensional domains. It leverages
the inherently graphical structure of Bayesian networks for exploring and explaining complex problems.
BayesiaLab is the result of nearly twenty years of software development by Dr. Lionel Jouffe and Dr. Paul
Munteanu. In 2001, their research efforts led to the formation of Bayesia S.A.S., headquartered in Laval in
northwestern France. Today, the company has grown to become the leading supplier of Bayesian network-
related technologies for hundreds major corporations around the world.
KNOWLEDGE MODELING
EXPERT ANALYTICS SIMULATION
KNOWLEDGE
DECISION SUPPORT
RISK
MANAGEMENT
DATA
KNOWLEDGE DISCOVERY B AY E S I A N DIAGNOSIS OPTIMIZATION
NETWORK
BayesiaLab in Context
In Chapter 1, we presented — at a conceptual level — that Bayesian networks can cover the entire map of
analytics. Figure 12 shows what this means in practice for the researcher. BayesiaLab’s functions, repre-
sented as blue boxes, are positioned across this map, and they demonstrate the universal applicability of the
Bayesian network paradigm.
Supervised
Learning
Machine
Learning Data Unsupervised
Data Variable
Artificial Structural
Algorithmic Clustering Clustering
Intelligence Learning
Model Source
Probabilistic
Parameter
Q2 Q3 Structural
Total &
Target
Direct Effects
Learning Equation Optimization
Q1 Q4 Models
Analysis
Bayesian
Human Parametric Updating
Learning
"Reasoning"
Modeling Purpose
Association Causation
Correlation
This mindset has a critical flaw, which is that causal relationships remain difficult to be machine-learned
from data. Rather, causal reasoning typically requires some form of assumptions, i.e. assumptions coming
from human knowledge.
In addition to allowing users to directly encode their explicit knowledge by drawing a network in the graph
panel, the Bayesia Expert Knowledge Elicitation Environment (BEKEE) is available as an extension to
BayesiaLab. It allows to systematically elicit both explicit and tacit knowledge of a group of experts during
brainstorming sessions.
Maintaining uncertainty during inference automatically prevents presenting potentially misleading point
estimates.
tion theory (e.g. the Minimum Description Length). With that, no assumptions of linearity are made at any
point.
Unsupervised Structural
Learning means that Bayesia-
Lab can discover probabilistic
relationships between a large
number of variables, without
the need to define inputs or
outputs. One might say that
this is the quintessential form
of knowledge discovery, as no
assumptions whatsoever are
required to perform these al-
gorithms on unknown
datasets.4
4 However, the analyst can still use any available domain knowledge to define structural constraints.
A third variation of this concept is of particular importance in BayesiaLab: the semi-automatic Multiple
Clustering workflow can be described as a kind of nonlinear, nonparametric and nonorthogonal factor
analysis.
In practice, Multiple Clustering often serves as the basis for developing Probabilistic Structural Equation
Models (Quadrant 3/4) with BayesiaLab.
BayesiaLab offers a range of sophisticated methods for missing values processing from which the analyst
can choose. During network learning, BayesiaLab performs missing values processing automatically “behind
the scenes”. More specifically, the Structural Expectation-Maximization algorithm and the Dynamic Com-
pletion algorithm are automatically applied after each modification of the network during learning, i.e. after
every single arc addition, suppression and inversion.
BayesiaLab offers a considerable number of functions relating to inference. For instance, inference can be
performed by setting evidence, i.e. clicking on any one of the Monitors, and results are returned instantly
for all the other Monitors.
Batch Inference is available when inference needs to be computed for a large number of records. For in-
stance, this can be used for applying a predictive score for all customers in a database.
The Adaptive Questionnaire function provides guidance in terms of the optimum sequence for seeking evi-
dence. With every piece of evidence set, BayesiaLab determines which is the next best piece of evidence to
obtain for a maximum information gain with respect to the target variable. In a medical context, this allows
to optimally “escalate” diagnostic procedures, from “low-cost & small-gain evidence (e.g. measuring the
patient’s blood pressure) to “high-cost & large-gain” evidence (e.g. performing an MRI scan).
All of the above questions can be answered, if the domain is fully understood, which is a priori never the
case. However, if we are able to build an adequate model of the domain that captures all of its dynamics,
BayesiaLab will be able to extract the effects.
BayesiaLab employs simulation to derive effects, as parameters per se do not exist in this nonparametric
framework. As all the dynamics of the domain are encoded in discrete conditional probability tables, effect
sizes only manifest themselves when different conditions are simulated.
Total Effects Analysis, Target Mean Analysis and many more of BayesiaLab’s functions offer the analyst
ways to study effects, especially nonlinear and interactive effects.
Optimization (Quadrant 4)
The ability to perform inference across all possible states of
all nodes of the network also facilitates searching for opti-
mum values. BayesiaLab’s Target Dynamic Profile and
Target Optimization provide the toolsets for this purpose.
Summary
BayesiaLab has consequently implemented (and expanded upon) the theory of Bayesian networks, which
was first introduced by Judea Pearl in the 1980s. In a remarkably short time, BayesiaLab has translated
cutting-edge theoretical research into highly relevant and practical tools for applied scientists. BayesiaLab
has opened up entirely new venues for exploring, understanding and explaining complex problem domains.
References
Barber, David. Bayesian Reasoning and Machine Learning. Cambridge University Press, 2012.
Barnard, G. A, and T. Bayes. “Studies in the History of Probability and Statistics: IX. Thomas Bayes’s Essay
Towards Solving a Problem in the Doctrine of Chances.” Biometrika 45, no. 3 (1958): 293–315.
Breiman, Leo. “Statistical Modeling: The Two Cultures (with Comments and a Rejoinder by the Author).”
Statistical Science 16, no. 3 (2001): 199–231.
Darwiche, Adnan. “Bayesian Networks.” Communications of the ACM 53, no. 12 (December 2010): 80.
doi:10.1145/1859204.1859227.
———. Modeling and Reasoning with Bayesian Networks. 1st ed. Cambridge University Press, 2009.
Koller, Daphne, and Nir Friedman. Probabilistic Graphical Models: Principles and Techniques. 1st ed. The
MIT Press, 2009.
Pearl, Judea. Causality: Models, Reasoning and Inference. 2nd ed. Cambridge University Press, 2009.
Pearl, Judea, and Stuart Russell. Bayesian Networks. UCLA Congnitive Systems Laboratory, November
2000. http://bayes.cs.ucla.edu/csl_papers.html.
Russell, Stuart. “Judea Pearl - A.M. Turing Award Winner.” Accessed August 31, 2013.
http://amturing.acm.org/award_winners/pearl_2658896.cfm.
Shmueli, Galit. “To Explain or to Predict?” Statistical Science 25, no. 3 (August 2010): 289–310.
doi:10.1214/10-STS330.
Spirtes, Peter, Clark Glymour, and Richard Scheines. Causation, Prediction, and Search, Second Edition.
2nd ed. The MIT Press, 2001.
Contact Information
Bayesia USA
312 Hamlet’s End Way
Franklin, TN 37067
USA
Phone: +1 888-386-8383
info@bayesia.us
www.bayesia.us
Bayesia S.A.S.
6, rue Léonard de Vinci
BP 119
53001 Laval Cedex
France
Phone: +33(0)2 43 49 75 69
info@bayesia.com
www.bayesia.com
Copyright
© 2013 Bayesia USA, Bayesia S.A.S. and Bayesia Singapore. All rights reserved.