You are on page 1of 13

Lecture 4: Bayesian Networks

Daniel Frances 2016


c

Contents

1 Introduction 2

2 Very Simple Bayes Network 3

3 Dependence Explodes Elicitation Process 4

4 New concept of Relevance and Irrelevance 5

5 Bayesian Networks 5

5.1 As a particular order of random variables . . . . . . . . . . . . . . . . . . . . 5

5.2 As a record of relevancy relationships . . . . . . . . . . . . . . . . . . . . . . 6

5.3 Bayesian Networks represented by DAGS . . . . . . . . . . . . . . . . . . . . 6

5.3.1 Fraud Detection Example . . . . . . . . . . . . . . . . . . . . . . . . 7

5.3.2 Printer Self-Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

5.4 Checking a BN with the d-separation theorem . . . . . . . . . . . . . . . . . 8

5.4.1 Continuing with the Credit Card example . . . . . . . . . . . . . . . 9

5.4.2 A Bidding example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

5.5 Solving Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1
1 INTRODUCTION

1 Introduction

We now have all the conceptual tools to carry out a Decision Analysis for any problem
for which the uncertainty can be captured through a set of univariate discrete probability
distributions.

There are however two practical problems that prevent the application to problems with
many random variables and decisions .

• Decision Tress become too “bushy” to be manageable - we need better graphic tools
for expressing our problem more succinctly.

• Dependencies between variables cause exponential growth for the number of probabil-
ities that need to be elicited and processed.

There have been related developments in the Computer Science community, namely Bayesian
or Belief Networks, which deal with each of these problems.

Unfortunately, they do not allow for decision variables. Once decision variables are added
to the Bayes Net, then it becomes known as an Influence Diagram. Influence Diagrams are
a much more succinct format for expressing Decision Analysis problems. While they can be
converted and solved as a Decision Tree, algorithms developed by Ross Shachter in the 80’s
can solve Influence Diagrams directly without recourse to Decision Trees.

But first some relevant foundation to Influence Diagrams via Bayesian Networks.

Page 2 Compiled on 2016/09/22 at 21:26:50


2 VERY SIMPLE BAYES NETWORK

2 Very Simple Bayes Network

Recall the equivalent decision trees below

Figure 1: Decision Tree Format

An equivalent Bayes Network representation of these decision trees are the two Bayes Net-
works below.

Historical Form Rollback Form

Disease Test Disease Test

Dis P(Dis) Dis Test P(Test|Dis) Test Dis P(Dis|Test) Test P(Dis)
D 0.01 D + 0.9 + D 0.04348 + 0.207
N 0.99 D - 0.1 + N 0.95652 - 0.793
N + 0.2 - D 0.00126
N - 0.8 - N 0.99874

Figure 2: Bayes Network Format

Thus flipping the tree is performed simply by reversing the arrow connecting the two events.

Page 3 Compiled on 2016/09/22 at 21:26:50


3 DEPENDENCE EXPLODES ELICITATION PROCESS

3 Dependence Explodes Elicitation Process

While the format is simple, with many events on a Bayesian Network, and many links
between events, can become numerically very costly.

In the last class we already dealt with one situation when we could see an explosion in the
number of probabilities that need to be elicited. We were dealing with possible dependencies
between information variables. For example,

• what if the conditional probability of a second positive scan was dependent on the
outcome of the first scan so that p(+|F, −) 6= p(+|F, +). In that case we can no longer
use 90% for the probability of a faulty second scan if the plate is faulty; it would also
depend on the outcome of the first scan. We tacitly assumed a Naive Bayes model and
kept the problem simple.

• when it comes to medical tests/symptoms we usually work with the sensitivity and
specificity of a test/symptom, ignoring any dependence that may exist between the
conditional probabilities of the test outcomes. Thus given a set of 4 symptoms/tests
each with a +/- outcome, conditional to having a particular disease θ, we have

p(y1 , y2 , y3 , y4 |θ) = p(y1 |θ)p(y2 |θ, y1 )p(y3 |θ, y1 , y2 )p(y4 |θ, y1 , y2 , y3 )

and require 1+2+4+8 = 15 assessments whereas the Naive Bayes assumption allowed
us to simplify the expression to

p(y1 , y2 , y3 , y4 |θ) = p(y1 |θ)p(y2 |θ)p(y3 |θ)p(y4 |θ)

which only requires 1 + 1 + 1 + 1 = 4 assessments.

• Suppose we have two random variables x and y that affect some performance measure.
Each variable can take on 10 different values. If they are independent then p(x, y) =
p(x)p(y) and we will need to elicit a total of 20 probability values. However if they
are not independent then p(x, y) = p(x)p(y|x) and we will need to elicit 10 probability
values for p(x), and 10 values p(x|y) for each of the y values, or at total of 110 values,
an 11-fold increase. If we were to have 3 variables and p(x, y, z) = p(x)p(y|x)p(z|x, y)
then we would need to elicit 10 values for p(x),100 for p(y|x), and 1000 values for
p(x|y, z), an increase from 30 to 1110, or a 37-fold increase.

This phenomenon also exist when random variables are dependent on decision variables. We
then have to evaluate a distribution for each value of the decision variable. Thus any tools
that help identify key dependencies, and irrelevance between variables, would allow us to
deal with more complex problems.

Page 4 Compiled on 2016/09/22 at 21:26:50


5 BAYESIAN NETWORKS

4 New concept of Relevance and Irrelevance

To encourage the DM to make belief statements that will disentangle the problem, and keep
apart variables that really don’t have much to do with each other, we introduce a binary
notion of irrelevance, that expresses that knowledge of one variable is irrelevant to knowledge
of another variable. or more formally

Definition 4.1. A DM believes that a measurement X is irrelevant for predicting Y,


given a measurement Z, if the DM believes that X will provide no further `information
for predicting Y , over and above that provided by Z. This is denoted as Y X|Z

Definition 4.2. The symmetry property requires that


a a
X Y |Z ⇔ Y X|Z

Relating back to the established concept of independence,


` consider p(X, Y |Z) = p(X|Z)p(Y |X, Z) =
p(Y |Z)p(X|Y, Z) by Bayes’ Rule. Since X Y |Z then p(X|Y, Z) = p(X|Z) so that now
p(X, Y |Z) = p(X|Z)p(Y |Z) and Y and X are independent given Z.

5 Bayesian Networks

5.1 As a particular order of random variables

The overall objective of Bayesian Inference is to develop the most accurate joint posterior
distribution p(x̂) of inter-related variables.

Suppose we have two variables A, B, we know from basic probability


P (A, B) = P (A|B)P (B) = P (B|A)P (A)
The choice between the last two depends if we place B before A or vice-versa. The importance
of ordering also extends to an arbitrary number of random variables, but now ordering
becomes so much more important. If we take random variables in the order we index them
then by definition
p(x1 , . . . , xn ) = p1 (x1 )p2 (x2 |x1 )p3 (x3 |x1 , x2 ) . . . pn (xn |x1 , . . . , xn−1 )
But it is important to realize this is only one possible ordering or factorization. A Bayesian
Network will document one particular ordering or factorization, but it supports transforma-
tions to alternate orderings. This point is fundamental to the Arc Reversals we noted above
and further covered under Influence Diagrams.

Page 5 Compiled on 2016/09/22 at 21:26:50


5.2 As a record of relevancy relationships 5 BAYESIAN NETWORKS

5.2 As a record of relevancy relationships

As we mentioned before if we insists that all random variables are relevant, then a full model
will be difficult to elicit or compute. We know that in the extreme we might be tempted to
use a Naive Bayes model and believe that all variables are unrelated so that p(x̂) = ni pi (xi )
Q
but this position would rarely be defensible. So the question is to find a factorization which
is both credible and solvable. Hence the use of the synonym belief networks.

Let’s zoom into one of the factors pi (xi |x1 , . . . , xi−1 ). Often many of the “given” xj , j =
1, . . . , xi−1 will be irrelevant to xi , i.e. given some key xj the other xj do not provide any
additional information which would be useful in predicting xi . Let’s separate for each i the
set of indexes of variables which are relevant - Qi - from those that are irrelevant - Ri - to xi .
For reasons that will be come clear we call Qi the parent set of xi , and the Ri the remainder
set of xi a
Xi XRi |XQi 1 ≤ i ≤ n

`
For three variables x1 , x2 , x3 with x2 x3 |x1 then

p(x1 , x2 , x3 ) = p(x1 )p(x2 |x1 )p(x3 |x1 , x2 ) = p(x1 )p(x2 |x1 )p(x3 |x1 )

which then generalizes to


n
Y
p(x̂) = pi (xi |x̂Qi )
i=1

5.3 Bayesian Networks represented by DAGS

Definition 5.1. A directed acyclic graph (DAG) G is a directed graph - of vertices and
directed edges - with no cycles.

Definition 5.2. A Bayesian Network (BN) on a set of measurements {X1 , . . . , Xn } can


be represented by a DAG G.

• the vertices are the measurements {X1 , . . . , Xn }

• there is a directed edge from Xi to Xj only if i ∈ Qj

Page 6 Compiled on 2016/09/22 at 21:26:50


5.3 Bayesian Networks represented by DAGS5 BAYESIAN NETWORKS

5.3.1 Fraud Detection Example

Before we proceed any further we need another simple example, taken from a Microsoft
Tutorial on BN. Suppose that a credit card company is trying to make inferences about
the likelihood that a particular transaction is fraudulent. They have isolated a number of
variables

• Fraud ∈ {T, F } - whether the purchaser is fraudulent or not.


• Jewelry ∈ {T, F } - whether purchase of jewelry was involved.
• Gas ∈ {T, F } - whether the card was also used for gas purchase the same day
• Age ∈ {< 30, 30 − 50, > 50} of the card-holder
• Sex ∈ {M, F } of the card-holder

Next the DM would visit each of the nodes in any order to identify the parent set of each
node.

• Fraud. The DM did not believe that the age and sex of the card-holder were relevant,
nor whether it was a Gas or Jewelry transaction.
• Gas. The DM believed that Fraud may be a predictor of Gas, but not the card-holders
Age or Sex, nor if it was a Jewelry transaction.
• Age. No relevant factors.
• Sex. No relevant factors.
• Jewelry. The DM believed that Age and Sex of the card-holder were relevant, if
the transaction was not fraudulent. Also Fraud would be a predictor of a Jewelry
transaction.

The DM then adds a directed arc into each node, for which the parent is considered to
be relevant. The result is Figure 3. The mathematical strength of this depiction is that

Figure 3: Bayesian Network for Fraud Detection

Sex
Fraud Age

Gas
Jewelry

it provides an unambiguous set of irrelevancy statements, as a starting point for Bayesian


Inference.

Page 7 Compiled on 2016/09/22 at 21:26:50


5.4 Checking a BN with the d-separation theorem
5 BAYESIAN NETWORKS

5.3.2 Printer Self-Detection

Figure 4 appeared in (1995)Heckerman, a BN to troubleshoot printing problems.

Figure 4: Bayesian Network for Troubleshooting Printer Problems

5.4 Checking a BN with the d-separation theorem

To readily deduce all implied irrelevancy statements directly from the graph we can use the
d-separation theorem. But first some definitions:

Definition 5.3. An ancestor of vertex X is a vertex Y such that there is a directed


path from Y to X.

Definition 5.4. An ancestral set of a set of vertices X - denoted as A(X) - is the set
of X plus all of their ancestors.

Definition 5.5. An ancestral graph of an ancestral set - denoted by G (A(X)) - is a


graph consisting of all vertices in the ancestral set of X and directed edges between all
parents and children.

Page 8 Compiled on 2016/09/22 at 21:26:50


5.4 Checking a BN with the d-separation theorem
5 BAYESIAN NETWORKS

Definition 5.6. A moralized graph G M is a graph that results from “marrying” all
parents of a common child by connecting them with an undirected edge.

Definition 5.7. The Skeleton of a graph G is the same graph but with all directed edges
replaced by undirected edges

Definition 5.8. Considering 3 sets of vertices XA , XB , XC , XB separates XA from XC


if all paths from a vertex ∈ XA to a vertex ∈ XC passes through a vertex ∈ XB

We can now state the d-separation theorem.

Theorem 5.1. If XB separates XA from XC in the skeleton of the moralized


` graph of
the ancestral graph of the union of disjoints sets XA , XB , XC then XA XC |XB .

5.4.1 Continuing with the Credit Card


Fraud Age example
Sex

Figure 5 shows the skeleton of the moralized graph for the Fraud example.
Gas
Jewelry

Figure 5: Skeleton of Moralized Graph of Ancestral Graph for Fraud Detection

Fraud Age Sex

Gas
Jewelry

Thus Fraud separates Gas from the Age, Sex and Jewelry vertices, so that once the DM
knows whether the purchaser is fraudulent or not,

• knowing the card was used to purchase Gas does not help to predict the Age, Sex of
the card-holder, nor whether the card was also used within 24 hrs to purchase Jewelry.

• knowing the Age, Sex of the card-holder, or whether the card was used to purchase
Jewelry does not help to predict the card was also used within 24 hrs to purchase Gas.

Page 9 Compiled on 2016/09/22 at 21:26:50


5.4 Checking a BN with the d-separation theorem
5 BAYESIAN NETWORKS

To test these hidden irrelevancies, the analyst may want to probe them by asking questions
such as

• Suppose you know the purchaser was not fraudulent

– Are there any scenarios for which knowing the card was used to purchase Gas
is going to help you predict whether the card will be used to purchase Jewelry
within the same 24-hr period?
– Are there any scenario for which knowing the Age and Sex of the card-holder will
help to predict if the card will be used to buy gas.

• Suppose you knew the purchaser was fraudulent

– Are there any scenarios for which knowing the card was used to purchase Gas
is going to help you predict whether the card will be used to purchase Jewelry
within the same 24-hr period?
– Are there any scenario for which knowing the card was used to purchase Jewelry
is going to help you predict whether the card will be used to purchase Gas within
the same 24-hr period?

5.4.2 A Bidding example

A company needs to determine now the cost of projects they may be tendering for in the
future, and actually be contracting to complete. Early in the process they produce a ball-
park (B) estimate to help determine of they should proceed. If it looks more likely they
will proceed, they ask for an expert (E) estimate. If they are almost ready to proceed they
put the project out for tender, and generally the lowest tender (T) then becomes the best
estimate. Of course it is not until the project is completed that will know how the cost
actually turned out (O).

Since the DM uses each cost to estimate the next step the BN is presented as Figure 6

Figure 6: Bayesian Network for Project Cost

We can now use d-separation to challenge the DM with questions such as

• Are there any scenarios for which the Ballpark cost might be a better predictor of the
Tender cost than the Expert Cost?

• Are there any scenarios for which the Ballpark or the Expert cost might be a better
predictor of the actual O-cost, than the Tender cost?

Page 10 Compiled on 2016/09/22 at 21:26:50


5.4 Checking a BN with the d-separation theorem
5 BAYESIAN NETWORKS

And suppose that the second question made the DM realize that there are times when the
T-costs come in much below E-cost, and when O-costs are much closer to the Expert costs
than to Tender. That is when there is not enough work on the market to keep the workers
busy. They will then tender low in the hope of getting the job, and then will find ways to
increase the O-cost to make up the loss.

Thus there is another variable that enters the picture. Let’s designate it as I - an index for
the amount of work available in the market. With this the new BN is shown in Figure 7
Figure 8 then shows the moralized BN, and then the new BN Skeleton. We could then again

Figure 7: Revised Bayesian Network for Project Cost

`
challenge the DM based on B (T, I, O)|E, by asking for example: Are there any scenarios
for which the Ballpark estimate might be able to better predict the Tendered or O-cost?

Figure 8: Skeleton of Moralized Graph of Ancestral Graph for Project Cost

Page 11 Compiled on 2016/09/22 at 21:26:50


5.5 Solving Bayesian Networks 5 BAYESIAN NETWORKS

5.5 Solving Bayesian Networks

Once the DM, analyst and auditor agree that the BN in Figure 3 is valid, then the next step
is to elicit relevant prior probabilities for each of the vertices, as shown in Figure 9

Figure 9: Fraud Bayesian Network with Prior Distributions


p(a=<30) = 0.25
p(f=yes) = 0..00001 p(a=30-50) = 0.40 p(s=male) = 0.5

Fraud Age Sex

Gas Jewelry

p(g=yes|f=yes) = 0.2 p(j=yes|f=yes,a=*,s=*) = 0.05


p(g=yes|f=no) = 0.01 p(j=yes|f=no,a=<30,s=male) = 0..0001
p(j=yes|f=no,a=30-50,s=male) = 0.0004
p(j=yes|f=no,a=>50,s=male) = 0.0002
p(j=yes|f=no,a=<30,s=female) = 0..0005
p(j=yes|f=no,a=30-50,s=female) = 0.002
p(j=yes|f=no,a=>50,s=female) = 0.001

Figure
With3:these
A Bayesian-network
prior distributions,forwedetecting
can then credit-card
compute anyfraud. Arcs
desired are drawn
posterior from cause
probability, for
to e ect. The local probability distribution(s) associated with a node are shown adjacent
example the posterior probability that a transaction involving gas(g), jewelry(j), a young
(Y) male (M) is fraudulent (f). The notation has been changed slightly from the source (and
to the node. Antoasterisk
the diagram) clarify is
thea calculations.
shorthand forNote,
\anyitstate."
is common practice to split any grouped
probabilities equally among all possibilities, hence the (1/6) factor.
Comparing
Note thatEquations 16 andin19,theweexpressions
the dependencies see that thebelow variables sets (in1 ;Figs.
follow those : : :; 3 nand
) correspond
9 DAGs. to
the Bayesian-network parents (Pa 1 ; : : :; Pan ), which in turn fully specify the arcs in the
p(g, j, Y, M |f )p(f )
network |g, j, Y, M )S=
p(fstructure . p(g, j, Y, M |f )p(f ) + p(g, j, Y, M |f )p(f ))
Consequently, to determine the structure of a Y,
p(g|f )p(j|f, Bayesian network
M )p(Y )p(M )p(f )we (1) order the vari-
ables somehow, and (2)p(g|f
=
determine theMvariables
)p(j|f, Y, )p(Y )p(Msets
)p(fthat satisfy
) + p(g|f Equation
)p(j|f , Y, M )p(Y 18 for i =)p(f
)p(M 1; :): :; n.
In our example, using =
the orderingp(g|f(F; )A;6 p(j|f,
1
S; G;∗,J ),∗)p(f
we )have the conditional independencies
p(g|f ) 16 p(j|f, ∗, ∗)p(f ) + p(g|f )p(j|f , Y, M )p(f )
0.2 p∗(a j
1 f ) = p(a)
6
∗ 0.05 ∗ 0.00001
= 1
j
0.2 ∗ 6 ∗ 0.05 ∗p0.00001
(s f; a) + =0.01p∗(s0.00001
) ∗ 0.99999

j j
−8
= p (
1.67 ∗ 10
g f;
1.67 ∗ 10−8 + 10 ∗ 10−8
a; s ) == p(g=f 0.14
1.67
11.67
)
p(j jf; a; s; g) = p(j jf; a; s) (20)
For realistic problems exact computations are often NP-hard and approximations are re-
Thus, we obtain
quired, thethe
which are structure shown
subject of in Figure
ongoing 3. But software has been developed for large
research.
This approach has a serious drawback. If we choose the variable order carelessly, the
resulting12network structure may fail to reveal
Page many
Compiled on conditional independencies
2016/09/22 at 21:26:50 among the
variables. For example, if we construct a Bayesian network for the fraud problem using
5.5 Solving Bayesian Networks 5 BAYESIAN NETWORKS

scale Bayesian Networks, which efficiently stores all the required prior distributions on the
network, and efficiently computes desired probabilities.

Page 13 Compiled on 2016/09/22 at 21:26:50

You might also like