Professional Documents
Culture Documents
Contents
1 Introduction 2
5 Bayesian Networks 5
1
1 INTRODUCTION
1 Introduction
We now have all the conceptual tools to carry out a Decision Analysis for any problem
for which the uncertainty can be captured through a set of univariate discrete probability
distributions.
There are however two practical problems that prevent the application to problems with
many random variables and decisions .
• Decision Tress become too “bushy” to be manageable - we need better graphic tools
for expressing our problem more succinctly.
• Dependencies between variables cause exponential growth for the number of probabil-
ities that need to be elicited and processed.
There have been related developments in the Computer Science community, namely Bayesian
or Belief Networks, which deal with each of these problems.
Unfortunately, they do not allow for decision variables. Once decision variables are added
to the Bayes Net, then it becomes known as an Influence Diagram. Influence Diagrams are
a much more succinct format for expressing Decision Analysis problems. While they can be
converted and solved as a Decision Tree, algorithms developed by Ross Shachter in the 80’s
can solve Influence Diagrams directly without recourse to Decision Trees.
But first some relevant foundation to Influence Diagrams via Bayesian Networks.
An equivalent Bayes Network representation of these decision trees are the two Bayes Net-
works below.
Dis P(Dis) Dis Test P(Test|Dis) Test Dis P(Dis|Test) Test P(Dis)
D 0.01 D + 0.9 + D 0.04348 + 0.207
N 0.99 D - 0.1 + N 0.95652 - 0.793
N + 0.2 - D 0.00126
N - 0.8 - N 0.99874
Thus flipping the tree is performed simply by reversing the arrow connecting the two events.
While the format is simple, with many events on a Bayesian Network, and many links
between events, can become numerically very costly.
In the last class we already dealt with one situation when we could see an explosion in the
number of probabilities that need to be elicited. We were dealing with possible dependencies
between information variables. For example,
• what if the conditional probability of a second positive scan was dependent on the
outcome of the first scan so that p(+|F, −) 6= p(+|F, +). In that case we can no longer
use 90% for the probability of a faulty second scan if the plate is faulty; it would also
depend on the outcome of the first scan. We tacitly assumed a Naive Bayes model and
kept the problem simple.
• when it comes to medical tests/symptoms we usually work with the sensitivity and
specificity of a test/symptom, ignoring any dependence that may exist between the
conditional probabilities of the test outcomes. Thus given a set of 4 symptoms/tests
each with a +/- outcome, conditional to having a particular disease θ, we have
and require 1+2+4+8 = 15 assessments whereas the Naive Bayes assumption allowed
us to simplify the expression to
• Suppose we have two random variables x and y that affect some performance measure.
Each variable can take on 10 different values. If they are independent then p(x, y) =
p(x)p(y) and we will need to elicit a total of 20 probability values. However if they
are not independent then p(x, y) = p(x)p(y|x) and we will need to elicit 10 probability
values for p(x), and 10 values p(x|y) for each of the y values, or at total of 110 values,
an 11-fold increase. If we were to have 3 variables and p(x, y, z) = p(x)p(y|x)p(z|x, y)
then we would need to elicit 10 values for p(x),100 for p(y|x), and 1000 values for
p(x|y, z), an increase from 30 to 1110, or a 37-fold increase.
This phenomenon also exist when random variables are dependent on decision variables. We
then have to evaluate a distribution for each value of the decision variable. Thus any tools
that help identify key dependencies, and irrelevance between variables, would allow us to
deal with more complex problems.
To encourage the DM to make belief statements that will disentangle the problem, and keep
apart variables that really don’t have much to do with each other, we introduce a binary
notion of irrelevance, that expresses that knowledge of one variable is irrelevant to knowledge
of another variable. or more formally
5 Bayesian Networks
The overall objective of Bayesian Inference is to develop the most accurate joint posterior
distribution p(x̂) of inter-related variables.
As we mentioned before if we insists that all random variables are relevant, then a full model
will be difficult to elicit or compute. We know that in the extreme we might be tempted to
use a Naive Bayes model and believe that all variables are unrelated so that p(x̂) = ni pi (xi )
Q
but this position would rarely be defensible. So the question is to find a factorization which
is both credible and solvable. Hence the use of the synonym belief networks.
Let’s zoom into one of the factors pi (xi |x1 , . . . , xi−1 ). Often many of the “given” xj , j =
1, . . . , xi−1 will be irrelevant to xi , i.e. given some key xj the other xj do not provide any
additional information which would be useful in predicting xi . Let’s separate for each i the
set of indexes of variables which are relevant - Qi - from those that are irrelevant - Ri - to xi .
For reasons that will be come clear we call Qi the parent set of xi , and the Ri the remainder
set of xi a
Xi XRi |XQi 1 ≤ i ≤ n
`
For three variables x1 , x2 , x3 with x2 x3 |x1 then
p(x1 , x2 , x3 ) = p(x1 )p(x2 |x1 )p(x3 |x1 , x2 ) = p(x1 )p(x2 |x1 )p(x3 |x1 )
Definition 5.1. A directed acyclic graph (DAG) G is a directed graph - of vertices and
directed edges - with no cycles.
Before we proceed any further we need another simple example, taken from a Microsoft
Tutorial on BN. Suppose that a credit card company is trying to make inferences about
the likelihood that a particular transaction is fraudulent. They have isolated a number of
variables
Next the DM would visit each of the nodes in any order to identify the parent set of each
node.
• Fraud. The DM did not believe that the age and sex of the card-holder were relevant,
nor whether it was a Gas or Jewelry transaction.
• Gas. The DM believed that Fraud may be a predictor of Gas, but not the card-holders
Age or Sex, nor if it was a Jewelry transaction.
• Age. No relevant factors.
• Sex. No relevant factors.
• Jewelry. The DM believed that Age and Sex of the card-holder were relevant, if
the transaction was not fraudulent. Also Fraud would be a predictor of a Jewelry
transaction.
The DM then adds a directed arc into each node, for which the parent is considered to
be relevant. The result is Figure 3. The mathematical strength of this depiction is that
Sex
Fraud Age
Gas
Jewelry
To readily deduce all implied irrelevancy statements directly from the graph we can use the
d-separation theorem. But first some definitions:
Definition 5.4. An ancestral set of a set of vertices X - denoted as A(X) - is the set
of X plus all of their ancestors.
Definition 5.6. A moralized graph G M is a graph that results from “marrying” all
parents of a common child by connecting them with an undirected edge.
Definition 5.7. The Skeleton of a graph G is the same graph but with all directed edges
replaced by undirected edges
Figure 5 shows the skeleton of the moralized graph for the Fraud example.
Gas
Jewelry
Gas
Jewelry
Thus Fraud separates Gas from the Age, Sex and Jewelry vertices, so that once the DM
knows whether the purchaser is fraudulent or not,
• knowing the card was used to purchase Gas does not help to predict the Age, Sex of
the card-holder, nor whether the card was also used within 24 hrs to purchase Jewelry.
• knowing the Age, Sex of the card-holder, or whether the card was used to purchase
Jewelry does not help to predict the card was also used within 24 hrs to purchase Gas.
To test these hidden irrelevancies, the analyst may want to probe them by asking questions
such as
– Are there any scenarios for which knowing the card was used to purchase Gas
is going to help you predict whether the card will be used to purchase Jewelry
within the same 24-hr period?
– Are there any scenario for which knowing the Age and Sex of the card-holder will
help to predict if the card will be used to buy gas.
– Are there any scenarios for which knowing the card was used to purchase Gas
is going to help you predict whether the card will be used to purchase Jewelry
within the same 24-hr period?
– Are there any scenario for which knowing the card was used to purchase Jewelry
is going to help you predict whether the card will be used to purchase Gas within
the same 24-hr period?
A company needs to determine now the cost of projects they may be tendering for in the
future, and actually be contracting to complete. Early in the process they produce a ball-
park (B) estimate to help determine of they should proceed. If it looks more likely they
will proceed, they ask for an expert (E) estimate. If they are almost ready to proceed they
put the project out for tender, and generally the lowest tender (T) then becomes the best
estimate. Of course it is not until the project is completed that will know how the cost
actually turned out (O).
Since the DM uses each cost to estimate the next step the BN is presented as Figure 6
• Are there any scenarios for which the Ballpark cost might be a better predictor of the
Tender cost than the Expert Cost?
• Are there any scenarios for which the Ballpark or the Expert cost might be a better
predictor of the actual O-cost, than the Tender cost?
And suppose that the second question made the DM realize that there are times when the
T-costs come in much below E-cost, and when O-costs are much closer to the Expert costs
than to Tender. That is when there is not enough work on the market to keep the workers
busy. They will then tender low in the hope of getting the job, and then will find ways to
increase the O-cost to make up the loss.
Thus there is another variable that enters the picture. Let’s designate it as I - an index for
the amount of work available in the market. With this the new BN is shown in Figure 7
Figure 8 then shows the moralized BN, and then the new BN Skeleton. We could then again
`
challenge the DM based on B (T, I, O)|E, by asking for example: Are there any scenarios
for which the Ballpark estimate might be able to better predict the Tendered or O-cost?
Once the DM, analyst and auditor agree that the BN in Figure 3 is valid, then the next step
is to elicit relevant prior probabilities for each of the vertices, as shown in Figure 9
Gas Jewelry
Figure
With3:these
A Bayesian-network
prior distributions,forwedetecting
can then credit-card
compute anyfraud. Arcs
desired are drawn
posterior from cause
probability, for
to eect. The local probability distribution(s) associated with a node are shown adjacent
example the posterior probability that a transaction involving gas(g), jewelry(j), a young
(Y) male (M) is fraudulent (f). The notation has been changed slightly from the source (and
to the node. Antoasterisk
the diagram) clarify is
thea calculations.
shorthand forNote,
\anyitstate."
is common practice to split any grouped
probabilities equally among all possibilities, hence the (1/6) factor.
Comparing
Note thatEquations 16 andin19,theweexpressions
the dependencies see that thebelow variables sets (in1 ;Figs.
follow those : : :; 3 nand
) correspond
9 DAGs. to
the Bayesian-network parents (Pa 1 ; : : :; Pan ), which in turn fully specify the arcs in the
p(g, j, Y, M |f )p(f )
network |g, j, Y, M )S=
p(fstructure . p(g, j, Y, M |f )p(f ) + p(g, j, Y, M |f )p(f ))
Consequently, to determine the structure of a Y,
p(g|f )p(j|f, Bayesian network
M )p(Y )p(M )p(f )we (1) order the vari-
ables somehow, and (2)p(g|f
=
determine theMvariables
)p(j|f, Y, )p(Y )p(Msets
)p(fthat satisfy
) + p(g|f Equation
)p(j|f , Y, M )p(Y 18 for i =)p(f
)p(M 1; :): :; n.
In our example, using =
the orderingp(g|f(F; )A;6 p(j|f,
1
S; G;∗,J ),∗)p(f
we )have the conditional independencies
p(g|f ) 16 p(j|f, ∗, ∗)p(f ) + p(g|f )p(j|f , Y, M )p(f )
0.2 p∗(a j
1 f ) = p(a)
6
∗ 0.05 ∗ 0.00001
= 1
j
0.2 ∗ 6 ∗ 0.05 ∗p0.00001
(s f; a) + =0.01p∗(s0.00001
) ∗ 0.99999
j j
−8
= p (
1.67 ∗ 10
g f;
1.67 ∗ 10−8 + 10 ∗ 10−8
a; s ) == p(g=f 0.14
1.67
11.67
)
p(j jf; a; s; g) = p(j jf; a; s) (20)
For realistic problems exact computations are often NP-hard and approximations are re-
Thus, we obtain
quired, thethe
which are structure shown
subject of in Figure
ongoing 3. But software has been developed for large
research.
This approach has a serious drawback. If we choose the variable order carelessly, the
resulting12network structure may fail to reveal
Page many
Compiled on conditional independencies
2016/09/22 at 21:26:50 among the
variables. For example, if we construct a Bayesian network for the fraud problem using
5.5 Solving Bayesian Networks 5 BAYESIAN NETWORKS
scale Bayesian Networks, which efficiently stores all the required prior distributions on the
network, and efficiently computes desired probabilities.