c14 15bayesian Networks 2020

Bayesian Networks
I bÑwnsonñanoÑwn.Ño :ÑwoosnÑoimqmiÑÑwyo:ÑÑb
a. him
Russell and Norvig: Chapter 14-15

WET PAINT ... VIDEO
BAYESIAN NETWORK
1ˢᵗ node
wait ÉÑon
2ⁿᵈ node
<
=Ñwn Chance :{ jÑwn /

rosin unis
riwhfiu PAIN 1%2

hiwowosrjñwii 1871 WET
↳
ñngnriri :Ñwñwoni%nrEwosjñw9ni
3rd
'
node Chair
mnñ
highlight : Back -
Cloth depend on
( into Ñow )
Set CPT probability for CHAIR
Set CPT probability for BACK_CLOTH
When there is a label WET PAINT on the chair ...
When there is no label on the chair ...
When there is no label but see painted on shirt ...
When there is a label WET PAINT on the chair and see
clean back shirt ...
When there is a label WET PAINT on the chair and see
painted back shirt ...
Probabilistic Agent
sensors
?
environment
agent
actuators
I believe that the sun
will still exist tomorrow
with probability 0.999999
and that it will be a sunny
with probability 0.6
Bayesianism is a
controversial but
increasingly popular
approach of statistics
that offers many
benefits, although not
everyone is persuaded
of its validity
FYI
Bayesians Networks based on a statistical approach
presented by a mathematician, Thomas Bayes in
1763.
Introduced by Pearl (1986 )
Resembles human reasoning Ñn9vimqwanÑunÑsnwww.vef
=
-
Causal relationship =
ÑÑwmqÑwwo
now .
Decision support system/ Expert System

Other Names
Belief networks
Probabilistic networks
Causal networks
Common Sense Reasoning about
uncertainty
Ben is waiting for Holmes and Watson who are
both late for an AI seminar. Juris Ñwwintry
Ben is worried that if the roads are icy one or

both of them may have crash his car
Suddenly Ben learns that Watson has crashed
snow
Ben then learns that it is warm outside and

roads are salted (not icy)
Causal Relationships
State of Road
Icy/ not icy
Watson Holmes
Crash/No crash Crash/No crash
Watson Crashed !
State of Road
Information Icy/ not icy
Flow
Watson Holmes
But Roads are dry
State of Road Information

100-1
not icy Flow

.
Watson Holmes
100-1 .
A simple Bayesian Network
Ñnww
Icy
Holmes ñiusnowÑwinÑui
P(XHolmes|XIcy): P(XWatson|XIcy):
=
P(XIcy):
Crash
yes 0.7 yes no yes no

no 0.3 yes 0.8 0.2
Holmes yes 0.8 0.2
no 0.1 0.9 no 0.1 0.9
P(XWatson=yes)=1
?
?
Watson ÑnÑÑiun
,
P(XIcy | XWatson=yes) = (0.95,0.05)

(0.70,0.30)a priori
Joint Probability + Marginalization
P(XHolmes | XWatson=yes) = (0.76,0.24)
(0.59,0.41)a priori
P(XIcy=no)=1
When initiating XIcy

XHolmes becomes independent of XWatson ;
Independent
XHolmes XWatson | XIcy

Wet grass
To avoid icy roads, Watson moves to UCLA;
Holmes moves in USC iwatsonn-UHolmeslairouqqmwnndo.org
ioineueiieonniuoiwom
One morning as Watson leaves for work, he

notices that his grass is wet. He wondered
=FMWNÑ 182in
whether he has left his sprinkler on or it has

rained
= bMÑoUWOJ
it is also get wet
explains why my lawn is wet, so probably the

Information Rain Sprinkler
Flow Yes/no On/Off
Holmes grass
Wet/Dry Wet
Information Rain Sprinkler
Flow Yes/no On/Off
Holmes grass
Wet Wet
Bayesian vs. the Classical Approach'w%iÑw%n nwniismain.ioon.si
represent it ooo person aunts
The Bayesian probability of an event x
prior and observed facts.
Classical probability refers to the true or actual probability of the

event and is not concerned with observed behavior.
Bayesian approach restricts its prediction to the next (N+1)

occurrence of an event given the observed previous (N) events.
↓
Baysian Mirin n.nimuwaiiriiun.nu niñiiviniiuñnliliniiiw
Classical approach is to predict likelihood of any given event

regardless of the number of occurrences.
24
Problem
iñun .
ÑonÑñsugar nima
At a certain time t, the KB of an agent is

some collection of beliefs
At time t
observation that changes the strength of
one of its beliefs
How should the agent update the strength
of its other beliefs?
Purpose of Bayesian Networks
Facilitate the description of a collection
of beliefs by making explicit causality
relations and conditional independence
among beliefs
Provide a more efficient way (than by
using joint distribution tables) to update
belief strengths when new evidence is
observed
Bayesian Networks
A simple, graphical notation for conditional
independence assertions resulting in a compact
representation for the full joint distribution
Syntax:
a set of nodes, one per variable ;
void : node iintiñ ink 1 ÑI
↓
éwnᵈñTuw many into Ñngnoir
a conditional distribution for each node given its parents:

P(Xi|Parents(Xi))
↓
Prop of ✗i throw;ñu node ÑÑw parent ooo Xi
Bayesian Networks
Data structure which represents the dependence
between variables
Gives concise specification of joint prob. dist.
Bayesian Belief Network is a graph that holds
Nodes are a set of random variables
Each node has a conditional prob. Table
Edges denote conditional dependencies
DAG : No directed cycle Mairi naw.riina-uuToowrouna.NU
Markov condition ) >
bdañwÑÑÑo
Plxi Parents ( ✗ i ) nil Markov condition
Ñwniñniñw Decision
Making
Learning Bayes Nets :B
aysian To n.ioiioqanatmn.rs networks
Given some data from the world, why would

we want to learn a Bayes net?
1. Compact representation of the data

There are fast algorithms for prediction/inference
given observations of the environment
2. Causal knowledge
There are fast algorithms for prediction/inference
given interventions in the environment
Bayesian network
Markov Assumption
Each random variable X is
independent of its non-
descendent given its parent
Pa(X) airfoil depend 9ns
; ✗ on
Y1 Y2
www
Parent ✗ depend on
ya
Formally, X
The current state depends
on only a finite history of
previous states.
First-order Markov Process:
P(xt|x0:xt-1) = P(xt|xt-1)
iñnowoisnrioisxoillon ( Xo ) owns : n%Ñ's ✗ t -
,
Likelihood Prior =
n !iñwn%Ñ
ñÑU match
P (e / h ) P ( h )
an hypothesis ñÑ ñiluwñ evidence
P ( h / e)
P (e)
→ he hypothesis ] =
riwwiig.in v80
knowledge www.nownt
el evidence ) input ñiornimi observe aniwñsiloiiiw
Posterior
→ =
= mÑsñÑñ evidence mnñi Prob ñniw update → Pch e) =

n.im :Ñwoosnnw%Ñ9:ÑñnmTÑ|ñwu evidence Thai
Probability
ñÑñoonmÑwñÑmiñññs
ÑwoYs Anilin'll
of Evidence
" " Miri update Evidence n%wio7m
Probability of an hypothesis, h, can be updated when

evidence, e, has been obtained.
Note: it is usually not necessary to calculate P(e) directly as it can be
obtained by normalizing the posterior probabilities, P(hi | e).
A Simple Example
Consider two related variables:
oined
1. Druganimal(D) with values y or n

Test jinn
2. Test (T) with values +ve or ve

And suppose we have the following probabilities:
P(D = y) = 0.001
P(T = +ve | D = y) = 0.8
P(T = +ve | D = n) = 0.01
These probabilities are sufficient to define a joint probability
distribution.
Suppose an athlete tests positive. What is the probability that he has
taken the drug?
P (T ve | D y ) P( D y )
P(D y|T ve)
P(T ve | D y ) P( D y ) P(T ve | D n) P ( D n)
0.8 0.001
0.8 0.001 0.01 0.999
0.074
i
A More Complex Case

Suppose now that there is a similar link between Lung Cancer (L) and a
chest X-ray (X) and that we also have the following relationships:
Torquato's
History of smoking (S) has a direct influence on bronchitis (B) and lung
woo
cancer (L);
5N .
1MÑoqjw
L and B have a direct influence on fatigue (F).

What is the probability that someone has bronchitis given that they
smoke, have fatigue and have received a positive X-ray result?
P(b1 , s1 f1 , x1 , l )
P(b1 , s1 , f1 , x1 ) l
P(b1 | s1 , f1 , x1 )
P( s1 , f1 , x1 ) P(b, s1 , f1 , x1 , l )
b ,l
where, for example, the variable B takes on values b1 (has bronchitis)

and b2 (does not have bronchitis).
R.E. Neapolitan, Learning Bayesian Networks (2004)
Problems with Large Instances
The joint probability distribution, P(b,s,f,x,l)
For five binary variables there are 25 = 32 values in the joint
distribution (for 100 variables there are over 1030 values)
How are these values to be obtained?
Inference
To obtain posterior distributions once some evidence is available

requires summation over an exponential number of terms eg 22 in
the calculation of
P(s1 , f1 , x1 ) P(b, s1 , f1 , x1 , l )
b ,l
which increases to 297 if there are 100 variables.

Bayesian Networks
A Bayesian network consists of:
A Graph
nodes represent the random variables

directed edges (arrows) between pairs of nodes
it must be a Directed Acyclic Graph (DAG) no directed cycles
the graph represents independence relationships between
variables
Conditional probability specifications
the conditional probability of each variable given its parents in

the DAG
An Example Bayesian Network
P(s1)=0.2
pc
"
"
"
Smoking history
"
°"
↓ depend
P(b1|s1)=0.25 P(l1|s1)=0.003
P(b1|s2)=0.05 -2
yes
no
P(l1|s2)=0.00005
↑
o.se
Bronchitis =
n
,
snafu's wog
Lung Cancer
Fatigue X-ray
P(f1|b1,l1)=0.75
P(x1|l1)=0.6
P(f1|b1,l2)=0.10
P(x1|l2)=0.02
P(f1|b2,l1)=0.5
P(f1|b2,l2)=0.05
R.E. Neapolitan, Learning Bayesian Networks (2004)

?⃝
The Joint Probability Distribution
Note that our joint distribution with 5 variables can be represented as:
P(s, b, l , f , x) P(s) P(b | s) P(l | b, s) P( f | b, s, l ) P( x | b, s, l , f )

But due to the Markov condition we have, for example,
P( x | b, s, l , f ) P( x | l )
Consequently the joint probability distribution can now be expressed as
P(s, b, l , f , x) P(s) P(b | s) P(l | s) P( f | b, l ) P( x | l )

For example, the probability that someone has a smoking history, lung cancer
but not bronchitis, suffers from fatigue and tests positive in an X-ray test is
Ñhlung union
go.UU.mimn.cn
cancer
you
x-ray
[ 11-0.25 )
P( s1 , b2 , l1 , f1 , x1 ) 0.2 0.75 0.003 0.5 0.6 0.000135

I
/ to'owiwÑg
n.int TaiÑw%nrpnÑÑswo
'
>
comma
ÑO n.woniimn.ms rhtwioowoiu and

The Markov Condition*
A Bayesian network (G,P) satisfies the Markov condition according to which for
each variable, X, in G:
X is conditionally independent of its non-descendents given its parents in G
Denoted by X nd(X) | pa(X) or IP(X, nd(X) | pa(X))
S
Eg, in this network
B L
F X
Representing the Joint
Distribution
In general, for a network with nodes X1, X2 n then
n
P( x1 , x2 ,..., xn ) P( xi | pa( xi ))
i 1
An enormous saving can be made regarding the number of values required for
the joint distribution.
To determine the joint distribution directly for n binary variables 2n 1 values
are required.
For a BN with n binary variables and each node has at most k parents then less
than 2kn values are required.
ñ'ÑwmqÑwwo
Causality and Bayesian
naw .
Networks
Clearly not every BN describes causal relationships between the variables.
Consider the dependence between Lung Cancer, L, and the X-ray test, X. By
focusing on just these variables we might be tempted to represent them by the
following BN niki rios
lung cancer ✗ -
ray
P(x1|l1)=0.6
P(l1)=0.001 L X P(x1|l2)=0.02
However, the following BN represents the same distribution and independencies

(i.e. none) mminnosñownñÑÑñi1 ñqnsinwtnw ray
www.wtuloiin.mn ✗ -
:/ni
lung
:Ñw cancer in
P(l1|x1)=0.02915
P(l1|x2)=0.00041
L X P(x1)=0.02058
Nevertheless, it is tempting to think that BNs can be created by creating a DAG

where the edges represent direct causal relationships between the variables.
Common Causes
Consider the following DAG:
Smoking
on it Common Causes nÑDÑo Bronchitis ñu
ÑmMM Mann Smoking

lung cancer
,
.
Bronchitis Lung Cancer
Markov condition: Ip(B, L | S), i.e. P(b | l, s) = P(b | s)

If we know the causal relationships S B and S L and we know that Joe is a
smoker, then finding out that he has Bronchitis will not give us any more
information about the probability of him having Lung Cancer.
So the Markov condition would be satisfied.
Common Effects
Consider the following DAG:
oiwoiwiiiw
Burglary Earthquake
Effect
Common minor inn
,
mind
Burglary in : earth
quan .
kawaii Alarm ñsow
Alarm
Markov condition: Ip(B, E), i.e. P(b | e) = P(b)

We would expect Burglary and Earthquake to be independent of each other
which is in agreement with the Markov condition.
We would, however expect them to be conditionally dependent given Alarm. If
the alarm has gone off, news that there had been an earthquake would
with the Markov condition.

coin )
The Causal Markov Condition

The basic idea is that the Markov condition holds for a causal DAG.
Certain other conditions must be met for the Causal Markov condition to
hold:
there must be no hidden common causes
there must not be selection bias
there must be no feedback loops
Even with these provisos there is a lot of controversy as to its validity.

It seems to be false in quantum mechanical systems which have been
Hidden Common Causes
H
X Y
If a DAG is created on the basis of causal relationships between the variables

under consideration then X and Y would be marginally independent according to
the Markov condition.
But since they have a hidden common cause, H, they will normally be dependent.
Inference in Bayesian Networks
The main point of BNs is to enable probabilistic inference to be performed.
There are two main types of inference to be carried out:
Belief updating to obtain the posterior probability of one or more
variables given evidence concerning the values of other variables
Abductive inference (or belief updating) find the most probable
configuration of a set of variables (hypothesis) given evidence
Consider the BN discussed earlier:
S What is the probability that

someone has bronchitis (B) given
that they smoke (S) have fatigue (F)
B L and have received a positive X-ray
(X) result?
F X
Example
Topology of network encodes conditional
independence assertions:
Weather Cavity
Toothache Catch
Weather is independent of other variables

Toothache and Catch are independent given Cavity
Example
off by a minor earthquake. Is there a burglar?
Variables: Burglar, Earthquake, Alarm, JohnCalls, MaryCalls
- A burglar can set the alarm off

- An earthquake can set the alarm off
- The alarm can cause Mary to call
- The alarm can cause John to call
£
d
2M
8
52
s
¥
→
:
9
I
see
or
J
-
=
E
O
9
A Simple Belief Network
Burglary Earthquake
causes
Intuitive meaning of arrow
Directed acyclic
Alarm
graph (DAG)
effects
Nodes are random variables
JohnCalls MaryCalls
Assigning Probabilities to Roots
P(B) P(E)
Burglary 0.001 Earthquake 0.002
Alarm
JohnCalls MaryCalls
Conditional Probability Tables
P(B) P(E)
PIA / B. E) = PCA / BNE )
B E P(A|B,E)
0.001 0.002
T T 0.95
Alarm
0.998
T
0.999
F 0.94
F T 0.29
F F 0.001
Size of the CPT for a

node with k parents: ?
JohnCalls MaryCalls
Conditional Probability Tables
P(B) P(E)
22 B E P(A|B,E)
=6
T T 0.95
Alarm T
F
F
T
0.94
0.29
F F 0.001
I =
2
A P(J|A) A P(M|A)
JohnCalls MaryCalls
0 T 0.90
F 0.05
T 0.70
F 0.01
What the BN Means
P(B) P(E)
B E P(A| )
T T 0.95
Alarm T
F
F
T
0.94
0.29
F F 0.001
P(x1,x2 n) = P(xi|Parents(Xi))
A P(J|A) A P(M|A)
JohnCalls T 0.90 MaryCalls T 0.70
F 0.05 F 0.01
Calculation of Joint Probability
P(B) P(E)
B E P(A| )
P(J M A B E) T T 0.95
= P(J|A)P(M|A)P(A| B, E)P( B)P(AlarmE)

T
F
F
T
0.94
0.29
= 0.9 x 0.7 x 0.001 x 0.999 x 0.998 F F 0.001
= 0.00062
A A
JohnCalls T 0.90 MaryCalls T 0.70
F 0.05 F 0.01
What The BN Encodes
Burglary Earthquake
Alarm
For example, John does
not observe any burglaries
JohnCalls directly MaryCalls
Each of the beliefs The beliefs JohnCalls

JohnCalls and MaryCalls and MaryCalls are
is independent of independent given
Burglary and Alarm or Alarm
Earthquake given Alarm
or Alarm
What The BN Encodes
Burglary Earthquake
For instance, the reasons why
John and Mary may not call if
Alarm
there is an alarm are unrelated
JohnCalls MaryCalls
Each of the beliefs The beliefs JohnCalls

Note JohnCalls
that theseandreasons could
MaryCalls and MaryCalls are
be other beliefs in the
is independent ofnetwork. independent given
The probabilities
Burglary andsummarize these Alarm or Alarm
non-explicit beliefs
Earthquake given Alarm
or Alarm
Structure of BN
The relation:
P(x 1,x
E.g., n) =
JohnCalls
2 is influencedP(x |Parents(X
by iBurglary, buti))
not
means that
directly. each belief
JohnCalls is independent
is directly influenced by of its
Alarm
predecessors in the BN given its parents
Said otherwise, the parents of a belief Xi are all
the beliefs that Xi
Usually (but not always) the parents of Xi are its
causes and Xi is the effect of these causes
Construction of BN
Choose the relevant sentences (random
variables) that describe the domain
The ordering guarantees that the BN
Selectwill
an have
ordering X1
no cycles n, so that all the
beliefs that directly influence Xi are before Xi
Add a node in the network labeled by Xj

Connect the node of its parents to Xj
Define the CPT of Xj
Cond. Independence Relations Ancestor
1. Each random variable Parent

X, is conditionally
Y1 Y2
independent of its non-
descendents, given its
parents Pa(X) X
Formally,
I(X; NonDesc(X) | Pa(X))
2. Each random variable
is conditionally
independent of all the Non-descendent
other nodes in the
graph, given its neighbor
Descendent
Inference In BN
Set E of evidence variables that are observed,
e.g., {JohnCalls,MaryCalls}
Query variable X, e.g., Burglary, for which we
would like to know the posterior probability
distribution P(X|E)
J M P(B| )
T T ? Distribution conditional to
the observations made
Inference Patterns
Burglary Earthquake Burglary Earthquake
Basic use of a BN: Given new

observations,
Alarm compute the new
Diagnostic Alarm Causal
strengths of some (or all) beliefs
JohnCalls MaryCalls JohnCalls MaryCalls
Burglary Other use: Given

Earthquake Burglary the strength of
Earthquake
a belief, which observation should

Alarm we gather to make the
Intercausal greatest
Alarm Mixed
JohnCalls MaryCalls JohnCalls MaryCalls

Types Of Nodes On A Path
diverging
Battery
linear
Radio SparkPlugs Gas
Starts
converging
Moves
Independence Relations In BN
diverging
Battery
linear

Given a set E of evidence nodes, two beliefs
connected by an undirected path are
independent if one of the following three Starts
conditions holds:
1. A node on the path is linear and in E converging
2. A node on the path is diverging and in E
3. A node on the path is converging and Moves
neither this node, nor any descendant is in E
diverging
Battery
linear

independent if one of the following three Starts
conditions holds:
3. A node
Gas and
on theRadio
path isare
converging
independent
and Moves
neither thisevidence
given node, nor on
any SparkPlugs
descendant is in E
diverging
Battery
linear

Gas andifRadio
independent one of are independent
the following three Starts
given
conditions evidence on Battery
holds:
diverging
Battery
linear

Given
Gasa and
set E Radio
of evidence nodes, two beliefs
are independent
given noifevidence,
independent but theythree
one of the following are
Starts
dependent
conditions holds: given evidence on
1. A node onStarts
the pathorisMoves
linear and in E converging
BN Inference
Simplest Case:
A B
P(B) = P(a)P(B|a) + P(~a)P(B|~a)
P(B) P( A ) P( B | A )
A
A B C
P(C) = ???
BN Inference
Chain:
X1 X2 Xn
What is time complexity to compute P(Xn)?
What is time complexity if we computed the full joint?

Example of a simple Bayesian network
A B
p(A,B,C) = p(C|A,B)p(A)p(B)
Probability model has simple factored form
Directed edges => direct dependence
Absence of an edge => conditional independence
Also known as belief networks, graphical models, causal networks
Other formulations, e.g., undirected graphical models

Examples of 3-way Bayesian Networks
A B C Marginal Independence:
p(A,B,C) = p(A) p(B) p(C)
Conditionally independent effects:

p(A,B,C) = p(B|A)p(C|A)p(A)
A B and C are conditionally independent

Given A
e.g., A is a disease, and we model

B C B and C as conditionally independent
symptoms given A
A B Independent Causes:
p(A,B,C) = p(C|A,B)p(A)p(B)
Given C, observing A makes B less likely

e.g., earthquake/burglary/alarm example
A and B are (marginally) independent

but become dependent once C is known
A B C Markov dependence:
p(A,B,C) = p(C|B) p(B|A)p(A)
Inference Ex. 2
Cloudy
Sprinkler
Algorithm is computing Rain
not individual
probabilities, but entire tables
WetGrass
Two ideas crucial to avoiding exponential blowup:
because of the structure of the BN, some
P(W)
subexpression Pin(w
the| rjoint
, s)Pdepend
(r | c)Ponly
(s | on
c)Pa(csmall
) number
of variableR ,S ,C
By computing P(them
w | r ,once
s) and P(rcaching
| c)P(sthe
| c)result,
P(c) we
can avoid Rgenerating
,S
themC exponentially many times
P(w | r , s)fC (r , s) fC (R , S)
R ,S
Approaches to inference
Exact inference
Inference in Simple Chains
Variable elimination
Clustering / join tree algorithms
Approximate inference
Stochastic simulation / sampling methods
Markov chain Monte Carlo methods
Stochastic simulation - direct
Suppose you are given values for some subset of the
variables, G, and want to infer values for unknown
variables, U
Randomly generate a very large number of
instantiations from the BN
Generate instantiations for all variables start at root
Rejection Sampling: keep those instantiations that are

consistent with the values for G
Use the frequency of values for U to get estimated
probabilities
Accuracy of the results depends on the size of the
sample (asymptotically approaches exact results)
Direct Stochastic Simulation
Cloudy P(WetGrass|Cloudy)?
P(WetGrass|Cloudy)
Rain = P(WetGrass Cloudy) / P(Cloudy)
Sprinkler
1. Repeat N times:
WetGrass 1.1. Guess Cloudy at random
1.2. For each guess of Cloudy, guess
Sprinkler and Rain, then WetGrass
2. Compute the ratio of the # runs where

WetGrass and Cloudy are True
over the # runs where Cloudy is True
Direct Sampling
Suppose we have no evidence, but we
want to determine P(Cloudy, Sprinkler,
Rain, WetGrass) for all Cloudy,
Sprinkler, Rain, WetGrass.
Direct sampling:
Sample each variable in topological order,
conditioned on values of parents.
I.e., always sample from P(Xi |
parents(Xi))
Example
1. Sample from P(Cloudy). Suppose returns true.
2. Sample from P(Sprinkler | Cloudy = true). Suppose

returns false.
3. Sample from P(Rain | Cloudy = true). Suppose

returns true.
4. Sample from P(WetGrass | Sprinkler = false, Rain =

true). Suppose returns true.
Here is the sampled event: [true, false, true, true]

Suppose there are N total samples, and let NS (x1, ...,
xn) be the observed frequency of the specific event
x1, ..., xn.
N S ( x1 ,..., xn )
lim P( x1 ,..., xn )
N N
N S ( x1 ,..., xn )
P( x1 ,..., xn )
N
Suppose N samples, n nodes. Complexity O(Nn).
Problem 1: Need lots of samples to get good

probability estimates.
Problem 2: Many samples are not realistic; low
likelihood.
Likelihood weighting
to be rejected in the first place!

Sample only from the unknown
variables Z
Weight each sample according to the
likelihood that it would occur, given the
evidence E
Markov chain Monte Carlo
algorithm
So called because
Markov chain each instance generated in the sample is
dependent on the previous instance
Monte Carlo statistical sampling method
Perform a random walk through variable assignment
space, collecting statistics as you go
Start with a random instantiation, consistent with evidence
variables
At each step, for some nonevidence variable, randomly
sample its value, consistent with the other current
assignments
Given enough samples, MCMC gives an accurate
estimate of the true distribution of values
Markov Chain Monte Carlo
Sampling Algorithm
Start with random sample from
variables: (x1, ..., xn). This is the
Next state: Randomly sample value for

one non-evidence variable Xi ,
conditioned on current values in
Markov Blanket Xi.
Markov Chain Monte Carlo Sampling
One of most common methods used in real
applications.
Markov blanket Xi:
Recall that: By construction of Bayesian network, a

node is conditionally independent of its non-
descendants, given its parents.
Proposition: A node Xi is conditionally independent
of all other nodes in the network, given its Markov
blanket.
X
Example
Query: What is P(Rain | Sprinkler =
true, WetGrass = true)?
MCMC:
Random sample, with evidence variables fixed:
[true, true, false, true]
Repeat:
1. Sample Cloudy, given current values of its Markov blanket:
Sprinkler = true, Rain = false. Suppose result is false. New
state:
[false, true, false, true]
2. Sample Rain, given current values of its Markov blanket:

Cloudy = false, Sprinkler = true, WetGrass = true. Suppose
result is true. New state: [false, true, true, true].
Each sample contributes to estimate for query
P(Rain | Sprinkler = true, WetGrass = true)
Suppose we perform 100 such samples, 20 with Rain = true and

80 with Rain = false.
Then answer to the query is

Normalize ( 20,80 ) = .20,.80
in which the long-run fraction of time spent in each state is

exactly proportional to its posterior probability, given the
That is: for all variables Xi, the probability of the value xi of Xi
appearing in a sample is equal to P(xi | e).
Claim: MCMC settles into behavior in which each state is

sampled exactly according to its posterior probability, given the
evidence.
Y1 Y2 Y3 Yn
P(C | Y1 n) = P(Yi | C) P (C)
Features Y are conditionally independent given the class variable C
Widely used in machine learning
Conditional probabilities P(Yi | C) can easily be estimated from labeled data

Hidden Markov Model (HMM)
Observed
Y1 Y2 Y3 Yn
----------------------------------------------------
S1 S2 S3 Sn Hidden
Two key assumptions:

1. hidden state sequence is Markov
2. observation Yt is CI of all other variables given St
Widely used in speech recognition, protein sequence models
Since this is a Bayesian network polytree, inference is linear in n

Summary
Bayesian networks represent a joint distribution using a graph
The graph encodes a set of conditional independence

assumptions
Answering queries (or inference or reasoning) in a Bayesian

network amounts to efficient computation of appropriate
conditional probabilities
Probabilistic inference is intractable in the general case

But can be carried out in linear time for certain classes of Bayesian
networks
Mailroom
Learning Bayesian
Networks
Mailroom
Learning Bayesian networks
E B
Data + Inducer
R A
Prior information E B P(A | E,B)
C e b .9 .1
e b .7 .3
e b .8 .2
e b .99 .01
Mailroom
Known Structure -- Complete Data
E, B, A
<Y,N,N>
<Y,Y,Y>
<N,N,Y>
<N,Y,Y>
.
. E B
<N,Y,Y>
Inducer A E B P(A | E,B)
E B e b .9 .1
E B P(A | E,B)
e b .7 .3
e b ? ?
A
e b .8 .2
e b ? ?
e b .99 .01
e b ? ?
e b ? ?
Network structure is specified

Inducer needs to estimate parameters
Data does not contain missing values
Mailroom
Unknown Structure -- Complete Data
E, B, A
<Y,N,N>
<Y,Y,Y>
<N,N,Y>
<N,Y,Y>
.
. E B
<N,Y,Y>
E B e b .9 .1
E B P(A | E,B)
e b .7 .3
e b ? ?
A
e b .8 .2
e b ? ?
e b .99 .01
e b ? ?
e b ? ?
Network structure is not specified

Inducer needs to select arcs & estimate parameters
Mailroom
Known Structure -- Incomplete Data
E, B, A
<Y,N,N>
<Y,?,Y>
<N,N,Y>
<N,Y,?>
. E B
.
<?,Y,Y>
E B e b .9 .1
E B P(A | E,B)
e b .7 .3
e b ? ?
A
e b .8 .2
e b ? ?
e b .99 .01
e b ? ?
e b ? ?
Network structure is specified

Data contains missing values
We consider assignments to missing values
Mailroom
Known Structure / Complete Data
Given a network structure G

And choice of parametric family for P(Xi|Pai)
Learn parameters for network
Goal
probability that generated the data

Learning Parameters for aMailroom
Bayesian Network
E B
Training data has the form: A
E [1] B [1] A[1] C [1] C
E [M ] B [M ] A[M ] C [M ]
Mailroom
Unknown Structure -- Complete Data
E, B, A
<Y,N,N>
<Y,Y,Y>
<N,N,Y>
<N,Y,Y>
.
. E B
<N,Y,Y>
E B e b .9 .1
E B P(A | E,B)
e b .7 .3
e b ? ?
A
e b .8 .2
e b ? ?
e b .99 .01
e b ? ?
e b ? ?
Network structure is not specified

Inducer needs to select arcs & estimate parameters
Mailroom
Benefits of Learning Structure

Discover structural properties of the domain
Ordering of events
Relevance
Identifying independencies faster inference
Predict effect of actions
Involves learning causal relationship among
variables
Why Struggle for Accurate Mailroom
Structure?
Earthquake Alarm Set Burglary
Sound
Adding an arc Missing an arc

Sound
Sound
Increases the number of Cannot be

parameters to be fitted compensated by
accurate fitting of
Wrong assumptions about parameters
causality and domain
structure Also misses causality
and domain structure
Approaches to Learning Mailroom
Structure
Constraint based
Perform tests of conditional independence
Search for a network that is consistent with the
observed dependencies and independencies
Pros & Cons

Intuitive, follows closely the construction of BNs
Separates structure learning from the form of the
independence tests
Sensitive to errors in individual tests
Approaches to Learning Mailroom
Structure
Score based
Define a score that evaluates how well the
(in)dependencies in a structure match the observations
Search for a structure that maximizes the score
Pros & Cons
Statistically motivated
Can make compromises
Takes the structure of conditional probabilities into
account
Computationally hard
Mailroom
Heuristic Search
Define a search space:
nodes are possible structures
edges denote adjacency of structures
Traverse this space looking for high-scoring
structures
Search techniques:
Greedy hill-climbing
Best first search
Simulated Annealing
...
Mailroom
Heuristic Search (cont.)

S C
Typical operations: E
S C
D
E
S C S C
E E
D D
Exploiting DecomposabilityMailroom
in
Local Search
S C S C
E E
D D
S C S C
E E
D D
Caching: To update the score of after a local

change, we only need to re-score the families
that were changed in the last move
Mailroom
Greedy Hill-Climbing
Simplest heuristic local search
Start with a given network
empty network
best tree
a random network
At each iteration
Evaluate all possible changes
Apply change that leads to best improvement in score
Reiterate
Stop when no modification improves score
Each step requires evaluating approximately
n new changes
Greedy Hill-Climbing: Possible
Mailroom
Pitfalls
Greedy Hill-Climbing can get struck in:
Local Maxima:
All one-edge changes reduce the score
Plateaus:
Some one-edge changes leave the score unchanged
Happens because equivalent networks received the same
score and are neighbors in the search space
Both occur during structure search
Standard heuristics can escape both
Random restarts
TABU search
Mailroom
Summary
Belief update
Role of conditional independence
Belief networks
Causality ordering
Inference in BN
Stochastic Simulation
Learning BNs
Mailroom
A Bayesian Network
37 variables, 509 parameters (instead of 237) MINVOLSET
PULMEMBOLUS INTUBATION KINKEDTUBE VENTMACH DISCONNECT
PAP SHUNT VENTLUNG VENITUBE
PRESS
MINOVL FIO2 VENTALV
ANAPHYLAXIS PVSAT ARTCO2
TPR SAO2 INSUFFANESTH EXPCO2
HYPOVOLEMIA LVFAILURE CATECHOL
LVEDVOLUME STROEVOLUME HISTORY ERRBLOWOUTPUT HR ERRCAUTER
CVP PCWP CO HREKG HRSAT

HRBP
BP
Mailroom
Population-Wide Approach
Anthrax Release Global nodes
Location of Release Time of Release Interface nodes
Each person in the

Person Model Person Model Person Model population
Note the conditional independence assumptions

Anthrax is infectious but non-contagious
Mailroom
Population-Wide Approach
Anthrax Release Global nodes
Location of Release Time of Release Interface nodes
Each person in the

Person Model Person Model Person Model population
Structure designed by expert judgment

Parameters obtained from census data, training data, and expert
assessments informed by literature and experience
Person Model (Initial Prototype)
Mailroom
Anthrax Release
Time Of Release Location of Release
Gender
Age Decile Age Decile Gender
Home Zip Home Zip

Other ED Other ED
Anthrax Infection Disease Anthrax Infection Disease
Respiratory Respiratory CC Respiratory Respiratory CC

from Anthrax From Other from Anthrax From Other
Respiratory Respiratory
CC CC
ED Admit ED Admit ED Admit ED Admit
from Anthrax from Other from Anthrax from Other
Respiratory CC Respiratory CC
When Admitted When Admitted
ED Admission ED Admission
Person Model (Initial Prototype) Mailroom
Anthrax Release
Time Of Release Location of Release
Female
20-30 50-60 Male
Gender
Age Decile Age Decile Gender
Home Zip Home Zip

Other ED Other ED
Anthrax Infection 15213 Disease Anthrax Infection 15146 Disease
Respiratory Respiratory CC Respiratory Respiratory CC

from Anthrax From Other from Anthrax From Other
Respiratory Respiratory
CC CC
ED Admit ED Admit ED Admit ED Admit
Unknown
from Anthrax False from Other from Anthrax from Other
Respiratory CC Respiratory CC
When Admitted When Admitted
Yesterday ED Admission never ED Admission

c14 15bayesian Networks 2020

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

c14 15bayesian Networks 2020

Uploaded by

Copyright:

Available Formats

Bayesian Networks

Russell and Norvig: Chapter 14-15

=Ñwn Chance :{ jÑwn /

riwhfiu PAIN 1%2

Decision support system/ Expert System

Ben is worried that if the roads are icy one or

Ben then learns that it is warm outside and

State of Road Information

not icy Flow

yes 0.7 yes no yes no

P(XIcy | XWatson=yes) = (0.95,0.05)

When initiating XIcy

XHolmes XWatson | XIcy

One morning as Watson leaves for work, he

whether he has left his sprinkler on or it has

it is also get wet

explains why my lawn is wet, so probably the

The Bayesian probability of an event x

prior and observed facts.

Classical probability refers to the true or actual probability of the

Bayesian approach restricts its prediction to the next (N+1)

Classical approach is to predict likelihood of any given event

At a certain time t, the KB of an agent is

a conditional distribution for each node given its parents:

Given some data from the world, why would

1. Compact representation of the data

= mÑsñÑñ evidence mnñi Prob ñniw update → Pch e) =

Probability of an hypothesis, h, can be updated when

1. Druganimal(D) with values y or n

2. Test (T) with values +ve or ve

A More Complex Case

L and B have a direct influence on fatigue (F).

where, for example, the variable B takes on values b1 (has bronchitis)

To obtain posterior distributions once some evidence is available

which increases to 297 if there are 100 variables.

nodes represent the random variables

the conditional probability of each variable given its parents in

R.E. Neapolitan, Learning Bayesian Networks (2004)

P(s, b, l , f , x) P(s) P(b | s) P(l | b, s) P( f | b, s, l ) P( x | b, s, l , f )

P(s, b, l , f , x) P(s) P(b | s) P(l | s) P( f | b, l ) P( x | l )

P( s1 , b2 , l1 , f1 , x1 ) 0.2 0.75 0.003 0.5 0.6 0.000135

ÑO n.woniimn.ms rhtwioowoiu and

However, the following BN represents the same distribution and independencies

Nevertheless, it is tempting to think that BNs can be created by creating a DAG

ÑmMM Mann Smoking

Bronchitis Lung Cancer

Markov condition: Ip(B, L | S), i.e. P(b | l, s) = P(b | s)

kawaii Alarm ñsow

Markov condition: Ip(B, E), i.e. P(b | e) = P(b)

with the Markov condition.

The Causal Markov Condition

Even with these provisos there is a lot of controversy as to its validity.

If a DAG is created on the basis of causal relationships between the variables

S What is the probability that

Weather is independent of other variables

off by a minor earthquake. Is there a burglar?

Variables: Burglar, Earthquake, Alarm, JohnCalls, MaryCalls

- A burglar can set the alarm off

PIA / B. E) = PCA / BNE )

Size of the CPT for a

= P(J|A)P(M|A)P(A| B, E)P( B)P(AlarmE)

Each of the beliefs The beliefs JohnCalls