You are on page 1of 115

Bayesian Networks

I bÑwnsonñanoÑwn.Ño :ÑwoosnÑoimqmiÑÑwyo:ÑÑb
a. him

Russell and Norvig: Chapter 14-15


WET PAINT ... VIDEO
BAYESIAN NETWORK
1ˢᵗ node
wait ÉÑon

2ⁿᵈ node
<

=Ñwn Chance :{ jÑwn /


rosin unis

riwhfiu PAIN 1%2


hiwowosrjñwii 1871 WET


ñngnriri :Ñwñwoni%nrEwosjñw9ni
3rd
'

node Chair
mnñ
highlight : Back -
Cloth depend on

( into Ñow )
Set CPT probability for CHAIR
Set CPT probability for BACK_CLOTH
When there is a label WET PAINT on the chair ...
When there is no label on the chair ...
When there is no label but see painted on shirt ...
When there is a label WET PAINT on the chair and see
clean back shirt ...
When there is a label WET PAINT on the chair and see
painted back shirt ...
Probabilistic Agent
sensors

?
environment
agent

actuators
I believe that the sun
will still exist tomorrow
with probability 0.999999
and that it will be a sunny
with probability 0.6
Bayesianism is a
controversial but
increasingly popular
approach of statistics
that offers many
benefits, although not
everyone is persuaded
of its validity
FYI
Bayesians Networks based on a statistical approach
presented by a mathematician, Thomas Bayes in
1763.
Introduced by Pearl (1986 )
Resembles human reasoning Ñn9vimqwanÑunÑsnwww.vef
=
-

Causal relationship =
ÑÑwmqÑwwo
now .

Decision support system/ Expert System


Other Names
Belief networks
Probabilistic networks
Causal networks
Common Sense Reasoning about
uncertainty
Ben is waiting for Holmes and Watson who are
both late for an AI seminar. Juris Ñwwintry

Ben is worried that if the roads are icy one or


both of them may have crash his car
Suddenly Ben learns that Watson has crashed
snow

Ben then learns that it is warm outside and


roads are salted (not icy)
Causal Relationships

State of Road
Icy/ not icy

Watson Holmes
Crash/No crash Crash/No crash
Watson Crashed !

State of Road
Information Icy/ not icy
Flow

Watson Holmes
Crash/No crash Crash/No crash
But Roads are dry

State of Road Information


100-1

not icy Flow


.

Watson Holmes
Crash/No crash Crash/No crash
100-1 .
A simple Bayesian Network

Ñnww
Icy
Holmes ñiusnowÑwinÑui

P(XHolmes|XIcy): P(XWatson|XIcy):
=

P(XIcy):
Crash

yes 0.7 yes no yes no


no 0.3 yes 0.8 0.2
Holmes yes 0.8 0.2
no 0.1 0.9 no 0.1 0.9
P(XWatson=yes)=1

?
?
Watson ÑnÑÑiun
,

P(XIcy | XWatson=yes) = (0.95,0.05)


(0.70,0.30)a priori
Joint Probability + Marginalization
P(XHolmes | XWatson=yes) = (0.76,0.24)
(0.59,0.41)a priori
P(XIcy=no)=1

When initiating XIcy


XHolmes becomes independent of XWatson ;
Independent

XHolmes XWatson | XIcy


Wet grass
To avoid icy roads, Watson moves to UCLA;
Holmes moves in USC iwatsonn-UHolmeslairouqqmwnndo.org
ioineueiieonniuoiwom

One morning as Watson leaves for work, he


notices that his grass is wet. He wondered
=FMWNÑ 182in

whether he has left his sprinkler on or it has


rained
= bMÑoUWOJ

it is also get wet

explains why my lawn is wet, so probably the


Information Rain Sprinkler
Flow Yes/no On/Off

Holmes grass
Wet/Dry Wet
Information Rain Sprinkler
Flow Yes/no On/Off

Holmes grass
Wet Wet
Bayesian vs. the Classical Approach'w%iÑw%n nwniismain.ioon.si
represent it ooo person aunts

The Bayesian probability of an event x

prior and observed facts.

Classical probability refers to the true or actual probability of the


event and is not concerned with observed behavior.

Bayesian approach restricts its prediction to the next (N+1)


occurrence of an event given the observed previous (N) events.

Baysian Mirin n.nimuwaiiriiun.nu niñiiviniiuñnliliniiiw

Classical approach is to predict likelihood of any given event


regardless of the number of occurrences.
24
Problem
iñun .
ÑonÑñsugar nima

At a certain time t, the KB of an agent is


some collection of beliefs
At time t
observation that changes the strength of
one of its beliefs
How should the agent update the strength
of its other beliefs?
Purpose of Bayesian Networks
Facilitate the description of a collection
of beliefs by making explicit causality
relations and conditional independence
among beliefs
Provide a more efficient way (than by
using joint distribution tables) to update
belief strengths when new evidence is
observed
Bayesian Networks
A simple, graphical notation for conditional
independence assertions resulting in a compact
representation for the full joint distribution

Syntax:
a set of nodes, one per variable ;
void : node iintiñ ink 1 ÑI


éwnᵈñTuw many into Ñngnoir

a conditional distribution for each node given its parents:


P(Xi|Parents(Xi))

Prop of ✗i throw;ñu node ÑÑw parent ooo Xi
Bayesian Networks
Data structure which represents the dependence
between variables
Gives concise specification of joint prob. dist.
Bayesian Belief Network is a graph that holds
Nodes are a set of random variables
Each node has a conditional prob. Table
Edges denote conditional dependencies
DAG : No directed cycle Mairi naw.riina-uuToowrouna.NU
Markov condition ) >
bdañwÑÑÑo
Plxi Parents ( ✗ i ) nil Markov condition

Ñwniñniñw Decision
Making
Learning Bayes Nets :B
aysian To n.ioiioqanatmn.rs networks

Given some data from the world, why would


we want to learn a Bayes net?

1. Compact representation of the data


There are fast algorithms for prediction/inference
given observations of the environment
2. Causal knowledge
There are fast algorithms for prediction/inference
given interventions in the environment
Bayesian network
Markov Assumption
Each random variable X is
independent of its non-
descendent given its parent
Pa(X) airfoil depend 9ns
; ✗ on
Y1 Y2
www

Parent ✗ depend on
ya

Formally, X
The current state depends
on only a finite history of
previous states.
First-order Markov Process:
P(xt|x0:xt-1) = P(xt|xt-1)
iñnowoisnrioisxoillon ( Xo ) owns : n%Ñ's ✗ t -

,
Likelihood Prior =
n !iñwn%Ñ

ñÑU match
P (e / h ) P ( h )
an hypothesis ñÑ ñiluwñ evidence

P ( h / e)
P (e)
→ he hypothesis ] =

riwwiig.in v80
knowledge www.nownt
el evidence ) input ñiornimi observe aniwñsiloiiiw

Posterior
→ =

= mÑsñÑñ evidence mnñi Prob ñniw update → Pch e) =


n.im :Ñwoosnnw%Ñ9:ÑñnmTÑ|ñwu evidence Thai
Probability
ñÑñoonmÑwñÑmiñññs
ÑwoYs Anilin'll
of Evidence
" " Miri update Evidence n%wio7m

Probability of an hypothesis, h, can be updated when


evidence, e, has been obtained.
Note: it is usually not necessary to calculate P(e) directly as it can be
obtained by normalizing the posterior probabilities, P(hi | e).
A Simple Example
Consider two related variables:
oined

1. Druganimal(D) with values y or n


Test jinn

2. Test (T) with values +ve or ve


And suppose we have the following probabilities:
P(D = y) = 0.001
P(T = +ve | D = y) = 0.8
P(T = +ve | D = n) = 0.01
These probabilities are sufficient to define a joint probability
distribution.
Suppose an athlete tests positive. What is the probability that he has
taken the drug?
P (T ve | D y ) P( D y )
P(D y|T ve)
P(T ve | D y ) P( D y ) P(T ve | D n) P ( D n)
0.8 0.001
0.8 0.001 0.01 0.999
0.074
i

A More Complex Case


Suppose now that there is a similar link between Lung Cancer (L) and a
chest X-ray (X) and that we also have the following relationships:
Torquato's
History of smoking (S) has a direct influence on bronchitis (B) and lung
woo

cancer (L);
5N .
1MÑoqjw

L and B have a direct influence on fatigue (F).


What is the probability that someone has bronchitis given that they
smoke, have fatigue and have received a positive X-ray result?
P(b1 , s1 f1 , x1 , l )
P(b1 , s1 , f1 , x1 ) l
P(b1 | s1 , f1 , x1 )
P( s1 , f1 , x1 ) P(b, s1 , f1 , x1 , l )
b ,l

where, for example, the variable B takes on values b1 (has bronchitis)


and b2 (does not have bronchitis).
R.E. Neapolitan, Learning Bayesian Networks (2004)
Problems with Large Instances
The joint probability distribution, P(b,s,f,x,l)
For five binary variables there are 25 = 32 values in the joint
distribution (for 100 variables there are over 1030 values)
How are these values to be obtained?

Inference

To obtain posterior distributions once some evidence is available


requires summation over an exponential number of terms eg 22 in
the calculation of
P(s1 , f1 , x1 ) P(b, s1 , f1 , x1 , l )
b ,l

which increases to 297 if there are 100 variables.


Bayesian Networks
A Bayesian network consists of:
A Graph

nodes represent the random variables


directed edges (arrows) between pairs of nodes
it must be a Directed Acyclic Graph (DAG) no directed cycles
the graph represents independence relationships between
variables
Conditional probability specifications

the conditional probability of each variable given its parents in


the DAG
An Example Bayesian Network
P(s1)=0.2
pc

"
"
"
Smoking history
"

°"
↓ depend
P(b1|s1)=0.25 P(l1|s1)=0.003
P(b1|s2)=0.05 -2
yes

no
P(l1|s2)=0.00005

o.se

Bronchitis =
n
,
snafu's wog
Lung Cancer

Fatigue X-ray

P(f1|b1,l1)=0.75
P(x1|l1)=0.6
P(f1|b1,l2)=0.10
P(x1|l2)=0.02
P(f1|b2,l1)=0.5
P(f1|b2,l2)=0.05

R.E. Neapolitan, Learning Bayesian Networks (2004)


?⃝
The Joint Probability Distribution
Note that our joint distribution with 5 variables can be represented as:

P(s, b, l , f , x) P(s) P(b | s) P(l | b, s) P( f | b, s, l ) P( x | b, s, l , f )


But due to the Markov condition we have, for example,

P( x | b, s, l , f ) P( x | l )
Consequently the joint probability distribution can now be expressed as

P(s, b, l , f , x) P(s) P(b | s) P(l | s) P( f | b, l ) P( x | l )


For example, the probability that someone has a smoking history, lung cancer
but not bronchitis, suffers from fatigue and tests positive in an X-ray test is
Ñhlung union

go.UU.mimn.cn
cancer

you
x-ray
[ 11-0.25 )

P( s1 , b2 , l1 , f1 , x1 ) 0.2 0.75 0.003 0.5 0.6 0.000135


I
/ to'owiwÑg
n.int TaiÑw%nrpnÑÑswo
'
>
comma

ÑO n.woniimn.ms rhtwioowoiu and


The Markov Condition*
A Bayesian network (G,P) satisfies the Markov condition according to which for
each variable, X, in G:
X is conditionally independent of its non-descendents given its parents in G
Denoted by X nd(X) | pa(X) or IP(X, nd(X) | pa(X))

S
Eg, in this network

B L

F X
Representing the Joint
Distribution
In general, for a network with nodes X1, X2 n then
n
P( x1 , x2 ,..., xn ) P( xi | pa( xi ))
i 1

An enormous saving can be made regarding the number of values required for
the joint distribution.
To determine the joint distribution directly for n binary variables 2n 1 values
are required.
For a BN with n binary variables and each node has at most k parents then less
than 2kn values are required.
ñ'ÑwmqÑwwo
Causality and Bayesian
naw .

Networks
Clearly not every BN describes causal relationships between the variables.
Consider the dependence between Lung Cancer, L, and the X-ray test, X. By
focusing on just these variables we might be tempted to represent them by the
following BN niki rios
lung cancer ✗ -

ray

P(x1|l1)=0.6
P(l1)=0.001 L X P(x1|l2)=0.02

However, the following BN represents the same distribution and independencies


(i.e. none) mminnosñownñÑÑñi1 ñqnsinwtnw ray
www.wtuloiin.mn ✗ -

:/ni
lung
:Ñw cancer in

P(l1|x1)=0.02915
P(l1|x2)=0.00041
L X P(x1)=0.02058

Nevertheless, it is tempting to think that BNs can be created by creating a DAG


where the edges represent direct causal relationships between the variables.
Common Causes
Consider the following DAG:

Smoking
on it Common Causes nÑDÑo Bronchitis ñu

ÑmMM Mann Smoking


lung cancer
,
.

Bronchitis Lung Cancer

Markov condition: Ip(B, L | S), i.e. P(b | l, s) = P(b | s)


If we know the causal relationships S B and S L and we know that Joe is a
smoker, then finding out that he has Bronchitis will not give us any more
information about the probability of him having Lung Cancer.
So the Markov condition would be satisfied.
Common Effects
Consider the following DAG:
oiwoiwiiiw

Burglary Earthquake
Effect
Common minor inn
,
mind
Burglary in : earth
quan .

kawaii Alarm ñsow

Alarm

Markov condition: Ip(B, E), i.e. P(b | e) = P(b)


We would expect Burglary and Earthquake to be independent of each other
which is in agreement with the Markov condition.
We would, however expect them to be conditionally dependent given Alarm. If
the alarm has gone off, news that there had been an earthquake would

with the Markov condition.


coin )

The Causal Markov Condition


The basic idea is that the Markov condition holds for a causal DAG.
Certain other conditions must be met for the Causal Markov condition to
hold:
there must be no hidden common causes
there must not be selection bias
there must be no feedback loops

Even with these provisos there is a lot of controversy as to its validity.


It seems to be false in quantum mechanical systems which have been
Hidden Common Causes
H

X Y

If a DAG is created on the basis of causal relationships between the variables


under consideration then X and Y would be marginally independent according to
the Markov condition.
But since they have a hidden common cause, H, they will normally be dependent.
Inference in Bayesian Networks
The main point of BNs is to enable probabilistic inference to be performed.
There are two main types of inference to be carried out:
Belief updating to obtain the posterior probability of one or more
variables given evidence concerning the values of other variables
Abductive inference (or belief updating) find the most probable
configuration of a set of variables (hypothesis) given evidence
Consider the BN discussed earlier:

S What is the probability that


someone has bronchitis (B) given
that they smoke (S) have fatigue (F)
B L and have received a positive X-ray
(X) result?

F X
Example
Topology of network encodes conditional
independence assertions:

Weather Cavity

Toothache Catch

Weather is independent of other variables


Toothache and Catch are independent given Cavity
Example

off by a minor earthquake. Is there a burglar?

Variables: Burglar, Earthquake, Alarm, JohnCalls, MaryCalls

- A burglar can set the alarm off


- An earthquake can set the alarm off
- The alarm can cause Mary to call
- The alarm can cause John to call
£
d
2M
8

52
s

¥

:
9
I
see

or
J
-

=
E
O
9
A Simple Belief Network

Burglary Earthquake
causes
Intuitive meaning of arrow

Directed acyclic
Alarm
graph (DAG)

effects
Nodes are random variables
JohnCalls MaryCalls
Assigning Probabilities to Roots
P(B) P(E)
Burglary 0.001 Earthquake 0.002

Alarm

JohnCalls MaryCalls
Conditional Probability Tables

P(B) P(E)
Burglary 0.001 Earthquake 0.002

PIA / B. E) = PCA / BNE )

B E P(A|B,E)
0.001 0.002
T T 0.95
Alarm
0.998
T
0.999
F 0.94
F T 0.29
F F 0.001

Size of the CPT for a


node with k parents: ?
JohnCalls MaryCalls
Conditional Probability Tables

P(B) P(E)
Burglary 0.001 Earthquake 0.002

22 B E P(A|B,E)
=6
T T 0.95
Alarm T
F
F
T
0.94
0.29
F F 0.001

I =
2

A P(J|A) A P(M|A)
JohnCalls MaryCalls
0 T 0.90
F 0.05
T 0.70
F 0.01
What the BN Means
P(B) P(E)
Burglary 0.001 Earthquake 0.002

B E P(A| )
T T 0.95
Alarm T
F
F
T
0.94
0.29
F F 0.001

P(x1,x2 n) = P(xi|Parents(Xi))
A P(J|A) A P(M|A)
JohnCalls T 0.90 MaryCalls T 0.70
F 0.05 F 0.01
Calculation of Joint Probability
P(B) P(E)
Burglary 0.001 Earthquake 0.002

B E P(A| )

P(J M A B E) T T 0.95

= P(J|A)P(M|A)P(A| B, E)P( B)P(AlarmE)


T
F
F
T
0.94
0.29
= 0.9 x 0.7 x 0.001 x 0.999 x 0.998 F F 0.001
= 0.00062

A A
JohnCalls T 0.90 MaryCalls T 0.70
F 0.05 F 0.01
What The BN Encodes
Burglary Earthquake

Alarm
For example, John does
not observe any burglaries
JohnCalls directly MaryCalls

Each of the beliefs The beliefs JohnCalls


JohnCalls and MaryCalls and MaryCalls are
is independent of independent given
Burglary and Alarm or Alarm
Earthquake given Alarm
or Alarm
What The BN Encodes
Burglary Earthquake
For instance, the reasons why
John and Mary may not call if
Alarm
there is an alarm are unrelated
JohnCalls MaryCalls

Each of the beliefs The beliefs JohnCalls


Note JohnCalls
that theseandreasons could
MaryCalls and MaryCalls are
be other beliefs in the
is independent ofnetwork. independent given
The probabilities
Burglary andsummarize these Alarm or Alarm
non-explicit beliefs
Earthquake given Alarm
or Alarm
Structure of BN
The relation:
P(x 1,x
E.g., n) =
JohnCalls
2 is influencedP(x |Parents(X
by iBurglary, buti))
not
means that
directly. each belief
JohnCalls is independent
is directly influenced by of its
Alarm
predecessors in the BN given its parents
Said otherwise, the parents of a belief Xi are all
the beliefs that Xi
Usually (but not always) the parents of Xi are its
causes and Xi is the effect of these causes
Construction of BN
Choose the relevant sentences (random
variables) that describe the domain
The ordering guarantees that the BN
Selectwill
an have
ordering X1
no cycles n, so that all the

beliefs that directly influence Xi are before Xi

Add a node in the network labeled by Xj


Connect the node of its parents to Xj
Define the CPT of Xj
Cond. Independence Relations Ancestor

1. Each random variable Parent


X, is conditionally
Y1 Y2
independent of its non-
descendents, given its
parents Pa(X) X
Formally,
I(X; NonDesc(X) | Pa(X))
2. Each random variable
is conditionally
independent of all the Non-descendent
other nodes in the
graph, given its neighbor
Descendent
Inference In BN
Set E of evidence variables that are observed,
e.g., {JohnCalls,MaryCalls}
Query variable X, e.g., Burglary, for which we
would like to know the posterior probability
distribution P(X|E)

J M P(B| )
T T ? Distribution conditional to
the observations made
Inference Patterns
Burglary Earthquake Burglary Earthquake

Basic use of a BN: Given new


observations,
Alarm compute the new
Diagnostic Alarm Causal
strengths of some (or all) beliefs
JohnCalls MaryCalls JohnCalls MaryCalls

Burglary Other use: Given


Earthquake Burglary the strength of
Earthquake

a belief, which observation should


Alarm we gather to make the
Intercausal greatest
Alarm Mixed

JohnCalls MaryCalls JohnCalls MaryCalls


Types Of Nodes On A Path
diverging
Battery
linear

Radio SparkPlugs Gas

Starts
converging

Moves
Independence Relations In BN
diverging
Battery
linear

Radio SparkPlugs Gas


Given a set E of evidence nodes, two beliefs
connected by an undirected path are
independent if one of the following three Starts
conditions holds:
1. A node on the path is linear and in E converging
2. A node on the path is diverging and in E
3. A node on the path is converging and Moves
neither this node, nor any descendant is in E
Independence Relations In BN
diverging
Battery
linear

Radio SparkPlugs Gas


Given a set E of evidence nodes, two beliefs
connected by an undirected path are
independent if one of the following three Starts
conditions holds:
1. A node on the path is linear and in E converging
2. A node on the path is diverging and in E
3. A node
Gas and
on theRadio
path isare
converging
independent
and Moves
neither thisevidence
given node, nor on
any SparkPlugs
descendant is in E
Independence Relations In BN
diverging
Battery
linear

Radio SparkPlugs Gas


Given a set E of evidence nodes, two beliefs
connected by an undirected path are
Gas andifRadio
independent one of are independent
the following three Starts
given
conditions evidence on Battery
holds:
1. A node on the path is linear and in E converging
2. A node on the path is diverging and in E
3. A node on the path is converging and Moves
neither this node, nor any descendant is in E
Independence Relations In BN
diverging
Battery
linear

Radio SparkPlugs Gas


Given
Gasa and
set E Radio
of evidence nodes, two beliefs
are independent
connected by an undirected path are
given noifevidence,
independent but theythree
one of the following are
Starts
dependent
conditions holds: given evidence on
1. A node onStarts
the pathorisMoves
linear and in E converging
2. A node on the path is diverging and in E
3. A node on the path is converging and Moves
neither this node, nor any descendant is in E
BN Inference
Simplest Case:
A B

P(B) = P(a)P(B|a) + P(~a)P(B|~a)

P(B) P( A ) P( B | A )
A

A B C

P(C) = ???
BN Inference
Chain:
X1 X2 Xn

What is time complexity to compute P(Xn)?

What is time complexity if we computed the full joint?


Example of a simple Bayesian network

A B
p(A,B,C) = p(C|A,B)p(A)p(B)

Probability model has simple factored form

Directed edges => direct dependence

Absence of an edge => conditional independence

Also known as belief networks, graphical models, causal networks

Other formulations, e.g., undirected graphical models


Examples of 3-way Bayesian Networks

A B C Marginal Independence:
p(A,B,C) = p(A) p(B) p(C)
Examples of 3-way Bayesian Networks

Conditionally independent effects:


p(A,B,C) = p(B|A)p(C|A)p(A)

A B and C are conditionally independent


Given A

e.g., A is a disease, and we model


B C B and C as conditionally independent
symptoms given A
Examples of 3-way Bayesian Networks

A B Independent Causes:
p(A,B,C) = p(C|A,B)p(A)p(B)

Given C, observing A makes B less likely


e.g., earthquake/burglary/alarm example

A and B are (marginally) independent


but become dependent once C is known
Examples of 3-way Bayesian Networks

A B C Markov dependence:
p(A,B,C) = p(C|B) p(B|A)p(A)
Inference Ex. 2
Cloudy

Sprinkler
Algorithm is computing Rain
not individual
probabilities, but entire tables
WetGrass
Two ideas crucial to avoiding exponential blowup:
because of the structure of the BN, some
P(W)
subexpression Pin(w
the| rjoint
, s)Pdepend
(r | c)Ponly
(s | on
c)Pa(csmall
) number
of variableR ,S ,C
By computing P(them
w | r ,once
s) and P(rcaching
| c)P(sthe
| c)result,
P(c) we
can avoid Rgenerating
,S
themC exponentially many times
P(w | r , s)fC (r , s) fC (R , S)
R ,S
Approaches to inference
Exact inference
Inference in Simple Chains
Variable elimination
Clustering / join tree algorithms
Approximate inference
Stochastic simulation / sampling methods
Markov chain Monte Carlo methods
Stochastic simulation - direct
Suppose you are given values for some subset of the
variables, G, and want to infer values for unknown
variables, U
Randomly generate a very large number of
instantiations from the BN
Generate instantiations for all variables start at root

Rejection Sampling: keep those instantiations that are


consistent with the values for G
Use the frequency of values for U to get estimated
probabilities
Accuracy of the results depends on the size of the
sample (asymptotically approaches exact results)
Direct Stochastic Simulation
Cloudy P(WetGrass|Cloudy)?
P(WetGrass|Cloudy)
Rain = P(WetGrass Cloudy) / P(Cloudy)
Sprinkler
1. Repeat N times:
WetGrass 1.1. Guess Cloudy at random
1.2. For each guess of Cloudy, guess
Sprinkler and Rain, then WetGrass

2. Compute the ratio of the # runs where


WetGrass and Cloudy are True
over the # runs where Cloudy is True
Direct Sampling
Suppose we have no evidence, but we
want to determine P(Cloudy, Sprinkler,
Rain, WetGrass) for all Cloudy,
Sprinkler, Rain, WetGrass.

Direct sampling:
Sample each variable in topological order,
conditioned on values of parents.
I.e., always sample from P(Xi |
parents(Xi))
Example
1. Sample from P(Cloudy). Suppose returns true.

2. Sample from P(Sprinkler | Cloudy = true). Suppose


returns false.

3. Sample from P(Rain | Cloudy = true). Suppose


returns true.

4. Sample from P(WetGrass | Sprinkler = false, Rain =


true). Suppose returns true.

Here is the sampled event: [true, false, true, true]


Suppose there are N total samples, and let NS (x1, ...,
xn) be the observed frequency of the specific event
x1, ..., xn.
N S ( x1 ,..., xn )
lim P( x1 ,..., xn )
N N

N S ( x1 ,..., xn )
P( x1 ,..., xn )
N
Suppose N samples, n nodes. Complexity O(Nn).

Problem 1: Need lots of samples to get good


probability estimates.
Problem 2: Many samples are not realistic; low
likelihood.
Likelihood weighting

to be rejected in the first place!


Sample only from the unknown
variables Z
Weight each sample according to the
likelihood that it would occur, given the
evidence E
Markov chain Monte Carlo
algorithm
So called because
Markov chain each instance generated in the sample is
dependent on the previous instance
Monte Carlo statistical sampling method
Perform a random walk through variable assignment
space, collecting statistics as you go
Start with a random instantiation, consistent with evidence
variables
At each step, for some nonevidence variable, randomly
sample its value, consistent with the other current
assignments
Given enough samples, MCMC gives an accurate
estimate of the true distribution of values
Markov Chain Monte Carlo
Sampling Algorithm
Start with random sample from
variables: (x1, ..., xn). This is the

Next state: Randomly sample value for


one non-evidence variable Xi ,
conditioned on current values in
Markov Blanket Xi.
Markov Chain Monte Carlo Sampling
One of most common methods used in real
applications.

Markov blanket Xi:

Recall that: By construction of Bayesian network, a


node is conditionally independent of its non-
descendants, given its parents.
Proposition: A node Xi is conditionally independent
of all other nodes in the network, given its Markov
blanket.
X
Example
Query: What is P(Rain | Sprinkler =
true, WetGrass = true)?
MCMC:
Random sample, with evidence variables fixed:
[true, true, false, true]

Repeat:
1. Sample Cloudy, given current values of its Markov blanket:
Sprinkler = true, Rain = false. Suppose result is false. New
state:
[false, true, false, true]

2. Sample Rain, given current values of its Markov blanket:


Cloudy = false, Sprinkler = true, WetGrass = true. Suppose
result is true. New state: [false, true, true, true].
Each sample contributes to estimate for query
P(Rain | Sprinkler = true, WetGrass = true)

Suppose we perform 100 such samples, 20 with Rain = true and


80 with Rain = false.

Then answer to the query is


Normalize ( 20,80 ) = .20,.80

in which the long-run fraction of time spent in each state is


exactly proportional to its posterior probability, given the

That is: for all variables Xi, the probability of the value xi of Xi
appearing in a sample is equal to P(xi | e).

Claim: MCMC settles into behavior in which each state is


sampled exactly according to its posterior probability, given the
evidence.
Y1 Y2 Y3 Yn

P(C | Y1 n) = P(Yi | C) P (C)

Features Y are conditionally independent given the class variable C

Widely used in machine learning

Conditional probabilities P(Yi | C) can easily be estimated from labeled data


Hidden Markov Model (HMM)
Observed
Y1 Y2 Y3 Yn

----------------------------------------------------

S1 S2 S3 Sn Hidden

Two key assumptions:


1. hidden state sequence is Markov
2. observation Yt is CI of all other variables given St

Widely used in speech recognition, protein sequence models

Since this is a Bayesian network polytree, inference is linear in n


Summary
Bayesian networks represent a joint distribution using a graph

The graph encodes a set of conditional independence


assumptions

Answering queries (or inference or reasoning) in a Bayesian


network amounts to efficient computation of appropriate
conditional probabilities

Probabilistic inference is intractable in the general case


But can be carried out in linear time for certain classes of Bayesian
networks
Mailroom

Learning Bayesian
Networks
Mailroom

Learning Bayesian networks

E B

Data + Inducer
R A
Prior information E B P(A | E,B)

C e b .9 .1
e b .7 .3
e b .8 .2
e b .99 .01
Mailroom
Known Structure -- Complete Data
E, B, A
<Y,N,N>
<Y,Y,Y>
<N,N,Y>
<N,Y,Y>
.
. E B
<N,Y,Y>
Inducer A E B P(A | E,B)
E B e b .9 .1
E B P(A | E,B)
e b .7 .3
e b ? ?
A
e b .8 .2
e b ? ?
e b .99 .01
e b ? ?
e b ? ?

Network structure is specified


Inducer needs to estimate parameters
Data does not contain missing values
Mailroom
Unknown Structure -- Complete Data
E, B, A
<Y,N,N>
<Y,Y,Y>
<N,N,Y>
<N,Y,Y>
.
. E B
<N,Y,Y>
Inducer A E B P(A | E,B)
E B e b .9 .1
E B P(A | E,B)
e b .7 .3
e b ? ?
A
e b .8 .2
e b ? ?
e b .99 .01
e b ? ?
e b ? ?

Network structure is not specified


Inducer needs to select arcs & estimate parameters
Data does not contain missing values
Mailroom
Known Structure -- Incomplete Data
E, B, A
<Y,N,N>
<Y,?,Y>
<N,N,Y>
<N,Y,?>
. E B
.
<?,Y,Y>
Inducer A E B P(A | E,B)
E B e b .9 .1
E B P(A | E,B)
e b .7 .3
e b ? ?
A
e b .8 .2
e b ? ?
e b .99 .01
e b ? ?
e b ? ?

Network structure is specified


Data contains missing values
We consider assignments to missing values
Mailroom
Known Structure / Complete Data

Given a network structure G


And choice of parametric family for P(Xi|Pai)

Learn parameters for network

Goal

probability that generated the data


Learning Parameters for aMailroom
Bayesian Network
E B

Training data has the form: A

E [1] B [1] A[1] C [1] C

E [M ] B [M ] A[M ] C [M ]
Mailroom
Unknown Structure -- Complete Data
E, B, A
<Y,N,N>
<Y,Y,Y>
<N,N,Y>
<N,Y,Y>
.
. E B
<N,Y,Y>
Inducer A E B P(A | E,B)
E B e b .9 .1
E B P(A | E,B)
e b .7 .3
e b ? ?
A
e b .8 .2
e b ? ?
e b .99 .01
e b ? ?
e b ? ?

Network structure is not specified


Inducer needs to select arcs & estimate parameters
Data does not contain missing values
Mailroom

Benefits of Learning Structure


Discover structural properties of the domain
Ordering of events
Relevance
Identifying independencies faster inference
Predict effect of actions
Involves learning causal relationship among
variables
Why Struggle for Accurate Mailroom
Structure?
Earthquake Alarm Set Burglary

Sound

Adding an arc Missing an arc

Earthquake Alarm Set Burglary


Earthquake Alarm Set Burglary

Sound
Sound

Increases the number of Cannot be


parameters to be fitted compensated by
accurate fitting of
Wrong assumptions about parameters
causality and domain
structure Also misses causality
and domain structure
Approaches to Learning Mailroom

Structure
Constraint based
Perform tests of conditional independence
Search for a network that is consistent with the
observed dependencies and independencies

Pros & Cons


Intuitive, follows closely the construction of BNs
Separates structure learning from the form of the
independence tests
Sensitive to errors in individual tests
Approaches to Learning Mailroom

Structure
Score based
Define a score that evaluates how well the
(in)dependencies in a structure match the observations
Search for a structure that maximizes the score
Pros & Cons
Statistically motivated
Can make compromises
Takes the structure of conditional probabilities into
account
Computationally hard
Mailroom

Heuristic Search
Define a search space:
nodes are possible structures
edges denote adjacency of structures
Traverse this space looking for high-scoring
structures
Search techniques:
Greedy hill-climbing
Best first search
Simulated Annealing
...
Mailroom

Heuristic Search (cont.)


S C

Typical operations: E
S C
D
E

S C S C

E E

D D
Exploiting DecomposabilityMailroom
in
Local Search
S C S C

E E

D D

S C S C

E E

D D

Caching: To update the score of after a local


change, we only need to re-score the families
that were changed in the last move
Mailroom

Greedy Hill-Climbing
Simplest heuristic local search
Start with a given network
empty network
best tree
a random network
At each iteration
Evaluate all possible changes
Apply change that leads to best improvement in score
Reiterate
Stop when no modification improves score
Each step requires evaluating approximately
n new changes
Greedy Hill-Climbing: Possible
Mailroom

Pitfalls
Greedy Hill-Climbing can get struck in:
Local Maxima:
All one-edge changes reduce the score
Plateaus:
Some one-edge changes leave the score unchanged
Happens because equivalent networks received the same
score and are neighbors in the search space
Both occur during structure search
Standard heuristics can escape both
Random restarts
TABU search
Mailroom

Summary
Belief update
Role of conditional independence
Belief networks
Causality ordering
Inference in BN
Stochastic Simulation
Learning BNs
Mailroom

A Bayesian Network
37 variables, 509 parameters (instead of 237) MINVOLSET

PULMEMBOLUS INTUBATION KINKEDTUBE VENTMACH DISCONNECT

PAP SHUNT VENTLUNG VENITUBE

PRESS
MINOVL FIO2 VENTALV

ANAPHYLAXIS PVSAT ARTCO2

TPR SAO2 INSUFFANESTH EXPCO2

HYPOVOLEMIA LVFAILURE CATECHOL

LVEDVOLUME STROEVOLUME HISTORY ERRBLOWOUTPUT HR ERRCAUTER

CVP PCWP CO HREKG HRSAT


HRBP

BP
Mailroom

Population-Wide Approach
Anthrax Release Global nodes

Location of Release Time of Release Interface nodes

Each person in the


Person Model Person Model Person Model population

Note the conditional independence assumptions


Anthrax is infectious but non-contagious
Mailroom

Population-Wide Approach
Anthrax Release Global nodes

Location of Release Time of Release Interface nodes

Each person in the


Person Model Person Model Person Model population

Structure designed by expert judgment


Parameters obtained from census data, training data, and expert
assessments informed by literature and experience
Person Model (Initial Prototype)
Mailroom

Anthrax Release

Time Of Release Location of Release

Gender
Age Decile Age Decile Gender

Home Zip Home Zip


Other ED Other ED
Anthrax Infection Disease Anthrax Infection Disease

Respiratory Respiratory CC Respiratory Respiratory CC


from Anthrax From Other from Anthrax From Other

Respiratory Respiratory
CC CC
ED Admit ED Admit ED Admit ED Admit
from Anthrax from Other from Anthrax from Other

Respiratory CC Respiratory CC
When Admitted When Admitted

ED Admission ED Admission
Person Model (Initial Prototype) Mailroom

Anthrax Release

Time Of Release Location of Release

Female
20-30 50-60 Male
Gender
Age Decile Age Decile Gender

Home Zip Home Zip


Other ED Other ED
Anthrax Infection 15213 Disease Anthrax Infection 15146 Disease

Respiratory Respiratory CC Respiratory Respiratory CC


from Anthrax From Other from Anthrax From Other

Respiratory Respiratory
CC CC
ED Admit ED Admit ED Admit ED Admit
Unknown
from Anthrax False from Other from Anthrax from Other

Respiratory CC Respiratory CC
When Admitted When Admitted

Yesterday ED Admission never ED Admission

You might also like