Lecture1 PDF

Probabilistic Graphical Models
David Sontag
New York University
Lecture 1, January 31, 2013
David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 1 / 44

One of the most exciting advances in machine learning (AI, signal
processing, coding, control, . . .) in the last decades

How can we gain global insight based on local observations?

Key idea
1 Represent the world as a collection of random variables X1 , . . . , Xn

with joint distribution p(X1 , . . . , Xn )
2 Learn the distribution from data
3 Perform “inference” (compute conditional distributions
p(Xi | X1 = x1 , . . . , Xm = xm ))

Reasoning under uncertainty
As humans, we are continuously making predictions under uncertainty

Classical AI and ML research ignored this phenomena
Many of the most recent advances in technology are possible because
of this new, probabilistic, approach

Applications: Deep question answering

Applications: Machine translation

Applications: Speech recognition

Applications: Stereo vision
input: two images! output: disparity!

Key challenges
1 Represent the world as a collection of random variables X1 , . . . , Xn

with joint distribution p(X1 , . . . , Xn )
How does one compactly describe this joint distribution?
Directed graphical models (Bayesian networks)
Undirected graphical models (Markov random fields, factor graphs)
2 Learn the distribution from data

Maximum likelihood estimation. Other estimation methods?
How much data do we need?
How much computation does it take?
3 Perform “inference” (compute conditional distributions

p(Xi | X1 = x1 , . . . , Xm = xm ))

Syllabus overview
We will study Representation, Inference & Learning

First in the simplest case
Only discrete variables
Fully observed models
Exact inference & learning
Then generalize
Continuous variables
Partially observed data during learning (hidden variables)
Approximate inference & learning
Learn about algorithms, theory & applications

Logistics: class
Class webpage:
http://cs.nyu.edu/~dsontag/courses/pgm13/
Sign up for mailing list!
Draft slides posted before each lecture
Book: Probabilistic Graphical Models: Principles and Techniques by
Daphne Koller and Nir Friedman, MIT Press (2009)
Required readings for each lecture posted to course website.
Many additional reference materials available!
Office hours: Wednesday 5-6pm and by appointment. 715
Broadway, 12th floor, Room 1204
Teaching Assistant: Li Wan (wanli@cs.nyu.edu)
Li’s Office hours: Monday 5-6pm. 715 Broadway, Room 1231

Logistics: prerequisites & grading
Prerequisites:
Previous class on machine learning
Basic concepts from probability and statistics
Algorithms (e.g., dynamic programming, graphs, complexity)
Calculus
Grading: problem sets (65%) + in class final exam (30%) +
participation (5%)
Class attendance is required.
7-8 assignments (every 1–2 weeks). Both theory and programming.
First homework out today, due next Thursday (Feb. 7) at 5pm
Important: See collaboration policy on class webpage
Solutions to the theoretical questions require formal proofs.
For the programming assignments, I recommend Python, Java, or
Matlab. Do not use C++.

ducNon&to&probability:&outcomes&
Review of probability: outcomes
bability: Chapter 2 IntroducNon&to&probability:&outcomes&
Reference:outcomes and Appendix A
and Appendix A
(space&specifies&the&possible&outcomes&that&we&would&
An outcome space specifies the possible outcomes that we would
•  An&outcome(space&specifies&the&possible&outcomes&that&we&would&
ntroducNon&to&probability:&outcomes&
n&about,&e.g.& like&to&reason&about,&e.g.&
like to reason about, e.g.
space specifies the possible outcomes that we would
tcome(space&specifies&the&possible&outcomes&that&we&would&
&reason&about,&e.g.&
Ω={ , } Coin toss
Ω={
about, e.g.
, } Coin toss
Ω={ Ω, = { }, ,
Coin toss
, , , } Die toss
, We specify ,
a probability , p(ω), for each} outcome
,•  We&specify&a&probability(p(x)&for&each&outcome&x(such&that(
Die tossω such that
={ , , , , , } X Die toss
X ≥ 0,
p(ω) p(ω)E.g.,
= 1p( ) = .6
p(x) 0, p(x) = 1
p( ) = .4
probability p(x) for each outcome
a&probability(p(x)&for&each&outcome&x(such&that(
x2⌦ x suchω∈Ω
pecify&a&probability(p(x)&for&each&outcome&x(such&that(
that
p(x) 0,
XX p(x) = 1 E.g., E.g.,
p( ) = .6
p( ) = .6
0, x
p(x) = 1 p( ) = .4
x p( ) = .4

IntroducNon&to&probability:&events&
Review of probability: events
•  An&event(is&a&subset&of&the&outcome&space,&e.g.&
An event is a subset of the outcome space, e.g.
IntroducNon&to&probability:&events&
E={ , , } Even die tosses
•  An&event(is&a&subset&of&the&outcome&space,&e.g.&
O={ , } , } Odd die tosses

E={ , , Even die tosses
The probability
•  The&probability&of&an&event&is&given&by&the&sum&of&the&probabiliNes&
of an event is given by the sum of the probabilities of
of&the&outcomes&it&contains,&
O={
the outcomes it
, contains,
, } Odd die tosses
X X
p(E) = p(x) p(EE.g.,
) = p(E) =p(ω)
p( ) + p( ) + p( )
•  The&probability&of&an&event&is&given&by&the&sum&of&the&probabiliNes&
x2E
ω∈E = 1/2, if fair die
of&the&outcomes&it&contains,&
X
p(E) = p(x) E.g., p(E) = p( ) + p( ) + p( )
x2E
= 1/2, if fair die

Independence of events
IntroducNon&to&probability:&independence&
IntroducNon&to&probability:&independenc
IntroducNon&to&probability:&independence&
Two events A and B are independent if
•  Two&events&A&and&B&are&independent(if&
p(A ∩ B) = p(A)p(B)
Two&events&A&and&B&are&independent(if&
• •  Two&events&A&and&B&are&independent(if&
p(A \ B) = p(A)p(B)
p(A
p(A \ B)
\ B)
Are these two events independent? = p(A)p(B)
= p(A)p(B)
Are these events independ
A AreAre
B these
these events
events independent?
independent?
No! p(A \ B) = 0✓ ◆
AA BB No!No! p(A \ B)✓= ◆
p(A \ B) = 0 0 p(A)p(B) 1
2
12 ✓ 2 ◆2
=
1 6
p(A)p(B) 1
No! p(A ∩ B) = 0, p(A)p(B) = = 6= 6
p(A)p(B)
6
•  Suppose&our&outcome&space&had&two&different&die:&
Now suppose our outcome space had two different die:
Ω={ , , , …, } 2 die tos
ΩΩ=={ { } }2 die2tosses
, …, to,…be), independent,
,and, each ,die ,is (defined
62 = 36 outcomes
i.e. die tosses
62 = 362 outcomes
p( ) = p( ) p( ) 6p( = 36 outcomes
) = p( ) p( )
and and
the each die is (defined
probability to be)isindependent,
distribution i.e. die is independent,
such that each
and each die is (defined to be) independent, i.e.
p( ) = p( ) p( ) p( ) = p( ) p( )
p( ) = p( ) p( ) p( ) = p( ) p( )

Independence of events
IntroducNon&to&probability:&independe
Two events A and B are independent if
•  Two&events&A&and&B&are&independent(if&
p(A ∩ B) = p(A)p(B)
p(A \ B) = p(A)p(B)
Are these two events independent?
A
B Are these events
ducNon&to&probability:&independence&
oducNon&to&probability:&independence& Yes! p(A \ B) =
✓
ts&A&and&B&are&independent(if&
ents&A&and&B&are&independent(if& p(A)p(B) = p
p(A \\
p(A B)B)==p(A)p(B)
p(A)p(B)
p(A) = p( ) p(B) = p( )
BB Are
Arethese
theseevents
eventsindependent?
independent?
Yes!
Yes!
Yes!p(A B)==0p0p( (
p(A\\B) ))
✓✓ ◆◆22
11 ))p(
p( p( ))
p(A)p(B)==p(
p(A)p(B)
66
Conditional
Conditional Probabilities
probability
A simple relation between joint and conditional probabilities
Let S , S be events, p(S ) > 0.
1 2 2
–  In fact, this is taken as the definition of a conditional probability
p(SA1 \B
S2 )
p(S1 | S2 ) =
p(S2 )
P
Claim 1: !2S p(! | S) = 1
Claim 2: If S1 and S2 are independent, then p(S1 | S2 ) = p(S1 )
A B
T W P
hot Let
sun A, B0.4
be events, p(B) > 0.
hot rain 0.1
p(A ∩ B)
cold sun 0.2
David Sontag (NYU) p(A | B) =
Graphical Models Lecture 1, January 31, 2013 18 / 44
cold rain 0.3 p(B)
P
Claim 1: ω∈S p(ω | S) = 1
Claim 2: If A and B are independent, then p(A | B) = p(A)

Two important rules
1 Chain rule Let S1 , . . . Sn be events, p(Si ) > 0.
p(S1 ∩ S2 ∩ · · · ∩ Sn ) = p(S1 )p(S2 | S1 ) · · · p(Sn | S1 , . . . , Sn−1 )
2 Bayes’ rule Let S1 , S2 be events, p(S1 ) > 0 and p(S2 ) > 0.
p(S1 ∩ S2 ) p(S2 | S1 )p(S1 )

p(S1 | S2 ) = =
p(S2 ) p(S2 )

Discrete random variables
Often each outcome corresponds to a setting of various attributes

(e.g., “age”, “gender”, “hasPneumonia”, “hasDiabetes”)
A random variable X is a mapping X : Ω → D
D is some set (e.g., the integers)
Induces a partition of all outcomes Ω
For some x ∈ D, we say
p(X = x) = p({ω ∈ Ω : X (ω) = x})
“probability that variable X assumes state x”

Notation: Val(X ) = set D of all values assumed by X
(will interchangeably call these the “values” or “states” of variable X )
P
p(X ) is a distribution: x∈Val(X ) p(X = x) = 1

Multivariate distributions
Instead of one random variable, have random vector
X(ω) = [X1 (ω), . . . , Xn (ω)]
Xi = xi is an event. The joint distribution
p(X1 = x1 , . . . , Xn = xn )
is simply defined as p(X1 = x1 ∩ · · · ∩ Xn = xn )

We will often write p(x1 , . . . , xn ) instead of p(X1 = x1 , . . . , Xn = xn )
Conditioning, chain rule, Bayes’ rule, etc. all apply

Working with random variables
For example, the conditional distribution
p(X1 , X2 = x2 )
p(X1 | X2 = x2 ) = .
p(X2 = x2 )
This notation means

p(X1 =x1 ,X2 =x2 )
p(X1 = x1 | X2 = x2 ) = p(X2 =x2 ) ∀x1 ∈ Val(X1 )
Two random variables are independent, X1 ⊥ X2 , if
p(X1 = x1 , X2 = x2 ) = p(X1 = x1 )p(X2 = x2 )
for all values x1 ∈ Val(X1 ) and x2 ∈ Val(X2 ).

Example
Consider three binary-valued random variables
X1 , X2 , X3 Val(Xi ) = {0, 1}
Let outcome space Ω be the cross-product of their states:

Ω = Val(X1 ) × Val(X2 ) × Val(X3 )
Xi (ω) is the value for Xi in the assignment ω ∈ Ω
Specify p(ω) for each outcome ω ∈ Ω by a big table:
x1 x2 x3 p(x1 , x2 , x3 )
0 0 0 .11
0 0 1 .02
..
.
1 1 1 .05
How many parameters do we need to specify?
23 − 1
Marginalization
Suppose X and Y are random variables with distribution p(X , Y )

X : Intelligence, Val(X ) = {“Very High”, “High”}
Y : Grade, Val(Y ) = {“a”, “b”}
Joint distribution specified by:
X
vh h
Y a 0.7 0.15
b 0.1 0.05
p(Y = a) = ?= 0.85
More generally, suppose we have a joint distribution p(X1 , . . . , Xn ).
Then,
XX XX X
p(Xi = xi ) = ··· ··· p(x1 , . . . , xn )
x1 x2 xi−1 xi+1 xn

Conditioning
Suppose X and Y are random variables with distribution p(X , Y )

X : Intelligence, Val(X ) = {“Very High”, “High”}
Y : Grade, Val(Y ) = {“a”, “b”}
X
vh h
Y a 0.7 0.15
b 0.1 0.05
Can compute the conditional probability
p(Y = a, X = vh)
p(Y = a | X = vh) =
p(X = vh)
p(Y = a, X = vh)
=
p(Y = a, X = vh) + p(Y = b, X = vh)
0.7
= = 0.875.
0.7 + 0.1
Example: Medical diagnosis
Variable for each symptom (e.g. “fever”, “cough”, “fast breathing”,

“shaking”, “nausea”, “vomiting”)
Variable for each disease (e.g. “pneumonia”, “flu”, “common cold”,
“bronchitis”, “tuberculosis”)
Diagnosis is performed by inference in the model:
p(pneumonia = 1 | cough = 1, fever = 1, vomiting = 0)
One famous model, Quick Medical Reference (QMR-DT), has 600

diseases and 4000 findings

Representing the distribution
Naively, could represent multivariate distributions with table of

probabilities for each outcome (assignment)
How many outcomes are there in QMR-DT? 24600
Estimation of joint distribution would require a huge amount of data
Inference of conditional probabilities, e.g.
p(pneumonia = 1 | cough = 1, fever = 1, vomiting = 0)
would require summing over exponentially many variables’ values

Moreover, defeats the purpose of probabilistic modeling, which is to
make predictions with previously unseen observations

Structure through independence
If X1 , . . . , Xn are independent, then
p(x1 , . . . , xn ) = p(x1 )p(x2 ) · · · p(xn )
2n entries can be described by just n numbers (if |Val(Xi )| = 2)!
However, this is not a very useful model – observing a variable Xi
cannot influence our predictions of Xj
If X1 , . . . , Xn are conditionally independent given Y , denoted as
Xi ⊥ X−i | Y , then
n
Y
p(y , x1 , . . . , xn ) = p(y )p(x1 | y ) p(xi | x1 , . . . , xi−1 , y )
i=2
Yn
= p(y )p(x1 | y ) p(xi | y ).
i=2
This is a simple, yet powerful, model

Example: naive Bayes for classification
Classify e-mails as spam (Y = 1) or not spam (Y = 0)
Let 1 : n index the words in our vocabulary (e.g., English)
Xi = 1 if word i appears in an e-mail, and 0 otherwise
E-mails are drawn according to some distribution p(Y , X1 , . . . , Xn )
Suppose that the words are conditionally independent given Y . Then,
n
Y
p(y , x1 , . . . xn ) = p(y ) p(xi | y )
i=1
Estimate the model with maximum likelihood. Predict with:

Q
p(Y = 1) ni=1 p(xi | Y = 1)
p(Y = 1 | x1 , . . . xn ) = P Qn
y ={0,1} p(Y = y ) i=1 p(xi | Y = y )
Are the independence assumptions made here reasonable?

Philosophy: Nearly all probabilistic models are “wrong”, but many are
nonetheless useful
Bayesian networks
Reference: Chapter 3
A Bayesian network is specified by a directed acyclic graph

G = (V , E ) with:
1 One node i ∈ V for each random variable Xi
2 One conditional probability distribution (CPD) per node, p(xi | xPa(i) ),
specifying the variable’s probability conditioned on its parents’ values
Corresponds 1-1 with a particular factorization of the joint
distribution: Y
p(x1 , . . . xn ) = p(xi | xPa(i) )
i∈V
Powerful framework for designing algorithms to perform probability

computations

Example
Consider the following Bayesian network:
d0 d1 i0 i1
0.6 0.4 0.7 0.3
Difficulty Intelligence
g1 g2 g3
Grade SAT
i 0, d 0 0.3 0.4 0.3
i 0, d 1 0.05 0.25 0.7
i 0, d 0 0.9 0.08 0.02 s0 s1
Letter
i 0, d 1 0.5 0.3 0.2 i 0 0.95 0.05
i1 0.2 0.8
l0 l1
g1 0.1 0.9
g 2 0.4 0.6
g 2 0.99 0.01
What is its joint distribution?

Y
p(x1 , . . . xn ) = p(xi | xPa(i) )
i∈V
p(d, i, g , s, l) = p(d)p(i)p(g | i, d)p(s | i)p(l | g )
More examples
Y
p(x1 , . . . xn ) = p(xi | xPa(i) )
i∈V
Will my car start this morning?
Heckerman et al., Decision-Theoretic Troubleshooting, 1995

More examples
Y
p(x1 , . . . xn ) = p(xi | xPa(i) )
i∈V
What is the differential diagnosis?
Beinlich et al., The ALARM Monitoring System, 1989

Bayesian networks are generative models
naive Bayes
Label
Y
X1 X2 X3 ... Xn
Features
Evidence is denoted by shading in a node

Can interpret Bayesian network as a generative process. For
example, to generate an e-mail, we
1 Decide whether it is spam or not spam, by samping y ∼ p(Y )
2 For each word i = 1 to n, sample xi ∼ p(Xi | Y = y )

Bayesian network structure implies conditional
independencies!
Difficulty Intelligence
Grade SAT
Letter
The joint distribution corresponding to the above BN factors as

p(d, i, g , s, l) = p(d)p(i)p(g | i, d)p(s | i)p(l | g )
However, by the chain rule, any distribution can be written as

p(d, i, g , s, l) = p(d)p(i | d)p(g | i, d)p(s | i, d, g )p(l | g , d, i, g , s)
Thus, we are assuming the following additional independencies:

D ⊥ I, S ⊥ {D, G } | I , L ⊥ {I , D, S} | G . What else?
Bayesian network structure implies conditional
independencies!
Local Structures &
Local Structures &
Independencies
Generalizing the above arguments, we obtain that a variable is
Independencies
independent from its non-descendants given its parents #
,8BB8)'@=$6):
z #
z P%H%)*'Q'76&8<@>69 G'=)7', ! #
"
z ,8BB8)'@=$6):
! "
R*%?6)':56'>6?6>'8A'*6)6'Q/':56'>6?6>9'8A'G'=)7','=$6'%)76@6)76):R
"
z P%H%)*'Q'76&8<@>69 G'=)7',
Common parent – fixing B decouples A and C !
?6>9'8A'G'=)7','=$6'%)76@6)76):R
R*%?6)':56'>6?6>'8A'*6)6'Q/':56'>6?6>9'8A'G'=)7','=$6'%)76@6)76):R
z ,=9&=76
Cascadez –S)8F%)*'Q
knowing76&8<@>69
B decouples ! # "
G'=)7', A and C
z ,=9&=76
! # " !
R*%?6)':56'>6?6>'8A'*6)6'Q/':56'>6?6>'*6)6'G'@$8?%769')8'
z S)8F%)*'Q 76&8<@>69 G'=)7',
# "
6H:$='@$67%&:%8)'?=><6'A8$':56'>6?6>'8A'*6)6',R
?6>'*6)6'G'@$8?%769')8' R*%?6)':56'>6?6>'8A'*6)6'Q/':56'>6?6>'*6)6'G'@$8?%769')8'
z T29:$<&:<$6
6H:$='@$67%&:%8)'?=><6'A8$':56'>6?6>'8A'*6)6',R ! #
6>'8A'*6)6',R
z S)8F%)*','&8<@>69'G'=)7'Q
z T29:$<&:<$6 !" #
! – Knowing
V-structure # C couples A and B
U6&=<96'G'&=)'R6H@>=%)'=F=JR'Q'FL$L:L',
z S)8F%)*','&8<@>69'G'=)7'Q "
RVA'G'&8$$6>=:69':8',/':56)'&5=)&6'A8$'Q':8'=>98'&8$$6>=:6':8'Q'F%>>'76&$6=96R
This important " phenomona is called explaining away and is what
U6&=<96'G'&=)'R6H@>=%)'=F=JR'Q'FL$L:L',
Q'FL$L:L', makes Bayesian networks so powerful
z W56'>=)*<=*6'%9'&8B@=&:/':56'&8)&6@:9'=$6'$%&5X
RVA'G'&8$$6>=:69':8',/':56)'&5=)&6'A8$'Q':8'=>98'&8$$6>=:6':8'Q'F%>>'76&$6=96R
" #$%&'(%)*'+',-./'011!20113 O1
6'A8$'Q':8'=>98'&8$$6>=:6':8'Q'F%>>'76&$6=96R
z W56'>=)*<=*6'%9'&8B@=&:/':56'&8)&6@:9'=$6'$%&5X
David Sontag (NYU) Graphical Models
" #$%&'(%)*'+',-./'011!20113 Lecture 1, January 31, 2013 O1 36 / 44
res &
A simple justification (for common parent)
es
#
G'=)7', ! "
)6'Q/':56'>6?6>9'8A'G'=)7','=$6'%)76@6)76):R
We’ll show that p(A, C | B) = p(A | B)p(C | B) for any distribution
! according
p(A, B, C ) that factors # to this graph
" structure, i.e.
69 G'=)7',
)6'Q/':56'>6?6>'*6)6'G'@$8?%769')8'
p(A, B, C ) = p(B)p(A | B)p(C | B)
'A8$':56'>6?6>'8A'*6)6',R
! #
Proof.
G'=)7'Q " C)
p(A, B,
p(A, C | B) =
>=%)'=F=JR'Q'FL$L:L', = p(A | B)p(C | B)
p(B)
56)'&5=)&6'A8$'Q':8'=>98'&8$$6>=:6':8'Q'F%>>'76&$6=96R
B@=&:/':56'&8)&6@:9'=$6'$%&5X
David Sontag (NYU)
" #$%&'(%)*'+',-./'011!20113 Graphical Models O1 Lecture 1, January 31, 2013 37 / 44
D-separation (“directed separated”) in Bayesian networks
Algorithm to calculate whether X ⊥ Z | Y by looking at graph

separation
Look to see if there is active path between X and Z when variables
Y are observed:
Y Y
X Z X Z
(a) (b)

X Z X Z
(a) whether X ⊥ Z | Y by looking
Algorithm to calculate (b) at graph
separation
Y are observed:
X Z X Z
Y Y
(a) (b)

Algorithm to calculate whether X ⊥ Z | Y by looking at graph

separation
Y are observed:
X Y Z X Y Z
(a) (b)
If no such path, then X and Z are d-separated with respect to Y

d-separation reduces statistical independencies (hard) to connectivity
in graphs (easy)
Important because it allows us to quickly prune the Bayesian network,
finding just the relevant variables for answering a query

D-separation example 1
X4
X2
X6
X1
X3 X5
David Sontag (NYU) Graphical Models X4 Lecture 1, January 31, 2013 41 / 44

D-separation example 2 X 3 X5
X4
X2
X6
X1
X3 X5

2011 Turing Award was for Bayesian networks

Summary
Bayesian networks given by (G , P) where P is specified as a set of

local conditional probability distributions associated with G ’s nodes
One interpretation of a BN is as a generative model, where variables
are sampled in topological order
Local and global independence properties identifiable via
d-separation criteria
Computing the probability of any assignment is obtained by
multiplying CPDs
Bayes’ rule is used to compute conditional probabilities
Marginalization or inference is often computationally difficult

Lecture1 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture1 PDF

Uploaded by

Copyright:

Available Formats

Probabilistic Graphical Models

New York University

Lecture 1, January 31, 2013

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 1 / 44

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 2 / 44

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 3 / 44

1 Represent the world as a collection of random variables X1 , . . . , Xn

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 4 / 44

As humans, we are continuously making predictions under uncertainty

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 5 / 44

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 6 / 44

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 7 / 44

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 8 / 44

input: two images! output: disparity!

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 9 / 44

1 Represent the world as a collection of random variables X1 , . . . , Xn

2 Learn the distribution from data

3 Perform “inference” (compute conditional distributions

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 10 / 44

We will study Representation, Inference & Learning

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 11 / 44

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 12 / 44

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 13 / 44

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 14 / 44

O={ , } , } Odd die tosses

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 15 / 44

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 16 / 44

cold rain 0.3 p(B)

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 18 / 44

1 Chain rule Let S1 , . . . Sn be events, p(Si ) > 0.

p(S1 ∩ S2 ∩ · · · ∩ Sn ) = p(S1 )p(S2 | S1 ) · · · p(Sn | S1 , . . . , Sn−1 )

2 Bayes’ rule Let S1 , S2 be events, p(S1 ) > 0 and p(S2 ) > 0.

p(S1 ∩ S2 ) p(S2 | S1 )p(S1 )

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 19 / 44

Often each outcome corresponds to a setting of various attributes

p(X = x) = p({ω ∈ Ω : X (ω) = x})

“probability that variable X assumes state x”

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 20 / 44

Instead of one random variable, have random vector

X(ω) = [X1 (ω), . . . , Xn (ω)]

Xi = xi is an event. The joint distribution

is simply defined as p(X1 = x1 ∩ · · · ∩ Xn = xn )

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 21 / 44

For example, the conditional distribution

This notation means

p(X1 = x1 , X2 = x2 ) = p(X1 = x1 )p(X2 = x2 )

for all values x1 ∈ Val(X1 ) and x2 ∈ Val(X2 ).

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 22 / 44

Let outcome space Ω be the cross-product of their states:

Suppose X and Y are random variables with distribution p(X , Y )

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 24 / 44

Suppose X and Y are random variables with distribution p(X , Y )

Variable for each symptom (e.g. “fever”, “cough”, “fast breathing”,

p(pneumonia = 1 | cough = 1, fever = 1, vomiting = 0)

One famous model, Quick Medical Reference (QMR-DT), has 600

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 26 / 44

Naively, could represent multivariate distributions with table of

p(pneumonia = 1 | cough = 1, fever = 1, vomiting = 0)

would require summing over exponentially many variables’ values

David Sontag (NYU) Graphical Models Lecture 1, January 31, 2013 27 / 44

This is a simple, yet powerful, model

Estimate the model with maximum likelihood. Predict with:

Are the independence assumptions made here reasonable?