You are on page 1of 32

CSL346: olnL Lsumauon

WlnLer 2012
Luke Zeulemoyer


Slldes adapLed from Carlos CuesLrln and uan kleln
?our rsL consulung [ob
A bllllonalre from Lhe suburbs of Seaule asks you
a quesuon:
Pe says: l have LhumbLack, lf l lp lL, whaL's Lhe
probablllLy lL wlll fall wlLh Lhe nall up?
?ou say: lease lp lL a few umes:
?ou say: 1he probablllLy ls:
(P) = 3/3
!" $%&$' ()&***
?ou say: 8ecause.
Random Variables
A random variable is some aspect of the world about
which we (may) have uncertainty
R = Is it raining?
D = How long will it take to drive to work?
L = Where am I?
We denote random variables with capital letters
Random variables have domains
R in {true, false} (sometimes write as {+r, r})
D in [0, !)
L in possible locations, maybe {(0,0), (0,1), !}
Probability Distributions
Discrete random variables have distributions
A discrete distribution is a TABLE of probabilities of values
A probability (lower case value) is a single number
Must have:
T P
warm 0.5
cold 0.5
W P
sun 0.6
rain 0.1
fog 0.3
meteor 0.0
Joint Distributions
A joint distribution over a set of random variables:
specifies a real number for each assignment (or outcome):
Size of distribution if n variables with domain sizes d?
Must obey:
For all but the smallest distributions, impractical to write out
T W P
hot sun 0.4
hot rain 0.1
cold sun 0.2
cold rain 0.3
Marginal Distributions
Marginal distributions are sub-tables which eliminate variables
Marginalization (summing out): Combine collapsed rows by adding
T W P
hot sun 0.4
hot rain 0.1
cold sun 0.2
cold rain 0.3
T P
hot 0.5
cold 0.5
W P
sun 0.6
rain 0.4
Conditional Probabilities
A simple relation between joint and conditional probabilities
In fact, this is taken as the definition of a conditional probability
T W P
hot sun 0.4
hot rain 0.1
cold sun 0.2
cold rain 0.3
Conditional Distributions
Conditional distributions are probability distributions over
some variables given fixed values of others
T W P
hot sun 0.4
hot rain 0.1
cold sun 0.2
cold rain 0.3
W P
sun 0.8
rain 0.2
W P
sun 0.4
rain 0.6
Conditional Distributions Joint Distribution
The Product Rule
Sometimes have conditional distributions but want the joint
Example:
R P
sun 0.8
rain 0.2
D W P
wet sun 0.1
dry sun 0.9
wet rain 0.7
dry rain 0.3
D W P
wet sun 0.08
dry sun 0.72
wet rain 0.14
dry rain 0.06
Bayes Rule
Two ways to factor a joint distribution over two variables:
Dividing, we get:
Why is this at all helpful?
Lets us build one conditional from its reverse
Often one conditional is tricky but the other one is simple
Foundation of many practical systems (e.g. ASR, MT)
In the running for most important ML equation!
Thats my rule!
1humbLack - 8lnomlal ulsLrlbuuon
(Peads) = ", (1alls) = 1-"

lllps are !"!"#":
lndependenL evenLs
ldenucally dlsLrlbuLed accordlng Lo 8lnomlal
dlsLrlbuuon
Sequence $ of $
P
Peads and $
1
1alls
!
D={x
i

| i=1n}, P(D

| ! ) = !
i
P(x
i
| ! )

Maxlmum Llkellhood Lsumauon
+%,%' Cbserved seL D of $
P
Peads and $
1
1alls
!&-.,)"$/$' 8lnomlal dlsLrlbuuon
0"%12/23' ndlng " ls an opumlzauon problem
WhaL's Lhe ob[ecuve funcuon?
405' Choose " Lo maxlmlze probablllLy of $
?our rsL parameLer learnlng algorlLhm
SeL derlvauve Lo zero, and solve!
Brief Article
The Author
January 11, 2012

= arg max

ln P(D | )
ln

H
d
d
ln P(D | ) =
d
d
[ln

H
(1 )

T
] =
d
d
[
H
ln +
T
ln(1 )]
1
Brief Article
The Author
January 11, 2012

= arg max

ln P(D | )
ln

H
d
d
ln P(D | ) =
d
d
[ln

H
(1 )

T
] =
d
d
[
H
ln +
T
ln(1 )]
1
Brief Article
The Author
January 11, 2012

= arg max

ln P(D | )
ln

H
d
d
ln P(D | ) =
d
d
[ln

H
(1 )

T
] =
d
d
[
H
ln +
T
ln(1 )]
=
H
d
d
ln +
T
d
d
ln(1 ) =

H



T
1
= 0
1
Brief Article
The Author
January 11, 2012

= arg max

ln P(D | )
ln

H
d
d
ln P(D | ) =
d
d
[ln

H
(1 )

T
] =
d
d
[
H
ln +
T
ln(1 )]
=
H
d
d
ln +
T
d
d
ln(1 ) =

H



T
1
= 0
1
Brief Article
The Author
January 11, 2012

= arg max

ln P(D | )
ln

H
d
d
ln P(D | ) =
d
d
ln

H
(1 )

T
1
Brief Article
The Author
January 11, 2012

= arg max

ln P(D | )
ln

H
d
d
ln P(D | ) =
d
d
[ln

H
(1 )

T
] =
d
d
[
H
ln +
T
ln(1 )]
=
H
d
d
ln +
T
d
d
ln(1 ) =

H



T
1
= 0
1
8uL, how many lps do l need?
8llllonalre says: l lpped 3 heads and 2 Lalls.
?ou say: " = 3/3, l can prove lL!
Pe says: WhaL lf l lpped 30 heads and 20 Lalls?
?ou say: Same answer, l can prove lL!
!" $%&$' ()%,6$ 7"8"1*
?ou say: umm. 1he more Lhe merrler???
Pe says: ls Lhls why l am paylng you Lhe blg bucks???
N





P
r
o
b
.

o
f

M
i
s
t
a
k
e

Exponential
Decay!
A bound (from Poedlng's lnequallLy)
lor N =!
H
+!
T
, and

LeL "
*
be Lhe Lrue parameLer, for any %>0:
AC Learnlng
AC: robably ApproxlmaLe CorrecL
8llllonalre says: l wanL Lo know Lhe LhumbLack ",
wlLhln % = 0.1, wlLh probablllLy aL leasL 1-& = 0.93.
Pow many lps? Cr, how blg do l seL N ?
Brief Article
The Author
January 11, 2012

= arg max

ln P(D | )
ln

H
d
d
ln P(D | ) =
d
d
[ln

H
(1 )

T
] =
d
d
[
H
ln +
T
ln(1 )]
=
H
d
d
ln +
T
d
d
ln(1 ) =

H



T
1
= 0
2e
2N
2
P(mistake)
1
Brief Article
The Author
January 11, 2012

= arg max

ln P(D | )
ln

H
d
d
ln P(D | ) =
d
d
[ln

H
(1 )

T
] =
d
d
[
H
ln +
T
ln(1 )]
=
H
d
d
ln +
T
d
d
ln(1 ) =

H



T
1
= 0
2e
2N
2
P(mistake)
1
Brief Article
The Author
January 11, 2012

= arg max

ln P(D | )
ln

H
d
d
ln P(D | ) =
d
d
[ln

H
(1 )

T
] =
d
d
[
H
ln +
T
ln(1 )]
=
H
d
d
ln +
T
d
d
ln(1 ) =

H



T
1
= 0
2e
2N
2
P(mistake)
1
Brief Article
The Author
January 11, 2012

= arg max

ln P(D | )
ln

H
d
d
ln P(D | ) =
d
d
[ln

H
(1 )

T
] =
d
d
[
H
ln +
T
ln(1 )]
=
H
d
d
ln +
T
d
d
ln(1 ) =

H



T
1
= 0
2e
2N
2
P(mistake)
ln ln 2 2N
2
1
Brief Article
The Author
January 11, 2012

= arg max

ln P(D | )
ln

H
d
d
ln P(D | ) =
d
d
[ln

H
(1 )

T
] =
d
d
[
H
ln +
T
ln(1 )]
=
H
d
d
ln +
T
d
d
ln(1 ) =

H



T
1
= 0
2e
2N
2
P(mistake)
ln ln 2 2N
2
N
ln(2/)
2
2
1
Interesting! Lets look at
some numbers!
% = 0.1, &=0.05
Brief Article
The Author
January 11, 2012

= arg max

ln P(D | )
ln

H
d
d
ln P(D | ) =
d
d
[ln

H
(1 )

T
] =
d
d
[
H
ln +
T
ln(1 )]
=
H
d
d
ln +
T
d
d
ln(1 ) =

H



T
1
= 0
2e
2N
2
P(mistake)
ln ln 2 2N
2
N
ln(2/)
2
2
N
ln(2/0.05)
2 0.1
2
1
Brief Article
The Author
January 11, 2012

= arg max

ln P(D | )
ln

H
d
d
ln P(D | ) =
d
d
[ln

H
(1 )

T
] =
d
d
[
H
ln +
T
ln(1 )]
=
H
d
d
ln +
T
d
d
ln(1 ) =

H



T
1
= 0
2e
2N
2
P(mistake)
ln ln 2 2N
2
N
ln(2/)
2
2
N
ln(2/0.05)
2 0.1
2

3.8
0.02
= 190
1
Brief Article
The Author
January 11, 2012

= arg max

ln P(D | )
ln

H
d
d
ln P(D | ) =
d
d
[ln

H
(1 )

T
] =
d
d
[
H
ln +
T
ln(1 )]
=
H
d
d
ln +
T
d
d
ln(1 ) =

H



T
1
= 0
2e
2N
2
P(mistake)
ln ln 2 2N
2
N
ln(2/)
2
2
N
ln(2/0.05)
2 0.1
2

3.8
0.02
= 190
1
WhaL lf l have prlor bellefs?
8llllonalre says: WalL, l know LhaL Lhe LhumbLack
ls close" Lo 30-30. WhaL can you do for me now?
9.: $%&' ; <%2 ="%12 /, ,)" >%&"$/%2 ?%&@
8aLher Lhan esumaung a slngle ", we obLaln a
dlsLrlbuuon over posslble values of "
In the beginning After observations
Observe flips
e.g.: {tails, tails}
8ayeslan Learnlng
use 8ayes rule:
Or equivalently:
Also, for uniform priors:
Prior
Normalization
Data Likelihood
Posterior
Brief Article
The Author
January 11, 2012

= arg max

ln P(D | )
ln

H
d
d
ln P(D | ) =
d
d
[ln

H
(1 )

T
] =
d
d
[
H
ln +
T
ln(1 )]
=
H
d
d
ln +
T
d
d
ln(1 ) =

H



T
1
= 0
2e
2N
2
P(mistake)
ln ln 2 2N
2
N
ln(2/)
2
2
N
ln(2/0.05)
2 0.1
2

3.8
0.02
= 190
P() 1
1
Brief Article
The Author
January 11, 2012

= arg max

ln P(D | )
ln

H
d
d
ln P(D | ) =
d
d
[ln

H
(1 )

T
] =
d
d
[
H
ln +
T
ln(1 )]
=
H
d
d
ln +
T
d
d
ln(1 ) =

H



T
1
= 0
2e
2N
2
P(mistake)
ln ln 2 2N
2
N
ln(2/)
2
2
N
ln(2/0.05)
2 0.1
2

3.8
0.02
= 190
P() 1
P( | D) P(D | )
1
! reduces to MLE objective
8ayeslan Learnlng for 1humbLacks
Llkellhood funcuon ls 8lnomlal:
WhaL abouL prlor?
8epresenL experL knowledge
Slmple posLerlor form
Con[ugaLe prlors:
Closed-form represenLauon of posLerlor
A.1 >/2.B/%=C <.2D:3%," -1/.1 /$ >",% E/$,1/7:F.2
8eLa prlor dlsLrlbuuon - (")
Llkellhood funcuon:
osLerlor:
Brief Article
The Author
January 11, 2012

= arg max

ln P(D | )
ln

H
d
d
ln P(D | ) =
d
d
[ln

H
(1 )

T
] =
d
d
[
H
ln +
T
ln(1 )]
=
H
d
d
ln +
T
d
d
ln(1 ) =

H



T
1
= 0
2e
2N
2
P(mistake)
ln ln 2 2N
2
N
ln(2/)
2
2
N
ln(2/0.05)
2 0.1
2

3.8
0.02
= 190
P() 1
P( | D) P(D | )
P( | D)

H
(1 )

H
1
(1 )

T
1
1
Brief Article
The Author
January 11, 2012

= arg max

ln P(D | )
ln

H
d
d
ln P(D | ) =
d
d
[ln

H
(1 )

T
] =
d
d
[
H
ln +
T
ln(1 )]
=
H
d
d
ln +
T
d
d
ln(1 ) =

H



T
1
= 0
2e
2N
2
P(mistake)
ln ln 2 2N
2
N
ln(2/)
2
2
N
ln(2/0.05)
2 0.1
2

3.8
0.02
= 190
P() 1
P( | D) P(D | )
P( | D)

H
(1 )

H
1
(1 )

T
1
1
Brief Article
The Author
January 11, 2012

= arg max

ln P(D | )
ln

H
d
d
ln P(D | ) =
d
d
[ln

H
(1 )

T
] =
d
d
[
H
ln +
T
ln(1 )]
=
H
d
d
ln +
T
d
d
ln(1 ) =

H



T
1
= 0
2e
2N
2
P(mistake)
ln ln 2 2N
2
N
ln(2/)
2
2
N
ln(2/0.05)
2 0.1
2

3.8
0.02
= 190
P() 1
P( | D) P(D | )
P( | D)

H
(1 )

H
1
(1 )

T
1
=

H
+
H
1
(1 )

T
+
t
+1
1
Brief Article
The Author
January 11, 2012

= arg max

ln P(D | )
ln

H
d
d
ln P(D | ) =
d
d
[ln

H
(1 )

T
] =
d
d
[
H
ln +
T
ln(1 )]
=
H
d
d
ln +
T
d
d
ln(1 ) =

H



T
1
= 0
2e
2N
2
P(mistake)
ln ln 2 2N
2
N
ln(2/)
2
2
N
ln(2/0.05)
2 0.1
2

3.8
0.02
= 190
P() 1
P( | D) P(D | )
P( | D)

H
(1)

H
1
(1)

T
1
=

H
+
H
1
(1)

T
+
t
+1
= Beta(
H
+
H
,
T
+
T
)
1
osLerlor dlsLrlbuuon
rlor:
uaLa: $
P
heads and $
1
Lalls
osLerlor dlsLrlbuuon:
8ayeslan osLerlor lnference
osLerlor dlsLrlbuuon:
8ayeslan lnference:
no longer slngle parameLer
lor any speclc f, Lhe funcuon of lnLeresL
CompuLe Lhe expecLed value of f


lnLegral ls oen hard Lo compuLe
MA: Maxlmum a posLerlorl
approxlmauon


As more daLa ls observed, 8eLa ls more cerLaln
MAP: use most likely parameter to approximate
the expectation
MA for 8eLa dlsLrlbuuon

MA: use mosL llkely parameLer:
Brief Article
The Author
January 11, 2012

= arg max

ln P(D | )
ln

H
d
d
ln P(D | ) =
d
d
[ln

H
(1 )

T
] =
d
d
[
H
ln +
T
ln(1 )]
=
H
d
d
ln +
T
d
d
ln(1 ) =

H



T
1
= 0
2e
2N
2
P(mistake)
ln ln 2 2N
2
N
ln(2/)
2
2
N
ln(2/0.05)
2 0.1
2

3.8
0.02
= 190
P() 1
P( | D) P(D | )
P( | D)

H
(1)

H
1
(1)

T
1
=

H
+
H
1
(1)

T
+
t
+1
= Beta(
H
+
H
,
T
+
T
)

H
+
H
1

H
+
H
+
T
+
T
2
1
Beta prior equivalent to extra thumbtack flips
As N " 1, prior is forgotten
But, for small sample size, prior is important!
WhaL abouL conunuous varlables?
8llllonalre says: lf l am
measurlng a
conunuous varlable,
whaL can you do for
me?
9.: $%&' 0", B" ,"==
&.: %7.:, G%:$$/%2$@
Some properues of Causslans
Amne Lransformauon (muluplylng by
scalar and addlng a consLanL) are
Causslan
x ~ &(,'
2
)
? = ax + b ! ? ~ &(a+b,a
2
'
2
)
Sum of Causslans ls Causslan
x ~ &(
x
,'
2
x
)
? ~ &(
?
,'
2
?
)
Z = x+? ! Z ~ &(
x
+
?
, '
2
x
+'
2
?
)
Lasy Lo dlerenuaLe, as we wlll see soon!
Learnlng a Causslan
CollecL a bunch of daLa
Popefully, l.l.d. samples
e.g., exam scores
Learn parameLers
Mean:
varlance: "
x
i
i =
5H%B
I<.1"
0 83
1 93
2 100
3 12
. .
99 89
MLL for Causslan:
rob. of l.l.d. samples $=[x
1
,.,x
n
}:
Log-likelihood of data:

MLE
,
MLE
= arg max
,
P(D | , )
2
?our second learnlng algorlLhm:
MLL for mean of a Causslan
WhaL's MLL for mean?

MLE
,
MLE
= arg max
,
P(D | , )
=
N

i=1
(x
i
)

2
= 0
2

MLE
,
MLE
= arg max
,
P(D | , )
=
N

i=1
(x
i
)

2
= 0
2

MLE
,
MLE
= arg max
,
P(D | , )
=
N

i=1
(x
i
)

2
= 0
=
N

i=1
x
i
+N = 0
=
N

+
N

i=1
(x
i
)
2

3
= 0
2
MLL for varlance
Agaln, seL derlvauve Lo zero:

MLE
,
MLE
= arg max
,
P(D | , )
=
N

i=1
(x
i
)

2
= 0
2

MLE
,
MLE
= arg max
,
P(D | , )
=
N

i=1
(x
i
)

2
= 0
=
N

+
N

i=1
(x
i
)
2

3
= 0
2
Learnlng Causslan parameLers
MLL:


81W. MLL for Lhe varlance of a Causslan ls 7/%$"E
LxpecLed resulL of esumauon ls 2., Lrue parameLer!
unblased varlance esumaLor:
8ayeslan learnlng of Causslan
parameLers
Con[ugaLe prlors
Mean: Causslan prlor
varlance: WlsharL ulsLrlbuuon
rlor for mean:

You might also like