Professional Documents
Culture Documents
Lecture 04
Exponential Families
Philipp Hennig
27 April 2023
Faculty of Science
Department of Computer Science
Chair for the Methods of Machine Learning
Example — inferring probability of wearing glasses
Step 1: Construct σ-algebra exposition by Stefan Harmeling
…
Probability of wearing glasses after five observations
!
Y
5
p(π|x1 , x2 , x3 , x4 , x5 ) = Z−1
5 p(xi |π) p(π)
i=1
= Z−1
5 π (1 − π) p(π)
n m
= p(π|n, m)
p(π|n, m) = Z−1
5 π (1 − π) p(π)
n m
Pierre-Simon, marquis de Laplace (1749–1827) Theorie Analytique des Probabilités, 1814, p. 364
Translated by a Deep Network, assisted by a human
Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 5
Conjugate Priors
More in the next lecture
E. Pitman. Sufficient statistics and intrinsic accuracy (1936). Math. Proc. Cambr. Phil. Soc. 32(4), 1936.
P. Diaconis and D. Ylvisaker, Conjugate priors for exponential families. Annals of Statistics 7(2), 1979.
Y
n
p(x | f) = f x · (1 − f)1−x x ∈ {0; 1}
i=1
=f n1
· (1 − f)n0
n0 := n − n1
1
p(f | α, β) = B(α, β) = fα−1 (1 − f)β−1
B(α, β)
1
p(f | x) = B(α + n1 , β + n0 ) = fα+n1 −1 (1 − f)β+n0 −1
B(α + n1 , β + n0 )
with
Z 1
Γ(x)Γ(y) (x − 1)! (y − 1)!
B(x, y) := tx−1 (1 − t)y−1 dt = =
0 Γ(x + y) (x + y − 1)!
Pierre-Simon, marquis de Laplace (1749–1827) Z ∞ Z 1
Γ(z) = e−t tz−1 dt = (− log x)z+1 dx = (z − 1)!
Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 0 0 7
Can we predict observations?
marginalizing over a Beta posterior – the Beta binomial distribution
p(x | f) = f n1 · (1 − f)n0
n0 := n − n1
1
p(f) = B(α, β) = fα−1 (1 − f)β−1
B(α, β)
with
Z 1
Γ(x)Γ(y) (x − 1)! (y − 1)!
B(x, y) := tx−1 (1 − t)y−1 dt = =
0 Γ(x + y) (x + y − 1)!
p(x) =?
p(x | f) = f n1 · (1 − f)n0
n0 := n − n1
1
p(f) = B(α, β) = fα−1 (1 − f)β−1
B(α, β)
with
Z 1
Γ(x)Γ(y) (x − 1)! (y − 1)!
B(x, y) := tx−1 (1 − t)y−1 dt = =
Γ(x + y) (x + y − 1)!
Z0
p(x) = p(x | f)p(f) df
Z
1
= f n1 · (1 − f)n0 · fα−1 (1 − f)β−1 df
B(α, β)
Pierre-Simon, marquis de Laplace (1749–1827)
p(x | f) = f n1 · (1 − f)n0
n0 := n − n1
1
p(f) = B(α, β) = fα−1 (1 − f)β−1
B(α, β)
with
Z 1
Γ(x)Γ(y) (x − 1)! (y − 1)!
B(x, y) := tx−1 (1 − t)y−1 dt = =
Γ(x + y) (x + y − 1)!
Z0
p(x) = p(x | f)p(f) df
Z
1
= f n1 · (1 − f)n0 · fα−1 (1 − f)β−1 df
B(α, β)
Pierre-Simon, marquis de Laplace (1749–1827) B(α + n1 , β + n0 )
=
B(α, β)
Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 8
Demo
Y
n
p(x) = fxi x ∈ {0; . . . , K}
i=1
Y
K
= fknk nk := |{xi | xi = k}|
k=1
1 Y αk −1
K
p(f | α) = D(α) = fk
B(α)
k=1
p(f | x) = D(α + n)
where
QK
Peter Gustav Lejeune Dirichlet (1805–1859) Γ(α )
B(α) = P k
i=1
Γ( k αk )
Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 10
Analytic Bayesian Inference
Inferring the variance of a Gaussian
Y
n Y
n
1 (xi − µ)2
p(x | σ) = N (xi ; µ, σ 2 ) =√ exp −
2πσ2 2σ 2
i=1 i=1
p(σ) = ?
Y
n Y
n
1 (xi − µ)2
p(x | σ) = N (xi ; µ, σ 2 ) =√ exp −
2πσ2 2σ 2
i=1 i=1
p(σ) = ?
n X1 n
1 n
log p(x | σ) = − log σ 2 − (xi − µ)2 · 2 − log 2π
2 2 σ 2
i
Y
n Y
n
1 (xi − µ)2
p(x | σ) = N (xi ; µ, σ 2 ) =√ exp −
2πσ2 2σ 2
i=1 i=1
p(σ) = ?
n X1 n
1 n
log p(x | σ) = − log σ 2 − (xi − µ)2 · 2 − log 2π
2 2 σ 2
i
1
log p(σ | α, β) = (α − 1) log σ −2 − β · − Z(α, β)
σ2
βα −2
p(σ | α, β) = (σ −2 )α−1 e−βσ =: G(σ −2 ; α, β)
Γ(α)
!
−2 n 1X
Daniel Bernoulli (1700-1782) p(σ | α, β, x) = G σ ; α + , β + (xi − µ) 2
2 2
i
For m, n ∈ N and x, y, z ∈ C :
Z 1
) Y P I V
\
B(x, y) = tx−1 (1 − t)y−1 dt
0
, E H E Q E V H
Γ(x)Γ(y)
= if x + x̄, y + ȳ > 0
Γ(x + y)
(m − 1)! (n − 1)!
B(m, n) =
(m + n − 1)!
Z ∞ Z 1
Γ(z) = e−t tz−1 dt = (− log x)z+1 dx
0 0
Γ(n) = (n − 1)!
Hadamard:
Leonhard Euler (1707–1783)
1 d 1−x x
\ H(x) = log Γ Γ 1−
Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 Γ(1 − x) dx 2 2 12
Can we predict observations?
marginalizing over a Gamma posterior – the t-distribution
1 (x − µ)2
p(x | σ) = N (x; µ, σ ) = √
2
exp −
2πσ 2 2σ 2
α
β −2
p(σ) = G(σ −2 ; α, β) = (σ −2 )α−1 e−βσ
Γ(α)
Z
p(x) = p(x | σ)p(σ) dσ
Z
(σ −2 )1/2 (x − µ)2 βα −2
= √ exp −σ −2 · (σ −2 )α−1 e−βσ dσ
2π 2 Γ(α)
( )
Z (x−µ)2
1 βα −σ −2 β+ 2
−2 α+ 12 −1
=√ (σ ) e dσ
2π Γ(α)
−α+ 12
1 Γ(α + 12 ) βα Γ(α + 21 ) (x − µ)2
William Sealy Gosset (1876–1937) =√ α+ 12 = p 1+
2π Γ(α) (x−µ)2 2πβΓ(α) 2β
β+ 2
Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 13
Analytic Bayesian Inference
Inferring the Mean of a Gaussian
Y
n Y
n
1 (yi − x)2
p(y | x) = N (yi ; x, σ 2 ) = √ exp −
2πσ2 2σ 2
i=1 i=1
p(x) = ?
Y
n Y
n
1 (yi − x)2
p(y | x) = N (yi ; x, σ 2 ) = √ exp −
2πσ2 2σ 2
i=1 i=1
p(x) = ?
n X1 n
1 n
log p(y | x) = − log σ 2 − (yi − x)2 · 2 − log 2π
2 2 σ 2
i
1 X 2
n
n
= − (log σ − log 2π) − 2
2
yi + yi x − x2
2 2σ
i
Y
n Y
n
1 (yi − x)2
p(y | x) = N (yi ; x, σ 2 ) = √ exp −
2πσ2 2σ 2
i=1 i=1
p(x) = ?
n X1 n
1 n
log p(y | x) = − log σ 2 − (yi − x)2 · 2 − log 2π
2 2 σ 2
i
1 X 2
n
n
= − (log σ 2 − log 2π) − 2 yi + yi x − x2
2 2σ
i
1 m m2 1
log p(x | m, v2 ) = − 2 x2 + 2 x − 2 − log 2πv2
2v v 2v 2 !
1 1 m 1 X
Carl Friedrich Gauss (1777–1855)
log p(x | y, m, v ) =
2
x − −2
2
+ 2 yi x + const.
2(v−2 + nσ −2 ) v + nσ −2 v2 σ
i
Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 14
Analytic Bayesian Inference
Inferring the Mean of a Gaussian
Y
n Y
n
1 (yi − x)2
p(y | x) = N (yi ; x, σ 2 ) = √ exp −
2πσ2 2σ 2
i=1 i=1
p(x) = ?
n X1 n
1 n
log p(y | x) = − log σ 2 − (yi − x)2 · 2 − log 2π
2 2 σ 2
i
1 X 2
n
n
= − (log σ 2 − log 2π) − 2 yi + yi x − x2
2 2σ
i
1 m m2 1
log p(x | m, v2 ) = − 2 x2 + 2 x − 2 − log 2πv2
2v v 2v 2! !
m 1 Xn −1
−2 −2
Carl Friedrich Gauss (1777–1855)
p(x | y, m, v ) = N x; Ψ 2 + 2
2
yi , Ψ := v + nσ
v σ
i
Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 14
Can we predict observations?
marginalizing over a Gaussian posterior
1 (y − x)2
p(y | x) = N (y; x, σ 2 ) = √ exp −
2πσ 2 2σ 2
p(x) = N (x; m, v2 )
Z
p(y) = p(y | x)p(x) dx
Z
1 (y − x)2 1 (x − m)2
= √ exp − ·√ exp − dx
2πσ 2 2σ 2 2πv2 2v2
Z
1 1 1
= √ exp − 2 (y − x)2 − 2 (x − m)2 dx
2
2π σ v 2 2σ 2v
Z
1 1 2 1 2
= √ exp − 2 (y − 2xy + x ) − 2 (x − 2mx + m ) dx
2 2
2π σ 2 v2 2σ 2v
2 2 2 2 2
x2
− 2x yσ +mv
+ yσσ2+mv Z
1 y − 2ym + m
2 2
σ 2 +v2 +v2
= √ exp − − exp
2π σ 2 v2 2(σ 2 + v2 ) 2(σ 2 + v2 )
1 (y − m)2
=p exp −
Carl Friedrich Gauss (1777–1855)
2π(σ 2 + v2 ) 2(σ 2 + v2 )
is called an exponential family of probability measures. The function ϕ : X _ Rd is called the sufficient
statistics. The parameters w ∈ Rd are the natural parameters of pw . The normalization constant
Z(w) : Rd _ R is the partition function. The function h(x) : X _ R+ is the base measure. For
notational convenience, it can be useful to re-parametrize the natural parameters w as w := η(θ) in
terms of canonical parameters θ.
n k
p(k | q) = q · (1 − q)n−k (nb: treating n as fixed)
k
n
= exp(k log q + (n − k) log(1 − q))
k
n q
= exp |{z}
k log + n log(1 − q)
k 1 − q | {z }
|{z} ϕ(k) | {z } − log Z(w)
=:h(k) w=η(q)
w
log Z(w) = n log(1 + e )
1
p(q | α, β) = qα−1 (1 − q)β−1
B(α, β)
⊺
log q α−1
1 exp
= |{z} − log B(α, β)
log(1 − q) β−1
h(q) | {z } | {z }
=:ϕ⊺ (q) w
⊺
1 log q α
= exp
log(1 − q) − log B(α, β)
q(1 − q) β
| {z } |{z}
h̃(q) w̃
sufficient statistics ϕ, natural parameters w and base measure h are not uniquely defined.
1 (x − µ)2
N (x; µ, σ ) = √ 2
exp −
2πσ 2 2σ 2
µ 2
1 x2 −1 µ
= √ · exp x · 2 + · 2 − + log σ
2π | σ {z2 σ } 2σ 2
| {z } ⊺
| {z }
=:h(x) =:ϕ(x) w =:log Z(w)
Thus we identify the precision and precision-adjusted mean as the natural parameters, and the first
two sample moments as the sufficient statistics:
µ
x w2 1
w := σ21 ϕ(x) := 1 2 log Z(w) := − 1 − log(−w2 )
− σ2 2 x 2w 2 2
F(ϕ(x) + α, ν + 1)
= h(x)
F(α, ν)
Computing F(α, ν) can be tricky. In general, this is the challenge when constructing an EF.
an inference algorithm!
For a long time, exponential families were the only way to do tractable Bayesian inference. In a way, the
essence of machine learning is to use computers to break free from exponential families.