You are on page 1of 33

Probabilistic Machine Learning

Lecture 04
Exponential Families

Philipp Hennig
27 April 2023

Faculty of Science
Department of Computer Science
Chair for the Methods of Machine Learning
Example — inferring probability of wearing glasses
Step 1: Construct σ-algebra exposition by Stefan Harmeling

Represent all unknowns as random variables (RVs)


▶ probability to wear glasses is represented by RV Y
▶ five observations are represented by RVs X1 , X2 , X3 , X4 , X5
Possible values of the RVs
▶ Y takes values π ∈ [0, 1]
▶ X1 , X2 , X3 , X4 , X5 are binary, i.e. values 0 and 1
Graphical representation Generative model and joint probability
▶ we abbreviate Y = π as π, Xi = xi as xi
Y ▶ p(π) is the prior of Y, written fully p(Y = π)
▶ p(xi |π) is the likelihood of observation xi
▶ note that the likelihood is a function of π
X1 X2 X3 X4 X5

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 1


Example — inferring probability of wearing glasses
Step 2: Define probability space, taking care of conditional independence exposition by Stefan Harmeling

Probability of wearing glasses without observations


p(π|“nothing”) = p(π)
Probability of wearing glasses after one observation
p(x1 |π)p(π)
p(π|x1 ) = R = Z−1
1 p(x1 |π)p(π)
p(x1 |π)p(π) dπ
Probability of wearing glasses after two observations
p(π|x1 , x2 ) = Z−1 −1
2 p(x2 |x1 , π)p(x1 |π)p(π) = Z2 p(x2 |π)p(x1 |π)p(π)


Probability of wearing glasses after five observations
!
Y
5
p(π|x1 , x2 , x3 , x4 , x5 ) = Z−1
5 p(xi |π) p(π)
i=1

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 2


Example — inferring probability of wearing glasses
Step 3: Define analytic forms of generative model exposition by Stefan Harmeling

What is the likelihood?



π for x1 = 1
p(x1 |π) =
1−π for x1 = 0

More helpful RVs:


▶ RV N for the number of observations being 1 (with values n)
▶ RV M for the number of observations being 0 (with values m)
Probability of wearing glasses after five observations
!
Y
5
p(π|x1 , x2 , x3 , x4 , x5 ) = Z−1
5 p(xi |π) p(π)
i=1

= Z−1
5 π (1 − π) p(π)
n m

= p(π|n, m)

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 3


Example — inferring probability of wearing glasses
Step 4: make computationally convenient choices. Here: a conjugate prior exposition by Stefan Harmeling

Posterior after seeing five observations:

p(π|n, m) = Z−1
5 π (1 − π) p(π)
n m

What prior p(π) would make the calculations easy?

p(π) = Z−1 π a−1 (1 − π)b−1 with parameters a > 0, b > 0

the Beta distribution with parameter a and b


Let’s give the normalization factor Z of the beta distribution a name!
Z 1
B(a, b) = π a−1 (1 − π)b−1 dπ
0

the Beta function with parameters a and b

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 4


Conjugate Priors
More in the next lecture

Quand les valeurs de x, considérées indépendamment du résultat


observé, ne sont pas également possibles; en nommant z la fonction de x
qui exprime leur probabilité; il est facile de voir, par ce qui a été dit dans le
premier chaptire de ce Livre, qu’en changeant dans la formule (1), y dans
y · z, on aura la probabilité que la valeur de x est comprise dans les limites
x = θ and x = θ′ . Cela revient à supposer toutes les valeurs de x
également possible à priori, et à considérer le résultat observé, comme
étant formé de deux résultats indépendans, dont les probabilités sont y et
z. On peut donc ramener ainsi tous les case à celui ou l’on suppose à
priori, avant l’événement, une égal possibilité aux différentes valeurs de x,
et par cette raison, nous adopterons cette hypothèse dans ce qui va
suivre.

Pierre-Simon, marquis de Laplace (1749–1827) Theorie Analytique des Probabilités, 1814, p. 364
Translated by a Deep Network, assisted by a human
Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 5
Conjugate Priors
More in the next lecture

When the values of x, considered independently of the observed result,


are not equally possible; if we name z the function of x which expresses
their probability; it is easy to see, by what has been said in the first
chapter of this Book, that by changing in formula (1), y in y · z, we will
have the probability that the value of x is within the limits x = θ and
x = θ′ . This amounts to assuming all the values of x equally possible a
priori, and to considering the observed result as being formed by two
independent results, whose probabilities are y and z. We can thus reduce
all the cases to the one where we assume a priori, before the event, an
equal possibility to the different values of x, and by this reason, we will
adopt this hypothesis in what follows.

Theorie Analytique des Probabilités, 1814, p. 364


Pierre-Simon, marquis de Laplace (1749–1827) Translated by a Deep Network, assisted by a human

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 5


Conjugate Priors
An observation from last lecture

Definition (Conjugate Prior)


Let D and x be a data-set and a variable to be inferred, respectively, connected by the likelihood
p(D | x) = ℓ(D; x). A conjugate prior to ℓ for x is a probability measure with pdf p(x) = g(x; θ), such
that
ℓ(D; x)g(x; θ)
p(x | D) = R = g(x; θ + ϕ(D)).
ℓ(D; x)g(x; θ) dx
That is, such that the posterior arising from ℓ is of the same functional form as the prior, with updated
parameters arising by adding some sufficient statistics of the observation D to the prior’s parameters.

E. Pitman. Sufficient statistics and intrinsic accuracy (1936). Math. Proc. Cambr. Phil. Soc. 32(4), 1936.
P. Diaconis and D. Ylvisaker, Conjugate priors for exponential families. Annals of Statistics 7(2), 1979.

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 6


Analytic Bayesian Inference
Inferring a Binary Distribution [cf. Lecture 2]

Y
n
p(x | f) = f x · (1 − f)1−x x ∈ {0; 1}
i=1
=f n1
· (1 − f)n0
n0 := n − n1
1
p(f | α, β) = B(α, β) = fα−1 (1 − f)β−1
B(α, β)
1
p(f | x) = B(α + n1 , β + n0 ) = fα+n1 −1 (1 − f)β+n0 −1
B(α + n1 , β + n0 )

with
Z 1
Γ(x)Γ(y) (x − 1)! (y − 1)!
B(x, y) := tx−1 (1 − t)y−1 dt = =
0 Γ(x + y) (x + y − 1)!
Pierre-Simon, marquis de Laplace (1749–1827) Z ∞ Z 1
Γ(z) = e−t tz−1 dt = (− log x)z+1 dx = (z − 1)!
Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 0 0 7
Can we predict observations?
marginalizing over a Beta posterior – the Beta binomial distribution

p(x | f) = f n1 · (1 − f)n0
n0 := n − n1
1
p(f) = B(α, β) = fα−1 (1 − f)β−1
B(α, β)

with
Z 1
Γ(x)Γ(y) (x − 1)! (y − 1)!
B(x, y) := tx−1 (1 − t)y−1 dt = =
0 Γ(x + y) (x + y − 1)!

p(x) =?

Pierre-Simon, marquis de Laplace (1749–1827)

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 8


Can we predict observations?
marginalizing over a Beta posterior – the Beta binomial distribution

p(x | f) = f n1 · (1 − f)n0
n0 := n − n1
1
p(f) = B(α, β) = fα−1 (1 − f)β−1
B(α, β)

with
Z 1
Γ(x)Γ(y) (x − 1)! (y − 1)!
B(x, y) := tx−1 (1 − t)y−1 dt = =
Γ(x + y) (x + y − 1)!
Z0
p(x) = p(x | f)p(f) df
Z
1
= f n1 · (1 − f)n0 · fα−1 (1 − f)β−1 df
B(α, β)
Pierre-Simon, marquis de Laplace (1749–1827)

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 8


Can we predict observations?
marginalizing over a Beta posterior – the Beta binomial distribution

p(x | f) = f n1 · (1 − f)n0
n0 := n − n1
1
p(f) = B(α, β) = fα−1 (1 − f)β−1
B(α, β)

with
Z 1
Γ(x)Γ(y) (x − 1)! (y − 1)!
B(x, y) := tx−1 (1 − t)y−1 dt = =
Γ(x + y) (x + y − 1)!
Z0
p(x) = p(x | f)p(f) df
Z
1
= f n1 · (1 − f)n0 · fα−1 (1 − f)β−1 df
B(α, β)
Pierre-Simon, marquis de Laplace (1749–1827) B(α + n1 , β + n0 )
=
B(α, β)
Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 8
Demo

live: streamlit cloud


local: ▶ git clone https://github.com/philipphennig/ProbML_Apps.git
▶ cd ProbML_Apps/04
▶ pip install -r requirements.txt
▶ streamlit run 04_Conjugate_Prior_Inference.py

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 9


Analytic Bayesian Inference
Inferring a Categorical Distribution image: Deutsches Museum München

Y
n
p(x) = fxi x ∈ {0; . . . , K}
i=1
Y
K
= fknk nk := |{xi | xi = k}|
k=1

1 Y αk −1
K
p(f | α) = D(α) = fk
B(α)
k=1
p(f | x) = D(α + n)

where
QK
Peter Gustav Lejeune Dirichlet (1805–1859) Γ(α )
B(α) = P k
i=1
Γ( k αk )
Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 10
Analytic Bayesian Inference
Inferring the variance of a Gaussian

Y
n Y
n  
1 (xi − µ)2
p(x | σ) = N (xi ; µ, σ 2 ) =√ exp −
2πσ2 2σ 2
i=1 i=1
p(σ) = ?

Daniel Bernoulli (1700-1782)

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 11


Analytic Bayesian Inference
Inferring the variance of a Gaussian

Y
n Y
n  
1 (xi − µ)2
p(x | σ) = N (xi ; µ, σ 2 ) =√ exp −
2πσ2 2σ 2
i=1 i=1
p(σ) = ?
n X1 n
1 n
log p(x | σ) = − log σ 2 − (xi − µ)2 · 2 − log 2π
2 2 σ 2
i

Daniel Bernoulli (1700-1782)

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 11


Analytic Bayesian Inference
Inferring the variance of a Gaussian

Y
n Y
n  
1 (xi − µ)2
p(x | σ) = N (xi ; µ, σ 2 ) =√ exp −
2πσ2 2σ 2
i=1 i=1
p(σ) = ?
n X1 n
1 n
log p(x | σ) = − log σ 2 − (xi − µ)2 · 2 − log 2π
2 2 σ 2
i
1
log p(σ | α, β) = (α − 1) log σ −2 − β · − Z(α, β)
σ2
βα −2
p(σ | α, β) = (σ −2 )α−1 e−βσ =: G(σ −2 ; α, β)
Γ(α)
!
−2 n 1X
Daniel Bernoulli (1700-1782) p(σ | α, β, x) = G σ ; α + , β + (xi − µ) 2
2 2
i

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 11


Aside: The Eulerian Integrals
For an evening at the fireplace: [Philip J. Davis. Leonhard Euler’s Integral: A Historical Profile of the Gamma Function, 1959]

For m, n ∈ N and x, y, z ∈ C :
Z 1
 )YPIV
\
B(x, y) = tx−1 (1 − t)y−1 dt
0
,EHEQEVH
Γ(x)Γ(y)
= if x + x̄, y + ȳ > 0
Γ(x + y)
(m − 1)! (n − 1)!
 B(m, n) =
(m + n − 1)!
Z ∞ Z 1
Γ(z) = e−t tz−1 dt = (− log x)z+1 dx
0 0
Γ(n) = (n − 1)!

Hadamard:
Leonhard Euler (1707–1783)
      
1 d 1−x   x
\ H(x) = log Γ Γ 1−
Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 Γ(1 − x) dx 2 2 12
Can we predict observations?
marginalizing over a Gamma posterior – the t-distribution

 
1 (x − µ)2
p(x | σ) = N (x; µ, σ ) = √
2
exp −
2πσ 2 2σ 2
α
β −2
p(σ) = G(σ −2 ; α, β) = (σ −2 )α−1 e−βσ
Γ(α)
Z
p(x) = p(x | σ)p(σ) dσ
Z  
(σ −2 )1/2 (x − µ)2 βα −2
= √ exp −σ −2 · (σ −2 )α−1 e−βσ dσ
2π 2 Γ(α)
( )
Z (x−µ)2
1 βα −σ −2 β+ 2
−2 α+ 12 −1
=√ (σ ) e dσ
2π Γ(α)
 −α+ 12
1 Γ(α + 12 ) βα Γ(α + 21 ) (x − µ)2
William Sealy Gosset (1876–1937) =√  α+ 12 = p 1+
2π Γ(α) (x−µ)2 2πβΓ(α) 2β
β+ 2
Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 13
Analytic Bayesian Inference
Inferring the Mean of a Gaussian

Y
n Y
n  
1 (yi − x)2
p(y | x) = N (yi ; x, σ 2 ) = √ exp −
2πσ2 2σ 2
i=1 i=1
p(x) = ?

Carl Friedrich Gauss (1777–1855)

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 14


Analytic Bayesian Inference
Inferring the Mean of a Gaussian

Y
n Y
n  
1 (yi − x)2
p(y | x) = N (yi ; x, σ 2 ) = √ exp −
2πσ2 2σ 2
i=1 i=1
p(x) = ?
n X1 n
1 n
log p(y | x) = − log σ 2 − (yi − x)2 · 2 − log 2π
2 2 σ 2
i

1 X 2
n
n
= − (log σ − log 2π) − 2
2
yi + yi x − x2
2 2σ
i

Carl Friedrich Gauss (1777–1855)

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 14


Analytic Bayesian Inference
Inferring the Mean of a Gaussian

Y
n Y
n  
1 (yi − x)2
p(y | x) = N (yi ; x, σ 2 ) = √ exp −
2πσ2 2σ 2
i=1 i=1
p(x) = ?
n X1 n
1 n
log p(y | x) = − log σ 2 − (yi − x)2 · 2 − log 2π
2 2 σ 2
i

1 X 2
n
n
= − (log σ 2 − log 2π) − 2 yi + yi x − x2
2 2σ
i
1 m m2 1
log p(x | m, v2 ) = − 2 x2 + 2 x − 2 − log 2πv2
2v v 2v 2 !
1 1 m 1 X
Carl Friedrich Gauss (1777–1855)
log p(x | y, m, v ) =
2
x − −2
2
+ 2 yi x + const.
2(v−2 + nσ −2 ) v + nσ −2 v2 σ
i
Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 14
Analytic Bayesian Inference
Inferring the Mean of a Gaussian

Y
n Y
n  
1 (yi − x)2
p(y | x) = N (yi ; x, σ 2 ) = √ exp −
2πσ2 2σ 2
i=1 i=1
p(x) = ?
n X1 n
1 n
log p(y | x) = − log σ 2 − (yi − x)2 · 2 − log 2π
2 2 σ 2
i

1 X 2
n
n
= − (log σ 2 − log 2π) − 2 yi + yi x − x2
2 2σ
i
1 m m2 1
log p(x | m, v2 ) = − 2 x2 + 2 x − 2 − log 2πv2
2v v 2v 2! !
m 1 Xn  −1
−2 −2
Carl Friedrich Gauss (1777–1855)
p(x | y, m, v ) = N x; Ψ 2 + 2
2
yi , Ψ := v + nσ
v σ
i
Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 14
Can we predict observations?
marginalizing over a Gaussian posterior

 
1 (y − x)2
p(y | x) = N (y; x, σ 2 ) = √ exp −
2πσ 2 2σ 2
p(x) = N (x; m, v2 )
Z
p(y) = p(y | x)p(x) dx

Carl Friedrich Gauss (1777–1855)

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 15


Can we predict observations?
marginalizing over a Gaussian posterior

Z    
1 (y − x)2 1 (x − m)2
= √ exp − ·√ exp − dx
2πσ 2 2σ 2 2πv2 2v2
Z  
1 1 1
= √ exp − 2 (y − x)2 − 2 (x − m)2 dx
2
2π σ v 2 2σ 2v
Z  
1 1 2 1 2
= √ exp − 2 (y − 2xy + x ) − 2 (x − 2mx + m ) dx
2 2
2π σ 2 v2 2σ 2v
  2 2   2 2 2 
x2
− 2x yσ +mv
+ yσσ2+mv Z 
1  y − 2ym + m
2 2
σ 2 +v2 +v2 
= √ exp − −  exp
2π σ 2 v2 2(σ 2 + v2 ) 2(σ 2 + v2 )
 
1 (y − m)2
=p exp −
Carl Friedrich Gauss (1777–1855)
2π(σ 2 + v2 ) 2(σ 2 + v2 )

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 15


▶ Conjugate priors allow analytic Bayesian inference
▶ How can we construct them in general?

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 16


Exponential Families
Exponentials of a Linear Form

Definition (Exponential Family, simplified form)


Consider a random variable X taking values x ∈ X ⊂ Rn . A probability distribution for X with pdf of the
functional form
h(x) ϕ(x)⊺ w
pw (x) = h(x) exp [ϕ(x)⊺ w − log Z(w)] = e = p(x | w)
Z(w)

is called an exponential family of probability measures. The function ϕ : X _ Rd is called the sufficient
statistics. The parameters w ∈ Rd are the natural parameters of pw . The normalization constant
Z(w) : Rd _ R is the partition function. The function h(x) : X _ R+ is the base measure. For
notational convenience, it can be useful to re-parametrize the natural parameters w as w := η(θ) in
terms of canonical parameters θ.

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 17


The Binomial Distribution
a quick tour of exponential families

 
n k
p(k | q) = q · (1 − q)n−k (nb: treating n as fixed)
k
 
n
= exp(k log q + (n − k) log(1 − q))
k
 
   
n  q 
= exp |{z}
k log + n log(1 − q)
k  1 − q | {z }
|{z} ϕ(k) | {z } − log Z(w)
=:h(k) w=η(q)
w
log Z(w) = n log(1 + e )

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 18


The Beta Distribution
a quick tour of exponential families

1
p(q | α, β) = qα−1 (1 − q)β−1
B(α, β)
 
 ⊺   
 log q α−1 
1 exp 
= |{z} − log B(α, β)
 log(1 − q) β−1 
h(q) | {z } | {z }
=:ϕ⊺ (q) w
 
 ⊺  
1  log q α 
= exp 
 log(1 − q) − log B(α, β) 

q(1 − q) β
| {z } |{z}
h̃(q) w̃

sufficient statistics ϕ, natural parameters w and base measure h are not uniquely defined.

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 19


The Gaussian
a quick tour of exponential families

 
1 (x − µ)2
N (x; µ, σ ) = √ 2
exp −
2πσ 2 2σ 2
 
 µ  2 
1  x2 −1 µ 
= √ · exp x · 2 + · 2 − + log σ 
2π | σ {z2 σ } 2σ 2 
| {z } ⊺
| {z }
=:h(x) =:ϕ(x) w =:log Z(w)

Thus we identify the precision and precision-adjusted mean as the natural parameters, and the first
two sample moments as the sufficient statistics:
 µ   
x w2 1
w := σ21 ϕ(x) := 1 2 log Z(w) := − 1 − log(−w2 )
− σ2 2 x 2w 2 2

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 20


A Family Meeting
Exponential families provide the probabilistic analogue to data types

Name sufficient stats domain use case


Bernoulli ϕ(x) = [x] X = {0; 1} coin toss
Poisson ϕ(x) = [x] X = R+ emails per day
Laplace ϕ(x) = [1, x]⊺ X=R floods
Helmert (χ2 ) ϕ(x) = [x, − log x] X=R variances
Dirichlet ϕ(x) = [log x] X = R+ class probabilities
Euler (Γ) ϕ(x) = [x, log x] X = R+ variances
Wishart ϕ(X) = [X, log |X|] X = {X ∈ RN×N | v⊺ Xv ≥ 0∀v ∈ RN } covariances
Gauss ϕ(X) = [X, XX⊺ ] X = RN functions
Boltzmann ϕ(X) = [X, triag(XX⊺ )] X = {0; 1}N thermodynamics

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 21


Exponential Families have Conjugate Priors
but the prior’s normalization constant can be tricky

▶ Consider the exponential family pw (x | w) = h(x) exp [ϕ(x)⊺ w − log Z(w)]


R
▶ its conjugate prior is the exponential family F(α, ν) = exp(α⊺ w − ν log Z(w)) dw
 ⊺   
w α
pα (w | α, ν) = exp − log F(α, ν)
− log Z(w) ν
!
Yn X
because pα (w | α, ν) pw (xi | w) ∝ pα w α + ϕ(xi ), ν + n
i=1 i

▶ and the predictive is


Z Z

p(x) = pw (x | w)pα (w | α, ν) dw = h(x) e(ϕ(x)+α) w+(ν+1) log Z(w)−log F(α,ν) dw

F(ϕ(x) + α, ν + 1)
= h(x)
F(α, ν)

Computing F(α, ν) can be tricky. In general, this is the challenge when constructing an EF.

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 22


Summary:
▶ Conjugate Priors allow analytic inference in probabilistic models
Please cite this course, as
▶ Exponential Families guarantee the existence of conjugate priors,
@techreport { Tuebingen_ProbML23 ,
although not always tractable ones title =
{ P r o b a b i l i s t i c Machine L e a r n i n g } ,
▶ The hardest part is finding the normalization constant. In fact, a u t h o r = { Hennig , P h i l i p p } ,
s e r i e s = { L e c t u r e Notes
finding the normalization constant is the only hard part. i n Machine L e a r n i n g } ,
year = {2023} ,

▶ Exponential families are a way to turn someone else’s integral into


i n s t i t u t i o n = { Tübingen AI Center } }

an inference algorithm!

For a long time, exponential families were the only way to do tractable Bayesian inference. In a way, the
essence of machine learning is to use computers to break free from exponential families.

Probabilistic ML — P. Hennig, SS 2023 — © Philipp Hennig, 2023 CC BY-NC-SA 4.0 23

You might also like