You are on page 1of 53

Machine Learning

Section 7: More on distributions, models, MAP, ML

Stefan Harmeling

27. October 2021


Gaussian distribution

Machine Learning / Stefan Harmeling / 27. October 2021 2


Univariate Gaussian distribution
see MLPP 2.4.1 (Murphy: Machine Learning: a Probabilistic Perspective)

▸ random variable X is real-valued


▸ parameters µ called mean, σ 2 > 0 called variance
▸ X has univariate Gaussian distribution, written

X ∼ N (µ, σ 2 )

▸ probability density function

1 (x−µ)2
N (x ∣ µ, σ 2 ) = √ e− 2σ2
σ 2π
▸ one can show: E X = µ and Var X = σ 2

Machine Learning / Stefan Harmeling / 27. October 2021 3


Multivariate Gaussian distribution
see MLPP 2.5.2

▸ random vector X has real-valued components


▸ parameters µ called mean vector, pos-def symmetric matrix Σ
called covariance matrix
▸ X has multivariate Gaussian distribution, written

X ∼ N (µ, Σ)

▸ probability density function

1 1 T
Σ−1 (x−µ)
N (x ∣ µ, Σ) = e− 2 (x−µ)
(2π)n/2 ∣Σ∣1/2

▸ special case: N (µ, σ 2 )


▸ one can show: E X = µ and Var X = Σ

Machine Learning / Stefan Harmeling / 27. October 2021 4


Closed under sum- and product rule:

A Gaussian joint distribution

x µ A B
p(x, y ) = N ([ ],[ ],[ T ])
y ν B C

has Gaussian marginals

p(x) = ∫ p(x, y ) dy = N (x, µ, A)

p(y ) = ∫ p(x, y ) dx = N (y , ν, C)

and Gaussian conditionals

p(x ∣ y ) = p(x, y )/p(y ) = N (x, µ + BC −1 (y − ν), A − BC −1 B T )


p(y ∣ x) = p(x, y )/p(x) = N (y , ν + B T A−1 (x − µ), C − B T A−1 B)

Machine Learning / Stefan Harmeling / 27. October 2021 5


Figure 1.
Machine Learning / Stefan Harmeling / 27. October 2021 Univariate distribution relationships. 6
previous graphics from: “Univariate Distribution Relationships”, Lawrence M. Leemis and
Jacquelyn T. McQueston, The American Statistician, February 2008, Vol. 62, No. 1,
page 47

Machine Learning / Stefan Harmeling / 27. October 2021 7


Distribution for waiting times

Machine Learning / Stefan Harmeling / 27. October 2021 8


Poisson distribution
see MLPP 2.3.3

▸ counts of rare events


▸ let random variable X ∈ {0, 1, . . .} be the number of events in
some time interval
▸ let λ > 0 be the parameter (the rate)
▸ X has Poisson distribution, written

X ∼ Poi(λ)

▸ probability mass function

λx
Poi(x ∣ λ) = e−λ
x!
▸ E X = Var X = λ
▸ e.g. number of emails you receive every days is Poisson
distributed
▸ e.g. the waiting time between events
Machine Learning / Stefan Harmeling / 27. October 2021 9
Distributions for tossing dice

Machine Learning / Stefan Harmeling / 27. October 2021 10


Binomial distribution
see MLPP 2.3.1

▸ toss a coin n times


▸ let random variable X ∈ {0, . . . , n} be number of heads
▸ let θ be the probability of heads
▸ X has binomial distribution, written

X ∼ Bin(n, θ)

▸ probability mass function

n
Bin(k ∣ n, θ) = ( ) θk (1 − θ)n−k
k
▸ E X = n θ, Var X = n θ(1 − θ)

Machine Learning / Stefan Harmeling / 27. October 2021 11


Bernoulli distribution
see MLPP 2.3.1

▸ toss a coin once


▸ let random variable X ∈ {0, 1} be a binary variable
▸ let θ be the probability of heads
▸ X has Bernoulli distribution, written

X ∼ Ber(θ)

▸ probability mass function

θ if x = 1
Ber(x ∣ θ) = θ[x=1] (1 − θ)[x=0] = {
1−θ if x = 0

using Iverson brackets [A] = 1 if A is true, [A] = 0 if A is false


▸ E X = θ, Var X = θ(1 − θ)
▸ special case: Ber(θ) = Bin(1, θ)

Machine Learning / Stefan Harmeling / 27. October 2021 12


Multinomial distribution
see MLPP 2.3.2

▸ toss a K -sided dice n times


▸ let X = [x1 , . . . , xK ]T be a random (column) vector, with xj being
the number of times side j occurs, ∑j xj = n
▸ let θ = [θ1 , . . . , θK ]T be the parameter (column) vector, with
∑j θj = 1 and θj ≥ 0
▸ let θj be the probability of side j of the dice
▸ X has multinomial distribution, written

X ∼ Mu(n, θ)

▸ probability mass function

K
n x
Mu(x ∣ n, θ) = ( ) ∏ θj j
x1 . . . xK j=1

n n!
with multinomial coefficient (x1 ...x K
)= x1 !x2 !⋯xK !

Machine Learning / Stefan Harmeling / 27. October 2021 13


Mean and variance of multinomial distribution
▸ the mean is a (column) vector:

E X = [n θ1 , . . . , n θK ]T

▸ the variance a matrix (aka covariance matrix):


⎡ n θ1 (1 − θ1 ) −n θ1 θ2 ⋯ −n θ1 θK ⎤
⎢ ⎥
⎢ −n θ θ n θ2 (1 − θ2 ) ⋯ −n θ2 θK ⎥
⎢ 2 1 ⎥
Var X = ⎢ ⎥
⎢ ⋮ ⋮ ⋮ ⎥
⎢ ⎥
⎢ −n θK θ1
⎣ −n θK θ2 ⋯ n θK (1 − θK ) ⎥

▸ on the diagonal we have the variances of each entry,

Var Xi = n θi (1 − θi )

▸ off the diagonal we have the covariances of two distinct entries,

Cov(Xi , Xj ) = −n θi θj for i ≠ j

which is negative since increasing one entries requires decreasing


the another one
Machine Learning / Stefan Harmeling / 27. October 2021 14
Multinoulli distribution
see MLPP 2.3.2

▸ toss a K -sided dice once


▸ let X = (x1 , . . . , xK ) be a random vector, with xj being binary, such
that only one is non-zero (aka one-hot encoding)
▸ let θ = (θ1 , . . . , θK ) be the parameter vector, with ∑j θj = 1 and
θj ≥ 0
▸ let θj be the probability of side j of the dice
▸ X has multinoulli distribution, written

X ∼ Cat(θ) = Mu(1, θ)

▸ probability mass function

K
x
Cat(x ∣ θ) = ∏ θj j
j=1

▸ aka categorical distribution or discrete distribution

Machine Learning / Stefan Harmeling / 27. October 2021 15


Tossing dice (1)
▸ tossing n times a K sided dice
▸ let X be random vector of number of times side j appeared
▸ distribution of X : Multinomial

X ∼ Mu(n, θ)

with parameter vector θ


▸ assume n = 1: Multinoulli

Cat(θ) = Mu(1, θ)

▸ assume case K = 2: Binomial

Bin(n, θ) = Mu(n, (θ, 1 − θ))

with θ ∈ [0, 1]
▸ assume n = 1 and K = 2: Bernoulli

Ber(θ) = Bin(1, θ) = Mu(1, (θ, 1 − θ)) = Cat((θ, 1 − θ))

with θ ∈ [0, 1]
Machine Learning / Stefan Harmeling / 27. October 2021 16
Tossing dice (2)

▸ tossing n times a K sided dice

n=1 n>1
K =2 Bernoulli Binomial
K >2 Multinoulli Multinomial

Machine Learning / Stefan Harmeling / 27. October 2021 17


What distribution should we choose for the
parameters?

Machine Learning / Stefan Harmeling / 27. October 2021 18


Beta-binomial model
MLPP 3.3
Data
▸ flip repeatedly a coin with unknown heads probability θ
▸ k number of heads, n total number of throws
▸ k is the data D
▸ same as wearing glasses example (Section 05)
Specify

θ ∼ Beta(a, b) p(θ) = Beta(θ ∣ a, b) prior


k ∣ θ ∼ Bin(n, θ) p(k ∣ θ) = Bin(k ∣ n, θ) likelihood

Infer

θ ∣ k ∼ Beta(a + k , b + n − k ) posterior
p(θ ∣ k ) = Beta(θ ∣ a + k , b + n − k ) posterior

▸ both notations are fine: θ ∼ Beta(a, b) and p(θ) = Beta(θ ∣ a, b)


Machine Learning / Stefan Harmeling / 27. October 2021 19
Beta distribution
see MLPP 2.4.6

▸ random variable θ ∈ [0, 1] (interval between zero and one)


▸ parameters a > 0 and b > 0
▸ θ has beta distribution, written

θ ∼ Beta(a, b)

▸ probability density function

1
Beta(θ ∣ a, b) = θa−1 (1 − θ)b−1
B(a, b)

with B(a, b) being the beta function

Γ(a)Γ(b)
B(a, b) =
Γ(a + b)

▸ EX = a ab a−1
a+b
, Var X = (a+b)2 (a+b+1)
, mode = a+b−2
(max of the PDF)

Machine Learning / Stefan Harmeling / 27. October 2021 20


Gamma function, Beta function, and all that
from http://en.wikipedia.org/wiki/Gamma_function
and http://en.wikipedia.org/wiki/Beta_function
Gamma function (extension of factorial function)

Γ(z) = ∫ e−t t z−1 dt for z ∈ C
0
Γ(n) = (n − 1)! = n!/n for n ∈ N

Beta function (extension of . . . ?)


1
B(x, y ) = ∫ t x−1 (1 − t)y −1 dt
0
Γ(x)Γ(y )
= for x, y ∈ C with x + x̄, y + ȳ > 0
Γ(x + y )
(m − 1)! (n − 1)!
B(m, n) = for m, n ∈ N
(m + n − 1)!
−1
m+n m+n
=( ) binomial coefficient
n mn

Machine Learning / Stefan Harmeling / 27. October 2021 21


Dirichlet distribution
see MLPP 2.5.4
▸ random vector θ = [θ1 , . . . , θK ]T with values in probability simplex,
i.e. ∑j θj = 1, θj ≥ 0.
▸ parameter vector α = [α1 , . . . , αK ]T , with αj > 0
▸ θ has Dirichlet distribution, written

θ ∼ Dir(α)

▸ probability density function

1 K αk −1
Dir(θ ∣ α) = ∏θ
B(α) k =1 k

with B(α) generalizing the beta function


K
∏k =1 Γ(αk )
B(α) =
Γ(∑Kk=1 αk )

▸ special case: Beta(a, b) = Dir([a, b]T )


Machine Learning / Stefan Harmeling / 27. October 2021 22
Beta-binomial model
MLPP 3.3
Data
▸ flip repeatedly a coin with unknown heads probability θ
▸ k number of heads, n total number of throws
▸ k is the data D
▸ same as wearing glasses example (Section 05)
Specify

p(θ) = Beta(θ ∣ a, b) prior


p(D ∣ θ) = Bin(k ∣ n, θ) likelihood

Infer

p(θ ∣ D) = Beta(θ ∣ a + k , b + n − k ) posterior

Since the prior and posterior have the same distribution, we say that
Beta distribution is the conjugate prior for the binomial likelihood.
Machine Learning / Stefan Harmeling / 27. October 2021 23
Dirichlet-multinomial model
MLPP 3.4

Data
▸ throw n times a dice with unknown probabilities θ = [θ1 , . . . , θK ]T
▸ data D = [x1 , . . . , xK ]T , with xj number of times side j
Specify

p(θ) = Dir(θ ∣ α) prior


p(D ∣ θ) = Mu(x ∣ n, θ) likelihood

Infer

p(θ ∣ D) = Dir(θ ∣ α + x) posterior

Since the prior and posterior have the same distribution, we say that
Dirichlet distribution is the conjugate prior for the multinomial likelihood.

Machine Learning / Stefan Harmeling / 27. October 2021 24


Digression: Gaussian-Gaussian model
Data
▸ sample n times from a univariate Gaussian distribution with
unknown mean µ and fixed variance σ 2
▸ data are n samples x1 , . . . , xn
Specify
p(µ) = N (µ ∣ 0, τ 2 ) prior
n
p(x1 , . . . , xn ∣ µ) = ∏ N (xi ∣ µ, σ 2 ) likelihood
i=1

Infer
p(µ ∣ x1 , . . . , xn ) = N (µ ∣ ν, ξ 2 ) posterior
with
σ −2 ∑ni=1 xi 1
ν= ξ2 =
τ −2 + nσ −2 τ −2 + nσ −2

Since the prior and posterior have the same distribution, we say that
Gaussian distribution is the conjugate prior for the Gaussian likelihood.
Machine Learning / Stefan Harmeling / 27. October 2021 25
For a long list of conjugate prior and their likelihood, see
https://en.wikipedia.org/wiki/Conjugate_prior.

Machine Learning / Stefan Harmeling / 27. October 2021 26


Summary: distributions for tossing coins and dice

Throw a coin (K = 2) or a dice (K > 2).


Distributions for the outcome
▸ coin (K = 2): X ∼ Ber(θ) with θ being scalar
▸ dice (K > 2): X ∼ Mu(θ) with θ being vector (length K )
Distributions for the parameter (conjugate priors!)
▸ coin (K = 2): θ ∼ Beta(a, b) with a and b being scalar
▸ dice (K > 2): θ ∼ Dir(α) with α being vector (length K )

Machine Learning / Stefan Harmeling / 27. October 2021 27


How can I get a point estimate?

Machine Learning / Stefan Harmeling / 27. October 2021 28


MAP estimator and ML estimator
▸ let’s denote the data as D (was k in the beta-binomial model)
▸ summarize the posterior by a point estimate
▸ maximum a posteriori estimator (MAP)

θMAP = arg max p(θ ∣ D) = arg max p(D ∣ θ)p(θ)


θ θ

(aka mode of the posterior)


▸ somewhat similar to maximum likelihood (ML) estimator

θML = arg max p(D ∣ θ)


θ

▸ likelihood term dominates for lots of data, thus the data


overwhelms the prior and MAP converges against ML
▸ MAP and ML ignore variance of posterior
▸ nonetheless, MAP is useful if the posterior is peaked, ML useful if
we have lots of data

Machine Learning / Stefan Harmeling / 27. October 2021 29


Famous ML estimator for Gaussian likelihoods
Setup
▸ consider Gaussian distributed data points X1 , . . . , Xn ∼ N (x ∣ µ, I)
▸ goal: estimate mean µ
Maximize the likelihood:

µML = arg max p(X1 , . . . , Xn ∣ µ)


µ

= arg max log p(X1 , . . . , Xn ∣ µ)


µ
n
1 1 T
= arg max log ∏ n/2
e− 2 (xi −µ) (xi −µ)
µ (2π)
i=1
n 1 T
= arg max ∑ log e− 2 (xi −µ) (xi −µ)
µ
i=1
n
= arg min ∑∥xi − µ∥2
µ
i=1

Thus we derived the method of least-squares!


Machine Learning / Stefan Harmeling / 27. October 2021 30
Naming conventions for MAP and MLE

▸ MAP is “maximum a-posteriori”.


▸ The MAP estimator for a parameter θ is a function of observed
data, that calculates the value for θ, that maximizes the posterior
distribution.
▸ ML is “maximum likelihood”.
▸ The ML estimator (sometimes called MLE) for a parameter θ is a
function of observed data, that calculates the value for θ, that
maximizes the likelihood.

Machine Learning / Stefan Harmeling / 27. October 2021 31


ML vs MAP: insights
ML is minimizing the negative log-likelihood:

θML = arg max p(D ∣ θ)


θ
= arg max log p(D ∣ θ)
θ
= arg min − log p(D ∣ θ)
θ ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
negative log-likelihood

MAP is a regularized ML:

θMAP = arg max p(θ ∣ D)


θ
= arg max p(D ∣ θ)p(θ)/p(D) "Bayes rule"
θ
= arg max p(D ∣ θ)p(θ) "p(D) is const wrt θ"
θ
= arg max log p(D ∣ θ) + log p(θ) "log is monotone"
θ
= arg min − log p(D ∣ θ) − log p(θ)
θ ´¹¹ ¹ ¹ ¹ ¹ ¹¸ ¹ ¹ ¹ ¹ ¹ ¹ ¶
regularization

Machine Learning / Stefan Harmeling / 27. October 2021 32


ML vs MAP: comparing the estimators
Example: Estimate the mean of a Gaussian distribution after seeing
data x1 , x2 , . . . , xn (just real numbers, univariate) for the model:

p(µ) = N (µ ∣ 0, τ 2 ) prior mean


n
p(x1 , . . . , xn ∣ µ) = ∏ N (xi ∣ µ, σ 2 ) likelihood of the data
i=1

For where λ = σ 2 /τ 2 we can derive:

µMAP = arg min − log p(x1 , . . . , xn ∣ µ) − log p(µ) = . . .


µ
n
1 n
= arg min ∑(xi − µ)2 + λµ2 = ∑ xi
µ
i=1 ° n + λ i=1
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶ regularization
least squares

µML = arg min − log p(D ∣ µ)


µ
n
2 1 n
= arg min ∑(xi − µ) = ∑ xi
µ
i=1 n i=1
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
negative log-likelihood
Machine Learning / Stefan Harmeling / 27. October 2021 33
Nice interpretation of MAP
Example: Estimate the mean of a Gaussian distribution after seeing
data x1 , x2 , . . . , xn (just real numbers, univariate):

1 n
µMAP = ∑ xi
n + λ i=1

▸ E.g. λ = 1 (i.e. σ 2 = τ 2 ) is like adding another (older) observation


x0 = 0 and doing ML.
▸ E.g. λ = 2 (i.e. σ 2 = 2τ 2 ) is like adding two (older) observations
with value zero and doing ML.
▸ E.g. λ = 100 (i.e. σ 2 = 100τ 2 ) is like adding 100 (older)
observations with value zero and doing ML.
Notes:
▸ The MLE is like MAP with λ = 0 (i.e. τ 2 = ∞, thus having an
infinitely wide Gaussian prior), i.e. without previous observations.
▸ For any integer λ we can interpret the MAP estimator as an MLE
with λ many additional zero measurements.
▸ Parameter λ is similar to parameters a and b of the Beta
distribution which also count previous observations.
Machine Learning / Stefan Harmeling / 27. October 2021 34
Which estimator should I choose?

Machine Learning / Stefan Harmeling / 27. October 2021 35


Which estimator should I choose? (1)
MLPP 5.7

Bayesian decision theory


▸ turn priors into posteriors to update your beliefs
▸ how to convert beliefs into actions?
▸ define a loss function which tells us how expensive it is to be
wrong
▸ i.e. what is the loss L(θ̂, θ) if we pick parameter θ̂ while θ is the
true one
▸ given the posterior p(θ ∣ D) pick the θ̂ that minimizes the posterior
expected loss

ρ(θ̂) = ∫ L(θ̂, θ)p(θ ∣ D)dθ

▸ Bayes estimator, aka Bayes decision rule

θ̂ = arg min ρ(θ̂)


θ̂

Machine Learning / Stefan Harmeling / 27. October 2021 36


Which estimator should I choose? (2)
MLPP 5.7

Some common loss functions


▸ for the 0-1 loss

0 if θ̂ = θ
L(θ̂, θ) = {
1 if θ̂ ≠ θ

the Bayes estimator is the MAP estimator


▸ for the quadratic loss, aka l2 loss, aka squared error

L(θ̂, θ) = (θ̂ − θ)2

the Bayes estimator is the posterior mean


▸ for the robust loss, aka absolute error, aka l1 loss

L(θ̂, θ) = ∣θ̂ − θ∣

the Bayes estimator is the posterior median

Machine Learning / Stefan Harmeling / 27. October 2021 37


Which estimator should I choose? (3)

Story:
You are at the NeurIPS confernce in a big hotel, standing in
front of five elevators. Where should you stand to minimize
the length of the way to the next open elevator?
What loss function should you use? What is the resulting estimator?
(Here you should use l1 loss to minimize the distance to the elevator...)

Machine Learning / Stefan Harmeling / 27. October 2021 38


Summary of point estimators
▸ Maximum Likelihood estimator (MLE):

θML = arg max p(D ∣ θ)


θ

▸ Bayes estimator:
▸ Maximum Aposteriori (MAP) estimator (minimizes 0-1 loss):

θMAP = arg max p(θ ∣ D) "the mode of the posterior


θ

= arg max p(D ∣ θ)p(θ)


θ

▸ Posterior mean (the estimator minimizing quadratic loss):

θposterior mean = Eθ p(θ ∣ D)


▸ Posterior median (the estimator minimizing l1 loss):

θposterior median = . . .

i.e. ∫θ<θposterior median p(θ ∣ D)dθ = ∫θ>θposterior median p(θ ∣ D)dθ

Machine Learning / Stefan Harmeling / 27. October 2021 39


What else can we do with the posteriors?
Don’t we usually just want point estimates?

Machine Learning / Stefan Harmeling / 27. October 2021 40


Posterior predictive distribution

Alternative to point estimates such as ML and MAP:


▸ posterior expresses our belief state about the world, e.g.

p(θ ∣ D) = Beta(θ ∣ a + k , b + n − k )

▸ use it to make predictions! (scientific method)


▸ define posterior predictive distribution
1 1
p(x = 1 ∣ D) = ∫ p(x = 1, θ ∣ D) dθ = ∫ p(x = 1 ∣ θ) p(θ ∣ D) dθ
0 0

where x is e.g. a random variable for the outcome of a future coin


toss, note that x ⊥⊥ D ∣ θ, look at the graphical model. . .
▸ posterior predictive distribution integrates out the unknown
parameter using the posterior

Machine Learning / Stefan Harmeling / 27. October 2021 41


E.g. for the beta-binomial model
▸ MAP and ML

θMAP = arg max p(θ ∣ D)


θ
a+k −1
= arg max Beta(θ ∣ a + k , b + n − k ) =
θ a+b+n−2
k
θML = arg max p(D ∣ θ) = arg max Bin(k ∣ n, θ) =
θ θ n
▸ ML equals the MAP estimate for uniform prior on θ, i.e. for a = 1,
b = 1.
▸ posterior predictive distribution
1
p(x = 1 ∣ D) = ∫ p(x = 1 ∣ θ)p(θ ∣ D)dθ
0
1
=∫ θ Beta(θ ∣ a + k , b + n − k )dθ
0
a+k
= = posterior mean
a+b+n

Machine Learning / Stefan Harmeling / 27. October 2021 42


Inference for a difference in proportions
MLPP 5.2.3, see link in MLPP for the source

Story
Two sellers at Amazon have the same price. One has 90
positive, 10 negative reviews. The other one 2 positive, 0
negative. Who should you buy from?
Apply two beta-binomial models (assuming uniform priors)

p(θ1 ∣ D1 ) = Beta(θ1 ∣ 91, 11) posterior about reliability


p(θ2 ∣ D2 ) = Beta(θ2 ∣ 3, 1) posterior about reliability

Compute probability that seller 1 is more reliable than seller 2:

p(θ1 > θ2 ∣ D1 , D2 )
1 1
=∫ ∫ [θ1 > θ2 ] Beta(θ1 ∣ 91, 11) Beta(θ2 ∣ 3, 1)dθ1 dθ2 ≈ 0.710
0 0

using numerical integration (your exercise...).

Machine Learning / Stefan Harmeling / 27. October 2021 43


Probabilistic inference: general recipe

Story
Learn something ...
Specify
▸ Prior
▸ Likelihood
Infer
▸ Posterior
▸ MAP, Mode, Median, Posterior predictive distribution
▸ (maybe MLE)

Machine Learning / Stefan Harmeling / 27. October 2021 44


Transformation of random variables

Machine Learning / Stefan Harmeling / 27. October 2021 45


Transformation of variables (1)
Theorem 7.1 (transformation of variable)
Suppose y (x) is an increasing monotonic function of x. Assume we
are given some random variable X with PDF pX (x).
1. Since y (x) is a monotonic function, it is invertible, i.e. its inverse
function x(y ) exists.
2. Let Y = y (X ) be a random variable defined via y (x).
3. Then the PDF pY (y ) of Y can be calculated from the PDF pX (x)
of X :
dx(y )
pY (y ) = pX (x(y ))
dy
Informal proof: preserve probability mass pX (x)dx = pY (y )dy .
Note: usually there are absolute values around dx/dy in the
transformation rule, however, we omitted that since we assume that the
transformation is increasing.
Example: X with PDF pX (x), Y = log X . Then
pY (y ) = pX (exp(y )) exp(y ).
Machine Learning / Stefan Harmeling / 27. October 2021 46
Transformation of variables (2)
Informal formula to remember:

p(x)dx = p(y )dy

Or, choose p(y ) such that “Integration by Substitution” works, i.e.:

Theorem 7.2 (rule of the unconscious statistician)


Given a random variable X with PDF p(x) and some function y (x).
Then the expected value of Y = y (X ) is

E Y = ∫ y p(y ) dy = ∫ y (x) p(x) dx = E y (X )

Proof: 1st equality (Def. 5.4), 2nd equality (Integration by substitution, and
transformation of RVs), 3rd equality (Def. 5.5, just a nice notation).
▸ Larry Wasserman calls this rule lazy (see All of Statistics,
Theorem 3.6), because there is no need to find p(y ).
▸ See also: https://en.wikipedia.org/wiki/Law_of_the_
unconscious_statistician
Machine Learning / Stefan Harmeling / 27. October 2021 47
Transformation example (1)
Beta distribution:
1
p(π) = Beta(π ∣ a, b) = π a−1 (1 − π)b−1 for π ∈ [0, 1]
B(a, b)

Change the parameterization by transforming π:

π 1
x(π) = log and its (well-known) inverse π(x) =
1−π 1 + e−x
What is p(x)?
Answer:

p(x) = Beta(π(x) ∣ a, b) note π ′ (x) = π(x)(1 − π(x))
dx
1
= π(x)a−1 (1 − π(x))b−1 π(x)(1 − π(x))
B(a, b)
1
= π(x)a (1 − π(x))b
B(a, b)

Machine Learning / Stefan Harmeling / 27. October 2021 48


FILLING THE HOLE!

Machine Learning / Stefan Harmeling / 27. October 2021 49


Transformation example (2)
Mean w/o and with transformation differ:
a
Eπ (π) =
a+b
a
Ex (x) = Eπ x(π) ≠ log = x (Eπ (π))
b
Mode with and w/o transformation differ (maximum of PDF):

a−1
arg max p(π) = for a, b > 1
π a+b−2
a−1
arg max p(x) ≠ x ( )
x a+b−2
DANGER:
▸ Mean changes under transformation.
▸ Mode/maximum might change after transformation.
▸ So be careful with these point estimates...
▸ Median should be fine...
Machine Learning / Stefan Harmeling / 27. October 2021 50
Transformation example (3)
Question: So what is the mean of X ?
Answer:

EX (x) = ψ(a) − ψ(b)

where ψ is the digamma function,

d
ψ(x) = log Γ(x)
dx
the derivative of the logarithm of the Gamma function. Note that for
Beta distributed π ∼ Beta(a, b), we have:

E log π = ψ(a) − ψ(a + b)

thus
π
E log = E log π − E log(1 − π) = ψ(a) − ψ(b)
1−π

Machine Learning / Stefan Harmeling / 27. October 2021 51


Transformation example (4)

For a random variable X ∼ pX (x), a monotonic transformation y (x) we


get a new random variable Y = y (X ) with some PDF pY (y ).
Mean and mode are in general not equivariant:

E Y = E y (X ) ≠ y (E X )
argmaxy pY (y ) ≠ y (argmaxx pX (x))

Median is in general equivariant:


Let x1/2 be the median of pX (x) and y1/2 be the median of pY (y ), then
we have:

y1/2 = y (x1/2 )

Proof: the monotonicity ensures that half of the probability mass will be
left of y (x1/2 ) and half of it will be right.
(By the way, could you figure out from this slide what equivariance is? If not, please look
it up and know the difference to invariance.)

Machine Learning / Stefan Harmeling / 27. October 2021 52


End of Section 07

Machine Learning / Stefan Harmeling / 27. October 2021 53

You might also like