ML Section07 Distributions Map ML

Machine Learning
Section 7: More on distributions, models, MAP, ML
Stefan Harmeling
27. October 2021

Gaussian distribution
Machine Learning / Stefan Harmeling / 27. October 2021 2

Univariate Gaussian distribution
see MLPP 2.4.1 (Murphy: Machine Learning: a Probabilistic Perspective)
▸ random variable X is real-valued

▸ parameters µ called mean, σ 2 > 0 called variance
▸ X has univariate Gaussian distribution, written
X ∼ N (µ, σ 2 )
▸ probability density function
1 (x−µ)2
N (x ∣ µ, σ 2 ) = √ e− 2σ2
σ 2π
▸ one can show: E X = µ and Var X = σ 2

Multivariate Gaussian distribution
see MLPP 2.5.2
▸ random vector X has real-valued components

▸ parameters µ called mean vector, pos-def symmetric matrix Σ
called covariance matrix
▸ X has multivariate Gaussian distribution, written
X ∼ N (µ, Σ)
1 1 T
Σ−1 (x−µ)
N (x ∣ µ, Σ) = e− 2 (x−µ)
(2π)n/2 ∣Σ∣1/2
▸ special case: N (µ, σ 2 )

▸ one can show: E X = µ and Var X = Σ

Closed under sum- and product rule:
A Gaussian joint distribution
x µ A B
p(x, y ) = N ([ ],[ ],[ T ])
y ν B C
has Gaussian marginals
p(x) = ∫ p(x, y ) dy = N (x, µ, A)
p(y ) = ∫ p(x, y ) dx = N (y , ν, C)
and Gaussian conditionals
p(x ∣ y ) = p(x, y )/p(y ) = N (x, µ + BC −1 (y − ν), A − BC −1 B T )

p(y ∣ x) = p(x, y )/p(x) = N (y , ν + B T A−1 (x − µ), C − B T A−1 B)

Figure 1.
Machine Learning / Stefan Harmeling / 27. October 2021 Univariate distribution relationships. 6
previous graphics from: “Univariate Distribution Relationships”, Lawrence M. Leemis and
Jacquelyn T. McQueston, The American Statistician, February 2008, Vol. 62, No. 1,
page 47

Distribution for waiting times

Poisson distribution
see MLPP 2.3.3
▸ counts of rare events

▸ let random variable X ∈ {0, 1, . . .} be the number of events in
some time interval
▸ let λ > 0 be the parameter (the rate)
▸ X has Poisson distribution, written
X ∼ Poi(λ)
▸ probability mass function
λx
Poi(x ∣ λ) = e−λ
x!
▸ E X = Var X = λ
▸ e.g. number of emails you receive every days is Poisson
distributed
▸ e.g. the waiting time between events
Distributions for tossing dice

Binomial distribution
see MLPP 2.3.1
▸ toss a coin n times

▸ let random variable X ∈ {0, . . . , n} be number of heads
▸ let θ be the probability of heads
▸ X has binomial distribution, written
X ∼ Bin(n, θ)
n
Bin(k ∣ n, θ) = ( ) θk (1 − θ)n−k
k
▸ E X = n θ, Var X = n θ(1 − θ)

Bernoulli distribution
see MLPP 2.3.1
▸ toss a coin once

▸ let random variable X ∈ {0, 1} be a binary variable
▸ let θ be the probability of heads
▸ X has Bernoulli distribution, written
X ∼ Ber(θ)
θ if x = 1
Ber(x ∣ θ) = θ[x=1] (1 − θ)[x=0] = {
1−θ if x = 0
using Iverson brackets [A] = 1 if A is true, [A] = 0 if A is false

▸ E X = θ, Var X = θ(1 − θ)
▸ special case: Ber(θ) = Bin(1, θ)

Multinomial distribution
see MLPP 2.3.2
▸ toss a K -sided dice n times

▸ let X = [x1 , . . . , xK ]T be a random (column) vector, with xj being
the number of times side j occurs, ∑j xj = n
▸ let θ = [θ1 , . . . , θK ]T be the parameter (column) vector, with
∑j θj = 1 and θj ≥ 0
▸ let θj be the probability of side j of the dice
▸ X has multinomial distribution, written
X ∼ Mu(n, θ)
K
n x
Mu(x ∣ n, θ) = ( ) ∏ θj j
x1 . . . xK j=1
n n!
with multinomial coefficient (x1 ...x K
)= x1 !x2 !⋯xK !

Mean and variance of multinomial distribution
▸ the mean is a (column) vector:
E X = [n θ1 , . . . , n θK ]T
▸ the variance a matrix (aka covariance matrix):

⎡ n θ1 (1 − θ1 ) −n θ1 θ2 ⋯ −n θ1 θK ⎤
⎢ ⎥
⎢ −n θ θ n θ2 (1 − θ2 ) ⋯ −n θ2 θK ⎥
⎢ 2 1 ⎥
Var X = ⎢ ⎥
⎢ ⋮ ⋮ ⋮ ⎥
⎢ ⎥
⎢ −n θK θ1
⎣ −n θK θ2 ⋯ n θK (1 − θK ) ⎥
⎦
▸ on the diagonal we have the variances of each entry,
Var Xi = n θi (1 − θi )
▸ off the diagonal we have the covariances of two distinct entries,
Cov(Xi , Xj ) = −n θi θj for i ≠ j
which is negative since increasing one entries requires decreasing

the another one
Multinoulli distribution
see MLPP 2.3.2
▸ toss a K -sided dice once

▸ let X = (x1 , . . . , xK ) be a random vector, with xj being binary, such
that only one is non-zero (aka one-hot encoding)
▸ let θ = (θ1 , . . . , θK ) be the parameter vector, with ∑j θj = 1 and
θj ≥ 0
▸ let θj be the probability of side j of the dice
▸ X has multinoulli distribution, written
X ∼ Cat(θ) = Mu(1, θ)
K
x
Cat(x ∣ θ) = ∏ θj j
j=1
▸ aka categorical distribution or discrete distribution

Tossing dice (1)
▸ tossing n times a K sided dice
▸ let X be random vector of number of times side j appeared
▸ distribution of X : Multinomial
X ∼ Mu(n, θ)
with parameter vector θ

▸ assume n = 1: Multinoulli
Cat(θ) = Mu(1, θ)
▸ assume case K = 2: Binomial
Bin(n, θ) = Mu(n, (θ, 1 − θ))
with θ ∈ [0, 1]
▸ assume n = 1 and K = 2: Bernoulli
Ber(θ) = Bin(1, θ) = Mu(1, (θ, 1 − θ)) = Cat((θ, 1 − θ))
with θ ∈ [0, 1]
Tossing dice (2)
▸ tossing n times a K sided dice
n=1 n>1
K =2 Bernoulli Binomial
K >2 Multinoulli Multinomial

What distribution should we choose for the
parameters?

Beta-binomial model
MLPP 3.3
Data
▸ flip repeatedly a coin with unknown heads probability θ
▸ k number of heads, n total number of throws
▸ k is the data D
▸ same as wearing glasses example (Section 05)
Specify
θ ∼ Beta(a, b) p(θ) = Beta(θ ∣ a, b) prior

k ∣ θ ∼ Bin(n, θ) p(k ∣ θ) = Bin(k ∣ n, θ) likelihood
Infer
θ ∣ k ∼ Beta(a + k , b + n − k ) posterior
p(θ ∣ k ) = Beta(θ ∣ a + k , b + n − k ) posterior
▸ both notations are fine: θ ∼ Beta(a, b) and p(θ) = Beta(θ ∣ a, b)

Beta distribution
see MLPP 2.4.6
▸ random variable θ ∈ [0, 1] (interval between zero and one)

▸ parameters a > 0 and b > 0
▸ θ has beta distribution, written
θ ∼ Beta(a, b)
1
Beta(θ ∣ a, b) = θa−1 (1 − θ)b−1
B(a, b)
with B(a, b) being the beta function
Γ(a)Γ(b)
B(a, b) =
Γ(a + b)
▸ EX = a ab a−1
a+b
, Var X = (a+b)2 (a+b+1)
, mode = a+b−2
(max of the PDF)

Gamma function, Beta function, and all that
from http://en.wikipedia.org/wiki/Gamma_function
and http://en.wikipedia.org/wiki/Beta_function
Gamma function (extension of factorial function)
∞
Γ(z) = ∫ e−t t z−1 dt for z ∈ C
0
Γ(n) = (n − 1)! = n!/n for n ∈ N
Beta function (extension of . . . ?)

1
B(x, y ) = ∫ t x−1 (1 − t)y −1 dt
0
Γ(x)Γ(y )
= for x, y ∈ C with x + x̄, y + ȳ > 0
Γ(x + y )
(m − 1)! (n − 1)!
B(m, n) = for m, n ∈ N
(m + n − 1)!
−1
m+n m+n
=( ) binomial coefficient
n mn

Dirichlet distribution
see MLPP 2.5.4
▸ random vector θ = [θ1 , . . . , θK ]T with values in probability simplex,
i.e. ∑j θj = 1, θj ≥ 0.
▸ parameter vector α = [α1 , . . . , αK ]T , with αj > 0
▸ θ has Dirichlet distribution, written
θ ∼ Dir(α)
1 K αk −1
Dir(θ ∣ α) = ∏θ
B(α) k =1 k
with B(α) generalizing the beta function

K
∏k =1 Γ(αk )
B(α) =
Γ(∑Kk=1 αk )
▸ special case: Beta(a, b) = Dir([a, b]T )

Beta-binomial model
MLPP 3.3
Data
▸ flip repeatedly a coin with unknown heads probability θ
▸ k number of heads, n total number of throws
▸ k is the data D
▸ same as wearing glasses example (Section 05)
Specify
p(θ) = Beta(θ ∣ a, b) prior

p(D ∣ θ) = Bin(k ∣ n, θ) likelihood
Infer
p(θ ∣ D) = Beta(θ ∣ a + k , b + n − k ) posterior
Since the prior and posterior have the same distribution, we say that
Beta distribution is the conjugate prior for the binomial likelihood.
Dirichlet-multinomial model
MLPP 3.4
Data
▸ throw n times a dice with unknown probabilities θ = [θ1 , . . . , θK ]T
▸ data D = [x1 , . . . , xK ]T , with xj number of times side j
Specify
p(θ) = Dir(θ ∣ α) prior

p(D ∣ θ) = Mu(x ∣ n, θ) likelihood
Infer
p(θ ∣ D) = Dir(θ ∣ α + x) posterior
Dirichlet distribution is the conjugate prior for the multinomial likelihood.

Digression: Gaussian-Gaussian model
Data
▸ sample n times from a univariate Gaussian distribution with
unknown mean µ and fixed variance σ 2
▸ data are n samples x1 , . . . , xn
Specify
p(µ) = N (µ ∣ 0, τ 2 ) prior
n
p(x1 , . . . , xn ∣ µ) = ∏ N (xi ∣ µ, σ 2 ) likelihood
i=1
Infer
p(µ ∣ x1 , . . . , xn ) = N (µ ∣ ν, ξ 2 ) posterior
with
σ −2 ∑ni=1 xi 1
ν= ξ2 =
τ −2 + nσ −2 τ −2 + nσ −2
Gaussian distribution is the conjugate prior for the Gaussian likelihood.
For a long list of conjugate prior and their likelihood, see
https://en.wikipedia.org/wiki/Conjugate_prior.

Summary: distributions for tossing coins and dice
Throw a coin (K = 2) or a dice (K > 2).

Distributions for the outcome
▸ coin (K = 2): X ∼ Ber(θ) with θ being scalar
▸ dice (K > 2): X ∼ Mu(θ) with θ being vector (length K )
Distributions for the parameter (conjugate priors!)
▸ coin (K = 2): θ ∼ Beta(a, b) with a and b being scalar
▸ dice (K > 2): θ ∼ Dir(α) with α being vector (length K )

How can I get a point estimate?

MAP estimator and ML estimator
▸ let’s denote the data as D (was k in the beta-binomial model)
▸ summarize the posterior by a point estimate
▸ maximum a posteriori estimator (MAP)
θMAP = arg max p(θ ∣ D) = arg max p(D ∣ θ)p(θ)

θ θ
(aka mode of the posterior)

▸ somewhat similar to maximum likelihood (ML) estimator
θML = arg max p(D ∣ θ)

θ
▸ likelihood term dominates for lots of data, thus the data

overwhelms the prior and MAP converges against ML
▸ MAP and ML ignore variance of posterior
▸ nonetheless, MAP is useful if the posterior is peaked, ML useful if
we have lots of data

Famous ML estimator for Gaussian likelihoods
Setup
▸ consider Gaussian distributed data points X1 , . . . , Xn ∼ N (x ∣ µ, I)
▸ goal: estimate mean µ
Maximize the likelihood:
µML = arg max p(X1 , . . . , Xn ∣ µ)

µ
= arg max log p(X1 , . . . , Xn ∣ µ)

µ
n
1 1 T
= arg max log ∏ n/2
e− 2 (xi −µ) (xi −µ)
µ (2π)
i=1
n 1 T
= arg max ∑ log e− 2 (xi −µ) (xi −µ)
µ
i=1
n
= arg min ∑∥xi − µ∥2
µ
i=1
Thus we derived the method of least-squares!

Naming conventions for MAP and MLE
▸ MAP is “maximum a-posteriori”.

▸ The MAP estimator for a parameter θ is a function of observed
data, that calculates the value for θ, that maximizes the posterior
distribution.
▸ ML is “maximum likelihood”.
▸ The ML estimator (sometimes called MLE) for a parameter θ is a
function of observed data, that calculates the value for θ, that
maximizes the likelihood.

ML vs MAP: insights
ML is minimizing the negative log-likelihood:

θ
= arg max log p(D ∣ θ)
θ
= arg min − log p(D ∣ θ)
θ ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
negative log-likelihood
MAP is a regularized ML:
θMAP = arg max p(θ ∣ D)

θ
= arg max p(D ∣ θ)p(θ)/p(D) "Bayes rule"
θ
= arg max p(D ∣ θ)p(θ) "p(D) is const wrt θ"
θ
= arg max log p(D ∣ θ) + log p(θ) "log is monotone"
θ
= arg min − log p(D ∣ θ) − log p(θ)
θ ´¹¹ ¹ ¹ ¹ ¹ ¹¸ ¹ ¹ ¹ ¹ ¹ ¹ ¶
regularization

ML vs MAP: comparing the estimators
Example: Estimate the mean of a Gaussian distribution after seeing
data x1 , x2 , . . . , xn (just real numbers, univariate) for the model:
p(µ) = N (µ ∣ 0, τ 2 ) prior mean

n
p(x1 , . . . , xn ∣ µ) = ∏ N (xi ∣ µ, σ 2 ) likelihood of the data
i=1
For where λ = σ 2 /τ 2 we can derive:
µMAP = arg min − log p(x1 , . . . , xn ∣ µ) − log p(µ) = . . .

µ
n
1 n
= arg min ∑(xi − µ)2 + λµ2 = ∑ xi
µ
i=1 ° n + λ i=1
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶ regularization
least squares
µML = arg min − log p(D ∣ µ)

µ
n
2 1 n
= arg min ∑(xi − µ) = ∑ xi
µ
i=1 n i=1
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
negative log-likelihood
Nice interpretation of MAP
Example: Estimate the mean of a Gaussian distribution after seeing
data x1 , x2 , . . . , xn (just real numbers, univariate):
1 n
µMAP = ∑ xi
n + λ i=1
▸ E.g. λ = 1 (i.e. σ 2 = τ 2 ) is like adding another (older) observation

x0 = 0 and doing ML.
▸ E.g. λ = 2 (i.e. σ 2 = 2τ 2 ) is like adding two (older) observations
with value zero and doing ML.
▸ E.g. λ = 100 (i.e. σ 2 = 100τ 2 ) is like adding 100 (older)
observations with value zero and doing ML.
Notes:
▸ The MLE is like MAP with λ = 0 (i.e. τ 2 = ∞, thus having an
infinitely wide Gaussian prior), i.e. without previous observations.
▸ For any integer λ we can interpret the MAP estimator as an MLE
with λ many additional zero measurements.
▸ Parameter λ is similar to parameters a and b of the Beta
distribution which also count previous observations.
Which estimator should I choose?

Which estimator should I choose? (1)
MLPP 5.7
Bayesian decision theory

▸ turn priors into posteriors to update your beliefs
▸ how to convert beliefs into actions?
▸ define a loss function which tells us how expensive it is to be
wrong
▸ i.e. what is the loss L(θ̂, θ) if we pick parameter θ̂ while θ is the
true one
▸ given the posterior p(θ ∣ D) pick the θ̂ that minimizes the posterior
expected loss
ρ(θ̂) = ∫ L(θ̂, θ)p(θ ∣ D)dθ
▸ Bayes estimator, aka Bayes decision rule
θ̂ = arg min ρ(θ̂)

θ̂

MLPP 5.7
Some common loss functions

▸ for the 0-1 loss
0 if θ̂ = θ
L(θ̂, θ) = {
1 if θ̂ ≠ θ
the Bayes estimator is the MAP estimator

▸ for the quadratic loss, aka l2 loss, aka squared error
L(θ̂, θ) = (θ̂ − θ)2
the Bayes estimator is the posterior mean

▸ for the robust loss, aka absolute error, aka l1 loss
L(θ̂, θ) = ∣θ̂ − θ∣
the Bayes estimator is the posterior median

Story:
You are at the NeurIPS confernce in a big hotel, standing in
front of five elevators. Where should you stand to minimize
the length of the way to the next open elevator?
What loss function should you use? What is the resulting estimator?
(Here you should use l1 loss to minimize the distance to the elevator...)

Summary of point estimators
▸ Maximum Likelihood estimator (MLE):

θ
▸ Bayes estimator:
▸ Maximum Aposteriori (MAP) estimator (minimizes 0-1 loss):
θMAP = arg max p(θ ∣ D) "the mode of the posterior

θ
= arg max p(D ∣ θ)p(θ)

θ
▸ Posterior mean (the estimator minimizing quadratic loss):
θposterior mean = Eθ p(θ ∣ D)

▸ Posterior median (the estimator minimizing l1 loss):
θposterior median = . . .
i.e. ∫θ<θposterior median p(θ ∣ D)dθ = ∫θ>θposterior median p(θ ∣ D)dθ

What else can we do with the posteriors?
Don’t we usually just want point estimates?

Posterior predictive distribution
Alternative to point estimates such as ML and MAP:

▸ posterior expresses our belief state about the world, e.g.
p(θ ∣ D) = Beta(θ ∣ a + k , b + n − k )
▸ use it to make predictions! (scientific method)

▸ define posterior predictive distribution
1 1
p(x = 1 ∣ D) = ∫ p(x = 1, θ ∣ D) dθ = ∫ p(x = 1 ∣ θ) p(θ ∣ D) dθ
0 0
where x is e.g. a random variable for the outcome of a future coin

toss, note that x ⊥⊥ D ∣ θ, look at the graphical model. . .
▸ posterior predictive distribution integrates out the unknown
parameter using the posterior

E.g. for the beta-binomial model
▸ MAP and ML
θMAP = arg max p(θ ∣ D)

θ
a+k −1
= arg max Beta(θ ∣ a + k , b + n − k ) =
θ a+b+n−2
k
θML = arg max p(D ∣ θ) = arg max Bin(k ∣ n, θ) =
θ θ n
▸ ML equals the MAP estimate for uniform prior on θ, i.e. for a = 1,
b = 1.
▸ posterior predictive distribution
1
p(x = 1 ∣ D) = ∫ p(x = 1 ∣ θ)p(θ ∣ D)dθ
0
1
=∫ θ Beta(θ ∣ a + k , b + n − k )dθ
0
a+k
= = posterior mean
a+b+n

Inference for a difference in proportions
MLPP 5.2.3, see link in MLPP for the source
Story
Two sellers at Amazon have the same price. One has 90
positive, 10 negative reviews. The other one 2 positive, 0
negative. Who should you buy from?
Apply two beta-binomial models (assuming uniform priors)
p(θ1 ∣ D1 ) = Beta(θ1 ∣ 91, 11) posterior about reliability

p(θ2 ∣ D2 ) = Beta(θ2 ∣ 3, 1) posterior about reliability
Compute probability that seller 1 is more reliable than seller 2:
p(θ1 > θ2 ∣ D1 , D2 )
1 1
=∫ ∫ [θ1 > θ2 ] Beta(θ1 ∣ 91, 11) Beta(θ2 ∣ 3, 1)dθ1 dθ2 ≈ 0.710
0 0
using numerical integration (your exercise...).

Probabilistic inference: general recipe
Story
Learn something ...
Specify
▸ Prior
▸ Likelihood
Infer
▸ Posterior
▸ MAP, Mode, Median, Posterior predictive distribution
▸ (maybe MLE)

Transformation of random variables

Transformation of variables (1)
Theorem 7.1 (transformation of variable)
Suppose y (x) is an increasing monotonic function of x. Assume we
are given some random variable X with PDF pX (x).
1. Since y (x) is a monotonic function, it is invertible, i.e. its inverse
function x(y ) exists.
2. Let Y = y (X ) be a random variable defined via y (x).
3. Then the PDF pY (y ) of Y can be calculated from the PDF pX (x)
of X :
dx(y )
pY (y ) = pX (x(y ))
dy
Informal proof: preserve probability mass pX (x)dx = pY (y )dy .
Note: usually there are absolute values around dx/dy in the
transformation rule, however, we omitted that since we assume that the
transformation is increasing.
Example: X with PDF pX (x), Y = log X . Then
pY (y ) = pX (exp(y )) exp(y ).
Transformation of variables (2)
Informal formula to remember:
p(x)dx = p(y )dy
Or, choose p(y ) such that “Integration by Substitution” works, i.e.:
Theorem 7.2 (rule of the unconscious statistician)

Given a random variable X with PDF p(x) and some function y (x).
Then the expected value of Y = y (X ) is
E Y = ∫ y p(y ) dy = ∫ y (x) p(x) dx = E y (X )
Proof: 1st equality (Def. 5.4), 2nd equality (Integration by substitution, and
transformation of RVs), 3rd equality (Def. 5.5, just a nice notation).
▸ Larry Wasserman calls this rule lazy (see All of Statistics,
Theorem 3.6), because there is no need to find p(y ).
▸ See also: https://en.wikipedia.org/wiki/Law_of_the_
unconscious_statistician
Transformation example (1)
Beta distribution:
1
p(π) = Beta(π ∣ a, b) = π a−1 (1 − π)b−1 for π ∈ [0, 1]
B(a, b)
Change the parameterization by transforming π:
π 1
x(π) = log and its (well-known) inverse π(x) =
1−π 1 + e−x
What is p(x)?
Answer:
dπ
p(x) = Beta(π(x) ∣ a, b) note π ′ (x) = π(x)(1 − π(x))
dx
1
= π(x)a−1 (1 − π(x))b−1 π(x)(1 − π(x))
B(a, b)
1
= π(x)a (1 − π(x))b
B(a, b)

FILLING THE HOLE!

Mean w/o and with transformation differ:
a
Eπ (π) =
a+b
a
Ex (x) = Eπ x(π) ≠ log = x (Eπ (π))
b
Mode with and w/o transformation differ (maximum of PDF):
a−1
arg max p(π) = for a, b > 1
π a+b−2
a−1
arg max p(x) ≠ x ( )
x a+b−2
DANGER:
▸ Mean changes under transformation.
▸ Mode/maximum might change after transformation.
▸ So be careful with these point estimates...
▸ Median should be fine...
Question: So what is the mean of X ?
Answer:
EX (x) = ψ(a) − ψ(b)
where ψ is the digamma function,
d
ψ(x) = log Γ(x)
dx
the derivative of the logarithm of the Gamma function. Note that for
Beta distributed π ∼ Beta(a, b), we have:
E log π = ψ(a) − ψ(a + b)
thus
π
E log = E log π − E log(1 − π) = ψ(a) − ψ(b)
1−π

For a random variable X ∼ pX (x), a monotonic transformation y (x) we

get a new random variable Y = y (X ) with some PDF pY (y ).
Mean and mode are in general not equivariant:
E Y = E y (X ) ≠ y (E X )
argmaxy pY (y ) ≠ y (argmaxx pX (x))
Median is in general equivariant:

Let x1/2 be the median of pX (x) and y1/2 be the median of pY (y ), then
we have:
y1/2 = y (x1/2 )
Proof: the monotonicity ensures that half of the probability mass will be
left of y (x1/2 ) and half of it will be right.
(By the way, could you figure out from this slide what equivariance is? If not, please look
it up and know the difference to invariance.)

End of Section 07

ML Section07 Distributions Map ML

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML Section07 Distributions Map ML

Uploaded by

Copyright:

Available Formats

Machine Learning

Section 7: More on distributions, models, MAP, ML

27. October 2021

Machine Learning / Stefan Harmeling / 27. October 2021 2

▸ random variable X is real-valued

▸ probability density function

Machine Learning / Stefan Harmeling / 27. October 2021 3

▸ random vector X has real-valued components

▸ probability density function

▸ special case: N (µ, σ 2 )

Machine Learning / Stefan Harmeling / 27. October 2021 4

A Gaussian joint distribution

has Gaussian marginals

p(x) = ∫ p(x, y ) dy = N (x, µ, A)

and Gaussian conditionals

p(x ∣ y ) = p(x, y )/p(y ) = N (x, µ + BC −1 (y − ν), A − BC −1 B T )

Machine Learning / Stefan Harmeling / 27. October 2021 5

Machine Learning / Stefan Harmeling / 27. October 2021 7

Machine Learning / Stefan Harmeling / 27. October 2021 8

▸ counts of rare events

▸ probability mass function

Machine Learning / Stefan Harmeling / 27. October 2021 10

▸ toss a coin n times

▸ probability mass function

Machine Learning / Stefan Harmeling / 27. October 2021 11

▸ toss a coin once

▸ probability mass function

using Iverson brackets [A] = 1 if A is true, [A] = 0 if A is false

Machine Learning / Stefan Harmeling / 27. October 2021 12

▸ toss a K -sided dice n times

▸ probability mass function

Machine Learning / Stefan Harmeling / 27. October 2021 13

▸ the variance a matrix (aka covariance matrix):

▸ off the diagonal we have the covariances of two distinct entries,

which is negative since increasing one entries requires decreasing

▸ toss a K -sided dice once

▸ probability mass function

▸ aka categorical distribution or discrete distribution

Machine Learning / Stefan Harmeling / 27. October 2021 15

with parameter vector θ

▸ assume case K = 2: Binomial

Bin(n, θ) = Mu(n, (θ, 1 − θ))

Ber(θ) = Bin(1, θ) = Mu(1, (θ, 1 − θ)) = Cat((θ, 1 − θ))

▸ tossing n times a K sided dice

Machine Learning / Stefan Harmeling / 27. October 2021 17

Machine Learning / Stefan Harmeling / 27. October 2021 18

θ ∼ Beta(a, b) p(θ) = Beta(θ ∣ a, b) prior

▸ both notations are fine: θ ∼ Beta(a, b) and p(θ) = Beta(θ ∣ a, b)

▸ random variable θ ∈ [0, 1] (interval between zero and one)

▸ probability density function

with B(a, b) being the beta function

Machine Learning / Stefan Harmeling / 27. October 2021 20

Beta function (extension of . . . ?)

Machine Learning / Stefan Harmeling / 27. October 2021 21

▸ probability density function

with B(α) generalizing the beta function

▸ special case: Beta(a, b) = Dir([a, b]T )

p(θ) = Beta(θ ∣ a, b) prior

p(θ ∣ D) = Beta(θ ∣ a + k , b + n − k ) posterior

p(θ) = Dir(θ ∣ α) prior

p(θ ∣ D) = Dir(θ ∣ α + x) posterior

Machine Learning / Stefan Harmeling / 27. October 2021 24