Professional Documents
Culture Documents
Stefan Harmeling
X ∼ N (µ, σ 2 )
1 (x−µ)2
N (x ∣ µ, σ 2 ) = √ e− 2σ2
σ 2π
▸ one can show: E X = µ and Var X = σ 2
X ∼ N (µ, Σ)
1 1 T
Σ−1 (x−µ)
N (x ∣ µ, Σ) = e− 2 (x−µ)
(2π)n/2 ∣Σ∣1/2
x µ A B
p(x, y ) = N ([ ],[ ],[ T ])
y ν B C
p(y ) = ∫ p(x, y ) dx = N (y , ν, C)
X ∼ Poi(λ)
λx
Poi(x ∣ λ) = e−λ
x!
▸ E X = Var X = λ
▸ e.g. number of emails you receive every days is Poisson
distributed
▸ e.g. the waiting time between events
Machine Learning / Stefan Harmeling / 27. October 2021 9
Distributions for tossing dice
X ∼ Bin(n, θ)
n
Bin(k ∣ n, θ) = ( ) θk (1 − θ)n−k
k
▸ E X = n θ, Var X = n θ(1 − θ)
X ∼ Ber(θ)
θ if x = 1
Ber(x ∣ θ) = θ[x=1] (1 − θ)[x=0] = {
1−θ if x = 0
X ∼ Mu(n, θ)
K
n x
Mu(x ∣ n, θ) = ( ) ∏ θj j
x1 . . . xK j=1
n n!
with multinomial coefficient (x1 ...x K
)= x1 !x2 !⋯xK !
E X = [n θ1 , . . . , n θK ]T
Var Xi = n θi (1 − θi )
Cov(Xi , Xj ) = −n θi θj for i ≠ j
X ∼ Cat(θ) = Mu(1, θ)
K
x
Cat(x ∣ θ) = ∏ θj j
j=1
X ∼ Mu(n, θ)
Cat(θ) = Mu(1, θ)
with θ ∈ [0, 1]
▸ assume n = 1 and K = 2: Bernoulli
with θ ∈ [0, 1]
Machine Learning / Stefan Harmeling / 27. October 2021 16
Tossing dice (2)
n=1 n>1
K =2 Bernoulli Binomial
K >2 Multinoulli Multinomial
Infer
θ ∣ k ∼ Beta(a + k , b + n − k ) posterior
p(θ ∣ k ) = Beta(θ ∣ a + k , b + n − k ) posterior
θ ∼ Beta(a, b)
1
Beta(θ ∣ a, b) = θa−1 (1 − θ)b−1
B(a, b)
Γ(a)Γ(b)
B(a, b) =
Γ(a + b)
▸ EX = a ab a−1
a+b
, Var X = (a+b)2 (a+b+1)
, mode = a+b−2
(max of the PDF)
θ ∼ Dir(α)
1 K αk −1
Dir(θ ∣ α) = ∏θ
B(α) k =1 k
Infer
Since the prior and posterior have the same distribution, we say that
Beta distribution is the conjugate prior for the binomial likelihood.
Machine Learning / Stefan Harmeling / 27. October 2021 23
Dirichlet-multinomial model
MLPP 3.4
Data
▸ throw n times a dice with unknown probabilities θ = [θ1 , . . . , θK ]T
▸ data D = [x1 , . . . , xK ]T , with xj number of times side j
Specify
Infer
Since the prior and posterior have the same distribution, we say that
Dirichlet distribution is the conjugate prior for the multinomial likelihood.
Infer
p(µ ∣ x1 , . . . , xn ) = N (µ ∣ ν, ξ 2 ) posterior
with
σ −2 ∑ni=1 xi 1
ν= ξ2 =
τ −2 + nσ −2 τ −2 + nσ −2
Since the prior and posterior have the same distribution, we say that
Gaussian distribution is the conjugate prior for the Gaussian likelihood.
Machine Learning / Stefan Harmeling / 27. October 2021 25
For a long list of conjugate prior and their likelihood, see
https://en.wikipedia.org/wiki/Conjugate_prior.
1 n
µMAP = ∑ xi
n + λ i=1
0 if θ̂ = θ
L(θ̂, θ) = {
1 if θ̂ ≠ θ
L(θ̂, θ) = ∣θ̂ − θ∣
Story:
You are at the NeurIPS confernce in a big hotel, standing in
front of five elevators. Where should you stand to minimize
the length of the way to the next open elevator?
What loss function should you use? What is the resulting estimator?
(Here you should use l1 loss to minimize the distance to the elevator...)
▸ Bayes estimator:
▸ Maximum Aposteriori (MAP) estimator (minimizes 0-1 loss):
θposterior median = . . .
p(θ ∣ D) = Beta(θ ∣ a + k , b + n − k )
Story
Two sellers at Amazon have the same price. One has 90
positive, 10 negative reviews. The other one 2 positive, 0
negative. Who should you buy from?
Apply two beta-binomial models (assuming uniform priors)
p(θ1 > θ2 ∣ D1 , D2 )
1 1
=∫ ∫ [θ1 > θ2 ] Beta(θ1 ∣ 91, 11) Beta(θ2 ∣ 3, 1)dθ1 dθ2 ≈ 0.710
0 0
Story
Learn something ...
Specify
▸ Prior
▸ Likelihood
Infer
▸ Posterior
▸ MAP, Mode, Median, Posterior predictive distribution
▸ (maybe MLE)
Proof: 1st equality (Def. 5.4), 2nd equality (Integration by substitution, and
transformation of RVs), 3rd equality (Def. 5.5, just a nice notation).
▸ Larry Wasserman calls this rule lazy (see All of Statistics,
Theorem 3.6), because there is no need to find p(y ).
▸ See also: https://en.wikipedia.org/wiki/Law_of_the_
unconscious_statistician
Machine Learning / Stefan Harmeling / 27. October 2021 47
Transformation example (1)
Beta distribution:
1
p(π) = Beta(π ∣ a, b) = π a−1 (1 − π)b−1 for π ∈ [0, 1]
B(a, b)
π 1
x(π) = log and its (well-known) inverse π(x) =
1−π 1 + e−x
What is p(x)?
Answer:
dπ
p(x) = Beta(π(x) ∣ a, b) note π ′ (x) = π(x)(1 − π(x))
dx
1
= π(x)a−1 (1 − π(x))b−1 π(x)(1 − π(x))
B(a, b)
1
= π(x)a (1 − π(x))b
B(a, b)
a−1
arg max p(π) = for a, b > 1
π a+b−2
a−1
arg max p(x) ≠ x ( )
x a+b−2
DANGER:
▸ Mean changes under transformation.
▸ Mode/maximum might change after transformation.
▸ So be careful with these point estimates...
▸ Median should be fine...
Machine Learning / Stefan Harmeling / 27. October 2021 50
Transformation example (3)
Question: So what is the mean of X ?
Answer:
d
ψ(x) = log Γ(x)
dx
the derivative of the logarithm of the Gamma function. Note that for
Beta distributed π ∼ Beta(a, b), we have:
thus
π
E log = E log π − E log(1 − π) = ψ(a) − ψ(b)
1−π
E Y = E y (X ) ≠ y (E X )
argmaxy pY (y ) ≠ y (argmaxx pX (x))
y1/2 = y (x1/2 )
Proof: the monotonicity ensures that half of the probability mass will be
left of y (x1/2 ) and half of it will be right.
(By the way, could you figure out from this slide what equivariance is? If not, please look
it up and know the difference to invariance.)