You are on page 1of 28

V: Discrete and continuous distributions

A modern crash course in intermediate Statistics and Probability

Paul Rognon

Barcelona School of Economics

Universitat Pompeu Fabra
Universitat Politècnica de Catalunya

1 / 22
Discrete distributions
Bernoulli distribution

The Bernoulli distribution models an experiment with a single binary

outcome (e.g.: success or fail, head or tail). We say that X has a
Bernoulli distribution with parameter p > 0, and write X ∼ Bern(p), if

fX (x) = p x (1 − p)1−x for x = 0, 1

Related distribution and model

• the logistic regression for binary outcome is a generalized linear
model for response with Bernoulli distribution.
• if X ∼ Bern(p), then 2X − 1 has a Rademacher distribution, a
distribution that occurs in machine learning theory.

2 / 22
Binomial distribution
The binomial distribution models an experiment where we count the
number of successes in n independent Bernoulli experiments, all of which
having the same success probability p. We say that Y has a binomial
distribution, and write Y ∼ Bin(n, p). It takes values y = 0, 1, . . . , n
with probability
fY (y ) = yn p y (1 − p)n−y

• If Y1 ∼ Bin(n1 , p), Y2 ∼ Bin(n2 , p) and are independent, then
Y1 + Y2 ∼ Bin(n1 + n2 , p).
• If ∀i = 1 . . . n, Xi are i.i.d Bern(p) then i Xi ∼ Bin(n, p)

Indeed: Qn P P
P(X1 = x , . . . , X = x ) = p xi (1 − p)1−xi = p i xi (1 − p)n− i xi .
1 2 2 i=1
so P( i Xi = k) = kn p k (1 − p)n−k

The number of ways to choose k out of n objects is kn := (n−k)!k! n!

3 / 22
Poisson distribution
Poisson distribution is used to model counts of rare random events: shark
attacks, big meteors hitting the Earth, etc. We say that X has a Poisson
distribution with parameter λ > 0 if
fX (x) = e −λ for x = 0, 1, 2, . . . .
Here X = X (Ω) is discrete but infinite.

Related model and properties

• mean and variance are equal to λ. A strong limitation in modelling,
alternatives are negative binomial or adjustments for overdispersion.
• X1 ∼ Pois(λ1 ), X2 ∼ Pois(λ2 ) independent, then
X1 + X2 ∼ Pois(λ1 + λ2 ).
• Poisson process, a stochastic process that counts the number of
occurrences of event by time t.

4 / 22
Multinomial distribution

The multinomial distribution is a multivariate extension of the binomial

distribution. It models an experiment where n independent trials with a
finite number of possible outcomes (larger than 2) are run. We say that
Pkhas a Multinomial distribution with parameter (n, p) where
j=1 pj = 1 if:

n! X
fX (x1 , . . . , xk ) = p x1 . . . pkxk where xj = n
x1 ! . . . xk ! 1

Related model and properties

• when n = 1, it is called the categorical distribution
• frequently appears in clustering and dimension reduction models.

5 / 22
Continuous distributions
Normal (Gaussian) distribution
We say that X has a Gaussian distribution with mean µ ∈ R and
variance σ 2 > 0, denoted by N(µ, σ 2 ), if its density function is
1 1 2
fX (x) = √ exp − 2 (x − µ) for x ∈ R.
2πσ 2σ
The Gaussian distribution approximates many real phenomena: see
Galton’s board, central limit theorem, etc.
Basic properties
• X ∼ N(µ, σ 2 ) then X σ−µ ∼ N(0, 1) with CDF Φ, so, in particular,
a−µ X −µ b−µ b−µ a−µ
P(a ≤ X ≤ b) = P ≤ ≤ =Φ −Φ
σ σ σ σ σ

• Xi ∼ N(µi , σi2 ) independent, then Xi ∼ N( µi , σi2 ).


R∞ x2
Useful exercise: Compute the integral −∞ e dx (use polar coordinates).
6 / 22
Gamma and Beta distributions
They are based on the gamma function:
Z ∞
Γ(z) = x z−1 e −x dx, z > 0 and Γ(n) = (n − 1)! n ∈ N⋆

We say that X has a Gamma distribution with parameters α and β,

denoted by X ∼ Gamma(α, β), if

fX (x) = x α−1 e −x/β , x > 0 where α, β > 0
β α Γ(α)

We say that X has a Beta distribution with parameters α and β, denoted

by X ∼ Beta(α, β), if

Γ(α + β) α−1
fX (x) = x (1 − x)β−1 , x ∈ (0, 1) where α, β > 0

Those and related distributions frequently appear in Bayesian statistics.

7 / 22
Other continuous distributions
The uniform distribution over [a, b], U(a, b)
The density function is fX (x) = b−a 11[a,b] (x).

Exponential distribution, Exp(λ)

The density function is fX (x) = λe −λx for λ > 0, x > 0.

Exponential distribution is used to model waiting times between

occurences in a Poisson process. It is a special case of the gamma
distribution, it is a Gamma(1, λ1 ).

Chi-square distribution, χ2p

If Z1 , . . . , Zp are independent N(0, 1) then

X = Z12 + Z22 + · · · + Zp2 ∼ χ2p .

The natural number p is called the degrees of freedom. It is also a

special case of the gamma distribution, it is a Gamma( p2 , 2).
8 / 22
Multivariate Gaussian: definition
Standard multivariate normal distribution
Let Z1 , . . . , Zp be independent identically distributed (iid) N(0, 1)
variables. Their joint distribution is:
Y 1 1 2 1 1
f (z1 , . . . , zp ) = √ exp(− zi ) = p/2
exp(− z T z).
2π 2 (2π) 2

Let Z = (Z1 , . . . , Zp ). Z is a random vector of p standard normal

random variables with mean vector:

µ = 0p

and covariance matrix:

Σ = Ip
We say Z has a standard multivariate normal distribution and note:

Z ∼ Np (0p , Ip )

9 / 22
Standard multivariate normal distribution

10 / 22
Bivariate normal distribution

We now define a case without independence for two variables. Let

σ12 ρσ1 σ2
µ ∈ R and Σ = definite positive.
ρσ1 σ2 σ22
We say the vector X = (X1 , X2 ) has a bivariate normal distribution and
note X ∼ N2 (µ, Σ), if:

( "
x1 − µ1 2

1 1 1
f (x1 , x2 ) = exp −
2π σ1 σ2 (1 − ρ2 ) 12 2 (1 − ρ2 ) σ1
x2 − µ2 2
x1 − µ1 x2 − µ2
+ − 2ρ
σ2 σ1 σ2

11 / 22
Contours of bivariate normal distribution when ρ = 0

12 / 22
Contours of bivariate normal distribution when ρ ≈ 0.5

13 / 22
Contours of bivariate normal distribution when ρ → 1

14 / 22

1. Show that √1 exp(− 21 zi2 ) = 1
exp(− 12 z T z).
i=1 2π (2π)p/2

σ12 ρσ1 σ2
2. Let X ∼ N2 (µ, Σ) with µ = (µ1 , µ2 ) and Σ = .
ρσ1 σ2 σ22
Show that
f (x1 , x2 ) = (2π)1 p/2 (det Σ)−1/2 exp − 12 (x − µ)T Σ−1 (x − µ)

3. Find 
a sufficient and necessary condition on ρ for
σ12 ρσ1 σ2

Σ= to be positive definite.
ρσ1 σ2 σ22
4. Why are the contours of the bivariate normal ellipse? What are the
principal axes of the ellipse?

15 / 22
Multivariate normal distribution (general case)

Let µ ∈ Rp and Σ a symmetric positive definite p × p matrix. We say

the vector X = (X1 , X2 , . . . , Xp ) has (non-degenerate) multivariate
normal distribution and note X ∼ Np (µ, Σ) when it has density:
1 −1/2 1 T −1
f (x) = (det Σ) exp − (x − µ) Σ (x − µ) .
(2π)p/2 2

Its characteristic function is:

φX (t) = exp itT µ exp − tT Σt ,


Its moment generating function is:

1 T
MX (t) = exp tT µ exp

t Σt

16 / 22
Multivariate Gaussian: linear
Linear transformations
Multivariate normal distribution is closed under linear transformations. It
is a defining property of the multivariate normal distribution:

If X ∼ Np (µ, Σ), for any A ∈ Rm×p (m ≤ p), AX ∼ Nm (Aµ, AΣAT )

Σ is positive definite then there exists V orthogonal and Λ diagonal such
that: Σ = V ΛV T . We define Σ1/2 = V Λ1/2 V T and
Σ−1/2 = V Λ−1/2 V T .

• If Z ∼ Np (0p , Ip ) and X = µ + Σ1/2 Z then X ∼ Np (µ, Σ).

• If X ∼ Np (µ, Σ) then Σ−1/2 (X − µ) ∼ Np (0p , Ip ).

Let X ∼ Np (µ, Σ), what is the distribution of (X − µ)T Σ−1 (X − µ)?

17 / 22
Example: Change of variables, Cholesky decomposition
and multivariate normal distribution
Suppose that X has a standard multivariate normal distribution. Let Σ
be a symmetric positive definite matrix and let Σ = LT L be its Cholesky
decomposition. What’s the distribution of Y = LT X ?

Since x = L−T y then |J(y )| = det(L−T ) = 1/ det(L) and

1 1
fX (L−T y ) = p/2
exp{− (L−T y )T (L−T y )}
(2π) 2
1 1 T −1
= exp{− y Σ y }
(2π)p/2 2

Noting that det(L) = (det Σ)1/2 we get

1 1 T −1
fY (y ) = fX (L−T y )|J(y )| = (det Σ)−1/2
exp − y Σ y .
(2π)p/2 2

That is Y ∼ N(0, Σ)
18 / 22
Multivariate Gaussian: marginal
and conditional distributions and
Marginal and conditional distributions
The multivariate normal distribution is closed under marginalization and
conditioning. Split X into two blocks X = (XA , XB ). Denote:
µ = (µA , µB ) and Σ=

Marginal distribution

XA ∼ N|A| (µA , ΣAA )

XB ∼ N|B| (µB , ΣBB )

Where |A| and |B| are the dimension of vectors XA and XB

Conditional distribution
∼ N|A| µA + ΣAB Σ−1 −1

XA |XB = xB BB (xB − µB ), ΣAA − ΣAB ΣBB ΣBA

∼ N|B| µB + ΣBA Σ−1 −1

XB |XA = xA AA (xA − µA ), ΣBB − ΣBA ΣAA ΣAB

19 / 22
The covariance matrix Σ of a multivariate normal vector and its inverse
K = Σ−1 encode independence relations. K is called the precision or
concentration matrix.
Two by two independence
Xi ⊥
⊥ Xj ⇔ Σij = 0
Conditional independence
⊥ Xj |Xrest ⇔ Σij = ΣiR Σ−1
Xi ⊥ −1
RR ΣRj ⇔ (Σ )ij = 0
The conditional independence properties of the precision matrix give rise
to an entire family of models called Gaussian graphical models.
Block matrix inversion:
(M/D)−1 −(M/D)−1 BD −1
M= ==
C D −D −1 C (M/D)−1 D −1 + D −1 C (M/D)−1 BD −1

Where M/D := A − BD −1 C and M/A := D − CA−1 B are called respectively

the Schur complements of block D and of block A.
20 / 22

1 0.7
1. Consider a bivariate normal with µ = (0, 2) and Σ = .
0.7 1
Find E[X1 |X2 ] and var(X1 |X2 ).
 
1.98 −1.4 −0.14
2. Consider the covariance matrix Σ = −1.40 2.0 0.20  of X a
−0.14 0.2 1.02
Gaussian vector. Are there components of X that are independent? Are
there components of X that are conditionally independent?

21 / 22
Wishart distribution
Wishart distribution
Can we define a distribution over the set of all p × p symmetric positive
definite matrices? Yes in the Gaussian case.
Let X1 , . . . , Xn −→ Np (0p , Σ), then
Y := nSn = Xi XiT has Wishart distribution Wp (Σ, n)

Denote K = Σ−1 . Then the density of the Wishart distribution is

(det K )n/2 n−p−1 1

f (Y ) = np/2
(det Y ) 2 e − 2 trace(KY ) ,
2 Γp (n/2)

which is well defined for any real n > p − 1.

We have E(Y ) = nΣ.

22 / 22

You might also like