You are on page 1of 30

Basic Facts for Quals Preparation

Kenneth Tay

Contents

1 General Probability/Statistics Definitions and Facts 2

2 Distributions 4
2.1 Arcsine Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Bernoulli Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Beta Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5 Cauchy Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5.1 Standard Cauchy Distribtion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5.2 General Cauchy Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.6 Chi-Squared Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.6.1 Non-Central Chi-Squared Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.7 Dirichlet Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.8 Exponential Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.8.1 Shifted Exponential Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.9 F Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.9.1 Non-Central F Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.9.2 Doubly Non-Central F Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.10 Gamma Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.11 Gamma Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.12 Geometric Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.13 Gumbel Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.14 Hypergeometric Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.15 Inverse Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.16 Laplace/Double Exponential Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1
2.17 Logistic Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.18 Multinomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.19 Negative Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.20 Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.20.1 Standard Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.20.2 Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.20.3 Bivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.21 Pareto Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.22 Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.23 T Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.23.1 Non-Central T Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.24 Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.25 Weibull Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 Analysis Facts 22
3.1 Basic Topology and Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Measure Theory, Integration and Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 Approximations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4 Linear Algebra Facts 25


4.1 Properties of Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Matrix Decompositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3 General Vector Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5 Useful Inequalities 28

6 Useful Integrals 29

7 Other Basic Facts 30

1 General Probability/Statistics Definitions and Facts


• X ∼ F (cdf), then F (X) ∼ Unif(0, 1).
1
• For any X, X ′ i.i.d., Var X = E[(X − X ′ )2 ].
2

2
• (300B HW5) Var X = inf t E[(Y − t)2 ].
• Conditional covariance decomposition: Cov X = Cov(E[X | Y ]) + E[Cov(X | Y )]. Since
E[Cov(X | Y )] ⪰ 0, thus Cov(E[X | Y ]) ⪯ Cov X.
• E[X n ] is the nth raw moment, sometimes denoted µ′n . E[(X − µ)n ] is the nth central moment,
E[(X − µ)n ]
sometimes denoted µn . is the normalized nth central moment.
σn
E[(X − µ)3 ] E[(X − µ)4 ]
• Skewness γ := , kurtosis κ := , excess kurtosis = κ − 3.
σ3 σ4
(k)
• If MX and ϕX are the MGF and characteristic functions of X respectively, then E[X k ] = MX (0) =
1 (k)
ϕ (0) (if it exists).
ik X
• Fisher information: Let fθ (x) be the probability density function of X conditional on the value of
θ.
 

– For all θ, Eθ log fθ (X) = 0.
∂θ
" 2 #  2 
∂ ∂
– Fisher information I(θ) := Eθ log fθ (X) = −Eθ log fθ (X) .
∂θ ∂θ2
1
– (300B Lec 4) For a normal location family (i.e. only mean unknown), Fisher information is 2 .
σ
 
– (300B Lec 5) In an exponential family with density pθ (x) = h(x) exp θT T (x) − A(θ) , Fisher
information I(θ) = ∇2 A(θ).
• Information Inequality:
– (TPE Thm 2.5.10 p120) Information Inequality: Suppose pθ is family of densities w.r.t. dominat-
ing measure µ and I(θ) > 0. Let δ be any statistic with Eθ (δ 2 ) < ∞ and such that the derivative
of Eθ (δ) w.r.t. θ exists and can be differentiated under the integral sign. Then
 ∂
2
∂θ Eθ (δ)
Varθ (δ) ≥ ,
I(θ)
 

with equality iff δ = a log pθ (x) + b, where a and b are constants (which may depend on θ).
∂θ
– (Stephen’s version) Let δ be an estimator for g(θ), and let Eθ (δ) = g(θ) + b(θ). Then

(b′ (θ) + g ′ (θ))2


Varθ (δ(X)) ≥ .
I(θ)
n
1 X
• For i.i.d. samples X1 , . . . , Xn , X̄ is an unbiased estimator of the mean and (Xi − X̄)2 is an
n − 1 i=1
unbiased estimator for the variance.
n
1 X
• (TPE Prob 2.2.15 p133, 310A HW3 Qn 2) For i.i.d. bivariate samples, (Xi − X̄)(Yi − Ȳ ) is
n − 1 i=1
an unbiased estimator for Cov(X, Y ).
• (310A HW8) Total variation distance for 2 probability measures is ∥µ − ν∥T V = sup |µ(A) − ν(A)|.
A∈F

3
– ∥·∥T V is a metric.
1
– ∥µ − ν∥T V = sup |Eµ (f ) − Eν (f )|.
2 ∥f ∥∞ ≤1
Z
1
– If µ and ν have densities fµ and fν w.r.t. some base measure λ, then ∥µ − ν∥T V = |fµ (ω) −
2
fν (ω)|λ(dω).

(300B HW8) For 2 distributions P and Q with densities p and q w.r.t. µ,


R R
– 2∥P − Q∥T V = (p − q)+ dµ + (q − p)+ dµ.
R
– ∥P − Q∥T V = (p ∨ q)dµ − 1.
R
– ∥P − Q∥T V = 1 − (p ∧ q)dµ.

2 Distributions

2.1 Arcsine Distribution

Let X ∼ Arcsine(a, w). a ∈ R location parameter, w > 0 scale parameter.

1
• PDF p(x) = p , x ∈ (a, a + w).
π(x − a)(a + w − x)
r !
2 x−a
• CDF P(X ≤ x) = arcsin .
π w

w w2
• EX = Median = a + , Var X = .
2 8
 

X n−1
Y 2j + 1 wn tn
• MGF E[etX ] = eat   .
0 j=0
2j + 2 n!

n−1
Y 2j + 1
• Moments E[X n ] = wn .
j=0
2j + 2

• The standard arcsine distribution (a = 0, w = 1) is Beta(1/2, 1/2).

• If X ∼ Arcsine(a, w), then c + dX ∼ Arcsine(c + ad, dw).

• If U ∼ Unif(0, 1), then a + w sin2 (πU/2) ∼ Arcsine(a, w).

2.2 Bernoulli Distribution

If X ∼ Ber(p), then P(X = 1) = p, P(X = 0) = 1 − p = q.

• EX = p, Var X = p(1 − p).

• MGF E[etX ] = q + pet .

4
• Characteristic function E[eitX ] = q + peit .
1
• Fisher information: .
p(1 − p)

2.3 Beta Distribution


Z 1
• Beta function: For a, b > 0, B(a, b) = ua−1 (1 − u)b−1 du.
0

1
• B(a, b) = B(b, a), B(a, 1) = .
a
Γ(a)Γ(b)
• B(a, b) = .
Γ(a + b)
π
• B(x, y) · B(x + y, 1 − y) = .
x sin(πy)
√ xx−1/2 y y−1/2
• For large x and large y, B(x, y) ∼ 2π (Stirling’s formula). For large x and fixed y,
(x + y)x+y−1/2
B(x, y) ∼ Γ(y)x−y .

Let X ∼ Beta(α, β), with α, β > 0. Support of X is [0, 1] or (0, 1).

Γ(α + β) α−1
• PDF p(x) = x (1 − x)β−1 .
Γ(α)Γ(β)
α αβ
• EX = , Var X = , Mode = α−1
α+β−2 .
α+β (α + β)2 (α + β + 1)
∞ k−1
!
X Y α+r tk
• MGF E[e ] = 1 +
tX
.
r=0
α + β + r k!
k=1

B(α + k, β)
• Moments: If k ≥ 0, then E[X k ] = .
B(α, β)
     
1 α+β−1 1 α+β−1 X α
• E = ,E = ,E = .
X α−1 1−X β−1 1−X β−1
d
• Beta(1, 1) = U (0, 1).
1
• Beta 1 1

2, 2 is called the arcsine distribution, PDF p(x) = .
p
π x(1 − x)

• If X ∼ Beta(α, β), then 1 − X ∼ Beta(β, α).

• If X ∼ Beta(α, 1), then − log X ∼ Exp(α).

• Suppose X and Y are independent gamma RVs with X ∼ Γ(a, r) and Y ∼ Γ(b, r) (shape and rate). Let
X
U = X + Y and V = . Then U and V are independent, with U ∼ Γ(a + b, r) and V ∼ Beta(a, b).
X +Y
(Proof: Look at joint PDF of X and Y , change of variables to U and V .)

5
 
(n/d)X n d dX
• If X ∼ Fn,d , then n d

∼ Beta , . Conversely, if X ∼ Beta 2, 2 , then ∼ Fn,d .
1 + (n/d)X 2 2 n(1 − X)
• Let X1 , . . . , Xn be independent U (0, 1) variables. Then the kth order statistic X(k) ∼ Beta(k, n−k+1).

• lim Beta(k, n) = Gam(k, 1).


n→∞

2.4 Binomial Distribution

Let X ∼ Bin(n, p).

• PMF P(X = x) = n

p px (1 − p)n−x , x ∈ {0, 1, . . . , n}.
Z 1−p
n!
• CDF can be written in the form F (k) = P(X ≤ k) = xn−k−1 (1 − x)k dx, k ∈
(n − k − 1)! k! 0
{0, 1, . . . , n}. (Proof: Integration by parts and induction.)

• EX = np, Var X = npq. Median is ⌊np⌋ or ⌈np⌉, mode is ⌊(n + 1)p⌋ or ⌊(n + 1)p⌋ − 1.

• MGF E[etX ] = (1 − p + pet )n .

• MGF E[eitX ] = (1 − p + peit )n .


n
• Fisher information: . (Proof: Consider binomial as sum of n independent Bernoulli RVs.)
p(1 − p)
• Poisson Approximation: If npn → r ∈ (0, ∞) as n → ∞, then Bin(n, pn ) converges to Pois(r).

• Normal Approximation: General rule of thumb is np ≥ 5 and n(1 − p) ≥ 5.

• If X ∼ Bin(n, p) and Y | X ∼ Bin(X, q), then Y ∼ Bin(n, pq).


a b
 
k m−k
• If X ∼ Bin(a, p) and Y ∼ Bin(b, p) are independent, then P(X = k | X + Y = m) = a+b
 , i.e. is
m
hypergeometric.
 
Y p
• (Theory Add Ex 12) If Y ∼ Bin(n, p), then E ≤ .
1+n−Y 1−p

2.5 Cauchy Distribution

2.5.1 Standard Cauchy Distribtion


1
• If X has standard Cauchy distribution, PDF p(x) = , CDF P(X ≤ x) = 1
2 + 1
π arctan x.
π(1 + x2 )
• EX does not exist.

• The standard Cauchy distribution is the same as t1 .


Z
• If Z, W are standard normal RVs, then has standard Cauchy distribution.
W
1
• If X has standard Cauchy distribution, so does .
X

6
2.5.2 General Cauchy Distribution

• If X has standard Cauchy distribution, then Y = a + bX has Cauchy distribution with location
parameter a and scale parameter b.
 
b 1 x−a 1
• PDF p(y) = , CDF P(Y ≤ y) = arctan + .
π[b2 + (x − a)2 ] π b 2

• MGF does not exist. Characteristic function E[eitY ] = exp(ait − b|t|).

• If X1 , . . . , Xn are independent Cauchy variables with location and scalePparameters


P ai and bi , then
X1 +· · ·+Xn has Cauchy distribution with location and scale parameters ai and bi . In particular,
if ai = a and bi = b, X̄ has the same distribution as the Xi ’s.
1
• (300B HW3) When b = 1 and a = θ, Fisher information is I(θ) = .
2

2.6 Chi-Squared Distribution

Let X ∼ χ2k , for k positive integer.

1
• PDF p(x) = xk/2−1 e−x/2 , for x ≥ 0. The PDF satisfies the differential equation 2xf ′ (x) +
2k/2 Γ(k/2)
f (x)(−k + x + 2) = 0.

2 3
• EX = k, Median ≈ k 1 −

9k , Mode = max(k − 2, 0), Var X = 2k.

• MGF E[etX ] = (1 − 2t)−k/2 for t < 21 . Characteristic function E[eitX ] = (1 − 2it)−k/2 .

Γ(n/2 + k)
• Moments: If X ∼ χ2n , then for k > −n/2, E[X k ] = 2k . For k ≤ −m/2, E[X k ] = ∞.
Γ(n/2)

1
• In particular, if X ∼ χ2n for n ≥ 3, then E[1/X] = .
n−2

• If Z1 , . . . , Zk are independent N (0, 1) RVs, then Z12 + · · · + Zk2 ∼ χ2k .

• If X1 , . . . Xn are i.i.d. χ2kn RVs, then X1 + · · · + Xn is χ2 with k1 + · · · + kn degrees of freedom.

• If Y ∼ N (µ, Σ) ∈ Rp , where Σ is non-singular, then (Y − µ)T Σ−1 (Y − µ) ∼ χ2k .

d
• χ2ν = Gam(ν/2, θ = 2).

• If X ∼ Gam(k, b) (shape, scale), then Y = 2b X ∼ χ22k .

d
• χ22 = Exp(1/2).
x
• Let fn denote the density of χ2n . Then fn+2 (x) = fn (x).
n
d
• If X ∼ Fν1 ,ν2 , then Y = lim ν1 X = χ2ν1 .
ν2 →∞

7
2.6.1 Non-Central Chi-Squared Distribution

Suppose X1 , . . . , Xn are independent RVs, where Xk ∼ N (µk , 1). Then Y = X12 + · · · + Xn2 is the non-central
chi-squared distribution with n degrees of freedom and non-centrality parameter λ = µ21 + · · · + µ2n .


X (λ/2)k xn/2−1+k e−x/2
• PDF of χ2n (λ) is p(x; λ) = e−λ/2 .
k! 2k+n/2 Γ(k + n/2)
k=0

• EY = n + λ, Var Y = 2(n + 2λ).


 
λt
• MGF E[e ] = (1 − 2t)
tY −n/2
exp , for t < 1/2. Characteristic function E[eitY ] = (1 −
  1 − 2t
−n/2 λit
2it) exp .
1 − 2it
• If J ∼ Pois(λ), then χ2k+2J ∼ χ′2
k (λ).

2.7 Dirichlet Distribution

Let K ≥ 2 be the number of categories. Let X ∼ Dirichlet(α1 , . . . , αK ), αi > 0 for all i. Let α0 =
α1 + · · · + αK .

K
X
• Support of X is (x1 , . . . , xK ), where xi ∈ (0, 1) and xi = 1.
i=1

K
1 Y αi −1 Γ(α1 ) . . . Γ(αK )
• PDF p(x) = x , where B(α) = .
B(α) i=1 i Γ(α1 + · · · + αK )

αi αi − 1
• EXi = , mode for Xi is (αi > 1).
α0 α0 − K
αi (α0 − αi ) −αi αj
• Var Xi = 2 , Cov(Xi , Xj ) = 2 for i ̸= j.
α0 (α0 + 1) α0 (α0 + 1)
"K # K
Y β B(α + β) Γ(α0 ) Y Γ(αi + βi )
• Moments E i
Xi = = .
i=1
B(α) Γ(α0 + β0 ) i=1 Γ(αi )

• The marginal distributions are beta distributions: Xi ∼ Beta(αi , α0 − αi ).


 
ind Y1 YK
• If Yi ∼ Gam(αi , θ), then V = Yi ∼ Gam ( αi , θ), and
P P
,..., ∼ Dirichlet(α1 , . . . , αK ).
V V

2.8 Exponential Distribution

Let X ∼ Exp(λ), λ > 0 (rate).

• PDF p(x) = λe−λx for x ≥ 0, CDF P(X ≤ x) = 1 − e−λx .


1 log 2 1
• EX = , Median = , Mode = 0. Var X = 2 .
λ λ λ

8
• Skewness is 2, excess kurtosis is 6.

λ λ
• MGF E[etX ] = for t < λ. Characteristic function E[eitX ] = .
λ−t λ − it

k!
• Moments EX k = . (Proof: Integration by parts.)
λk
1
• Fisher information: .
λ2
• Memoryless property: For exponentially distributed X, P(X > s + t | X > s) = P(X > t) for all
s, t ≥ 0.

• If X1 , . . . , Xn are independent exponential RVs with rates λ1 , . . . , λn , then min{X1 , . . . , Xn } ∼ Exp(λ1 +


· · · + λn ). The maximum is NOT exponentially distributed.

k
X 1
• If X1 , . . . , Xn are independent Exp(1) random variables, then the k th order statistic T(k) ∼ Exp(1).
i=1
n−i+1
(Proof uses memoryless property, see Prob Qual 2013-1.)

• If X ∼ Exp(λ), then kX ∼ Exp(λ/k).

• If X ∼ Exp(1/2), then X ∼ χ22 .

• Exp(λ) = Gam(1, λ) (shape-rate parametrization).

• If U ∼ Unif(0, 1), then − log U ∼ Exp(1).

• If X ∼ Exp(λ), then e−X ∼ Beta(λ, 1).


a
• If X ∼ Exp(a) and Y ∼ Exp(b) and are independent, then P(X < Y ) = . Extending the set-up
a+b
λi Qn
to n RVs: P(Xi < Xj for all j ̸= i) = Pn , and P(X1 < · · · < Xn ) = i=1 Pnλi λj .
j=1 λj j=i

2.8.1 Shifted Exponential Distribution

(Notation from TPE p18) Let X ∼ E(a, b) (−∞ < a < ∞, b > 0). a is shift parameter, b is scale parameter.

1
• PDF p(x) = e−(x−a)/b if x ≥ a, 0 otherwise. CDF P(X ≤ x) = 1 − exp[−(x − a)/b] for x ≥ a, 0
b
otherwise.

• EX = a + b, Var X = b2 .

iid
• If X1 , . . . , Xn ∼ E(a, b), then smallest order statistic X(1) ∼ E(a, b/n).

iid
• (TPE Eg 1.6.24 p43) If X1 , . . . , Xn ∼ E(a, b), let T1 = X(1) , T2 = [Xi − X(1) ]. Then T1 and T2 are
P
b
independent (Basu’s Theorem), and T1 ∼ E(a, b/n) and T2 ∼ χ22n−2 .
2

9
2.9 F Distribution

Let F ∼ Fn,m , n, m > 0.

s
1 (nx)n mm
• PDF p(x) = for x > 0, where B is the beta function. (PDF also defined
xB(n/2, m/2) (nx + m)n+m
at x = 0 for n ≥ 2.)

m n−2 m 2m2 (n + m − 2)
• EX = for m > 2 (∞ if m ≤ 2), mode = for n > 2, Var X = for
m−2 n m+2 n(m − 2)2 (m − 4)
m > 4 (undefined for m ≤ 2, ∞ for 2 < m ≤ 4).

• MGF does not exist.


 m k Γ(n/2 + k) Γ(m/2 − k)
• kth moment only exists when 2k < m. In that case, EX k = .
n Γ(n/2) Γ(m/2)

• If X1 , . . . , Xn and Y1 , . . . , Ym are independent N (0, 1) random variables,

(X12 + · · · + Xn2 )/n 2


d χ /n
F = 2 2
∼ Fn,m = 2n .
(Y1 + · · · + Ym )/m χm /m

mX nX/m
• If X ∼ Beta(n/2, m/2), then ∼ Fn,m . Conversely, if X ∼ Fn,m , then ∼
n(1 − X) 1 + nX/m
Beta(n/2, m/2).
1
• If X ∼ Fn,m , then ∼ Fm,n .
X
• If X ∼ tn , then X 2 ∼ F1,n .

X
• If X, Y ∼ Exp(λ) independent, then ∼ F2,2 .
Y
d
• As m → ∞, Fn,m → χ2n /n.

2.9.1 Non-Central F Distribution

χ′2
n1 (λ)/n1
This is defined by a non-central χ2 distribution divided by a central χ2 distribution, i.e. .
χ2n2 /n2

2
(n1 + λ)2 + (n1 + 2λ)(n2 − 2)

n2 (n1 + λ) n2
• EF = if n2 > 2, does not exist if n2 ≤ 2. Var F = 2
n1 (n2 − 2) (n2 − 2)2 (n2 − 4) n1
if n2 > 4, does not exist if n2 ≤ 4.

2.9.2 Doubly Non-Central F Distribution

χ′2
n1 (λ1 )/n1
This is defined by the ratio of 2 non-central χ2 distributions, i.e. ′2
.
χn2 (λ1 )/n2

10
2.10 Gamma Function
Z ∞
For z ∈ C with Re(z) > 0, the gamma function is defined by Γ(z) = xz−1 e−x dx.
0

• Γ(z + 1) = zΓ(z) for all z.


π
• Γ(z)Γ(1 − z) = for all z ∈
/ Z.
sin(πz)

• If n is a positive integer, Γ(n) = (n − 1)!.


 √
• Γ 21 = π.

• If n is an even positive integer, Γ(n/2) = (n/2 − 1)!. If n is an odd positive integer, Γ(n/2) =
(n − 1)! √
n−1
π.
2 (n/2 − 1/2)!

• (Rudin Thm 8.18) log Γ is convex on (0, ∞).

Γ(n + α)
• For α ∈ R, lim = 1.
n→∞ Γ(n)nα

2.11 Gamma Distribution

The Gamma distribution is often parametrized in 2 different ways: X ∼ Gam(k, θ) (shape, scale) or X ∼
Gam(α, β) (shape, rate). All parameters are positive, and k = α, θ = β1 represent the same distribution.

Γ(α)
Rate interpretation: Γ(α, β) = . (BDA3 uses shape-rate parametrization.)
β

1
• PDF p(x) = xk−1 e−x/θ for x > 0.
Γ(k)θk

• EX = kθ, Mode = (k − 1)θ for k ≥ 1, Var X = kθ2 .


1
• MGF E[etX ] = (1 − θt)−k for t < . Characteristic function E[eitX ] = (1 − θit)−k .
θ
θa Γ(a + k)
• Moments E[X a ] = for a > −k.
Γ(k)

• If X ∼ Gam(k, θ), then for any c > 0, cX ∼ Gam(k, cθ).

• Gam(1, λ) = Exp(λ).

• Gam(ν/2, θ = 2) = χ2ν . Conversely, if Q ∼ χ2ν and c > 0, then cQ ∼ Gam(ν/2, 2c).


X
• If X ∼ Gam(α, θ) independent of Y ∼ Gam(β, θ), then X + Y ∼ Gam(α + β, θ) and ∼
X +Y
Beta(α, β).

• If Xi ∼ Gam(αi , 1) independent and S = X1 +· · ·+Xn , then (X1 /S, . . . , Xn /S) ∼ Dirichlet(α1 , . . . , αn ).


(Proof: Compute joint density of (S, X1 /S, . . . , Xn−1 /S) via change of variables.)

11
2.12 Geometric Distribution

Let X ∼ Geom(p) (p is probability of success).

• P(X = k) is the probability that the 1st success occurs on the kth trial. P(X = k) = p(1 − p)k−1 ,
k ∈ {1, 2, . . . }.

• CDF P(X ≤ k) = 1 − (1 − p)k .


1 q
• EX = , Var X = 2 , Mode = 1.
p p
pet peit
• MGF E[etX ] = t
for t < − log(1 − p). Characteristic function E[eitX ] = .
1 − (1 − p)e 1 − (1 − p)eit
• Memoryless property: For m, n ∈ N, P(X > n + m | X > m) = P(X > n).

• If X1 , . . . , Xr are independent Geom(p) RVs, then their sum has distribution NegBin(r, p).

• If X1 , . . . , Xr are independent
Y Geom(pr ) RVs (possibly different parameters), then min Xi is Geometric
with parameter p = 1 − (1 − pi ).
i

d
• Exponential approximation (Dembo Eg 3.2.5): Let Zp ∼ Geom(p). Then as p → 0, pZp → Exp(1).

2.13 Gumbel Distribution

Location parameter µ ∈ R, scale parameter β > 0. Standard Gumbel distribution has µ = 0, β = 1.

• PDF p(x) = exp [−(z + e−z )], where z = (z − µ)/β.


1
β

• CDF P (X ≤ x) = exp −e−(x−µ)/β .




• EX = µ + βγ, where γ is the Euler-Mascheroni constant, Median = µ − β log log 2, Mode = µ,


π2 β 2
Var X = .
6
• MGF E[etX ] = Γ(1 − βt)eµt . Characteristic function E[eitX ] = Γ(1 − iβt)eiµt .

• The standard Gumbel distribution is the limit of the maximum of n i.i.d. RVs (whose distribution falls
in a certain class). This is true for exponential distribution, normal distribution.

• (Gumbel Max Trick) Let ε1 , . . . , εk be i.i.d. standard Gumbel RVs. Then for any α1 , . . . , αk ,
 eαr
P argmax1≤i≤k αi + εi = r = Pk .
i=1 eαi

2.14 Hypergeometric Distribution

The hypergeometric distribution describes the probability of k successes in n draws, without replacement,
from a finite population of size N that contains exactly K successes (each draw is either a success or a
failure).

12
• Support is k ∈ N such that max(0, n + K − N ) ≤ k ≤ min(K, n).
K N −K
 
k n−k
• PMF P(Y = k) = N
 .
n

nK nK(N − K)(N − n)
• EY = , Var Y = .
N N 2 (N − 1)
n
X
• Let Xi = 1 if draw i is a success, 0 otherwise. Then Y = Xi .
i=1

nK 
• Let Y ∼ Hypergeom(n, N, K). E[Y k ] = E (Z + 1)k−1 , where Z ∼ Hypergeom(n-1, N-1, K-1).

N
(Proof by combinatorial identities.)

• Let Y ∼ Hypergeom(n, N, K), and let p = K/N . If N and K are large compared to n, and p is not
close to 0 or 1, then Y ∼Binom(n,
˙ p).

2.15 Inverse Gaussian Distribution

Let X ∼ IG(µ, λ) with µ, λ > 0.

• Support is x ∈ (0, ∞).


r "r  #   " r  #
−λ(x − µ)2
 
λ λ x 2λ λ x
• PDF p(x) = exp . CDF P(X ≤ x) = Φ − 1 +exp Φ − +1 .
2πx3 2µ2 x x µ µ x µ
"r #
9µ2 3µ µ3
• EX = µ, Mode = µ 1+ 2 − . Var X = .
2λ 2λ λ

1 2
• E[1/X] = 1/µ + 1/λ, Var (1/X) = + .
µλ λ2
" r !# " r !#
λ 2µ2 t λ 2µ2 it
• MGF E[etX ] = exp 1− 1− . Characteristic function E[eitX ] = exp 1− 1− .
µ λ µ λ

• If {Xt } is the Brownian motion with drift ν, i.e. Xt = νt + σWt , then


 for a2 fixed
 level α > 0, the first
α α
passage time is an inverse-Gaussian: Tα = inf{t > 0 : Xt = α} ∼ IG , .
ν σ2

• If X ∼ IG(µ, λ), then for k > 0, kX ∼ IG(kµ, kλ).


n  X X 2 
ind
X
• If Xi ∼ IG(µ0 wi , λ0 wi2 ), then Xi ∼ IG µ0 wi , λ0 wi .
i=1

2.16 Laplace/Double Exponential Distribution

Let X ∼ Laplace(µ, b), where b > 0 (scale).

13
   
1 |x − µ| 1 x−µ
• PDF p(x) = exp − . CDF P (X ≤ x) = exp if x < µ, P (X ≤ x) = 1 −
 2b b 2 b
1 x−µ
exp − if x ≥ µ.
2 b
• Mean, median and mode are all µ. Var X = 2b2 . Skewness is 0, excess kurtosis is 3.
exp(µt) exp(iµt)
• MGF E[etX ] = 2 2
for |t|< 1/b. Characteristic function E[eitX ] = .
1−b t 1 + b2 t 2
• Central moments E[(X − µ)n ] = 0 if n is odd, = bn n! if n is even.
• If X ∼ Laplace(µ, b), then kX + c ∼ Laplace(kµ + c, kb).
• If X ∼ Laplace(µ, b), then |X − µ|∼ Exp(1/b) (rate).
• If X, Y ∼ Exp(λ), then X − Y ∼ Laplace(0, 1/λ).
iid
• If X1 , . . . , X4 ∼ N (0, 1), then X1 X2 − X3 X4 ∼ Laplace(0, 1).
n
iid 2X
• If X1 , . . . , Xn ∼ Laplace(µ, b), then |Xi − µ|∼ χ22n .
b i=1

|X − µ|
• If X, Y Laplace(µ, b), then ∼ F2,2 .
|Y − µ|
• If X, Y ∼ Unif(0, 1), then log(X/Y ) ∼ Laplace(0, 1).
• If X ∼ Exp(λ) and Y ∼ Bernoulli(1/2), then X(2Y − 1) ∼ Laplace(0, 1/λ).
• If X ∼ Exp(λ) and Y ∼ Exp(ν), then λX − νY ∼ Laplace(0, 1).

• If V ∼ Exp(1) and Z ∼ N (0, 1), then X = µ + b 2V Z ∼ Laplace(µ, b).

2.17 Logistic Distribution

Let X ∼ Logistic(µ, s). µ location parameter, s > 0 scale parameter.

exp − x−µ

1
• PDF p(x) = s
 , CDF P (X ≤ x) = .
x−µ 2 1 + exp − x−µ

s 1 + exp − s s

s2 π 2
• Mean, median and mode are all µ. Var X = .
3
• MGF E[etX ] = eµt B(1 − st, 1 + st) for st ∈ (−1, 1), where B is the beta function. Characteristic
πst
function E[eitX ] = eitµ .
sinh(πst)
• Central moments E[(X − µ)n ] = sn π n (2n − 2) · |Bn |, where Bn is the nth Bernoulli number.
• If X ∼ Logistic(µ, s), then kX + c ∼ Logistic(kµ + c, ks).
• If U ∼ Unif(0, 1), then µ + s[log U − log(1 − U )] ∼ Logistic(µ, s).
• If X, Y ∼ Gumbelα, β, then X − Y ∼ Logistic(0, β).
• If X ∼ Exp(1), then µ + β log(eX − 1) ∼ Logistic(µ, β).
 
X
• If X, Y ∼ Exp(1), then µ − β log ∼ Logistic(µ, β).
Y

14
2.18 Multinomial Distribution

Let X ∼ Multinom(n; p1 , . . . , ps ). n objects belonging to s classes, p1 + · · · + ps = 1.

n!
• PMF P (X1 = x1 , . . . Xs = xs ) = px1 . . . pxs s (for xi ’s that sum up to n).
x1 ! . . . xs ! 1
• EX  i = npi (1 − pi ), Cov(Xi , Xj ) = −npi pj for i =
 i = npi , Var X ̸ j. In matrix notation, Var X =
n diag(p) − ppT . (Proof: 310A HW1 Qn3.)
!n  n
X s Xs
• MGF E[et·X ] = pi eti . Characteristic function E[eit·X ] =  pj eitj  .
i=1 j=1

• (TPE Eg 5.3 p24) Multinom(n; p1 , . . . , ps ) is an (s − 1)-dimensional exponential family.

• The Xi ’s have marginal distribution Binom(n, pi ).

• Poisson-Multinomial connection: Suppose Xi ’s independent RVs with Xi ∼ Pois(λi ). Let S =


X1 + · · · + Xn , λ = λ1 + · · · + λn . Then
  
λ1 λn
(X1 , . . . , Xn ) | S ∼ Multinom S, ,..., .
λ λ

Conversely, suppose that N ∼ Pois(λ) and conditional on N = n, X = (X1 , . . . , Xk ) ∼ Multinom(n, (p1 , . . . , pk )).
Then the Xi ’s are marginally independent and Poisson-distributed with parameters λp1 , . . . , λpk .

2.19 Negative Binomial Distribution

The negative binomial is parametrized in a number of ways.

BDA3/305C Notes

Let Y ∼ NegBin(α, β), where α > 0 (shape), β > 0 (rate).

  α  y
α+y−1 β 1
• PMF is p(y) = , y = 0, 1, . . . .
y β+1 β+1

α α(β + 1)
• EY = , Var Y = .
β β2
• The negative binomial is a mixture of Poisson distributions with rates which follow the gamma distri-
bution (shape-rate): Z
NegBin(y | α, β) = Pois(y | θ)Gamma(θ | α, β)dθ.

TSH

Let Y ∼ NegBin(p, m), where p is the probability of success, and m is the number of successes to be obtained.

15
β
• If we let m = α and p = , we get BDA’s parametrization.
β+1
• Interpretation: If Y +m independent trials are needed to obtain m successes (and each trial has success
probability p), then Y ∼ NegBin(p, m).
 
m+y−1 m
• PMF is p(y) = p (1 − p)y , y = 0, 1, . . . .
y
m(1 − p) m(1 − p)
• EY = , Var Y = .
p p2
 m  m
p p
• MGF E[e ] =
tY itY
for t < − log(1−p). Characteristic function E[e ] = .
1 − (1 − p)et 1 − (1 − p)eit
m
• Fisher information: .
p(1 − p)2
• Y is the sum of m independent Geom(p) random variables.

Agresti

1
Let Y ∼ NegBin(k, µ) or Y ∼ NegBin(γ, µ), where γ = (dispersion parameter). k > 0, µ > 0.
k
α
• If we let µ =and k = α, we get BDA’s parametrization.
β
 k  y
Γ(y + k) k µ
• PMF p(y) = 1− , y = 0, 1, . . . .
Γ(k)Γ(y + 1) µ + k µ+k
• EY = µ, Var Y = µ + γµ2 .
• As γ → 0, the negative binomial converges to the Poisson.

2.20 Normal Distribution

Let Z ∼ N (µ, σ 2 ).

1 (z−µ)2
• PDF p(z) = √ e− 2σ 2 .
2πσ 2
h 2 2
i h i
σ 2 t2
• MGF E etZ = exp µt + σ 2t . Characteristic function E eitZ = exp iµt −
   
2 .
1 
0
• Fisher information: σ 2
.
0 2σ1 4
• Central moments (i.e. µ = 0): for non-negative integer p,
(
p 0 if p odd,
E[X ] = p
σ (p − 1)! ! if p even.
(q
Γ p+1
p/2

2
p p π if p odd p 2 2
E[|X| ] = σ (p − 1)! ! · =σ · √ .
1 if p even π

16
• E
1
X does not exist.
• If X and Y are jointly normal, then uncorrelatedness is the same as independence.
• (TPE Eg 5.16 p31) Stein’s identity for the normal: If X ∼ N (µ, σ 2 ) and g a differentiable function
with E|g ′ (X)|< ∞, then E[g(X)(X − µ)] = σ 2 Eg ′ (X).
1
• Variance stabilizing transformation: If X ∼N
˙ (θ, αθ(1 − θ)), then taking Y = √ arcsin(2X − 1),
  α
1
we have Y ∼N
˙ √ arcsin(2θ − 1), 1 .
α
• Cramér’s decomposition theorem: If X1 and X2 are independent and X1 + X2 is normally dis-
tributed, then both X1 and X2 must be normally distributed.
• Marcinkiewicz theorem: If a random variable X has characteristic function of the form φX (t) =
eQ(t) where Q is a polynomial, then Q can be at most a quadratic polynomial.
• If X and Y are independent N (µ, σ 2 ) RVs, then X + Y and X − Y are independent and identically
distributed. Bernstein’s theorem asserts the converse: If X and Y are independent s.t. X + Y and
X − Y are also independent, then X and Y must have normal distributions.
• KL divergence: If X1 ∼ N (µ1 , σ12 ) and X2 ∼ N (µ2 , σ22 ), then
(µ1 − µ2 )2 1 σ12 σ12
 
Dkl (X1 ∥ X2 ) = + − 1 − log 2 .
2σ22 2 σ22 σ2

• Hellinger distance: If X1 ∼ N (µ1 , σ12 ) and X2 ∼ N (µ2 , σ22 ), then


s
(µ1 − µ2 )2
 
2 2σ1 σ2
dhel (X1 , X2 ) = 1 − exp − .
σ12 + σ22 4(σ12 + σ22 )

2.20.1 Standard Normal Distribution

• For Z ∼ N (0, 1), ϕ′ (x) = −xϕ(x) for all x ∈ R.


iid
• If Z1 , . . . , Zn ∼ N (0, 1), then Z12 + · · · + Zn2 ∼ χ2n .
r
2
• E|Z|= .
π
iid
• Box-Muller method: If U, V ∼ U [0, 1], then
p p
X = −2 log U cos(2πV ), Y = −2 log U sin(2 sin V )
are independent N (0, 1) random variables.
iid
• (Owen Section 3.2.4) If Z1 , . . . , Zn ∼ N (µ, σ 2 ), then Z̄ is independent of Z1 − Z̄, . . . , Zn − Z̄.
• (Dembo Ex 2.2.24, 310A Lec 9, 300C Lec 2) Approximating tail of a Gaussian: Let Z ∼ N (0, 1).
Then for any x > 0,
2 2
1 e−x /2 1 e−x /2
 
1
− 3 √ ≤ P {Z > x} ≤ √ .
x x 2π x 2π
2
e−x /2 φ(x)
For large x, 1 − Φ(x) ∼ √ = .
x 2π x

17
• (300C Lec 2) Holding α fixed, for large n,
 
p 1 log log n p
|z(α/n)|≈ 2 log n 1 − ≈ 2 log n.
4 log n

• (Dembo Ex 2.2.24, 300C Lec 2) Let Z1 , Z2 , . . . be independent N (0, 1) random variables. Then with
probability 1,
maxi≤n Zi
lim √ = 1.
n→∞ 2 log n
• (300C, Lec 24) For large λ,
E[Z 2 ; |Z|> λ] ≈ 2λϕ(λ).
• If Z1 and Z2 are standard normal variables, then Z1 /Z2 has the standard Cauchy distribution.
d ϕ(x)
• Φ(x) is log-concave (i.e. log Φ(x) is concave), so log Φ(x) = is decreasing in x.
dx Φ(x)
• If Gkθ is the CDF of the truncated normal Z ∼ N (θ, 1) | Z ≤ k, then Gkθ (t) is a decreasing function of
θ.

2.20.2 Multivariate Normal Distribution

Let Z ∼ N (µ, Σ), with µ ∈ Rd , Σ ∈ Rd×d , and Σ positive semi-definite.

• Support Z ∈ µ + span(Σ) ⊆ Rd .
 
1 1
• PDF exists only when Σ is positive definite: p(z) = p exp − (z − µ)T Σ−1 (z − µ) .
(2π)d |Σ| 2

• MGF E[et·Z ] = exp µT t + 21 tT Σt . Characteristic function E[eit·Z ] = exp iµT t − 21 tT Σt .


 
P 
d
• (Dembo Ex 3.5.20) A random vector X has the multivariate normal distribution iff i=1 aji X i , j = 1, . . . , m
is a Gaussian random vector for any non-random coefficients a11 , . . . , amd ∈ R. (Also holds when
m = 1.)
     
X1 µ1 Σ11 Σ12
• Let ∼N , (X1 and X2 can be vectors). Then the distribution of X1 given
X2 µ2 Σ21 Σ22
X2 = a is multivariate normal:
X1 | X2 = a ∼ N µ1 + Σ12 Σ−1 −1

22 (a − µ2 ), Σ11 − Σ12 Σ22 Σ21 .

The covariance matrix above is called the Schur complement of Σ22 in Σ. Note that it does not depend
on a. (In order to prove this result, we use the fact that X2 and X1 − Σ12 Σ−1
22 X2 are independent.)

• Using the notation above, X1 and X2 are independent iff Σ12 = 0. (Proof: Factor the characteristic
function.)
• Suppose Y ∼ N (µ, Σ) and Σ−1 exists. Then (Y − µ)T Σ−1 (Y − µ) ∼ χ2n . (Proof: Σ = P T ΛP ,
Z ∼ Λ−1/2 P (Y − µ). Then Z ∼ N (0, In ).)
• Assume that Z ∼ N (0, Σ) is d-dimensional. Then Z T Σ−1 Z ∼ χ2d . More generally, (Z −µ)T Σ† (Z −µ) ∼
χ2rank(Σ) , where Σ† is the pseudo-inverse of Σ.

• If Z ∼ N (0, I) and Q orthogonal (i.e. QQT = I), then QZ ∼ N (0, I).


• If φ is the density for N (0, I), then ∂i φ(x − µ) = −(xi − µi )φ(x − µ).

18
2.20.3 Bivariate Normal Distribution
 
X
• In the bivariate normal case Z = , letting ρ be the correlation between X and Y , we can write
Y
the PDF as
(x − µX )2 (y − µY )2
  
1 1 2ρ(x − µX )(y − µY )
p(x, y) = exp − 2 + − .
2(1 − ρ2 ) σY2
p
2πσX σY 1 − ρ2 σX σX σY


n − 2ρ̂
• (300A Lec 16) Under ρ = 0 (i.e. independence), the sample correlation has p ∼ tn−2 .
1 − ρ̂2
     
X 0 1 ρ
• Let
p
∼N , . Then Y = ρX + 1 − ρ2 ε for ε ∼ N (0, 1).
Y 0 ρ 1

2.21 Pareto Distribution

Let X ∼ Pareto(a, b). a > 0 shape, b > 0 scale.


 a
aba b
• PDF p(x) = a+1 for x ≥ b. CDF P(X ≤ x) = 1 − .
x x
ab √
• EX = ∞ if a ≤ 1, EX = if a > 1. Median is b a 2, Mode is b.
a−1
ab2
• Var X = ∞ if a ≤ 2, Var X = if a > 2.
(a − 1)2 (a − 2)
• MGF and characteristic function uses incomplete gamma function.
abn
• Moments E[X n ] = if 0 < n < a, = ∞ if n ≥ a.
a−n
 a
− 1b

• Fisher information: b2 .
− 1b a12

• If X ∼ Pareto(a, b) and c > 0, then cX ∼ Pareto(a, bc).

• If X ∼ Pareto(a, b) and n > 0, then X n ∼ Pareto(a/n, bn ).

2.22 Poisson Distribution

Let X ∼ Pois(λ).

λk e−λ
• PMF P (X = k) = .
k!
• EX = λ, Var X = λ.

• MGF E[etX ] = exp [λ(et − 1)]. Characteristic function E[eitX ] = exp λ(eit − 1) .
 

• For k = 1, 2, . . . , E[X(X − 1) . . . (X − K + 1)] = λk .

19
1
• Fisher information: .
λ
• Let X1 ∼ Pois(λ
 X2 ∼ Pois(λ2 ) be independent. X1 + X2 ∼ Pois(λ1 + λ2 ), and X1 | X1 + X2 =
1 ) and 
λ1
k ∼ Binom k, .
λ1 + λ2

• Poisson-Multinomial connection: Suppose Xi ’s independent RVs with Xi ∼ Pois(λi ). Let S =


X1 + · · · + Xn , λ = λ1 + · · · + λn . Then
  
λ1 λn
(X1 , . . . , Xn ) | S ∼ Multinom S, ,..., .
λ λ

Conversely, suppose that N ∼ Pois(λ) and conditional on N = n, X = (X1 , . . . , Xk ) ∼ Multinom(n, (p1 , . . . , pk )).
Then the Xi ’s are marginally independent and Poisson-distributed with parameters λp1 , . . . , λpk .

2.23 T Distribution

Let X ∼ tν , ν > 0. (ν can be any positive real number.)

 ν+1
Γ ν+1
 
x2 2
• PDF p(x) = √ 2  1+ .
νπΓ ν2 ν
ν
• EX = 0 (if ν > 1, otherwise undefined), median and mean are 0. Var X = for ν > 2, ∞ for
ν−2
1 < ν ≤ 2, undefined otherwise.

• MGF is undefined, characteristic function involves modified Bessel function of the second kind.

• When ν > 1,

0  if k odd, 0 < k < ν,
E[X k ] =
   
1 k+1 ν−k
√ ν
 Γ Γ if k even, 0 < k < ν.
πΓ 2
2 2

Moments of order ν or higher don’t exist.

• As n → ∞, tn → N (0, 1).

Z
• If Z ∼ N (0, 1) and X ∼ χ2n with Z and X independent, then p ∼ tn .
X/n
N
iid 1 X
• Let X1 , . . . , Xn ∼ N (µ, σ 2 ). Let X̄ be the sample mean and S 2 = (Xi − X̄)2 be the sample
N − 1 i=1
X̄ − µ
variance. Then √ ∼ tn−1 .
S/ n
d
• t1 = standard Cauchy distribution.

• If T ∼ tν , then T 2 ∼ F1,ν .

20
2.23.1 Non-Central T Distribution

N (λ, 1)
• This is obtained by t′n (λ) = p .
χ2n /n

• Non-central t-distribution is asymmetric unless µ = 0. Right tail will be heavier than the left when
µ > 0 and vice versa.

• If T ∼ t′ν (µ), then T 2 ∼ F1,ν



(µ2 ).

d
• As n → ∞, t′n (µ) → N (µ, 1).

2.24 Uniform Distribution

Let U ∼ Unif(a, b).

1 x−a
• PDF p(x) = 1x∈[a,b] . CDF P (X ≤ x) = for a ≤ x ≤ b.
b−a b−a
a+b (b − a)2
• EX = Median = , Var X = .
2 12
etb − eta eitb − eita
• MGF E[etU ] = for t ̸= 0, E[etU ] = 1 if t = 0. Characteristic function E[eitU ] = .
t(b − a) it(b − a)
iid
• If X1 , . . . , Xn ∼ Unif(0, 1), then the kth order statistic X(k) ∼ Beta(k, n + 1 − k).

• Unif(0, 1) = Beta(1, 1).

• If U ∼ Unif(0, 1), then U n ∼ Beta(1/n, 1).

• Suppose F is a distribution function for a probability distribution on R, and F −1 is the corresponding


quantile function, i.e. F −1 (u) = inf{x : F (x) ≥ u}. Then X = F −1 (U ) has distribution function F .

2.25 Weibull Distribution

Let X ∼ Weibull(λ, k), with λ > 0 (scale) and k > 0 (shape).

• Support is x ≥ 0.
k  x k−1 −(x/λ)k k
• PDF p(x) = e . CDF P (X ≤ x) = 1 − e−(x/λ) .
λ λ
 1/k
1/k k−1
• EX = λΓ(1 + 1/k), median = λ (log 2) , mode = λ if k > 1, 0 otherwise.
k

• Var X = λ2 Γ(1 + 2/k) − (Γ(1 + 1/k))2 .


 

∞ n n ∞
X t λ X (it)n λn
• MGF E[etX ] = Γ(1+n/k) for k ≥ 1. Characteristic function E[eitX ] = Γ(1+n/k).
n=0
n! n=0
n!

21
 k
W
• The Weibull is the distribution of a random variable W such that ∼ Exp(1). Going in the
λ
other direction, if X ∼ Exp(1), then λX 1/k ∼ Weibull(λ, k).

• If U ∼ Unif(0, 1), λ[− log U ]1/k ∼ Weibull(λ, k).

• As k → ∞, Weibull(λ, k) converges to a point mass at λ.

3 Analysis Facts

3.1 Basic Topology and Spaces

• Banach space: A complete vector space with a norm.

• Hilbert space: A real or complex inner product space which is complete w.r.t. distance induced by
the inner product.

• Compact metric space: A metric space X is compact if every open cover of X has a finite subcover.

• Sequentially compact: A metric space X is sequentially compact if every sequence of points in X


has a convergent subsequence converging to a point in X.

• Totally bounded: A metric space (M, d) is totally bounded iff for every ε > 0, there exists a finite
collection of open balls in M of radius ε whose union covers M .

• Heine-Borel Theorem for arbitrary metric space: A subset of a metric space is compact iff it is
complete and totally bounded.

• Borel-Lebesgue Theorem: For a metric space (X, d), the following are equivalent:

1. X is compact.
2. Every collection of closed subsets of X with the finite intersection property (i.e. every finite
subcollection has non-empty intersection) has non-empty intersection.
3. X is sequentially compact.
4. X is complete and totally bounded.

• (Stein Thm 2.4) Every Hilbert space has an orthonormal basis.

• (Rudin Thm 2.35) Closed subsets of compact sets are compact. If F is closed and K is compact, then
F ∩ K is compact.

• (Rudin Thm 4.19) If f is a continuous mapping from compact metric space X to metric space Y , then
f is uniformly continuous on X.

• If T is a compact set and f : T 7→ R is continuous, then f is bounded.

• Over a compact subset of the real line, continuously differentiable =⇒ Lipschitz-continuous =⇒


α-Hölder continuous (α > 0) =⇒ uniformly continuous =⇒ continuous =⇒ RCLL =⇒ separable.

22
3.2 Measure Theory, Integration and Differentiation
• (Stein Cor 3.5) Gδ sets are countable intersections of open sets, while Fσ sets are countable unions of
closed sets. A subset E ⊂ Rd is (Lebesgue)-measurable (i) iff E differs from a Gδ by a set of measure
zero, (ii) iff E differs from an Fσ by a set of measure zero.

• (Stein Thm 4.4) Egorov’s Theorem: Suppose {fk } is a sequence of measurable functions defined on
a measurable set E with m(E) < ∞, and assume fk → f a.e. on E. Given ε > 0, we can find closed
set Aε ⊂ E such that m(E − Aε ) < ε and fk → f uniformly on Aε .
∂fi
• Suppose f : Rn → Rm . The Jacobian is an m × n matrix with entries Jij = .
∂xj
• Change of variables: See Ross p279.

• Lebesgue Differentiation Theorem: If f is a measurable function, then for almost every x, f (x) =
1 x+r
Z
lim f (y)dy.
r→0 r x

• Mean Value Theorem: If f is continuous on [a, b] and differentiable on (a, b), then there exists
f (b) − f (a)
c ∈ (a, b) such that f ′ (c) = (i.e. f (b) = f (a) + (b − a)f ′ (c)).
b−a
• Differentiation under integral sign: Let f (x, t) be such that f (x, t) and its partial derivative fx (x, t)
are continuous in t and x in some region of the (x, t) plane, including a(x) ≤ x ≤ b(x), x0 ≤ x ≤ x1 . Also
suppose that a(x) and b(x) are both continuous and both have continuous derivatives for x0 ≤ x ≤ x1 .
Then, for x0 ≤ x ≤ x1 ,
Z b(x) ! Z b(x)
d d d ∂
f (x, t)dt = f (x, b(x)) · b(x) − f (x, a(x)) · a(x) + f (x, t)dt.
dx a(x) dx dx a(x) ∂x

• Inverse Function Theorem: For functions of a single variable, if f is a continuously differentiable


function with non-zero derivative at point a, then f is invertible in a neighborhood of a, the inverse is
continuously differentiable, and
1
(f −1 )′ (f (a)) = ′ .
f (a)

For functions of more than one variable: Let f : open set of Rn → Rn . If F is continuously differen-
tiable and its total derivative of F is invertible at a point p, then an inverse function to F exists in
some neighborhood of F (p). F −1 is also continuously differentiable, with

JF −1 (F (p)) = [JF (p)]−1 .

• If f : Rn 7→ R is L-Lipschitz, i.e. |f (x) − f (y)|≤ L|x − y|, and is differentiable, then for all x ∈ Rn ,
∥∇f (x)∥≤ L.
Z
• (Durrett Ex 1.4.4) Riemann-Lebesgue Lemma: If g is integrable, then lim g(x) cos(nx)dx = 0.
n→∞

3.3 Approximations
√  n n
• Stirling’s Approximation for factorial: n! ∼ 2πn (i.e. ratio goes to 1 as n → ∞).
e

23
r   
2π  z z 1
• Stirling’s Approximation for gamma function: Γ(z) = 1+O for large z, i.e.
z e z
√  z z
Γ(z + 1) ∼ 2πz .
e
• Volume of ball in n-dimensional space: Volume of ball with radius r in n-dimensional space ∼
 n/2
1 2πe
√ rn .
nπ n
• Weierstrass Approximation Theorem: If h is continuous on [0, 1], then there exist polynomials
pn such that sup |h(x) − pn (x)|→ 0 as n → ∞.
x∈[0,1]

• (Rudin Thm 5.15) Taylor’s Theorem: Suppose f is a real function on [a, b], n a positive integer,
f (n−1) continuous on [a, b], f (n) (t) exists for every t ∈ (a, b). Let α, β be distinct points of [a, b]. Then,
there exists a point x between α and β such that
n−1
X f (k) (α) f (n) (x)
f (β) = (β − α)k + (β − α)n .
k! n!
k=0

• Newton’s method: Say we are trying to find the solution to f (x) = 0. If our current guess is xk ,
f (xk )
one step of Newton’s method gives us our next guess: xk+1 = xk − ′ .
f (xk )
• (310A Lec 9) For small x, log(1 − x) ∼ −x.
n
X √
• As n → ∞, 2 log( k log k) ∼ n log n. (For proof, see 310A HW9.)
k=1

N
X 1 N
• (310B Lec 8) For large N , ≈ log .
k−1 j
k=j+1

3.4 Convergence
• (Rudin Thm 3.33) Root test: Given
P p P
an , let α = lim sup n |an |. If α < 1, an converges. If α > 1,
P n→∞
an diverges. If α = 1, the test gives no information.
an+1 an+1
• (Rudin Thm 3.34) Ratio test: If lim sup
P
< 1, the series an converges. If ≥ 1 for all
P n→∞ an an
large enough n, then an diverges.

Y
• Convergence of infinite product: Let an be a sequence of positive numbers. Then (1 + an ) and
n=1

Y ∞
X
(1 − an ) converge iff an converges. (Proof takes logs and uses fact that log(1 + x) ∼ x for small
n=1 n=1
x.)
• (Rudin Thm 7.11) Suppose fn → f uniformly on a set E in a metric space. Let x be a limit point of
E and suppose that lim fn (t) = An . Then {An } converges, and lim f (t) = lim An .
t→x t→x n→∞

• (Rudin Thm 7.12) If {fn } is a sequence of continuous functions on E and if fn → f uniformly on E,


then f is continuous on E.

24
• (Rudin Thm 7.16) If fn → f uniformly on [a, b] and fn are integrable on [a, b], then f is integrable on
[a, b] and
Z b Z b
f (x)dx = lim fn (x)dx.
a n→∞ a

• (Rudin Thm 7.17) Suppose {fn } are differentiable on [a, b] and that {fn (x0 )} converges for some point
x0 ∈ [a, b]. If {fn′ } converges uniformly on [a, b], then {fn } converges uniformly on [a, b] to a function
f , and for all x ∈ [a, b],
f ′ (x) = lim fn′ (x).
n→∞

• Lusin’s Theorem: Let A be a measurable subset of R with finite measure, and let f : A 7→ R be
measurable. Then for any ε > 0, there is a compact set K ⊆ A with m(A \ K) > ε such that the
restriction of f to K is continuous.

• (Dembo Lem 2.2.11) Let yn be a sequence in a topological space. If every subsequence has a further
subsequence which converges to y, then yn → y.

• (Dembo Lem 2.3.20) Kronecker’s


P Lemma: Let {xn } and {bn } be 2 sequences of real numbers with
bn > 0 and bn ↑ ∞. If n xn /bn converges, then sn /bn → 0 for sn = x1 + · · · + xn .
n
1X
• (310A Lec 13) If xn → x, then xi → x.
n i=1


X xn+1 − xn
• (310B Lec 10) If xn is a sequence of positive real numbers increasing to infinity, then =
n=1
xn+1

X xn+1 − xn
∞, and < ∞.
n=1
x2n+1

• (310B Lec 19) Subadditive Lemma: Let {xn } be a sequence of real numbers such that xn+m ≤
xn xn
xn + xm for all n, m. Then lim exists and is equal to inf .
n→∞ n n≥1 n

• (Durrett Lem 3.1.1) If cj → 0, aj → ∞ and aj cj → λ, then (1 + cj )aj → eλ . (Generalization in Ex


3.1.1.)

4 Linear Algebra Facts

4.1 Properties of Matrices


• Matrix multiplication as sum of inner products: If A ∈ Rm×n , B ∈ Rn×p , A1 , . . . , An ∈ Rm×1 the
n
X
columns of A and B1 , . . . , Bn ∈ R1×n the rows of B, then AB = Ai Bi .
i=1
In particular, for the regression setting: Let Z1 , . . . , Zn be the column vectors corresponding to subject
1, . . . , n. Let Z be the usual design matrix (i.e. row i belongs to subject i). Then we can write
n
X
ZT Z = Zi ZiT .
i=1

• For matrices, Cov(AX, BY ) = ACov(X, Y )B T , Var (AX + b) = AVar (X)AT .

25
A + AT
 
• x Ax = x
T T
x. Thus, when considering quadratic forms, we may assume A is symmetric.
2

• If EY = µ ∈ Rn and Var Y = Σ, then for non-random matrix A, E[Y T AY ] = µT Aµ + tr(AΣ).

• For any matrix A, the row space of A and the column space of A have the same rank. In addition,
Rank(AT A) = Rank(A).

• rank(AB) ≤ min(rank(A), rank(B)).

• Determinant:
Q
– det(A) = i λi .
– For square matrices A and B of equal size, det(AB) = det(A)det(B).
– If A is an n × n matrix, det(cA) = cn det(A).
– Considering a matrix in block form, we have
 
A B
det = (det A)(det C).
0 C

– Sylvester’s theorem: If A ∈ Rm×n and B ∈ Rn×m , then det(Im + AB) = det(In + BA).
n
X
• Trace: tr(A) = aii = sum of eigenvalues.
i=1

– More generally, tr(Ak ) = λki .


P
i
– Trace is a linear operator, and tr(A) = tr(AT ).
– tr(AB) = tr(BA) (if the matrices AB and BA make sense). However, tr(AB) ̸= tr(A)tr(B).
– Trace is invariant under cyclic permutations but not arbitrary permutations. (However, for 3
symmetric matrices, any permutation is ok.)
– If PX = X(X T X)−1 X T (projection matrix), then tr(PX ) = rank(X). (Proof: Use cyclic permu-
tations.)
−1
– If λ is an eigenvalue
P of A, then 1/λ is an eigenvalue for A . So if the eigenvalues of A are λi ’s,
−1
then tr(A ) = i 1/λi .

• Inverses:

– Schur complement: Writing a matrix in block form,


−1
(A − BD−1 C)−1 −(A − BD−1 C)−1 BD−1
  
A B
= .
C D −D C(A − BD−1 C)−1
−1
D−1 + D−1 C(A − BD−1 C)−1 BD−1

– (305A Sec 14.3.2) Sherman-Morrison Formula: If A ∈ Rn×n is invertible, u, v ∈ Rn and


A−1 uv T A−1
1 + v T Au ̸= 0, then (A + uv T )−1 = A−1 − .
1 + v T A−1 u
– Woodbury Formula: (A + U CV )−1 = A−1 − A−1 U (C −1 + V A−1 U )−1 V A−1 .

• Positive definite matrices: A ∈ Rn×n is PD (A ≻ 0) if ⟨Ax, x⟩ > 0 for all x ∈ Rn \ {0}.

– If A, B ≻ 0 and t > 0, then A + B ≻ 0 and tA ≻ 0.


– A has positive eigenvalues. This also implies that it will have positive trace and determinant.

26
– A has positive diagonal entries.
– A is invertible.
– A has a unique positive definite square root.
(v T u)2
– (300B Lec 4) If A is positive definite, then sup = uT A−1 u (with equality when v = A−1 u).
v v T Av
• Positive semi-definite matrices: A ∈ Rn×n is PSD (A ⪰ 0) if ⟨Ax, x⟩ ≥ 0 for all x ∈ Rn \ {0}. PSD
matrices have corresponding properties of PD matrices as above. (PSD will have unique PSD square
root.)
– If A’s smallest eigenvalue is λ > 0, then A ⪰ λI.
– If the smallest eigenvalue of A is ≥ the largest eigenvalue of B, then A ⪰ B.
• Perron-Frobenius Theorem: Let A be a positive matrix (i.e. all entries positive). Then the
following hold:
– There is an r > 0 such that r is an eigenvalue of A and any other eigenvalue λ must satisfy |λ|< r.
– r is a simple root of the characteristic polynomial of A, and hence its eigenspace has dimension 1.
– There exists an eigenvector v with all components positive such that Av = rv. (Respectively,
there exists a positive left eigenvector w with wT A = rwT .)
– There are no other non-negative eigenvectors except positive multiples of v. (Same for left eigen-
vectors.)
– lim Ak /rk = vwT , where v and w are normalized so that wT v = 1. Moreover, vwT is the
k→∞
projection onto the eigenspace corresponding to r. (This convergence is uniform.)
X X
– min aij ≤ r ≤ max aij .
i i
j j

• (305C p11) Perron-Frobenius Theorem v2: Let P ∈ [0, ∞)N ×N be a matrix with (possibly com-
plex) right eigenvalues λ1 , . . . , λN . Let ρ = max |λj |. Then P has an eigenvalue equal to ρ with a
1≤j≤N
corresponding eigenvector with all non-negative entries.
• Computational cost:
– Multiplying n × m matrix by m × p matrix: O(nmp). Multiplying two n × n matrices: O(n2.373 ).
– Inverting an n × n matrix: O(n2.373 ).
– QR decomposition for m × n matrix: O(mn2 ).
– SVD decomposition for m × n matrix: O(min(mn2 , m2 n)).
– Determinant of an n × n matrix: O(n2.373 ).
– Back substitution for an n × n triangular matrix: O(n2 ).

4.2 Matrix Decompositions


• Every real symmetric matrix A can be decomposed as A = QΛQT , where Q is a real orthogonal matrix
(whose columns are eigenvectors of A), and Λ is a real diagonal matrix (whose diagonal entries are the
eigenvalues of A).
• Singular Value Decomposition: For any M ∈ Rn×p , we have the decomposition M = Un×n Σn×p Vp×p T
,
where U and V are orthogonal, Σ = diag(σ1 , . . . , σk ) with k = min(n, p) and σ1 ≥ . . . ≥ σk ≥ 0.

27
• QR Decomposition: Any real square matrix A may be decomposed as A = QR, where Q is an
orthonormal matrix and R is an upper triangular matrix. (If A is invertible, then the factorization is
unique if we require the diagonal elements of R to be positive.)

• Cholesky Decomposition: For any real symmetric positive semi-definite matrix A, there is a lower
triangular L such that A = LLT .

4.3 General Vector Spaces


• Operator norm:

∥A∥op = inf{c ≥ 0 : ∥Av∥≤ c∥v∥ for all v} = sup{∥Av∥: ∥v∥= 1}.

The spectral radius of A (i.e. largest absolute value of its eigenvalues) is always bounded above by
∥A∥op .

• (Dembo Ex 4.3.6) Parallelogram Law: Let H be a linear vector space with an inner product. Then
for any u, v ∈ H, ∥u + v∥2 +∥u − v∥2 = 2∥u∥2 +2∥v∥2 .

• (Hoffman & Kunze Eqn 8-3) Polarization Identity: For an inner product space, ⟨u, v⟩ = 1
4 ∥u +
v∥2 − 14 ∥u − v∥2 .

• (Hoffman & Kunze Thm 3.2) Rank-Nullity Theorem: Let T be a linear transformation from V into
W . Suppose that V is finite-dimensional. Then rank(T ) + nullity(T ) = dim V .

• (Hoffman & Kunze Eqn 8-8): In an inner product space, if v is a linear combintion of an orthogonal
sequence of non-zero vectors u1 , . . . , um , then
m
X ⟨v, uk ⟩
v= uk .
∥uk ∥2
k=1

• (Hoffman & Kunze Cor on p287) Bessel’s Inequality: Let {a1 , . . . , an } be an orthogonal set of
non-zero vectors in an inner product space V . If b is any vector in V , then
n
X |⟨b, ak ⟩|2
≤ ∥b∥2 .
∥ak ∥2
k=1

5 Useful Inequalities
• ex ≥ 1 + x and 1 − e−x ≤ x ∧ 1 for all x ∈ R.

• For any a > 0, x 7→ eax + e−ax is an increasing function.

• e|x| ≤ ex + e−x .
ex + e−x 2
• For all x ∈ R, cosh x = ≤ ex /2 . (Proof in 310A HW2 Q5.)
2
1 + e−x
 
• For positive x, log ≥ −x.
2

• For any k ≥ 2, Γ(k/2) ≤ (k/2)k/2 and k 1/k ≤ e1/e .

28
n+1
√ √
 n n   n n  n n
n+1 1
• e ≤ n! ≤ e . 2πn < n! < 2πn e 12n .
e e e e
s
Z δ √
 

• log 1 + dε ≤ 2 2δ. (Proof: Use log(1 + x) ≤ x.)
0 ε

• For any integer ℓ ≥ 1, |xℓ − y ℓ |≤ ℓ|x − y|max(|x|, |y|)ℓ−1 .


• There is some constant c > 0 such that |cos x|≤ 1 − cx2 for all x ∈ [−π/2, π/2].

• Reverse triangle inequality: For all x, y ∈ R, |x − y|≥ |x|−|y| .

(EY )2
• (Durrett Ex 1.6.6) Let Y ≥ 0 with EY 2 < ∞. Then P(Y > 0) ≥ . (Proof uses Cauch-Schwarz
EY 2
on Y 1{Y >0} .)
• Cauchy-Schwarz inequality: For any 2 random variables X and Y , (E[XY ])2 ≤ E[X 2 ]E[Y 2 ], with
equality iff X = aY for some constant a ∈ R.
• Jensen’s inequality: Let X be a random variable and g a convex function such that E[g(X)] and
g(EX) are finite. Then E[g(X)] ≥ g(EX), with equality holding iff X is constant, or g is linear, or
there is a set A s.t. P(X ∈ A) = 1 and g is linear over A (i.e. there are a and b such that g(x) = ax + b
for all x ∈ A).
• Correlation inequality: For any real-valued random variable X and increasing functions g and h,
Cor(g(Y ), h(Y )) ≥ 0.
• (300B HW5) Marcinkiewicz-Zygmund inequality: Let Xi be independent mean-zero random
|k ] < ∞ for some
variables with E[|Xi k ≥1. Then there are constants Ak and Bk which depend only
n
!k/2  n k
 
n
!k/2 
X X X
on k such that Ak E  Xi2  ≤ E Xi  ≤ Bk E  Xi2 . (When k = 2, we may
i=1 i=1 i=1
take A2 = B2 = 1.)
• (300B HW8) Let V ∈ R be a random variable such that |V |≤ D w.p. 1. Let E[V 2 ] = σ 2 . Then for all
σ 2 − c2
c ∈ [0, σ], P(|V |≥ c) ≥ 2 .
D − c2
t2
 
• (Sub-G p18) Bounding of sub-Gaussian moments: If X is such that P(|X|> t) ≤ 2 exp − 2 ,

k 2 k/2
then for any positive √ integer k, E[|X| ] ≤ (2σ ) kΓ(k/2). In particular, for k ≥ 2 we have
(E[|X|k ])1/k ≤ σe1/e k.

6 Useful Integrals
Z ∞ √
2 π
• e−x dx = .
0 2
Z ∞
sin x π
• dx = .
0 x 2
Z 3π/2
• eitx dt = 2π if x = 0, 0 otherwise.
−π/2

29
Z
• xex dx = ex (x − 1) + C.
Z
• x2 ex dx = ex (x − 2x + 2) + C.
Z
• xe−x dx = −e−x (x + 1) + C.
Z
• x2 e−x dx = −e−x (x2 + 2x + 2) + C.

e−(n+1)x [1 − (n + 1)enx ]
Z
• e−x (1 − e−nx )dx = + C.
n+1

7 Other Basic Facts


x3 x5 x7
• sin x = x − + − + ...
3! 5! 7!
x2 x4 x6
• cos x = 1 − + − + ...
2! 4! 6!
x3 2x5 17x7
• tan x = x + + + + ...
3 15 315
x2 x3 x4
• log(1 + x) = x − + − + ...
2 3 4
• |x − y|= x + y − 2 min(x, y).
• The function f (x) = xe−x attains its maximum value 1/e uniquely at x = 1.
• (Agresti p52) Conditional independence does not imply marginal independence.

• (Agresti Prob 9.19, 305B HW3) Assume X, Y and Z are categorical variables.
– If Y is jointly independent of X and Z, then X and Y are conditionally independent given Z.
– Mutual independence of X, Y and Z implies that X and Y are both marginally and conditionally
independent.
– If X ⊥ Y and Y ⊥ Z, it is not necessarily the case that X ⊥ Z.

30

You might also like