Professional Documents
Culture Documents
Kenneth Tay
Contents
2 Distributions 4
2.1 Arcsine Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Bernoulli Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Beta Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5 Cauchy Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5.1 Standard Cauchy Distribtion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5.2 General Cauchy Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.6 Chi-Squared Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.6.1 Non-Central Chi-Squared Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.7 Dirichlet Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.8 Exponential Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.8.1 Shifted Exponential Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.9 F Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.9.1 Non-Central F Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.9.2 Doubly Non-Central F Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.10 Gamma Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.11 Gamma Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.12 Geometric Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.13 Gumbel Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.14 Hypergeometric Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.15 Inverse Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.16 Laplace/Double Exponential Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1
2.17 Logistic Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.18 Multinomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.19 Negative Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.20 Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.20.1 Standard Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.20.2 Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.20.3 Bivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.21 Pareto Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.22 Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.23 T Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.23.1 Non-Central T Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.24 Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.25 Weibull Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3 Analysis Facts 22
3.1 Basic Topology and Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Measure Theory, Integration and Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 Approximations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5 Useful Inequalities 28
6 Useful Integrals 29
2
• (300B HW5) Var X = inf t E[(Y − t)2 ].
• Conditional covariance decomposition: Cov X = Cov(E[X | Y ]) + E[Cov(X | Y )]. Since
E[Cov(X | Y )] ⪰ 0, thus Cov(E[X | Y ]) ⪯ Cov X.
• E[X n ] is the nth raw moment, sometimes denoted µ′n . E[(X − µ)n ] is the nth central moment,
E[(X − µ)n ]
sometimes denoted µn . is the normalized nth central moment.
σn
E[(X − µ)3 ] E[(X − µ)4 ]
• Skewness γ := , kurtosis κ := , excess kurtosis = κ − 3.
σ3 σ4
(k)
• If MX and ϕX are the MGF and characteristic functions of X respectively, then E[X k ] = MX (0) =
1 (k)
ϕ (0) (if it exists).
ik X
• Fisher information: Let fθ (x) be the probability density function of X conditional on the value of
θ.
∂
– For all θ, Eθ log fθ (X) = 0.
∂θ
" 2 # 2
∂ ∂
– Fisher information I(θ) := Eθ log fθ (X) = −Eθ log fθ (X) .
∂θ ∂θ2
1
– (300B Lec 4) For a normal location family (i.e. only mean unknown), Fisher information is 2 .
σ
– (300B Lec 5) In an exponential family with density pθ (x) = h(x) exp θT T (x) − A(θ) , Fisher
information I(θ) = ∇2 A(θ).
• Information Inequality:
– (TPE Thm 2.5.10 p120) Information Inequality: Suppose pθ is family of densities w.r.t. dominat-
ing measure µ and I(θ) > 0. Let δ be any statistic with Eθ (δ 2 ) < ∞ and such that the derivative
of Eθ (δ) w.r.t. θ exists and can be differentiated under the integral sign. Then
∂
2
∂θ Eθ (δ)
Varθ (δ) ≥ ,
I(θ)
∂
with equality iff δ = a log pθ (x) + b, where a and b are constants (which may depend on θ).
∂θ
– (Stephen’s version) Let δ be an estimator for g(θ), and let Eθ (δ) = g(θ) + b(θ). Then
3
– ∥·∥T V is a metric.
1
– ∥µ − ν∥T V = sup |Eµ (f ) − Eν (f )|.
2 ∥f ∥∞ ≤1
Z
1
– If µ and ν have densities fµ and fν w.r.t. some base measure λ, then ∥µ − ν∥T V = |fµ (ω) −
2
fν (ω)|λ(dω).
2 Distributions
1
• PDF p(x) = p , x ∈ (a, a + w).
π(x − a)(a + w − x)
r !
2 x−a
• CDF P(X ≤ x) = arcsin .
π w
w w2
• EX = Median = a + , Var X = .
2 8
∞
X n−1
Y 2j + 1 wn tn
• MGF E[etX ] = eat .
0 j=0
2j + 2 n!
n−1
Y 2j + 1
• Moments E[X n ] = wn .
j=0
2j + 2
4
• Characteristic function E[eitX ] = q + peit .
1
• Fisher information: .
p(1 − p)
1
• B(a, b) = B(b, a), B(a, 1) = .
a
Γ(a)Γ(b)
• B(a, b) = .
Γ(a + b)
π
• B(x, y) · B(x + y, 1 − y) = .
x sin(πy)
√ xx−1/2 y y−1/2
• For large x and large y, B(x, y) ∼ 2π (Stirling’s formula). For large x and fixed y,
(x + y)x+y−1/2
B(x, y) ∼ Γ(y)x−y .
Γ(α + β) α−1
• PDF p(x) = x (1 − x)β−1 .
Γ(α)Γ(β)
α αβ
• EX = , Var X = , Mode = α−1
α+β−2 .
α+β (α + β)2 (α + β + 1)
∞ k−1
!
X Y α+r tk
• MGF E[e ] = 1 +
tX
.
r=0
α + β + r k!
k=1
B(α + k, β)
• Moments: If k ≥ 0, then E[X k ] = .
B(α, β)
1 α+β−1 1 α+β−1 X α
• E = ,E = ,E = .
X α−1 1−X β−1 1−X β−1
d
• Beta(1, 1) = U (0, 1).
1
• Beta 1 1
2, 2 is called the arcsine distribution, PDF p(x) = .
p
π x(1 − x)
• Suppose X and Y are independent gamma RVs with X ∼ Γ(a, r) and Y ∼ Γ(b, r) (shape and rate). Let
X
U = X + Y and V = . Then U and V are independent, with U ∼ Γ(a + b, r) and V ∼ Beta(a, b).
X +Y
(Proof: Look at joint PDF of X and Y , change of variables to U and V .)
5
(n/d)X n d dX
• If X ∼ Fn,d , then n d
∼ Beta , . Conversely, if X ∼ Beta 2, 2 , then ∼ Fn,d .
1 + (n/d)X 2 2 n(1 − X)
• Let X1 , . . . , Xn be independent U (0, 1) variables. Then the kth order statistic X(k) ∼ Beta(k, n−k+1).
• PMF P(X = x) = n
p px (1 − p)n−x , x ∈ {0, 1, . . . , n}.
Z 1−p
n!
• CDF can be written in the form F (k) = P(X ≤ k) = xn−k−1 (1 − x)k dx, k ∈
(n − k − 1)! k! 0
{0, 1, . . . , n}. (Proof: Integration by parts and induction.)
• EX = np, Var X = npq. Median is ⌊np⌋ or ⌈np⌉, mode is ⌊(n + 1)p⌋ or ⌊(n + 1)p⌋ − 1.
6
2.5.2 General Cauchy Distribution
• If X has standard Cauchy distribution, then Y = a + bX has Cauchy distribution with location
parameter a and scale parameter b.
b 1 x−a 1
• PDF p(y) = , CDF P(Y ≤ y) = arctan + .
π[b2 + (x − a)2 ] π b 2
1
• PDF p(x) = xk/2−1 e−x/2 , for x ≥ 0. The PDF satisfies the differential equation 2xf ′ (x) +
2k/2 Γ(k/2)
f (x)(−k + x + 2) = 0.
2 3
• EX = k, Median ≈ k 1 −
9k , Mode = max(k − 2, 0), Var X = 2k.
Γ(n/2 + k)
• Moments: If X ∼ χ2n , then for k > −n/2, E[X k ] = 2k . For k ≤ −m/2, E[X k ] = ∞.
Γ(n/2)
1
• In particular, if X ∼ χ2n for n ≥ 3, then E[1/X] = .
n−2
d
• χ2ν = Gam(ν/2, θ = 2).
d
• χ22 = Exp(1/2).
x
• Let fn denote the density of χ2n . Then fn+2 (x) = fn (x).
n
d
• If X ∼ Fν1 ,ν2 , then Y = lim ν1 X = χ2ν1 .
ν2 →∞
7
2.6.1 Non-Central Chi-Squared Distribution
Suppose X1 , . . . , Xn are independent RVs, where Xk ∼ N (µk , 1). Then Y = X12 + · · · + Xn2 is the non-central
chi-squared distribution with n degrees of freedom and non-centrality parameter λ = µ21 + · · · + µ2n .
∞
X (λ/2)k xn/2−1+k e−x/2
• PDF of χ2n (λ) is p(x; λ) = e−λ/2 .
k! 2k+n/2 Γ(k + n/2)
k=0
Let K ≥ 2 be the number of categories. Let X ∼ Dirichlet(α1 , . . . , αK ), αi > 0 for all i. Let α0 =
α1 + · · · + αK .
K
X
• Support of X is (x1 , . . . , xK ), where xi ∈ (0, 1) and xi = 1.
i=1
K
1 Y αi −1 Γ(α1 ) . . . Γ(αK )
• PDF p(x) = x , where B(α) = .
B(α) i=1 i Γ(α1 + · · · + αK )
αi αi − 1
• EXi = , mode for Xi is (αi > 1).
α0 α0 − K
αi (α0 − αi ) −αi αj
• Var Xi = 2 , Cov(Xi , Xj ) = 2 for i ̸= j.
α0 (α0 + 1) α0 (α0 + 1)
"K # K
Y β B(α + β) Γ(α0 ) Y Γ(αi + βi )
• Moments E i
Xi = = .
i=1
B(α) Γ(α0 + β0 ) i=1 Γ(αi )
8
• Skewness is 2, excess kurtosis is 6.
λ λ
• MGF E[etX ] = for t < λ. Characteristic function E[eitX ] = .
λ−t λ − it
k!
• Moments EX k = . (Proof: Integration by parts.)
λk
1
• Fisher information: .
λ2
• Memoryless property: For exponentially distributed X, P(X > s + t | X > s) = P(X > t) for all
s, t ≥ 0.
k
X 1
• If X1 , . . . , Xn are independent Exp(1) random variables, then the k th order statistic T(k) ∼ Exp(1).
i=1
n−i+1
(Proof uses memoryless property, see Prob Qual 2013-1.)
(Notation from TPE p18) Let X ∼ E(a, b) (−∞ < a < ∞, b > 0). a is shift parameter, b is scale parameter.
1
• PDF p(x) = e−(x−a)/b if x ≥ a, 0 otherwise. CDF P(X ≤ x) = 1 − exp[−(x − a)/b] for x ≥ a, 0
b
otherwise.
• EX = a + b, Var X = b2 .
iid
• If X1 , . . . , Xn ∼ E(a, b), then smallest order statistic X(1) ∼ E(a, b/n).
iid
• (TPE Eg 1.6.24 p43) If X1 , . . . , Xn ∼ E(a, b), let T1 = X(1) , T2 = [Xi − X(1) ]. Then T1 and T2 are
P
b
independent (Basu’s Theorem), and T1 ∼ E(a, b/n) and T2 ∼ χ22n−2 .
2
9
2.9 F Distribution
s
1 (nx)n mm
• PDF p(x) = for x > 0, where B is the beta function. (PDF also defined
xB(n/2, m/2) (nx + m)n+m
at x = 0 for n ≥ 2.)
m n−2 m 2m2 (n + m − 2)
• EX = for m > 2 (∞ if m ≤ 2), mode = for n > 2, Var X = for
m−2 n m+2 n(m − 2)2 (m − 4)
m > 4 (undefined for m ≤ 2, ∞ for 2 < m ≤ 4).
mX nX/m
• If X ∼ Beta(n/2, m/2), then ∼ Fn,m . Conversely, if X ∼ Fn,m , then ∼
n(1 − X) 1 + nX/m
Beta(n/2, m/2).
1
• If X ∼ Fn,m , then ∼ Fm,n .
X
• If X ∼ tn , then X 2 ∼ F1,n .
X
• If X, Y ∼ Exp(λ) independent, then ∼ F2,2 .
Y
d
• As m → ∞, Fn,m → χ2n /n.
χ′2
n1 (λ)/n1
This is defined by a non-central χ2 distribution divided by a central χ2 distribution, i.e. .
χ2n2 /n2
2
(n1 + λ)2 + (n1 + 2λ)(n2 − 2)
n2 (n1 + λ) n2
• EF = if n2 > 2, does not exist if n2 ≤ 2. Var F = 2
n1 (n2 − 2) (n2 − 2)2 (n2 − 4) n1
if n2 > 4, does not exist if n2 ≤ 4.
χ′2
n1 (λ1 )/n1
This is defined by the ratio of 2 non-central χ2 distributions, i.e. ′2
.
χn2 (λ1 )/n2
10
2.10 Gamma Function
Z ∞
For z ∈ C with Re(z) > 0, the gamma function is defined by Γ(z) = xz−1 e−x dx.
0
• If n is an even positive integer, Γ(n/2) = (n/2 − 1)!. If n is an odd positive integer, Γ(n/2) =
(n − 1)! √
n−1
π.
2 (n/2 − 1/2)!
Γ(n + α)
• For α ∈ R, lim = 1.
n→∞ Γ(n)nα
The Gamma distribution is often parametrized in 2 different ways: X ∼ Gam(k, θ) (shape, scale) or X ∼
Gam(α, β) (shape, rate). All parameters are positive, and k = α, θ = β1 represent the same distribution.
Γ(α)
Rate interpretation: Γ(α, β) = . (BDA3 uses shape-rate parametrization.)
β
1
• PDF p(x) = xk−1 e−x/θ for x > 0.
Γ(k)θk
• Gam(1, λ) = Exp(λ).
11
2.12 Geometric Distribution
• P(X = k) is the probability that the 1st success occurs on the kth trial. P(X = k) = p(1 − p)k−1 ,
k ∈ {1, 2, . . . }.
• If X1 , . . . , Xr are independent Geom(p) RVs, then their sum has distribution NegBin(r, p).
• If X1 , . . . , Xr are independent
Y Geom(pr ) RVs (possibly different parameters), then min Xi is Geometric
with parameter p = 1 − (1 − pi ).
i
d
• Exponential approximation (Dembo Eg 3.2.5): Let Zp ∼ Geom(p). Then as p → 0, pZp → Exp(1).
• The standard Gumbel distribution is the limit of the maximum of n i.i.d. RVs (whose distribution falls
in a certain class). This is true for exponential distribution, normal distribution.
• (Gumbel Max Trick) Let ε1 , . . . , εk be i.i.d. standard Gumbel RVs. Then for any α1 , . . . , αk ,
eαr
P argmax1≤i≤k αi + εi = r = Pk .
i=1 eαi
The hypergeometric distribution describes the probability of k successes in n draws, without replacement,
from a finite population of size N that contains exactly K successes (each draw is either a success or a
failure).
12
• Support is k ∈ N such that max(0, n + K − N ) ≤ k ≤ min(K, n).
K N −K
k n−k
• PMF P(Y = k) = N
.
n
nK nK(N − K)(N − n)
• EY = , Var Y = .
N N 2 (N − 1)
n
X
• Let Xi = 1 if draw i is a success, 0 otherwise. Then Y = Xi .
i=1
nK
• Let Y ∼ Hypergeom(n, N, K). E[Y k ] = E (Z + 1)k−1 , where Z ∼ Hypergeom(n-1, N-1, K-1).
N
(Proof by combinatorial identities.)
• Let Y ∼ Hypergeom(n, N, K), and let p = K/N . If N and K are large compared to n, and p is not
close to 0 or 1, then Y ∼Binom(n,
˙ p).
1 2
• E[1/X] = 1/µ + 1/λ, Var (1/X) = + .
µλ λ2
" r !# " r !#
λ 2µ2 t λ 2µ2 it
• MGF E[etX ] = exp 1− 1− . Characteristic function E[eitX ] = exp 1− 1− .
µ λ µ λ
13
1 |x − µ| 1 x−µ
• PDF p(x) = exp − . CDF P (X ≤ x) = exp if x < µ, P (X ≤ x) = 1 −
2b b 2 b
1 x−µ
exp − if x ≥ µ.
2 b
• Mean, median and mode are all µ. Var X = 2b2 . Skewness is 0, excess kurtosis is 3.
exp(µt) exp(iµt)
• MGF E[etX ] = 2 2
for |t|< 1/b. Characteristic function E[eitX ] = .
1−b t 1 + b2 t 2
• Central moments E[(X − µ)n ] = 0 if n is odd, = bn n! if n is even.
• If X ∼ Laplace(µ, b), then kX + c ∼ Laplace(kµ + c, kb).
• If X ∼ Laplace(µ, b), then |X − µ|∼ Exp(1/b) (rate).
• If X, Y ∼ Exp(λ), then X − Y ∼ Laplace(0, 1/λ).
iid
• If X1 , . . . , X4 ∼ N (0, 1), then X1 X2 − X3 X4 ∼ Laplace(0, 1).
n
iid 2X
• If X1 , . . . , Xn ∼ Laplace(µ, b), then |Xi − µ|∼ χ22n .
b i=1
|X − µ|
• If X, Y Laplace(µ, b), then ∼ F2,2 .
|Y − µ|
• If X, Y ∼ Unif(0, 1), then log(X/Y ) ∼ Laplace(0, 1).
• If X ∼ Exp(λ) and Y ∼ Bernoulli(1/2), then X(2Y − 1) ∼ Laplace(0, 1/λ).
• If X ∼ Exp(λ) and Y ∼ Exp(ν), then λX − νY ∼ Laplace(0, 1).
√
• If V ∼ Exp(1) and Z ∼ N (0, 1), then X = µ + b 2V Z ∼ Laplace(µ, b).
exp − x−µ
1
• PDF p(x) = s
, CDF P (X ≤ x) = .
x−µ 2 1 + exp − x−µ
s 1 + exp − s s
s2 π 2
• Mean, median and mode are all µ. Var X = .
3
• MGF E[etX ] = eµt B(1 − st, 1 + st) for st ∈ (−1, 1), where B is the beta function. Characteristic
πst
function E[eitX ] = eitµ .
sinh(πst)
• Central moments E[(X − µ)n ] = sn π n (2n − 2) · |Bn |, where Bn is the nth Bernoulli number.
• If X ∼ Logistic(µ, s), then kX + c ∼ Logistic(kµ + c, ks).
• If U ∼ Unif(0, 1), then µ + s[log U − log(1 − U )] ∼ Logistic(µ, s).
• If X, Y ∼ Gumbelα, β, then X − Y ∼ Logistic(0, β).
• If X ∼ Exp(1), then µ + β log(eX − 1) ∼ Logistic(µ, β).
X
• If X, Y ∼ Exp(1), then µ − β log ∼ Logistic(µ, β).
Y
14
2.18 Multinomial Distribution
n!
• PMF P (X1 = x1 , . . . Xs = xs ) = px1 . . . pxs s (for xi ’s that sum up to n).
x1 ! . . . xs ! 1
• EX i = npi (1 − pi ), Cov(Xi , Xj ) = −npi pj for i =
i = npi , Var X ̸ j. In matrix notation, Var X =
n diag(p) − ppT . (Proof: 310A HW1 Qn3.)
!n n
X s Xs
• MGF E[et·X ] = pi eti . Characteristic function E[eit·X ] = pj eitj .
i=1 j=1
Conversely, suppose that N ∼ Pois(λ) and conditional on N = n, X = (X1 , . . . , Xk ) ∼ Multinom(n, (p1 , . . . , pk )).
Then the Xi ’s are marginally independent and Poisson-distributed with parameters λp1 , . . . , λpk .
BDA3/305C Notes
α y
α+y−1 β 1
• PMF is p(y) = , y = 0, 1, . . . .
y β+1 β+1
α α(β + 1)
• EY = , Var Y = .
β β2
• The negative binomial is a mixture of Poisson distributions with rates which follow the gamma distri-
bution (shape-rate): Z
NegBin(y | α, β) = Pois(y | θ)Gamma(θ | α, β)dθ.
TSH
Let Y ∼ NegBin(p, m), where p is the probability of success, and m is the number of successes to be obtained.
15
β
• If we let m = α and p = , we get BDA’s parametrization.
β+1
• Interpretation: If Y +m independent trials are needed to obtain m successes (and each trial has success
probability p), then Y ∼ NegBin(p, m).
m+y−1 m
• PMF is p(y) = p (1 − p)y , y = 0, 1, . . . .
y
m(1 − p) m(1 − p)
• EY = , Var Y = .
p p2
m m
p p
• MGF E[e ] =
tY itY
for t < − log(1−p). Characteristic function E[e ] = .
1 − (1 − p)et 1 − (1 − p)eit
m
• Fisher information: .
p(1 − p)2
• Y is the sum of m independent Geom(p) random variables.
Agresti
1
Let Y ∼ NegBin(k, µ) or Y ∼ NegBin(γ, µ), where γ = (dispersion parameter). k > 0, µ > 0.
k
α
• If we let µ =and k = α, we get BDA’s parametrization.
β
k y
Γ(y + k) k µ
• PMF p(y) = 1− , y = 0, 1, . . . .
Γ(k)Γ(y + 1) µ + k µ+k
• EY = µ, Var Y = µ + γµ2 .
• As γ → 0, the negative binomial converges to the Poisson.
Let Z ∼ N (µ, σ 2 ).
1 (z−µ)2
• PDF p(z) = √ e− 2σ 2 .
2πσ 2
h 2 2
i h i
σ 2 t2
• MGF E etZ = exp µt + σ 2t . Characteristic function E eitZ = exp iµt −
2 .
1
0
• Fisher information: σ 2
.
0 2σ1 4
• Central moments (i.e. µ = 0): for non-negative integer p,
(
p 0 if p odd,
E[X ] = p
σ (p − 1)! ! if p even.
(q
Γ p+1
p/2
2
p p π if p odd p 2 2
E[|X| ] = σ (p − 1)! ! · =σ · √ .
1 if p even π
16
• E
1
X does not exist.
• If X and Y are jointly normal, then uncorrelatedness is the same as independence.
• (TPE Eg 5.16 p31) Stein’s identity for the normal: If X ∼ N (µ, σ 2 ) and g a differentiable function
with E|g ′ (X)|< ∞, then E[g(X)(X − µ)] = σ 2 Eg ′ (X).
1
• Variance stabilizing transformation: If X ∼N
˙ (θ, αθ(1 − θ)), then taking Y = √ arcsin(2X − 1),
α
1
we have Y ∼N
˙ √ arcsin(2θ − 1), 1 .
α
• Cramér’s decomposition theorem: If X1 and X2 are independent and X1 + X2 is normally dis-
tributed, then both X1 and X2 must be normally distributed.
• Marcinkiewicz theorem: If a random variable X has characteristic function of the form φX (t) =
eQ(t) where Q is a polynomial, then Q can be at most a quadratic polynomial.
• If X and Y are independent N (µ, σ 2 ) RVs, then X + Y and X − Y are independent and identically
distributed. Bernstein’s theorem asserts the converse: If X and Y are independent s.t. X + Y and
X − Y are also independent, then X and Y must have normal distributions.
• KL divergence: If X1 ∼ N (µ1 , σ12 ) and X2 ∼ N (µ2 , σ22 ), then
(µ1 − µ2 )2 1 σ12 σ12
Dkl (X1 ∥ X2 ) = + − 1 − log 2 .
2σ22 2 σ22 σ2
17
• (300C Lec 2) Holding α fixed, for large n,
p 1 log log n p
|z(α/n)|≈ 2 log n 1 − ≈ 2 log n.
4 log n
• (Dembo Ex 2.2.24, 300C Lec 2) Let Z1 , Z2 , . . . be independent N (0, 1) random variables. Then with
probability 1,
maxi≤n Zi
lim √ = 1.
n→∞ 2 log n
• (300C, Lec 24) For large λ,
E[Z 2 ; |Z|> λ] ≈ 2λϕ(λ).
• If Z1 and Z2 are standard normal variables, then Z1 /Z2 has the standard Cauchy distribution.
d ϕ(x)
• Φ(x) is log-concave (i.e. log Φ(x) is concave), so log Φ(x) = is decreasing in x.
dx Φ(x)
• If Gkθ is the CDF of the truncated normal Z ∼ N (θ, 1) | Z ≤ k, then Gkθ (t) is a decreasing function of
θ.
• Support Z ∈ µ + span(Σ) ⊆ Rd .
1 1
• PDF exists only when Σ is positive definite: p(z) = p exp − (z − µ)T Σ−1 (z − µ) .
(2π)d |Σ| 2
The covariance matrix above is called the Schur complement of Σ22 in Σ. Note that it does not depend
on a. (In order to prove this result, we use the fact that X2 and X1 − Σ12 Σ−1
22 X2 are independent.)
• Using the notation above, X1 and X2 are independent iff Σ12 = 0. (Proof: Factor the characteristic
function.)
• Suppose Y ∼ N (µ, Σ) and Σ−1 exists. Then (Y − µ)T Σ−1 (Y − µ) ∼ χ2n . (Proof: Σ = P T ΛP ,
Z ∼ Λ−1/2 P (Y − µ). Then Z ∼ N (0, In ).)
• Assume that Z ∼ N (0, Σ) is d-dimensional. Then Z T Σ−1 Z ∼ χ2d . More generally, (Z −µ)T Σ† (Z −µ) ∼
χ2rank(Σ) , where Σ† is the pseudo-inverse of Σ.
18
2.20.3 Bivariate Normal Distribution
X
• In the bivariate normal case Z = , letting ρ be the correlation between X and Y , we can write
Y
the PDF as
(x − µX )2 (y − µY )2
1 1 2ρ(x − µX )(y − µY )
p(x, y) = exp − 2 + − .
2(1 − ρ2 ) σY2
p
2πσX σY 1 − ρ2 σX σX σY
√
n − 2ρ̂
• (300A Lec 16) Under ρ = 0 (i.e. independence), the sample correlation has p ∼ tn−2 .
1 − ρ̂2
X 0 1 ρ
• Let
p
∼N , . Then Y = ρX + 1 − ρ2 ε for ε ∼ N (0, 1).
Y 0 ρ 1
Let X ∼ Pois(λ).
λk e−λ
• PMF P (X = k) = .
k!
• EX = λ, Var X = λ.
• MGF E[etX ] = exp [λ(et − 1)]. Characteristic function E[eitX ] = exp λ(eit − 1) .
19
1
• Fisher information: .
λ
• Let X1 ∼ Pois(λ
X2 ∼ Pois(λ2 ) be independent. X1 + X2 ∼ Pois(λ1 + λ2 ), and X1 | X1 + X2 =
1 ) and
λ1
k ∼ Binom k, .
λ1 + λ2
Conversely, suppose that N ∼ Pois(λ) and conditional on N = n, X = (X1 , . . . , Xk ) ∼ Multinom(n, (p1 , . . . , pk )).
Then the Xi ’s are marginally independent and Poisson-distributed with parameters λp1 , . . . , λpk .
2.23 T Distribution
ν+1
Γ ν+1
x2 2
• PDF p(x) = √ 2 1+ .
νπΓ ν2 ν
ν
• EX = 0 (if ν > 1, otherwise undefined), median and mean are 0. Var X = for ν > 2, ∞ for
ν−2
1 < ν ≤ 2, undefined otherwise.
• MGF is undefined, characteristic function involves modified Bessel function of the second kind.
• When ν > 1,
0 if k odd, 0 < k < ν,
E[X k ] =
1 k+1 ν−k
√ ν
Γ Γ if k even, 0 < k < ν.
πΓ 2
2 2
• As n → ∞, tn → N (0, 1).
Z
• If Z ∼ N (0, 1) and X ∼ χ2n with Z and X independent, then p ∼ tn .
X/n
N
iid 1 X
• Let X1 , . . . , Xn ∼ N (µ, σ 2 ). Let X̄ be the sample mean and S 2 = (Xi − X̄)2 be the sample
N − 1 i=1
X̄ − µ
variance. Then √ ∼ tn−1 .
S/ n
d
• t1 = standard Cauchy distribution.
• If T ∼ tν , then T 2 ∼ F1,ν .
20
2.23.1 Non-Central T Distribution
N (λ, 1)
• This is obtained by t′n (λ) = p .
χ2n /n
• Non-central t-distribution is asymmetric unless µ = 0. Right tail will be heavier than the left when
µ > 0 and vice versa.
d
• As n → ∞, t′n (µ) → N (µ, 1).
1 x−a
• PDF p(x) = 1x∈[a,b] . CDF P (X ≤ x) = for a ≤ x ≤ b.
b−a b−a
a+b (b − a)2
• EX = Median = , Var X = .
2 12
etb − eta eitb − eita
• MGF E[etU ] = for t ̸= 0, E[etU ] = 1 if t = 0. Characteristic function E[eitU ] = .
t(b − a) it(b − a)
iid
• If X1 , . . . , Xn ∼ Unif(0, 1), then the kth order statistic X(k) ∼ Beta(k, n + 1 − k).
• Support is x ≥ 0.
k x k−1 −(x/λ)k k
• PDF p(x) = e . CDF P (X ≤ x) = 1 − e−(x/λ) .
λ λ
1/k
1/k k−1
• EX = λΓ(1 + 1/k), median = λ (log 2) , mode = λ if k > 1, 0 otherwise.
k
∞ n n ∞
X t λ X (it)n λn
• MGF E[etX ] = Γ(1+n/k) for k ≥ 1. Characteristic function E[eitX ] = Γ(1+n/k).
n=0
n! n=0
n!
21
k
W
• The Weibull is the distribution of a random variable W such that ∼ Exp(1). Going in the
λ
other direction, if X ∼ Exp(1), then λX 1/k ∼ Weibull(λ, k).
3 Analysis Facts
• Hilbert space: A real or complex inner product space which is complete w.r.t. distance induced by
the inner product.
• Compact metric space: A metric space X is compact if every open cover of X has a finite subcover.
• Totally bounded: A metric space (M, d) is totally bounded iff for every ε > 0, there exists a finite
collection of open balls in M of radius ε whose union covers M .
• Heine-Borel Theorem for arbitrary metric space: A subset of a metric space is compact iff it is
complete and totally bounded.
• Borel-Lebesgue Theorem: For a metric space (X, d), the following are equivalent:
1. X is compact.
2. Every collection of closed subsets of X with the finite intersection property (i.e. every finite
subcollection has non-empty intersection) has non-empty intersection.
3. X is sequentially compact.
4. X is complete and totally bounded.
• (Rudin Thm 2.35) Closed subsets of compact sets are compact. If F is closed and K is compact, then
F ∩ K is compact.
• (Rudin Thm 4.19) If f is a continuous mapping from compact metric space X to metric space Y , then
f is uniformly continuous on X.
22
3.2 Measure Theory, Integration and Differentiation
• (Stein Cor 3.5) Gδ sets are countable intersections of open sets, while Fσ sets are countable unions of
closed sets. A subset E ⊂ Rd is (Lebesgue)-measurable (i) iff E differs from a Gδ by a set of measure
zero, (ii) iff E differs from an Fσ by a set of measure zero.
• (Stein Thm 4.4) Egorov’s Theorem: Suppose {fk } is a sequence of measurable functions defined on
a measurable set E with m(E) < ∞, and assume fk → f a.e. on E. Given ε > 0, we can find closed
set Aε ⊂ E such that m(E − Aε ) < ε and fk → f uniformly on Aε .
∂fi
• Suppose f : Rn → Rm . The Jacobian is an m × n matrix with entries Jij = .
∂xj
• Change of variables: See Ross p279.
• Lebesgue Differentiation Theorem: If f is a measurable function, then for almost every x, f (x) =
1 x+r
Z
lim f (y)dy.
r→0 r x
• Mean Value Theorem: If f is continuous on [a, b] and differentiable on (a, b), then there exists
f (b) − f (a)
c ∈ (a, b) such that f ′ (c) = (i.e. f (b) = f (a) + (b − a)f ′ (c)).
b−a
• Differentiation under integral sign: Let f (x, t) be such that f (x, t) and its partial derivative fx (x, t)
are continuous in t and x in some region of the (x, t) plane, including a(x) ≤ x ≤ b(x), x0 ≤ x ≤ x1 . Also
suppose that a(x) and b(x) are both continuous and both have continuous derivatives for x0 ≤ x ≤ x1 .
Then, for x0 ≤ x ≤ x1 ,
Z b(x) ! Z b(x)
d d d ∂
f (x, t)dt = f (x, b(x)) · b(x) − f (x, a(x)) · a(x) + f (x, t)dt.
dx a(x) dx dx a(x) ∂x
For functions of more than one variable: Let f : open set of Rn → Rn . If F is continuously differen-
tiable and its total derivative of F is invertible at a point p, then an inverse function to F exists in
some neighborhood of F (p). F −1 is also continuously differentiable, with
• If f : Rn 7→ R is L-Lipschitz, i.e. |f (x) − f (y)|≤ L|x − y|, and is differentiable, then for all x ∈ Rn ,
∥∇f (x)∥≤ L.
Z
• (Durrett Ex 1.4.4) Riemann-Lebesgue Lemma: If g is integrable, then lim g(x) cos(nx)dx = 0.
n→∞
3.3 Approximations
√ n n
• Stirling’s Approximation for factorial: n! ∼ 2πn (i.e. ratio goes to 1 as n → ∞).
e
23
r
2π z z 1
• Stirling’s Approximation for gamma function: Γ(z) = 1+O for large z, i.e.
z e z
√ z z
Γ(z + 1) ∼ 2πz .
e
• Volume of ball in n-dimensional space: Volume of ball with radius r in n-dimensional space ∼
n/2
1 2πe
√ rn .
nπ n
• Weierstrass Approximation Theorem: If h is continuous on [0, 1], then there exist polynomials
pn such that sup |h(x) − pn (x)|→ 0 as n → ∞.
x∈[0,1]
• (Rudin Thm 5.15) Taylor’s Theorem: Suppose f is a real function on [a, b], n a positive integer,
f (n−1) continuous on [a, b], f (n) (t) exists for every t ∈ (a, b). Let α, β be distinct points of [a, b]. Then,
there exists a point x between α and β such that
n−1
X f (k) (α) f (n) (x)
f (β) = (β − α)k + (β − α)n .
k! n!
k=0
• Newton’s method: Say we are trying to find the solution to f (x) = 0. If our current guess is xk ,
f (xk )
one step of Newton’s method gives us our next guess: xk+1 = xk − ′ .
f (xk )
• (310A Lec 9) For small x, log(1 − x) ∼ −x.
n
X √
• As n → ∞, 2 log( k log k) ∼ n log n. (For proof, see 310A HW9.)
k=1
N
X 1 N
• (310B Lec 8) For large N , ≈ log .
k−1 j
k=j+1
3.4 Convergence
• (Rudin Thm 3.33) Root test: Given
P p P
an , let α = lim sup n |an |. If α < 1, an converges. If α > 1,
P n→∞
an diverges. If α = 1, the test gives no information.
an+1 an+1
• (Rudin Thm 3.34) Ratio test: If lim sup
P
< 1, the series an converges. If ≥ 1 for all
P n→∞ an an
large enough n, then an diverges.
∞
Y
• Convergence of infinite product: Let an be a sequence of positive numbers. Then (1 + an ) and
n=1
∞
Y ∞
X
(1 − an ) converge iff an converges. (Proof takes logs and uses fact that log(1 + x) ∼ x for small
n=1 n=1
x.)
• (Rudin Thm 7.11) Suppose fn → f uniformly on a set E in a metric space. Let x be a limit point of
E and suppose that lim fn (t) = An . Then {An } converges, and lim f (t) = lim An .
t→x t→x n→∞
24
• (Rudin Thm 7.16) If fn → f uniformly on [a, b] and fn are integrable on [a, b], then f is integrable on
[a, b] and
Z b Z b
f (x)dx = lim fn (x)dx.
a n→∞ a
• (Rudin Thm 7.17) Suppose {fn } are differentiable on [a, b] and that {fn (x0 )} converges for some point
x0 ∈ [a, b]. If {fn′ } converges uniformly on [a, b], then {fn } converges uniformly on [a, b] to a function
f , and for all x ∈ [a, b],
f ′ (x) = lim fn′ (x).
n→∞
• Lusin’s Theorem: Let A be a measurable subset of R with finite measure, and let f : A 7→ R be
measurable. Then for any ε > 0, there is a compact set K ⊆ A with m(A \ K) > ε such that the
restriction of f to K is continuous.
• (Dembo Lem 2.2.11) Let yn be a sequence in a topological space. If every subsequence has a further
subsequence which converges to y, then yn → y.
∞
X xn+1 − xn
• (310B Lec 10) If xn is a sequence of positive real numbers increasing to infinity, then =
n=1
xn+1
∞
X xn+1 − xn
∞, and < ∞.
n=1
x2n+1
• (310B Lec 19) Subadditive Lemma: Let {xn } be a sequence of real numbers such that xn+m ≤
xn xn
xn + xm for all n, m. Then lim exists and is equal to inf .
n→∞ n n≥1 n
25
A + AT
• x Ax = x
T T
x. Thus, when considering quadratic forms, we may assume A is symmetric.
2
• For any matrix A, the row space of A and the column space of A have the same rank. In addition,
Rank(AT A) = Rank(A).
• Determinant:
Q
– det(A) = i λi .
– For square matrices A and B of equal size, det(AB) = det(A)det(B).
– If A is an n × n matrix, det(cA) = cn det(A).
– Considering a matrix in block form, we have
A B
det = (det A)(det C).
0 C
– Sylvester’s theorem: If A ∈ Rm×n and B ∈ Rn×m , then det(Im + AB) = det(In + BA).
n
X
• Trace: tr(A) = aii = sum of eigenvalues.
i=1
• Inverses:
26
– A has positive diagonal entries.
– A is invertible.
– A has a unique positive definite square root.
(v T u)2
– (300B Lec 4) If A is positive definite, then sup = uT A−1 u (with equality when v = A−1 u).
v v T Av
• Positive semi-definite matrices: A ∈ Rn×n is PSD (A ⪰ 0) if ⟨Ax, x⟩ ≥ 0 for all x ∈ Rn \ {0}. PSD
matrices have corresponding properties of PD matrices as above. (PSD will have unique PSD square
root.)
– If A’s smallest eigenvalue is λ > 0, then A ⪰ λI.
– If the smallest eigenvalue of A is ≥ the largest eigenvalue of B, then A ⪰ B.
• Perron-Frobenius Theorem: Let A be a positive matrix (i.e. all entries positive). Then the
following hold:
– There is an r > 0 such that r is an eigenvalue of A and any other eigenvalue λ must satisfy |λ|< r.
– r is a simple root of the characteristic polynomial of A, and hence its eigenspace has dimension 1.
– There exists an eigenvector v with all components positive such that Av = rv. (Respectively,
there exists a positive left eigenvector w with wT A = rwT .)
– There are no other non-negative eigenvectors except positive multiples of v. (Same for left eigen-
vectors.)
– lim Ak /rk = vwT , where v and w are normalized so that wT v = 1. Moreover, vwT is the
k→∞
projection onto the eigenspace corresponding to r. (This convergence is uniform.)
X X
– min aij ≤ r ≤ max aij .
i i
j j
• (305C p11) Perron-Frobenius Theorem v2: Let P ∈ [0, ∞)N ×N be a matrix with (possibly com-
plex) right eigenvalues λ1 , . . . , λN . Let ρ = max |λj |. Then P has an eigenvalue equal to ρ with a
1≤j≤N
corresponding eigenvector with all non-negative entries.
• Computational cost:
– Multiplying n × m matrix by m × p matrix: O(nmp). Multiplying two n × n matrices: O(n2.373 ).
– Inverting an n × n matrix: O(n2.373 ).
– QR decomposition for m × n matrix: O(mn2 ).
– SVD decomposition for m × n matrix: O(min(mn2 , m2 n)).
– Determinant of an n × n matrix: O(n2.373 ).
– Back substitution for an n × n triangular matrix: O(n2 ).
27
• QR Decomposition: Any real square matrix A may be decomposed as A = QR, where Q is an
orthonormal matrix and R is an upper triangular matrix. (If A is invertible, then the factorization is
unique if we require the diagonal elements of R to be positive.)
• Cholesky Decomposition: For any real symmetric positive semi-definite matrix A, there is a lower
triangular L such that A = LLT .
The spectral radius of A (i.e. largest absolute value of its eigenvalues) is always bounded above by
∥A∥op .
• (Dembo Ex 4.3.6) Parallelogram Law: Let H be a linear vector space with an inner product. Then
for any u, v ∈ H, ∥u + v∥2 +∥u − v∥2 = 2∥u∥2 +2∥v∥2 .
• (Hoffman & Kunze Eqn 8-3) Polarization Identity: For an inner product space, ⟨u, v⟩ = 1
4 ∥u +
v∥2 − 14 ∥u − v∥2 .
• (Hoffman & Kunze Thm 3.2) Rank-Nullity Theorem: Let T be a linear transformation from V into
W . Suppose that V is finite-dimensional. Then rank(T ) + nullity(T ) = dim V .
• (Hoffman & Kunze Eqn 8-8): In an inner product space, if v is a linear combintion of an orthogonal
sequence of non-zero vectors u1 , . . . , um , then
m
X ⟨v, uk ⟩
v= uk .
∥uk ∥2
k=1
• (Hoffman & Kunze Cor on p287) Bessel’s Inequality: Let {a1 , . . . , an } be an orthogonal set of
non-zero vectors in an inner product space V . If b is any vector in V , then
n
X |⟨b, ak ⟩|2
≤ ∥b∥2 .
∥ak ∥2
k=1
5 Useful Inequalities
• ex ≥ 1 + x and 1 − e−x ≤ x ∧ 1 for all x ∈ R.
• e|x| ≤ ex + e−x .
ex + e−x 2
• For all x ∈ R, cosh x = ≤ ex /2 . (Proof in 310A HW2 Q5.)
2
1 + e−x
• For positive x, log ≥ −x.
2
28
n+1
√ √
n n n n n n
n+1 1
• e ≤ n! ≤ e . 2πn < n! < 2πn e 12n .
e e e e
s
Z δ √
2δ
• log 1 + dε ≤ 2 2δ. (Proof: Use log(1 + x) ≤ x.)
0 ε
(EY )2
• (Durrett Ex 1.6.6) Let Y ≥ 0 with EY 2 < ∞. Then P(Y > 0) ≥ . (Proof uses Cauch-Schwarz
EY 2
on Y 1{Y >0} .)
• Cauchy-Schwarz inequality: For any 2 random variables X and Y , (E[XY ])2 ≤ E[X 2 ]E[Y 2 ], with
equality iff X = aY for some constant a ∈ R.
• Jensen’s inequality: Let X be a random variable and g a convex function such that E[g(X)] and
g(EX) are finite. Then E[g(X)] ≥ g(EX), with equality holding iff X is constant, or g is linear, or
there is a set A s.t. P(X ∈ A) = 1 and g is linear over A (i.e. there are a and b such that g(x) = ax + b
for all x ∈ A).
• Correlation inequality: For any real-valued random variable X and increasing functions g and h,
Cor(g(Y ), h(Y )) ≥ 0.
• (300B HW5) Marcinkiewicz-Zygmund inequality: Let Xi be independent mean-zero random
|k ] < ∞ for some
variables with E[|Xi k ≥1. Then there are constants Ak and Bk which depend only
n
!k/2 n k
n
!k/2
X X X
on k such that Ak E Xi2 ≤ E Xi ≤ Bk E Xi2 . (When k = 2, we may
i=1 i=1 i=1
take A2 = B2 = 1.)
• (300B HW8) Let V ∈ R be a random variable such that |V |≤ D w.p. 1. Let E[V 2 ] = σ 2 . Then for all
σ 2 − c2
c ∈ [0, σ], P(|V |≥ c) ≥ 2 .
D − c2
t2
• (Sub-G p18) Bounding of sub-Gaussian moments: If X is such that P(|X|> t) ≤ 2 exp − 2 ,
2σ
k 2 k/2
then for any positive √ integer k, E[|X| ] ≤ (2σ ) kΓ(k/2). In particular, for k ≥ 2 we have
(E[|X|k ])1/k ≤ σe1/e k.
6 Useful Integrals
Z ∞ √
2 π
• e−x dx = .
0 2
Z ∞
sin x π
• dx = .
0 x 2
Z 3π/2
• eitx dt = 2π if x = 0, 0 otherwise.
−π/2
29
Z
• xex dx = ex (x − 1) + C.
Z
• x2 ex dx = ex (x − 2x + 2) + C.
Z
• xe−x dx = −e−x (x + 1) + C.
Z
• x2 e−x dx = −e−x (x2 + 2x + 2) + C.
e−(n+1)x [1 − (n + 1)enx ]
Z
• e−x (1 − e−nx )dx = + C.
n+1
• (Agresti Prob 9.19, 305B HW3) Assume X, Y and Z are categorical variables.
– If Y is jointly independent of X and Z, then X and Y are conditionally independent given Z.
– Mutual independence of X, Y and Z implies that X and Y are both marginally and conditionally
independent.
– If X ⊥ Y and Y ⊥ Z, it is not necessarily the case that X ⊥ Z.
30