A Probability and Statistics Cheatsheet PDF

Probability and Statistics
Cheat Sheet
Copyright
c Matthias Vallentin, 2011
vallentin@icir.org
6th March, 2011

This cheat sheet integrates a variety of topics in probability the- 12 Parametric Inference 11 20 Stochastic Processes 22
ory and statistics. It is based on literature [1, 6, 3] and in-class 12.1 Method of Moments . . . . . . . . . . . 11 20.1 Markov Chains . . . . . . . . . . . . . . 22
material from courses of the statistics department at the Univer- 12.2 Maximum Likelihood . . . . . . . . . . . 12 20.2 Poisson Processes . . . . . . . . . . . . . 22
sity of California in Berkeley but also influenced by other sources 12.2.1 Delta Method . . . . . . . . . . . 12
[4, 5]. If you find errors or have suggestions for further topics, I 21 Time Series 23
12.3 Multiparameter Models . . . . . . . . . 12
would appreciate if you send me an email. The most recent ver- 21.1 Stationary Time Series . . . . . . . . . . 23
12.3.1 Multiparameter Delta Method . 13
sion of this document is available at http://bit.ly/probstat. 21.2 Estimation of Correlation . . . . . . . . 24
12.4 Parametric Bootstrap . . . . . . . . . . 13 21.3 Non-Stationary Time Series . . . . . . . 24
To reproduce, please contact me.
21.3.1 Detrending . . . . . . . . . . . . 24
13 Hypothesis Testing 13 21.4 ARIMA models . . . . . . . . . . . . . . 24
Contents 14 Bayesian Inference 14
21.4.1 Causality and Invertibility . . . . 25
21.5 Spectral Analysis . . . . . . . . . . . . . 25
1 Distribution Overview 3 14.1 Credible Intervals . . . . . . . . . . . . . 14
1.1 Discrete Distributions . . . . . . . . . . 3 14.2 Function of Parameters . . . . . . . . . 14 22 Math 26
1.2 Continuous Distributions . . . . . . . . 4 14.3 Priors . . . . . . . . . . . . . . . . . . . 15 22.1 Gamma Function . . . . . . . . . . . . . 26
14.3.1 Conjugate Priors . . . . . . . . . 15 22.2 Beta Function . . . . . . . . . . . . . . . 26
2 Probability Theory 6 14.4 Bayesian Testing . . . . . . . . . . . . . 15 22.3 Series . . . . . . . . . . . . . . . . . . . 27
22.4 Combinatorics . . . . . . . . . . . . . . 27
3 Random Variables 6 15 Exponential Family 16
3.1 Transformations . . . . . . . . . . . . . 7
16 Sampling Methods 16
4 Expectation 7 16.1 The Bootstrap . . . . . . . . . . . . . . 16
16.1.1 Bootstrap Confidence Intervals . 16
5 Variance 7
16.2 Rejection Sampling . . . . . . . . . . . . 17
6 Inequalities 8 16.3 Importance Sampling . . . . . . . . . . . 17
7 Distribution Relationships 8 17 Decision Theory 17

17.1 Risk . . . . . . . . . . . . . . . . . . . . 17
8 Probability and Moment Generating 17.2 Admissibility . . . . . . . . . . . . . . . 17
Functions 9 17.3 Bayes Rule . . . . . . . . . . . . . . . . 18
17.4 Minimax Rules . . . . . . . . . . . . . . 18
9 Multivariate Distributions 9
9.1 Standard Bivariate Normal . . . . . . . 9 18 Linear Regression 18
9.2 Bivariate Normal . . . . . . . . . . . . . 9 18.1 Simple Linear Regression . . . . . . . . 18
9.3 Multivariate Normal . . . . . . . . . . . 9
18.2 Prediction . . . . . . . . . . . . . . . . . 19
10 Convergence 9 18.3 Multiple Regression . . . . . . . . . . . 19
10.1 Law of Large Numbers (LLN) . . . . . . 10 18.4 Model Selection . . . . . . . . . . . . . . 19
10.2 Central Limit Theorem (CLT) . . . . . 10
19 Non-parametric Function Estimation 20
11 Statistical Inference 10 19.1 Density Estimation . . . . . . . . . . . . 20
11.1 Point Estimation . . . . . . . . . . . . . 10 19.1.1 Histograms . . . . . . . . . . . . 20
11.2 Normal-based Confidence Interval . . . . 11 19.1.2 Kernel Density Estimator (KDE) 21
11.3 Empirical Distribution Function . . . . . 11 19.2 Non-parametric Regression . . . . . . . 21
11.4 Statistical Functionals . . . . . . . . . . 11 19.3 Smoothing Using Orthogonal Functions 21
1 Distribution Overview
1.1 Discrete Distributions
Notation1 FX (x) fX (x) E [X] V [X] MX (s)

0 x<a
(b − a + 1)2 − 1 eas − e−(b+1)s

bxc−a+1 I(a < x < b) a+b
Uniform Unif {a, . . . , b} a≤x≤b
 b−a b−a+1 2 12 s(b − a)
1 x>b

Bernoulli Bern (p) (1 − p)1−x px (1 − p)1−x p p(1 − p) 1 − p + pes
!
n x
Binomial Bin (n, p) I1−p (n − x, x + 1) p (1 − p)n−x np np(1 − p) (1 − p + pes )n
x
k k
!n
n! x
X X
Multinomial Mult (n, p) px1 1 · · · pkk xi = n npi npi (1 − pi ) pi e si
x1 ! . . . xk ! i=1 i=0
! m m−x

x − np x n−x nm nm(N − n)(N − m)
Hypergeometric Hyp (N, m, n) ≈Φ N
N/A
N 2 (N − 1)
p
np(1 − p) x
N
! r
x+r−1 r 1−p 1−p p
Negative Binomial NBin (n, p) Ip (r, x + 1) p (1 − p)x r r
r−1 p p2 1 − (1 − p)es
1 1−p p
Geometric Geo (p) 1 − (1 − p)x x ∈ N+ p(1 − p)x−1 x ∈ N+
p p2 1 − (1 − p)es
x
X λi λx e−λ s
Poisson Po (λ) e−λ λ λ eλ(e −1)
i=0
i! x!
Uniform (discrete) Binomial Geometric Poisson

n = 40, p = 0.3 p = 0.2 λ=1
0.8
● ● ● ● ●
n = 30, p = 0.6 p = 0.5 λ=4
0.25
n = 25, p = 0.9 p = 0.8 λ = 10
0.3
0.20
0.6
0.15
0.2
1
PMF
PMF
PMF
PMF
●
0.4
● ● ● ● ● ● ● ●
n ●
●
0.10
●
●
0.1
0.2
●
0.05
● ●
● ●
●
● ●
●
●
● ● ●
●
● ● ●
● ●
0.00
● ●
0.0
0.0
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
a b 0 10 20 30 40 0 2 4 6 8 10 0 5 10 15 20
x x x x
1 We use the notation γ(s, x) and Γ(x) to refer to the Gamma functions (see §22.1), and use B(x, y) and Ix to refer to the Beta functions (see §22.2).
3
1.2 Continuous Distributions
Notation FX (x) fX (x) E [X] V [X] MX (s)

0 x<a
(b − a)2 esb − esa

x−a I(a < x < b) a+b
Uniform Unif (a, b) a<x<b
 b−a b−a 2 12 s(b − a)
1 x>b

(x − µ)2
Z x
σ 2 s2

1
N µ, σ 2 σ2

Normal Φ(x) = φ(t) dt φ(x) = √ exp − µ exp µs +
−∞ σ 2π 2σ 2 2
(ln x − µ)2

1 1 ln x − µ 1 2 2 2
ln N µ, σ 2 eµ+σ /2
(eσ − 1)e2µ+σ

Log-Normal + erf √ √ exp −
2 2 2σ 2 x 2πσ 2 2σ 2

1 T
Σ−1 (x−µ) 1
Multivariate Normal MVN (µ, Σ) (2π)−k/2 |Σ|−1/2 e− 2 (x−µ) µ Σ exp µT s + sT Σs
2
−(ν+1)/2
Γ ν+1

ν ν
2 x2
Student’s t Student(ν) Ix , √ ν
1+ 0 0
2 2 νπΓ 2 ν

1 k x 1
Chi-square χ2k γ , xk/2 e−x/2 k 2k (1 − 2s)−k/2 s < 1/2
Γ(k/2) 2 2 2k/2 Γ(k/2)
r
d
(d1 x)d1 d2 2
2d22 (d1 + d2 − 2)

d1 d1 (d1 x+d2 )d1 +d2 d2
F F(d1 , d2 ) I d1 x , d1 d1 d2 − 2 d1 (d2 − 2)2 (d2 − 4)

d1 x+d2 2 2 xB 2
, 2
1 −x/β 1
Exponential Exp (β) 1 − e−x/β e β β2 (s < 1/β)
β 1 − βs
α
γ(α, x/β) 1 1
Gamma Gamma (α, β) xα−1 e−x/β αβ αβ 2 (s < 1/β)
Γ(α) Γ (α) β α 1 − βs
Γ α, βx

β α −α−1 −β/x β β2 2(−βs)α/2 p
Inverse Gamma InvGamma (α, β) x e α>1 α>2 Kα −4βs
Γ (α) Γ (α) α−1 (α − 1)2 (α − 2)2 Γ(α)
P
k
Γ i=1 αi Y α −1
k
αi E [Xi ] (1 − E [Xi ])
Dirichlet Dir (α) Qk xi i Pk Pk
i=1 Γ (αi ) i=1 i=1 αi i=1 αi + 1
∞ k−1
!
Γ (α + β) α−1 α αβ X Y α+r sk
Beta Beta (α, β) Ix (α, β) x (1 − x)β−1 1+
Γ (α) Γ (β) α+β (α + β)2 (α + β + 1) r=0
α+β+r k!
k=1
∞
sn λn

k k x k−1 −(x/λ)k 1 2 X n
Weibull Weibull(λ, k) 1 − e−(x/λ) e λΓ 1 + λ2 Γ 1 + − µ2 Γ 1+
λ λ k k n=0
n! k
x α
m xα αxm xα
Pareto Pareto(xm , α) 1− x ≥ xm m
α α+1 x ≥ xm α>1 m
α>2 α(−xm s)α Γ(−α, −xm s) s < 0
x x α−1 (α − 1)2 (α − 2)
4
Uniform (continuous) Normal Log−normal Student's t
ν=1
0.4
µ = 0, σ2 = 0.2 µ = 0, σ2 = 3
1.0
µ = 0, σ2 = 1 µ = 2, σ2 = 2 ν=2
µ = 0, σ2 = 5 µ = 0, σ2 = 1 ν=5
ν=∞
0.8
µ = −2, σ2 = 0.5 µ = 0.5, σ2 = 1
µ = 0.25, σ2 = 1
µ = 0.125, σ2 = 1
0.8
0.3
0.6
0.6
PDF
PDF
PDF
φ(x)
0.2
0.4
1
● ●
0.4
b−a
0.1
0.2
0.2
0.0
0.0
0.0
● ●
a b −4 −2 0 2 4 0.0 0.5 1.0 1.5 2.0 2.5 3.0 −4 −2 0 2 4

x x x x
χ2
F Exponential Gamma
k=1 d1 = 1, d2 = 1 β=2 α = 1, β = 2
0.5
2.0
0.5
k=2 d1 = 2, d2 = 1 β=1 α = 2, β = 2
3.0
k=3 d1 = 5, d2 = 2 β = 0.4 α = 3, β = 2
k=4 d1 = 100, d2 = 1 α = 5, β = 1
k=5 d1 = 100, d2 = 100 α = 9, β = 0.5
0.4
0.4
2.5
1.5
2.0
0.3
0.3
PDF
PDF
PDF
PDF
1.0
1.5
0.2
0.2
1.0
0.5
0.1
0.1
0.5
0.0
0.0
0.0
0.0
0 2 4 6 8 0 1 2 3 4 5 0 1 2 3 4 5 0 5 10 15 20
x x x x
Inverse Gamma Beta Weibull Pareto

α = 1, β = 1 α = 0.5, β = 0.5 λ = 1, k = 0.5 xm = 1, α = 1
3.0
2.5
α = 2, β = 1 α = 5, β = 1 λ = 1, k = 1 xm = 1, α = 2
α = 3, β = 1 α = 1, β = 3 λ = 1, k = 1.5 xm = 1, α = 4
α = 3, β = 0.5 α = 2, β = 2 λ = 1, k = 5
α = 2, β = 5
4
2.5
2.0
3
2.0
3
1.5
2
PDF
PDF
PDF
PDF
1.5
2
1.0
1.0
1
1
0.5
0.5
0.0
0.0
0
0
0 1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 2.5 0 1 2 3 4 5
x x x x
5
2 Probability Theory Law of Total Probability
n n
Definitions X G
P [B] = P [B|Ai ] P [Ai ] Ω= Ai
• Sample space Ω i=1 i=1
• Outcome (point or element) ω ∈ Ω Bayes’ Theorem

• Event A ⊆ Ω
n
• σ-algebra A P [B | Ai ] P [Ai ] G
P [Ai | B] = Pn Ω= Ai
1. ∅ ∈ A j=1 P [B | Aj ] P [Aj ] i=1
S∞
2. A1 , A2 , . . . , ∈ A =⇒ i=1 Ai ∈ A Inclusion-Exclusion Principle
3. A ∈ A =⇒ ¬A ∈ A
n n
r
[ X \
• Probability distribution P
X
(−1)r−1

Ai = A ij

1. P [A] ≥ 0 for every A i=1 r=1 i≤i1 <···<ir ≤n j=1
2. P [Ω] = 1
"∞ #
G X∞ 3 Random Variables
3. P Ai = P [Ai ]
i=1 i=1 Random Variable
• Probability space (Ω, A, P) X:Ω→R
Properties Probability Mass Function (PMF)
• P [∅] = 0 fX (x) = P [X = x] = P [{ω ∈ Ω : X(ω) = x}]

• B = Ω ∩ B = (A ∪ ¬A) ∩ B = (A ∩ B) ∪ (¬A ∩ B)
Probability Density Function (PDF)
• P [¬A] = 1 − P [A]
b
• P [B] = P [A ∩ B] + P [¬A ∩ B]
Z
P [a ≤ X ≤ b] = f (x) dx
• P [Ω] = 1 P [∅] = 0 a
S T T S
• ¬( n An ) = n ¬An ¬( n An ) = n ¬An DeMorgan
S T Cumulative Distribution Function (CDF):
• P [ n An ] = 1 − P [ n ¬An ]
• P [A ∪ B] = P [A] + P [B] − P [A ∩ B] FX : R → [0, 1] FX (x) = P [X ≤ x]
=⇒ P [A ∪ B] ≤ P [A] + P [B] 1. Nondecreasing: x1 < x2 =⇒ F (x1 ) ≤ F (x2 )
• P [A ∪ B] = P [A ∩ ¬B] + P [¬A ∩ B] + P [A ∩ B] 2. Normalized: limx→−∞ = 0 and limx→∞ = 1
• P [A ∩ ¬B] = P [A] − P [A ∩ B] 3. Right-continuous: limy↓x F (y) = F (x)
Continuity of Probabilities
S∞ Z b
• A1 ⊂ A2 ⊂ . . . =⇒ limn→∞ P [An ] = P [A] where A = i=1 Ai P [a ≤ Y ≤ b | X = x] = fY |X (y | x)dy a≤b
T∞
• A1 ⊃ A2 ⊃ . . . =⇒ limn→∞ P [An ] = P [A] where A = i=1 Ai a
f (x, y)
Independence ⊥
⊥ fY |X (y | x) =
A⊥
⊥ B ⇐⇒ P [A ∩ B] = P [A] P [B] fX (x)
Independence
Conditional Probability
1. P [X ≤ x, Y ≤ y] = P [X ≤ x] P [Y ≤ y]
P [A ∩ B]
P [A | B] = if P [B] > 0 2. fX,Y (x, y) = fX (x)fY (y)
P [B] 6
Z
3.1 Transformations • E [XY ] = xyfX,Y (x, y) dFX (x) dFY (y)
X,Y
Transformation function
• E [ϕ(Y )] 6= ϕ(E [X]) (cf. Jensen inequality)
Z = ϕ(X)
• P [X ≥ Y ] = 0 =⇒ E [X] ≥ E [Y ] ∧ P [X = Y ] = 1 =⇒ E [X] = E [Y ]
Discrete X ∞
X • E [X] = P [X ≥ x]
fZ (z) = P [ϕ(X) = z] = P [{x : ϕ(x) = z}] = P X ∈ ϕ−1 (z) =

f (x) x=1
x∈ϕ−1 (z) Sample mean
n
Continuous 1X
X̄n = Xi
Z n i=1
FZ (z) = P [ϕ(X) ≤ z] = f (x) dx with Az = {x : ϕ(x) ≤ z} Conditional Expectation
Az Z
Special case if ϕ strictly monotone • E [Y | X = x] = yf (y | x) dy

d

dx 1 • E [X] = E [E [X | Y ]]
fZ (z) = fX (ϕ−1 (z)) ϕ−1 (z) = fX (x) = fX (x)
Z ∞
dz dz |J| • E[ϕ(X, Y ) | X = x] = ϕ(x, y)fY |X (y | x) dx
Z −∞
∞
The Rule of the Lazy Statistician
• E [ϕ(Y, Z) | X = x] = ϕ(y, z)f(Y,Z)|X (y, z | x) dy dz
−∞
Z
E [Z] = ϕ(x) dFX (x) • E [Y + Z | X] = E [Y | X] + E [Z | X]
Z Z • E [ϕ(X)Y | X] = ϕ(X)E [Y | X]
E [IA (x)] = IA (x) dFX (x) = dFX (x) = P [X ∈ A] • E[Y | X] = c =⇒ Cov [X, Y ] = 0
A
Convolution
Z ∞ Z z
5 Variance
X,Y ≥0
• Z := X + Y fZ (z) = fX,Y (x, z − x) dx = fX,Y (x, z − x) dx Variance
−∞ 0
Z ∞ 2
2
• Z := |X − Y | fZ (z) = 2 fX,Y (x, z + x) dx • V [X] = σX = E (X − E [X])2 = E X 2 − E [X]
" n # n
Z ∞ 0 Z ∞ X X X
X ⊥⊥ • V Xi = V [Xi ] + 2 Cov [Xi , Yj ]
• Z := fZ (z) = |x|fX,Y (x, xz) dx = xfx (x)fX (x)fY (xz) dx i=1 i=1
Y −∞ −∞ " n #
i6=j
X n
X
• V Xi = V [Xi ] iff Xi ⊥
⊥ Xj
4 Expectation i=1 i=1
Expectation Standard deviation p

X sd[X] = V [X] = σX


 xfX (x) X discrete Covariance
Z  x

• E [X] = µX = x dFX (x) = • Cov [X, Y ] = E [(X − E [X])(Y − E [Y ])] = E [XY ] − E [X] E [Y ]
• Cov [X, a] = 0

 Z
 xfX (x) X continuous


• Cov [X, X] = V [X]
• P [X = c] = 1 =⇒ E [c] = c • Cov [X, Y ] = Cov [Y, X]
• E [cX] = c E [X] • Cov [aX, bY ] = abCov [X, Y ]
• E [X + Y ] = E [X] + E [Y ] • Cov [X + a, Y + b] = Cov [X, Y ]
7
 
Xn m
X n X
X m • limn→∞ Bin (n, p) = N (np, np(1 − p)) (n large, p far from 0 and 1)
• Cov  Xi , Yj  = Cov [Xi , Yj ]
Negative Binomial
i=1 j=1 i=1 j=1
• X ∼ NBin (1, p) = Geo (p)
Correlation Pr
Cov [X, Y ] • X ∼ NBin (r, p) = i=1 Geo (p)
ρ [X, Y ] = p • Xi ∼ NBin (ri , p) =⇒
P P
Xi ∼ NBin ( ri , p)
V [X] V [Y ]
• X ∼ NBin (r, p) . Y ∼ Bin (s + r, p) =⇒ P [X ≤ s] = P [Y ≥ r]
Independence
Poisson
X⊥
⊥ Y =⇒ ρ [X, Y ] = 0 ⇐⇒ Cov [X, Y ] = 0 ⇐⇒ E [XY ] = E [X] E [Y ] n
X n
X
!
• Xi ∼ Po (λi ) ∧ Xi ⊥⊥ Xj =⇒ Xi ∼ Po λi
Sample variance i=1 i=1
n
1 X
 
S2 = (Xi − X̄n )2 n n
X X λ i
n − 1 i=1 • Xi ∼ Po (λi ) ∧ Xi ⊥⊥ Xj =⇒ Xi Xj ∼ Bin  Xj , Pn 
j=1 j=1 j=1 λ j
Conditional Variance
Exponential
2
• V [Y | X] = E (Y − E [Y | X])2 | X = E Y 2 | X − E [Y | X] n
X
• V [Y ] = E [V [Y | X]] + V [E [Y | X]] • Xi ∼ Exp (β) ∧ Xi ⊥
⊥ Xj =⇒ Xi ∼ Gamma (n, β)
i=1
• Memoryless property: P [X > x + y | X > y] = P [X > x]
6 Inequalities Normal

X−µ

Cauchy-Schwarz
2
• X ∼ N µ, σ 2 =⇒ ∼ N (0, 1)
σ
E [XY ] ≤ E X 2 E Y 2

• X ∼ N µ, σ ∧ Z = aX + b =⇒ Z ∼ N aµ + b, a2 σ 2
2
Markov •

X ∼ N µ1 , σ12 ∧ Y ∼ N µ2 , σ22 =⇒ X + Y ∼ N µ1 + µ2 , σ12 + σ22

E [ϕ(X)]
P [ϕ(X) ≥ t] ≤ •

Xi ∼ N µi , σi2 =⇒
P
X ∼N
P P 2

t i i i µi , i σi
P [a < X ≤ b] = Φ b−µ − Φ a−µ

Chebyshev • σ σ
V [X]
P [|X − E [X]| ≥ t] ≤ • Φ(−x) = 1 − Φ(x) φ0 (x) = −xφ(x) φ00 (x) = (x2 − 1)φ(x)
t2
−1
Chernoff • Upper quantile of N (0, 1): zα = Φ (1 − α)
eδ

P [X ≥ (1 + δ)µ] ≤ δ > −1 Gamma
(1 + δ)1+δ
• X ∼ Gamma (α, β) ⇐⇒ X/β ∼ Gamma (α, 1)
Jensen Pα
• Gamma (α, β) ∼ i=1 Exp (β)
E [ϕ(X)] ≥ ϕ(E [X]) ϕ convex P P
• Xi ∼ Gamma (αi , β) ∧ Xi ⊥
⊥ Xj =⇒ i Xi ∼ Gamma ( i αi , β)
Z ∞
Γ(α) α−1 −λx
• = x e dx
7 Distribution Relationships λα 0
Beta
Binomial
1 Γ(α + β) α−1
n • xα−1 (1 − x)β−1 = x (1 − x)β−1
• Xi ∼ Bern (p) =⇒
X
Xi ∼ Bin (n, p) B(α, β) Γ(α)Γ(β)
B(α + k, β) α+k−1
E X k−1

i=1 • E Xk = =
• X ∼ Bin (n, p) , Y ∼ Bin (m, p) =⇒ X + Y ∼ Bin (n + m, p) B(α, β) α+β+k−1
• limn→∞ Bin (n, p) = Po (np) (n large, p small) • Beta (1, 1) ∼ Unif (0, 1)
8
8 Probability and Moment Generating Functions Conditional mean and variance
σX
E [X | Y ] = E [X] + ρ (Y − E [Y ])

• GX (t) = E tX |t| < 1
"∞ σY
∞
#
X (Xt)i X E Xi
· ti
t
Xt

• MX (t) = GX (e ) = E e =E =
p
i! i! V [X | Y ] = σX 1 − ρ2
i=0 i=0
• P [X = 0] = GX (0)
• P [X = 1] = G0X (0) 9.3 Multivariate Normal
(i)
GX (0) Covariance Matrix Σ (Precision Matrix Σ−1 )
• P [X = i] =
i!  
V [X1 ] · · · Cov [X1 , Xk ]
• E [X] = G0X (1− )
Σ=
 .. .. .. 
(k)
• E X k = MX (0) . . . 

X!
Cov [Xk , X1 ] · · · V [Xk ]
(k)
• E = GX (1− )
(X − k)! If X ∼ N (µ, Σ),
2
• V [X] = G00X (1− ) + G0X (1− ) − (G0X (1− ))
−1/2 1
• GX (t) = GY (t) =⇒ X = Y
d
fX (x) = (2π)−n/2 |Σ| exp − (x − µ)T Σ−1 (x − µ)
2
Properties
9 Multivariate Distributions
• Z ∼ N (0, 1) ∧ X = µ + Σ1/2 Z =⇒ X ∼ N (µ, Σ)
9.1 Standard Bivariate Normal • X ∼ N (µ, Σ) =⇒ Σ−1/2 (X − µ) ∼ N (0, 1)

p • X ∼ N (µ, Σ) =⇒ AX ∼ N Aµ, AΣAT
Let X, Y ∼ N (0, 1) ∧ X ⊥
⊥ Z with Y = ρX + 1 − ρ2 Z
• X ∼ N (µ, Σ) ∧ a is vector of length k =⇒ aT X ∼ N aT µ, aT Σa
Joint density 2
x + y 2 − 2ρxy

1
f (x, y) = exp − 10 Convergence
2(1 − ρ2 )
p
2π 1 − ρ2
Conditionals Let {X1 , X2 , . . .} be a sequence of rv’s and let X be another rv. Let Fn denote
the cdf of Xn and let F denote the cdf of X.
(Y | X = x) ∼ N ρx, 1 − ρ2 (X | Y = y) ∼ N ρy, 1 − ρ2

and
Types of Convergence
Independence D
X⊥
⊥ Y ⇐⇒ ρ = 0 1. In distribution (weakly, in law): Xn → X
lim Fn (t) = F (t) ∀t where F continuous

n→∞
9.2 Bivariate Normal
P
Let X ∼ N µx , σx2 and Y ∼ N µy , σy2 . 2. In probability: Xn → X
1

z
(∀ε > 0) lim P [|Xn − X| > ε] = 0
n→∞
f (x, y) = exp −
2(1 − ρ2 )
p
2πσx σy 1−ρ 2
as
3. Almost surely (strongly): Xn → X
" 2 2 #
x − µx y − µy x − µx y − µy h i h i
z= + − 2ρ P lim Xn = X = P ω ∈ Ω : lim Xn (ω) = X(ω) = 1
σx σy σx σy n→∞ n→∞
9
qm
4. In quadratic mean (L2 ): Xn → X CLT Notations
lim E (Xn − X)2 = 0

Zn ≈ N (0, 1)
n→∞
σ2

X̄n ≈ N µ,
Relationships n
σ2

qm
• Xn → X =⇒ Xn → X =⇒ Xn → X
P D X̄n − µ ≈ N 0,
n
as
• Xn → X =⇒ Xn → X
P √ 2

D P
n(X̄n − µ) ≈ N 0, σ
• Xn → X ∧ (∃c ∈ R) P [X = c] = 1 =⇒ Xn → X √
n(X̄n − µ)
• Xn
P
→X ∧ Yn
P
→ Y =⇒ Xn + Yn → X + Y
P
≈ N (0, 1)
qm qm qm
n
• Xn →X ∧ Yn → Y =⇒ Xn + Yn → X + Y
P P P
• Xn →X ∧ Yn → Y =⇒ Xn Yn → XY
• Xn
P
→X =⇒
P
ϕ(Xn ) → ϕ(X) Continuity Correction
x + 12 − µ
D D
• Xn → X =⇒ ϕ(Xn ) → ϕ(X)
qm P X̄n ≤ x ≈ Φ √
• Xn → b ⇐⇒ limn→∞ E [Xn ] = b ∧ limn→∞ V [Xn ] = 0 σ/ n
qm
• X1 , . . . , Xn iid ∧ E [X] = µ ∧ V [X] < ∞ ⇐⇒ X̄n → µ
x − 12 − µ

P X̄n ≥ x ≈ 1 − Φ √
Slutzky’s Theorem σ/ n
Delta Method
D P D
• Xn → X and Yn → c =⇒ Xn + Yn → X + c
σ2 σ2

D P D 0 2
• Xn → X and Yn → c =⇒ Xn Yn → cX Yn ≈ N µ, =⇒ ϕ(Yn ) ≈ N ϕ(µ), (ϕ (µ))
D D D n n
• In general: Xn → X and Yn → Y =⇒
6 Xn + Yn → X + Y
11 Statistical Inference
10.1 Law of Large Numbers (LLN) iid
Let X1 , · · · , Xn ∼ F if not otherwise noted.
Let {X1 , . . . , Xn } be a sequence of iid rv’s, E [X1 ] = µ, and V [X1 ] < ∞.
11.1 Point Estimation
Weak (WLLN)
P • Point estimator θbn of θ is a rv: θbn = g(X1 , . . . , Xn )
X̄n → µ as n → ∞ h i
• bias(θbn ) = E θbn − θ
Strong (SLLN) P
as • Consistency: θbn → θ
X̄n → µ as n → ∞
• Sampling distribution: F (θbn )
r h i
• Standard error: se(θn ) = V θbn
b
10.2 Central Limit Theorem (CLT)
h i h i
Let {X1 , . . . , Xn } be a sequence of iid rv’s, E [X1 ] = µ, and V [X1 ] = σ 2 . • Mean squared error: mse = E (θbn − θ)2 = bias(θbn )2 + V θbn
√ • limn→∞ bias(θbn ) = 0 ∧ limn→∞ se(θbn ) = 0 =⇒ θbn is consistent

X̄n − µ n(X̄n − µ) D
Zn := q = →Z where Z ∼ N (0, 1) θbn − θ D
V X̄n σ • Asymptotic normality: → N (0, 1)
se
• Slutzky’s Theorem often lets us replace se(θbn ) by some (weakly) consis-
lim P [Zn ≤ z] = Φ(z) z∈R tent estimator σ
bn .
n→∞ 10
11.2 Normal-based Confidence Interval 11.4 Statistical Functionals

b 2 . Let zα/2 = Φ−1 (1 − (α/2)), i.e., P Z > zα/2 = α/2 • Statistical functional: T (F )

Suppose θbn ≈ N θ, se

and P −zα/2 < Z < zα/2 = 1 − α where Z ∼ N (0, 1). Then • Plug-in estimator of θ = T (F ) : θbn = T (F̂n )
R
• Linear functional: T (F ) = ϕ(x) dFX (x)
Cn = θbn ± zα/2 se
b • Plug-in estimator for linear functional:
Z n
1X
T (F̂n ) =
ϕ(x) dFbn (x) = ϕ(Xi )
11.3 Empirical Distribution Function n i=1

Empirical Distribution Function (ECDF) b 2 =⇒ T (F̂n ) ± zα/2 se
• Often: T (F̂n ) ≈ N T (F ), se b
Pn
I(Xi ≤ x) • pth quantile: F −1 (p) = inf{x : F (x) ≥ p}
i=1
Fbn (x) = • µ̂ = X̄n
n
n
1 X
b2 =
• σ (Xi − X̄n )2
n − 1 i=1
(
1 Xi ≤ x
I(Xi ≤ x) = 1
Pn 3
0 Xi > x n i=1 (Xi − µ̂)
• κ̂ =
b3 j
σ
Pn
Properties (for any fixed x) (Xi − X̄n )(Yi − Ȳn )
• ρ̂ = qP i=1 qP
n 2 n
h i
i=1 (X i − X̄n ) i=1 (Yi − Ȳn )
• E F̂n = F (x)
h i F (x)(1 − F (x))
• V F̂n =
n
12 Parametric Inference
F (x)(1 − F (x)) D
• mse = →0 Let F = f (x; θ : θ ∈ Θ be a parametric model with parameter space Θ ⊂ Rk
n
P and parameter θ = (θ1 , . . . , θk ).
• F̂n → F (x)
Dvoretzky-Kiefer-Wolfowitz (DKW) Inequality (X1 , . . . , Xn ∼ F ) 12.1 Method of Moments

j th moment Z
2
P sup F (x) − F̂n (x) > ε = 2e−2nε

αj (θ) = E X j = xj dFX (x)

x
Nonparametric 1 − α confidence band for F j th sample moment

n
1X j
α̂j = X
L(x) = max{F̂n − n , 0} n i=1 i
U (x) = min{F̂n + n , 1} Method of Moments Estimator (MoM)
s
1 2
= log α1 (θ) = α̂1
2n α
α2 (θ) = α̂2
.. ..
.=.
P [L(x) ≤ F (x) ≤ U (x) ∀x] ≥ 1 − α αk (θ) = α̂k
11
Properties of the MoM estimator • Equivariance: θbn is the mle =⇒ ϕ(θbn ) is the mle of ϕ(θ)
• θbn exists with probability tending to 1 • Asymptotic normality:
P
p
• Consistency: θbn → θ 1. se ≈ 1/In (θ)
• Asymptotic normality: (θbn − θ) D
→ N (0, 1)
√ D
se
n(θb − θ) → N (0, Σ) q
b ≈ 1/In (θbn )
2. se
where Σ = gE Y Y T g T , Y = (X, X 2 , . . . , X k )T ,
∂ −1
(θbn − θ) D
g = (g1 , . . . , gk ) and gj = ∂θ αj (θ) → N (0, 1)
se
b
• Asymptotic optimality (or efficiency), i.e., smallest variance for large sam-
12.2 Maximum Likelihood ples. If θen is any other estimator, the asymptotic relative efficiency is
Likelihood: Ln : Θ → [0, ∞) h i
V θbn
n
Y are(θen , θbn ) = h i ≤ 1
Ln (θ) = f (Xi ; θ) V θen
i=1
• Approximately the Bayes estimator
Log-likelihood
n
`n (θ) = log Ln (θ) =
X
log f (Xi ; θ) 12.2.1 Delta Method
i=1 b where ϕ is differentiable and ϕ0 (θ) 6= 0:
If τ = ϕ(θ)
Maximum Likelihood Estimator (mle)
τn − τ ) D
(b
→ N (0, 1)
Ln (θbn ) = sup Ln (θ) se(b
b τ)
θ
Score Function where τb = ϕ(θ)

b is the mle of τ and
∂
s(X; θ) = log f (X; θ)

b = ϕ0 (θ)
se se(
b θn )
b b
∂θ
Fisher Information
I(θ) = Vθ [s(X; θ)] 12.3 Multiparameter Models
In (θ) = nI(θ) Let θ = (θ1 , . . . , θk ) and θb = (θb1 , . . . , θbk ) be the mle.
Fisher Information (exponential family)
∂ 2 `n ∂ 2 `n
Hjj = Hjk =
∂ ∂θ2 ∂θj ∂θk
I(θ) = Eθ − s(X; θ)
∂θ Fisher Information Matrix
Observed Fisher Information 
Eθ [H11 ] ··· Eθ [H1k ]

n
In (θ) = −  .. .. ..
∂2 X
 
. . .
Inobs (θ) = −

log f (Xi ; θ)
∂θ2 i=1 Eθ [Hk1 ] · · · Eθ [Hkk ]
Properties of the mle Under appropriate regularity conditions

P
• Consistency: θbn → θ (θb − θ) ≈ N (0, Jn )
12
with Jn (θ) = In−1 . Further, if θbj is the j th component of θ, then 13 Hypothesis Testing
H0 : θ ∈ Θ0 versus H1 : θ ∈ Θ1
(θbj − θj ) D
→ N (0, 1) Definitions
se
bj
• Null hypothesis H0
h i • Alternative hypothesis H1
b 2j = Jn (j, j) and Cov θbj , θbk = Jn (j, k)
where se • Simple hypothesis θ = θ0
• Composite hypothesis θ > θ0 or θ < θ0
• Two-sided test: H0 : θ = θ0 versus H1 : θ 6= θ0
• One-sided test: H0 : θ ≤ θ0 versus H1 : θ > θ0
12.3.1 Multiparameter Delta Method • Critical value c
• Test statistic T
Let τ = ϕ(θ1 , . . . , θk ) be a function and let the gradient of ϕ be • Rejection Region R = {x : T (x) > c}
• Power function β(θ) = P [X ∈ R]
∂ϕ
  • Power of a test: 1 − P [Type II error] = 1 − β = inf β(θ)
θ∈Θ1
 ∂θ1  • Test size: α = P [Type I error] = sup β(θ)
 . 
 .. 
∇ϕ =   θ∈Θ0
 ∂ϕ 
Retain H0 Reject H0
√
∂θk H0 true Type
√ I error (α)
H1 true Type II error (β) (power)
p-value
Suppose ∇ϕθ=θb 6= 0 and τb = ϕ(θ).
b Then,

• p-value = supθ∈Θ0 Pθ [T (X) ≥ T (x)] = inf α : T (x) ∈ Rα
Pθ [T (X ? ) ≥ T (X)]

• p-value = supθ∈Θ0 = inf α : T (X) ∈ Rα
τ − τ) D
(b
→ N (0, 1)
| {z }
se(b
b τ) 1−Fθ (T (X)) since T (X ? )∼Fθ
p-value evidence
where < 0.01 very strong evidence against H0
0.01 − 0.05 strong evidence against H0
r
T 0.05 − 0.1 weak evidence against H0
se(b
b τ) = ˆ
∇ϕ Jˆn ∇ϕ
ˆ > 0.1 little or no evidence against H0
Wald Test
• Two-sided test
and Jˆn = Jn (θ) ˆ = ∇ϕ b.

b and ∇ϕ
θ=θ θb − θ0
• Reject H0 when |W | > zα/2 where W =
se
b
• P |W | > zα/2 → α
• p-value = Pθ0 [|W | > |w|] ≈ P [|Z| > |w|] = 2Φ(−|w|)
12.4 Parametric Bootstrap
Likelihood Ratio Test (LRT)
Sample from f (x; θbn ) instead of from F̂n , where θbn could be the mle or method supθ∈Θ Ln (θ) Ln (θbn )
• T (X) = =
of moments estimator. supθ∈Θ0 Ln (θ) Ln (θbn,0 ) 13
k
D
X iid • xn = (x1 , . . . , xn )
• λ(X) = 2 log T (X) → χ2r−q where Zi2 ∼ χ2k with Z1 , . . . , Zk ∼ N (0, 1)
• Prior density f (θ)
i=1 • Likelihood f (xn | θ): joint density of the data
• p-value = Pθ0 [λ(X) > λ(x)] ≈ P χ2r−q > λ(x) n
Y
In particular, X n iid =⇒ f (xn | θ) = f (xi | θ) = Ln (θ)
Multinomial LRT
i=1
• Posterior density f (θ | xn )

X1 Xk
• Let p̂n = ,..., be the mle
• Normalizing constant cn = f (xn ) = f (x | θ)f (θ) dθ
R
n n
k Xj • Kernel: part of a density that depends Ron θ
Ln (p̂n ) Y p̂j
• T (X) = = θLn (θ)f (θ)
• Posterior Mean θ̄n = θf (θ | xn ) dθ = R Ln (θ)f
R
Ln (p0 ) j=1
p0j (θ) dθ
k
X p̂j D
• λ(X) = 2 Xj log → χ2k−1 14.1 Credible Intervals
j=1
p 0j
• The approximate size α LRT rejects H0 when λ(X) ≥ χ2k−1,α 1 − α Posterior Interval
2
Pearson χ Test Z b
n
P [θ ∈ (a, b) | x ] = f (θ | xn ) dθ = 1 − α
k
X (Xj − E [Xj ])2 a
• T = where E [Xj ] = np0j under H0
j=1
E [Xj ] 1 − α Equal-tail Credible Interval
D
• T → χ2k−1 Z a Z ∞
f (θ | xn ) dθ = f (θ | xn ) dθ = α/2

• p-value = P χ2k−1 > T (x)
D
−∞ b
2
• Faster → Xk−1 than LRT, hence preferable for small n
1 − α Highest Posterior Density (HPD) region Rn
Independence Testing
1. P [θ ∈ Rn ] = 1 − α
• I rows, J columns, X multinomial sample of size n = I ∗ J
X 2. Rn = {θ : f (θ | xn ) > k} for some k
• mles unconstrained: p̂ij = nij
X
• mles under H0 : p̂0ij = p̂i· p̂·j = Xni· n·j Rn is unimodal =⇒ Rn is an interval

PI PJ nX
• LRT: λ = 2 i=1 j=1 Xij log Xi· Xij·j
PI PJ (X −E[X ])2
• Pearson χ2 : T = i=1 j=1 ijE[Xij ]ij
14.2 Function of Parameters
D
• LRT and Pearson → χ2k ν, where ν = (I − 1)(J − 1) Let τ = ϕ(θ) and A = {θ : ϕ(θ) ≤ τ }.
Posterior CDF for τ
Z
14 Bayesian Inference H(r | xn ) = P [ϕ(θ) ≤ τ | xn ] = f (θ | xn ) dθ
A
Bayes’ Theorem
Posterior Density
f (x | θ)f (θ) f (x | θ)f (θ) h(τ | xn ) = H 0 (τ | xn )
f (θ | x) = n
=R ∝ Ln (θ)f (θ)
f (x ) f (x | θ)f (θ) dθ
Bayesian Delta Method
Definitions
τ | X n ≈ N ϕ(θ),
b seb ϕ0 (θ)
b
n
• X = (X1 , . . . , Xn )

14
14.3 Priors Continuous likelihood (subscript c denotes constant)
Likelihood Conjugate Prior Posterior hyperparameters
Choice
Uniform(0, θ) Pareto(xm , k) max x(n) , xm , k + n
n
• Subjective Bayesianism: prior should incorporate as much detail as possible Exponential(λ) Gamma(α, β) α + n, β +
X
xi
the research’s a priori knowledge — via prior elicitation. i=1
• Objective Bayesianism: prior should incorporate as little detail as possible Pn
µ0 i=1 xi 1 n
(non-informative prior). Normal(µ, σc2 ) Normal(µ0 , σ02 ) + / + 2 ,
σ2 σ2 σ02 σc
• Robust Bayesianism: consider various priors and determine sensitivity of 0 c−1
1 n
our inferences to changes in the prior. + 2
σ02 σc
Pn
νσ02 + i=1 (xi − µ)2
Types Normal(µc , σ 2 ) Scaled Inverse Chi- ν + n,
ν+n
square(ν, σ02 )
• Flat: f (θ) ∝ constant νλ + nx̄ n
R∞ Normal(µ, σ 2 ) Normal- , ν + n, α + ,
• Proper: −∞ f (θ) dθ = 1 ν+n 2
scaled Inverse n
γ(x̄ − λ)2
R∞
• Improper: −∞ f (θ) dθ = ∞ 1X 2
Gamma(λ, ν, α, β) β+ (xi − x̄) +
• Jeffreys’ prior (transformation-invariant): 2 i=1 2(n + γ)
−1
Σ−1 −1
Σ−1 −1

p p MVN(µ, Σc ) MVN(µ0 , Σ0 ) 0 + nΣc 0 µ0 + nΣ x̄ ,
f (θ) ∝ I(θ) f (θ) ∝ det(I(θ))
−1 −1
Σ−1

0 + nΣc
n
• Conjugate: f (θ) and f (θ | xn ) belong to the same parametric family X
MVN(µc , Σ) Inverse- n + κ, Ψ + (xi − µc )(xi − µc )T
Wishart(κ, Ψ) i=1
n
X xi
14.3.1 Conjugate Priors Pareto(xmc , k) Gamma(α, β) α + n, β + log
i=1
xm c
Discrete likelihood Pareto(xm , kc ) Pareto(x0 , k0 ) x0 , k0 − kn where k0 > kn
Xn
Likelihood Conjugate Prior Posterior hyperparameters Gamma(αc , β) Gamma(α0 , β0 ) α0 + nαc , β0 + xi
n n i=1
X X
Bernoulli(p) Beta(α, β) α+ xi , β + n − xi
i=1
Xn n
X
i=1
n
X
14.4 Bayesian Testing
Binomial(p) Beta(α, β) α+ xi , β + Ni − xi If H0 : θ ∈ Θ0 :
i=1 i=1 i=1
n
X
Z
Negative Binomial(p) Beta(α, β) α + rn, β + xi Prior probability P [H0 ] = f (θ) dθ
n
i=1 ZΘ0
Posterior probability P [H0 | xn ] = f (θ | xn ) dθ
X
Poisson(λ) Gamma(α, β) α+ xi , β + n
Θ0
i=1
n
X
Multinomial(p) Dirichlet(α) α+ x(i)
i=1 Let H0 , . . . , HK−1 be K hypotheses. Suppose θ ∼ f (θ | Hk ),
n
f (xn | Hk )P [Hk ]
X
Geometric(p) Beta(α, β) α + n, β + xi
P [Hk | xn ] = PK ,
n
k=1 f (x | Hk )P [Hk ]
i=1
15
Marginal Likelihood 1. Estimate VF [Tn ] with VF̂n [Tn ].
Z 2. Approximate VF̂n [Tn ] using simulation:
f (xn | Hi ) = f (xn | θ, Hi )f (θ | Hi ) dθ
∗ ∗
Θ (a) Repeat the following B times to get Tn,1 , . . . , Tn,B , an iid sample from
Posterior Odds (of Hi relative to Hj ) the sampling distribution implied by F̂n
P [Hi | xn ] f (xn | Hi ) P [Hi ] i. Sample uniformly X1∗ , . . . , Xn∗ ∼ F̂n .

= × ii. Compute Tn∗ = g(X1∗ , . . . , Xn∗ ).
P [Hj | xn ] f (xn | Hj ) P [Hj ]
(b) Then
| {z } | {z }
Bayes Factor BFij prior odds
B B
!2
Bayes Factor 1 X ∗ 1 X ∗
log10 BF10 BF10 evidence vboot = V̂F̂n = Tn,b − T
B B r=1 n,r
b=1
0 − 0.5 1 − 1.5 Weak
0.5 − 1 1.5 − 10 Moderate
1−2 10 − 100 Strong
16.1.1 Bootstrap Confidence Intervals
>2 > 100 Decisive
p
1−p BF 10 Normal-based Interval
p∗ = p where p = P [H1 ] and p∗ = P [H1 | xn ]
1 + 1−p BF10
Tn ± zα/2 se
ˆ boot
15 Exponential Family Pivotal Interval

Scalar parameter
1. Location parameter θ = T (F )
fX (x | θ) = h(x) exp {η(θ)T (x) − A(θ)} 2. Pivot Rn = θbn − θ
= h(x)g(θ) exp {η(θ)T (x)} 3. Let H(r) = P [Rn ≤ r] be the cdf of Rn
∗ ∗
4. Let Rn,b = θbn,b − θbn . Approximate H using bootstrap:
Vector parameter
( s
)
B
1 X
X
fX (x | θ) = h(x) exp ηi (θ)Ti (x) − A(θ) Ĥ(r) = ∗
I(Rn,b ≤ r)
i=1 B
b=1
= h(x) exp {η(θ) · T (x) − A(θ)}
= h(x)g(θ) exp {η(θ) · T (x)} 5. Let θβ∗ denote the β sample quantile of (θbn,1
∗ ∗
, . . . , θbn,B )
Natural form 6. Let rβ∗ denote the β sample quantile of (Rn,1
∗ ∗
, . . . , Rn,B ), i.e., rβ∗ = θβ∗ − θbn

fX (x | η) = h(x) exp {η · T(x) − A(η)} 7. Then, an approximate 1 − α confidence interval is Cn = â, b̂ with
= h(x)g(η) exp {η · T(x)} α
= h(x)g(η) exp η T T(x) â = θbn − Ĥ −1 1 − = ∗
θbn − r1−α/2 = ∗
2θbn − θ1−α/2

2
α
b̂ = θbn − Ĥ −1 = ∗
θbn − rα/2 = ∗
2θbn − θα/2
2
16 Sampling Methods
Percentile Interval
16.1 The Bootstrap
∗ ∗
Cn = θα/2 , θ1−α/2
Let Tn = g(X1 , . . . , Xn ) be a statistic.
16
16.2 Rejection Sampling • Decision rule: synonymous for an estimator θb
• Action a ∈ A: possible value of the decision rule. In the estimation
Setup
context, the action is just an estimate of θ, θ(x).
b
• We can easily sample from g(θ) • Loss function L: consequences of taking action a when true state is θ or
• We want to sample from h(θ), but it is difficult discrepancy between θ and θ, b L : Θ × A → [−k, ∞).
k(θ)
• We know h(θ) up to proportional constant: h(θ) = R Loss functions
k(θ) dθ
• Envelope condition: we can find M > 0 such that k(θ) ≤ M g(θ) ∀θ • Squared error loss: L(θ, a) = (θ − a)2
(
K1 (θ − a) a − θ < 0
Algorithm • Linear loss: L(θ, a) =
K2 (a − θ) a − θ ≥ 0
1. Draw θcand ∼ g(θ) • Absolute error loss: L(θ, a) = |θ − a| (linear loss with K1 = K2 )
2. Generate u ∼ Unif (0, 1) • Lp loss: L(θ, a) = |θ − a|p
k(θcand ) (
3. Accept θcand if u ≤ 0 a=θ
M g(θcand ) • Zero-one loss: L(θ, a) =
1 a 6= θ
4. Repeat until B values of θcand have been accepted
Example 17.1 Risk

• We can easily sample from the prior g(θ) = f (θ) Posterior Risk
• Target is the posterior with h(θ) ∝ k(θ) = f (xn | θ)f (θ) Z h i
• Envelope condition: f (xn | θ) ≤ f (xn | θbn ) = Ln (θbn ) ≡ M r(θb | x) = L(θ, θ(x))f
b (θ | x) dθ = Eθ|X L(θ, θ(x))
b
• Algorithm
(Frequentist) Risk
1. Draw θcand ∼ f (θ)
Z
2. Generate u ∼ Unif (0, 1)
h i
R(θ, θ)
b = L(θ, θ(x))f
b (x | θ) dx = EX|θ L(θ, θ(X))
b
Ln (θcand )
3. Accept θcand if u ≤
Ln (θbn ) Bayes Risk
ZZ
16.3 Importance Sampling
h i
r(f, θ)
b = L(θ, θ(x))f
b (x, θ) dx dθ = Eθ,X L(θ, θ(X))
b
Sample from an importance function g rather than target density h.
Algorithm to obtain an approximation to E [q(θ) | xn ]:
h h ii h i
r(f, θ)
b = Eθ EX|θ L(θ, θ(X)
b = Eθ R(θ, θ)b
iid
1. Sample from the prior θ1 , . . . , θn ∼ f (θ)
h h ii h i
r(f, θ)
b = EX Eθ|X L(θ, θ(X)
b = EX r(θb | X)
Ln (θi )
2. For each i = 1, . . . , B, calculate wi = PB
i=1 Ln (θi )
n
PB 17.2 Admissibility
3. E [q(θ) | x ] ≈ i=1 q(θi )wi
• θb0 dominates θb if
∀θ : R(θ, θb0 ) ≤ R(θ, θ)
b
17 Decision Theory
∃θ : R(θ, θb0 ) < R(θ, θ)
b
Definitions
• θb is inadmissible if there is at least one other estimator θb0 that dominates
• Unknown quantity affecting our decision: θ ∈ Θ it. Otherwise it is called admissible.
17
17.3 Bayes Rule Residual Sums of Squares (rss)
Bayes Rule (or Bayes Estimator) n
X
rss(βb0 , βb1 ) = ˆ2i
• r(f, θ)
b = inf e r(f, θ)
θ
e
i=1
R
• θ(x) = inf r(θ | x) ∀x =⇒ r(f, θ)
b b b = r(θb | x)f (x) dx
Least Square Estimates
Theorems
βbT = (βb0 , βb1 )T : min rss
β
b0 ,β
b1
• Squared error loss: posterior mean
• Absolute error loss: posterior median
• Zero-one loss: posterior mode βb0 = Ȳn − βb1 X̄n
Pn Pn
(Xi − X̄n )(Yi − Ȳn ) i=1 Xi Yi − nX̄Y
17.4 Minimax Rules βb1 = i=1 Pn 2
= n
(X − X̄ ) 2
P 2
i=1 i n i=1 Xi − nX
Maximum Risk

β0
h i
R̄(θ)
b = sup R(θ, θ) R̄(a) = sup R(θ, a) E βb | X n =
b β1
θ θ
σ 2 n−1 ni=1 Xi2 −X n
h i P
Minimax Rule n
V β |X =
b
e = inf sup R(θ, θ)
b = inf R̄(θ)
sup R(θ, θ) e nsX −X n 1
θ θe θe θ
r Pn
2
σ i=1 Xi
√
b
se(
b βb0 ) =
θb = Bayes rule ∧ ∃c : R(θ, θ)
b =c sX n n
σ
√
b
Least Favorable Prior se(
b βb1 ) =
sX n
θbf = Bayes rule ∧ R(θ, θbf ) ≤ r(f, θbf ) ∀θ Pn Pn
where s2X = n−1 i=1 (Xi − X n )2 and σ
b2 = 1
n−2 ˆ2i
i=1 an (unbiased) estimate
of σ. Further properties:
18 Linear Regression
P P
• Consistency: βb0 → β0 and βb1 → β1
Definitions
• Asymptotic normality:
• Response variable Y
• Covariate X (aka predictor variable or feature) βb0 − β0 D βb1 − β1 D
→ N (0, 1) and → N (0, 1)
se(
b βb0 ) se(
b βb1 )
18.1 Simple Linear Regression
• Approximate 1 − α confidence intervals for β0 and β1 are
Model
Yi = β0 + β1 Xi + i E [i | Xi ] = 0, V [i | Xi ] = σ 2 βb0 ± zα/2 se(
b βb0 ) and βb1 ± zα/2 se(
b βb1 )
Fitted Line
rb(x) = βb0 + βb1 x • The Wald test for testing H0 : β1 = 0 vs. H1 : β1 6= 0 is: reject H0 if
|W | > zα/2 where W = βb1 /se(
b βb1 ).
Predicted (Fitted) Values
Ybi = rb(Xi ) R2
Pn b 2
Pn 2
Residuals i=1 (Yi − Y ) ˆ rss
2
= 1 − Pn i=1 i 2 = 1 −

ˆi = Yi − Ybi = Yi − βb0 + βb1 Xi R = Pn 2
i=1 (Yi − Y ) i=1 (Yi − Y )
tss
18
Likelihood If the (k × k) matrix X T X is invertible,
n n n
Y Y Y βb = (X T X)−1 X T Y
L= f (Xi , Yi ) = fX (Xi ) × fY |X (Yi | Xi ) = L1 × L2 h i
i=1 i=1 i=1 V βb | X n = σ 2 (X T X)−1
n
βb ≈ N β, σ 2 (X T X)−1
Y
L1 = fX (Xi )
i=1
n
(
2
) Estimate regression function
Y
−n 1 X
L2 = fY |X (Yi | Xi ) ∝ σ exp − 2 Yi − (β0 − β1 Xi ) k
2σ i X
i=1 rb(x) = βbj xj
j=1
Under the assumption of Normality, the least squares estimator is also the mle
2
Unbiased estimate for σ
n
1X 2 n
b2 =
σ ˆ 1 X 2
n i=1 i b2 =
σ ˆ ˆ = X βb − Y
n − k i=1 i
18.2 Prediction mle

n−k 2
µ
b = X̄ b2 =
σ σ
Observe X = x∗ of the covarite and want to predict their outcome Y∗ . n
1 − α Confidence Interval
Yb∗ = βb0 + βb1 x∗ βbj ± zα/2 se(
b βbj )
h i h i h i h i
V Yb∗ = V βb0 + x2∗ V βb1 + 2x∗ Cov βb0 , βb1
18.4 Model Selection
Prediction Interval Pn Consider predicting a new observation Y ∗ for covariates X ∗ and let S ⊂ J
2

2 2 i=1 (Xi − X∗ ) denote a subset of the covariates in the model, where |S| = k and |J| = n.
ξn = σ
b P +1
n i (Xi − X̄)2 j
b
Issues
• Underfitting: too few covariates yields high bias
Yb∗ ± zα/2 ξbn
• Overfitting: too many covariates yields high variance
18.3 Multiple Regression Procedure

1. Assign a score to each model
Y = Xβ +
2. Search through all models to find the one with the highest score
where       Hypothesis Testing
X11 ··· X1k β1 1
 .. ..  β =  ... 
..  ..  H0 : βj = 0 vs. H1 : βj 6= 0 ∀j ∈ J
X= . =.
 
. . 
Xn1 ··· Xnk βk n Mean Squared Prediction Error (mspe)
Likelihood h i

1
mspe = E (Yb (S) − Y ∗ )2
2 −n/2
L(µ, Σ) = (2πσ ) exp − 2 rss
2σ
Prediction Risk
N
X n
X n
X h i
rss = (y − Xβ)T (y − Xβ) = ||Y − Xβ||2 = (Yi − xTi β)2 R(S) = mspei = E (Ybi (S) − Yi∗ )2
i=1 i=1 i=1 19
Training Error 19 Non-parametric Function Estimation
n
X
R
btr (S) = (Ybi (S) − Yi )2 19.1 Density Estimation
i=1 R
Estimate f (x), where f (x) = P [X ∈ A] = A f (x) dx.
2
R Integrated Square Error (ise)
Pn b 2
R i=1 (Yi (S) − Y )
rss(S) btr (S) Z 2 Z
R2 (S) = 1 − =1− =1− P n 2 L(f, fbn ) = f (x) − fbn (x) dx = J(h) + f 2 (x) dx
i=1 (Yi − Y )
tss tss
The training error is a downward-biased estimate of the prediction risk. Frequentist Risk
h i Z Z
h i R(f, fbn ) = E L(f, fbn ) = b2 (x) dx + v(x) dx
E R btr (S) < R(S)
h i
h
i n
X h i b(x) = E fbn (x) − f (x)
bias(R btr (S) − R(S) = −2
btr (S)) = E R Cov Ybi , Yi h i
i=1 v(x) = V fbn (x)
Adjusted R2
19.1.1 Histograms
2 n − 1 rss
R (S) = 1 −
n − k tss Definitions
Mallow’s Cp statistic • Number of bins m
1
• Binwidth h = m
R(S)
b =R σ 2 = lack of fit + complexity penalty
btr (S) + 2kb • Bin Bj has νj observations
R
• Define pbj = νj /n and pj = Bj f (u) du
Akaike Information Criterion (AIC)
Histogram Estimator
m
AIC(S) = bS2 )
`n (βbS , σ −k X pbj
fbn (x) = I(x ∈ Bj )
j=1
h
Bayesian Information Criterion (BIC) h i p
j
E fbn (x) =
k h
bS2 ) − log n
BIC(S) = `n (βbS , σ h i p (1 − p )
j j
2 V fbn (x) =
nh2
h2
Z
Validation and Training 2 1
R(fn , f ) ≈
b (f 0 (u)) du +
12 nh
m
X n n !1/3
R
bV (S) = (Ybi∗ (S) − Yi∗ )2 m = |{validation data}|, often or ∗ 1 6
i=1
4 2 h = 1/3 R 2 du
n (f 0 (u))
2/3 Z 1/3
Leave-one-out Cross-validation C 3 2
R∗ (fbn , f ) ≈ 2/3 C= (f 0 (u)) du
n n
!2 n 4
X
2
X Yi − Ybi (S)
R
bCV (S) = (Yi − Yb(i) ) = Cross-validation estimate of E [J(h)]
i=1 i=1
1 − Uii (S) Z n m
2 2Xb 2 n+1 X 2
JCV (h) = fn (x) dx −
b b f(−i) (Xi ) = − pb
U (S) = XS (XST XS )−1 XS (“hat matrix”) n i=1 (n − 1)h (n − 1)h j=1 j
20
19.1.2 Kernel Density Estimator (KDE) k-nearest Neighbor Estimator
Kernel K 1 X
rb(x) = Yi where Nk (x) = {k values of x1 , . . . , xn closest to x}
k
i:xi ∈Nk (x)
• K(x) ≥ 0
Nadaraya-Watson Kernel Estimator
R
• K(x) dx = 1
R
• xK(x) dx = 0 n
X
rb(x) = wi (x)Yi
R 2 2
• x K(x) dx ≡ σK >0
i=1
x−xi

KDE K
wi (x) = h ∈ [0, 1]
n
Pn x−xj
j=1 K

1X1 x − Xi h
fbn (x) = K
n i=1 h h h4
Z 4 Z
f 0 (x)
2
Z Z rn , r) ≈
R(b x2 K 2 (x) dx r00 (x) + 2r0 (x) dx
1 4 00 2 1 4 f (x)
R(f, fn ) ≈ (hσK )
b (f (x)) dx + K 2 (x) dx
4 nh σ 2 K 2 (x) dx
Z R
−2/5 −1/5 −1/5 + dx
nhf (x)
Z Z
c c2 c3
h∗ = 1 c1 = σ 2
K , c 2 = K 2
(x) dx, c 3 = (f 00 (x))2 dx c1
n1/5 h∗ ≈
Z 4/5 Z 1/5 n1/5
c4 5 2 2/5 c2
R∗ (f, fbn ) = 4/5 c4 = (σK ) K 2 (x) dx (f 00 )2 dx R∗ (b
rn , r) ≈ 4/5
n 4 n
| {z }
C(K)
Cross-validation estimate of E [J(h)]

Epanechnikov Kernel
n n
√ X X (Yi − rb(xi ))2
(Yi − rb(−i) (xi ))2 =
(
√ 3
|x| < 5 JbCV (h) = !2
K(x) = 4 5(1−x2 /5)
i=1 i=1 K(0)
0 otherwise 1− Pn x−x
j
j=1 K h
Cross-validation estimate of E [J(h)]

19.3 Smoothing Using Orthogonal Functions
n n n
1 X X ∗ Xi − Xj
Z
2 2Xb 2 Approximation
JbCV (h) = fn (x) dx −
b f(−i) (Xi ) ≈ 2
K + K(0)
n i=1 hn i=1 j=1 h nh ∞
X J
X
r(x) = βj φj (x) ≈ βj φj (x)
Z j=1 i=1
K ∗ (x) = K (2) (x) − 2K(x) K (2) (x) = K(x − y)K(y) dy Multivariate Regression
Y = Φβ + η
 
19.2 Non-parametric Regression φ0 (x1 ) ··· φJ (x1 )
 .. .. .. 
where ηi = i and Φ =  . . . 
Estimate f (x), where f (x) = E [Y | X = x]. Consider pairs of points
(x1 , Y1 ), . . . , (xn , Yn ) related by φ0 (xn ) · · · φJ (xn )
Least Squares Estimator
Yi = r(xi ) + i
βb = (ΦT Φ)−1 ΦT Y
E [i ] = 0
1
V [i ] = σ 2 ≈ ΦT Y (for equallly spaced observations only)
n
21
Cross-validation estimate of E [J(h)] 20.2 Poisson Processes
 2
Xn J
X Poisson Process
R
bCV (J) = Yi − φj (xi )βbj,(−i) 
i=1 j=1
• {Xt : t ∈ [0, ∞)} – number of events up to and including time t
• X0 = 0
20 Stochastic Processes • Independent increments:
Stochastic Process
( ∀t0 < · · · < tn : Xt1 − Xt0 ⊥
⊥ ··· ⊥
⊥ Xtn − Xtn−1
{0, ±1, . . . } = Z discrete
{Xt : t ∈ T } T =
[0, ∞) continuous
• Intensity function λ(t)
• Notations: Xt , X(t)
• State space X – P [Xt+h − Xt = 1] = λ(t)h + o(h)
• Index set T – P [Xt+h − Xt = 2] = o(h)
Rt
• Xs+t − Xs ∼ Po (m(s + t) − m(s)) where m(t) = 0
λ(s) ds
20.1 Markov Chains
Markov Chain Homogeneous Poisson Process
P [Xn = x | X0 , . . . , Xn−1 ] = P [Xn = x | Xn−1 ] ∀n ∈ T, x ∈ X
λ(t) ≡ λ =⇒ Xt ∼ Po (λt) λ>0
Transition probabilities
pij ≡ P [Xn+1 = j | Xn = i] Waiting Times

pij (n) ≡ P [Xm+n = j | Xm = i] n-step
Wt := time at which Xt occurs
Transition matrix P (n-step: Pn )
• (i, j) element is pij
1
• pij > 0 Wt ∼ Gamma t,
P λ
• i pij = 1
Chapman-Kolmogorov Interarrival Times

X
pij (m + n) = pij (m)pkj (n) St = Wt+1 − Wt
k
Pm+n = Pm Pn
1
Pn = P × · · · × P = Pn St ∼ Exp
λ
Marginal probability
µn = (µn (1), . . . , µn (N )) where µi (i) = P [Xn = i]

St
µ0 , initial distribution
µn = µ0 Pn Wt−1 Wt t
22
21 Time Series 21.1 Stationary Time Series
Mean function Z ∞
Strictly stationary
µxt = E [xt ] = xft (x) dx
−∞ P [xt1 ≤ c1 , . . . , xtk ≤ ck ] = P [xt1 +h ≤ c1 , . . . , xtk +h ≤ ck ]
Autocovariance function
γx (s, t) = E [(xs − µs )(xt − µt )] = E [xs xt ] − µs µt ∀k ∈ N, tk , ck , h ∈ Z
γx (t, t) = E (xt − µt )2 = V [xt ]

Weakly stationary
Autocorrelation function (ACF)
• E x2t < ∞ ∀t ∈ Z
2
Cov [xs , xt ] γ(s, t) • E xt = m ∀t ∈ Z
ρ(s, t) = p =p
V [xs ] V [xt ] γ(s, s)γ(t, t) • γx (s, t) = γx (s + r, t + r) ∀r, s, t ∈ Z
Cross-covariance function (CCV) Autocovariance function

γxy (s, t) = E [(xs − µxs )(yt − µyt )]
• γ(h) = E [(xt+h − µ)(xt − µ)] ∀h ∈ Z

Cross-correlation function (CCF) • γ(0) = E (xt − µ)2
γxy (s, t) • γ(0) ≥ 0
ρxy (s, t) = p • γ(0) ≥ |γ(h)|
γx (s, s)γy (t, t)
• γ(h) = γ(−h)
Backshift operator
B k (xt ) = xt−k Autocorrelation function (ACF)
Difference operator
∇d = (1 − B)d Cov [xt+h , xt ] γ(t + h, t) γ(h)
ρx (h) = p =p =
V [xt+h ] V [xt ] γ(t + h, t + h)γ(t, t) γ(0)
White Noise
2
• wt ∼ wn(0, σw ) Jointly stationary time series
iid 2

• Gaussian: wt ∼ N 0, σw
γxy (h) = E [(xt+h − µx )(yt − µy )]
• E [wt ] = 0 t ∈ T
• V [wt ] = σ 2 t ∈ T
• γw (s, t) = 0 s 6= t ∧ s, t ∈ T γxy (h)
ρxy (h) = p
γx (0)γy (h)
Random Walk
• Drift δ Linear Process
Pt
• xt = δt + j=1 wj ∞
X ∞
X
• E [xt ] = δt xt = µ + ψj wt−j where |ψj | < ∞
j=−∞ j=−∞
Symmetric Moving Average
k
X k
X ∞
X
2
mt = aj xt−j where aj = a−j ≥ 0 and aj = 1 γ(h) = σw ψj+h ψj
j=−k j=−k j=−∞
23
21.2 Estimation of Correlation 21.3.1 Detrending
Sample mean Least Squares
n
1X
x̄ = xt 1. Choose trend model, e.g., µt = β0 + β1 t + β2 t2
n t=1
2. Minimize rss to obtain trend estimate µ bt = βb0 + βb1 t + βb2 t2
Sample variance 3. Residuals , noise wt
n
1 X |h|
V [x̄] = 1− γx (h) Moving average
n n
h=−n
1
• The low-pass filter vt is a symmetric moving average mt with aj = 2k+1 :
Sample autocovariance function
k
n−h 1 X
1 X vt = xt−1
γ
b(h) = (xt+h − x̄)(xt − x̄) 2k + 1
n t=1 i=−k
1
Pk
Sample autocorrelation function • If 2k+1 i=−k wt−j ≈ 0, a linear trend function µt = β0 + β1 t passes
without distortion
γ
b(h)
ρb(h) = Differencing
γ
b(0)
• µt = β0 + β1 t =⇒ ∇xt = β1
Sample cross-variance function
n−h
1 X 21.4 ARIMA models
γ
bxy (h) = (xt+h − x̄)(yt − y)
n t=1 Autoregressive polynomial
Sample cross-correlation function φ(z) = 1 − φ1 z − · · · − φp zp z ∈ C ∧ φp 6= 0
γ
bxy (h) Autoregressive operator
ρbxy (h) = p
γbx (0)b
γy (0)
φ(B) = 1 − φ1 B − · · · − φp B p
Properties
Autoregressive model order p, AR (p)
1
• σρbx (h) = √ if xt is white noise
n xt = φ1 xt−1 + · · · + φp xt−p + wt ⇐⇒ φ(B)xt = wt
1
• σρbxy (h) = √ if xt or yt is white noise AR (1)
n
k−1 ∞
X k→∞,|φ|<1 X
21.3 Non-Stationary Time Series • xt = φk (xt−k ) + φj (wt−j ) = φj (wt−j )
j=0 j=0
Classical decomposition model | {z }
linear process
P∞
xt = µt + st + wt • E [xt ] = j=0 φj (E [wt−j ]) = 0
2 h
σw φ
• µt = trend • γ(h) = Cov [xt+h , xt ] = 1−φ2
γ(h)
• st = seasonal component • ρ(h) = γ(0) = φh
• wt = random noise term • ρ(h) = φρ(h − 1) h = 1, 2, . . .
24
Moving average polynomial Seasonal ARIMA
θ(z) = 1 + θ1 z + · · · + θq zq z ∈ C ∧ θq 6= 0 • Denoted by ARIMA (p, d, q) × (P, D, Q)s
Moving average operator • ΦP (B s )φ(B)∇D d s
s ∇ xt = δ + ΘQ (B )θ(B)wt
θ(B) = 1 + θ1 B + · · · + θp B p
21.4.1 Causality and Invertibility
MA (q) (moving average model order q) P∞
ARMA (p, q) is causal (future-independent) ⇐⇒ ∃{ψj } : j=0 ψj < ∞ such that
xt = wt + θ1 wt−1 + · · · + θq wt−q ⇐⇒ xt = θ(B)wt
q ∞
X
xt = wt−j = ψ(B)wt
X
E [xt ] = θj E [wt−j ] = 0
j=0
j=0
( Pq−h P∞
2
σw j=0 θj θj+h 0≤h≤q ARMA (p, q) is invertible ⇐⇒ ∃{πj } : j=0 πj < ∞ such that
γ(h) = Cov [xt+h , xt ] =
0 h>q
∞
X
MA (1) π(B)xt = Xt−j = wt
xt = wt + θwt−1 j=0

2 2
(1 + θ )σw h = 0

Properties
γ(h) = θσw 2
h=1


0 h>1 • ARMA (p, q) causal ⇐⇒ roots of φ(z) lie outside the unit circle
(
θ
2 h=1 ∞
θ(z)
ρ(h) = (1+θ )
X
0 h>1 ψ(z) = ψj z j = |z| ≤ 1
j=0
φ(z)
ARMA (p, q)
xt = φ1 xt−1 + · · · + φp xt−p + wt + θ1 wt−1 + · · · + θq wt−q • ARMA (p, q) invertible ⇐⇒ roots of θ(z) lie outside the unit circle
∞
φ(B)xt = θ(B)wt X φ(z)
π(z) = πj z j = |z| ≤ 1
Partial autocorrelation function (PACF) j=0
θ(z)
• xh−1
i , regression of xi on {xh−1 , xh−2 , . . . , x1 }
Behavior of the ACF and PACF for causal and invertible ARMA models
• φhh = corr(xh − xh−1
h , x0 − xh−1
0 ) h≥2
• E.g., φ11 = corr(x1 , x0 ) = ρ(1) AR (p) MA (q) ARMA (p, q)
ARIMA (p, d, q) ACF tails off cuts off after lag q tails off
∇d xt = (1 − B)d xt is ARMA (p, q) PACF cuts off after lag p tails off q tails off
φ(B)(1 − B)d xt = θ(B)wt
Exponentially Weighted Moving Average (EWMA) 21.5 Spectral Analysis
xt = xt−1 + wt − λwt−1 Periodic process
∞
X xt = A cos(2πωt + φ)
xt = (1 − λ)λj−1 xt−j + wt when |λ| < 1
j=1 = U1 cos(2πωt) + U2 sin(2πωt)
x̃n+1 = (1 − λ)xn + λx̃n
• Frequency index ω (cycles per unit time), period 1/ω
25
• Amplitude A Discrete Fourier Transform (DFT)
• Phase φ n
X
• U1 = A cos φ and U2 = A sin φ often normally distributed rv’s d(ωj ) = n−1/2 xt e−2πiωj t
i=1
Periodic mixture
q
X Fourier/Fundamental frequencies
xt = (Uk1 cos(2πωk t) + Uk2 sin(2πωk t))
ωj = j/n
k=1
• Uk1 , Uk2 , for k = 1, . . . , q, are independent zero-mean rv’s with variances σk2 Inverse DFT
n−1
Pq X
• γ(h) = k=1 σk2 cos(2πωk h) xt = n−1/2 d(ωj )e2πiωj t
Pq
• γ(0) = E x2t = k=1 σk2 j=0
Spectral representation of a periodic process Periodogram

I(j/n) = |d(j/n)|2
γ(h) = σ 2 cos(2πω0 h)
Scaled Periodogram
σ 2 −2πiω0 h σ 2 2πiω0 h
= e + e 4
2 2 P (j/n) = I(j/n)
Z 1/2 n
e2πiωh dF (ω)
!2 !2
= 2X
n
2X
n
−1/2 = xt cos(2πtj/n + xt sin(2πtj/n
n t=1 n t=1
Spectral distribution function

0 ω < −ω0

22 Math
F (ω) = σ 2 /2 −ω ≤ ω < ω0

 2
σ ω ≥ ω0 22.1 Gamma Function
Z ∞
• F (−∞) = F (−1/2) = 0 • Ordinary: Γ(s) = ts−1 e−t dt
• F (∞) = F (1/2) = γ(0) 0 Z ∞
Spectral density • Upper incomplete: Γ(s, x) = ts−1 e−t dt
x
Z x
∞
X 1 1 • Lower incomplete: γ(s, x) = ts−1 e−t dt
f (ω) = γ(h)e−2πiωh − ≤ω≤
2 2 0
h=−∞
• Γ(α + 1) = αΓ(α) α>1
P∞ R 1/2
• Needs |γ(h)| < ∞ =⇒ γ(h) = e2πiωh f (ω) dω h = 0, ±1, . . . • Γ(n) = (n − 1)! n∈N
h=−∞ −1/2 √
• f (ω) ≥ 0 • Γ(1/2) = π
• f (ω) = f (−ω)
• f (ω) = f (1 − ω) 22.2 Beta Function
R 1/2 Z 1
• γ(0) = V [xt ] = −1/2 f (ω) dω Γ(x)Γ(y)
• Ordinary: B(x, y) = B(y, x) = tx−1 (1 − t)y−1 dt =
2
• White noise: fw (ω) = σw 0 Γ(x + y)
Z x
• ARMA (p, q) , φ(B)xt = θ(B)wt : • Incomplete: B(x; a, b) = ta−1 (1 − t)b−1 dt
0
|θ(e−2πiω )|2
2 • Regularized incomplete:
fx (ω) = σw a+b−1
|φ(e−2πiω )|2 B(x; a, b) a,b∈N X (a + b − 1)!
Pp Pq Ix (a, b) = = xj (1 − x)a+b−1−j
where φ(z) = 1 − k=1 φk z k and θ(z) = 1 + k=1 θk z k B(a, b) j=a
j!(a + b − 1 − j)!
26
• I0 (a, b) = 0 I1 (a, b) = 1 Stirling numbers, 2nd kind
• Ix (a, b) = 1 − I1−x (b, a) (
n n−1 n−1 n 1 n=0
=k + 1≤k≤n =
k k k−1 0 0 else
22.3 Series
Finite Binomial Partitions
n n n
n(n + 1) n
X
Pn+k,k = Pn,i k > n : Pn,k = 0 n ≥ 1 : Pn,0 = 0, P0,0 = 1
X X
n
• k= • =2
2 k i=1
k=1 k=0
n n
f :B→U D = distinguishable, ¬D = indistinguishable.

X X r+k r+n+1 Balls and Urns
• (2k − 1) = n2 • =
k n
k=1 k=0
n n
X n(n + 1)(2n + 1) X k n+1 |B| = n, |U | = m f arbitrary f injective f surjective f bijective
• k2 = • =
6 m m+1
k=1 k=0 ( (
mn m ≥ n

n 2 • Vandermonde’s Identity: n n n! m = n
X n(n + 1) B : D, U : ¬D m m!
• k3 = r
m n

m+n

0 else m 0 else
2
X
k=1 =
n k r−k r (
cn+1 − 1 k=0

X n+n−1 m n−1 1 m=n
• ck = c 6= 1 • Binomial Theorem: B : ¬D, U : D
c−1 n
n n−k k n n m−1 0 else
k=0
X
a b = (a + b)n
k m
( (
k=0 X n 1 m≥n n 1 m=n
B : D, U : ¬D
k 0 else m 0 else
Infinite k=1
∞ ∞ m
( (
X 1 X p 1 m≥n 1 m=n
pk = pk =
X
• , |p| < 1 B : ¬D, U : ¬D Pn,k Pn,m
1−p 1−p 0 else 0 else
k=0 k=1 k=1
∞ ∞
!
X d X d 1 1
• kpk−1 = pk
= = |p| < 1
dp dp 1 − p 1 − p2 References
k=0 k=0
∞
X r+k−1 k
• x = (1 − x)−r r ∈ N+ [1] P. G. Hoel, S. C. Port, and C. J. Stone. Introduction to Probability Theory.
k
k=0 Brooks Cole, 1972.
∞
X α k
• p = (1 + p)α |p| < 1 , α ∈ C [2] L. M. Leemis and J. T. McQueston. Univariate Distribution Relationships.
k
k=0 The American Statistician, 62(1):45–53, 2008.
22.4 Combinatorics [3] R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications
With R Examples. Springer, 2006.
Sampling
[4] A. Steger. Diskrete Strukturen – Band 1: Kombinatorik, Graphentheorie,
k out of n w/o replacement w/ replacement Algebra. Springer, 2001.
k−1
Y n! [5] A. Steger. Diskrete Strukturen – Band 2: Wahrscheinlichkeitstheorie und
ordered nk = (n − i) = nk Statistik. Springer, 2002.
i=0
(n − k)!

n nk n!

n−1+r

n−1+r
[6] L. Wasserman. All of Statistics: A Concise Course in Statistical Inference.
unordered = = = Springer, 2003.
k k! k!(n − k)! r n−1
27
Univariate distribution relationships, courtesy of Leemis and McQueston [2].
28

A Probability and Statistics Cheatsheet PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Probability and Statistics Cheatsheet PDF

Uploaded by

Copyright:

Available Formats

Probability and Statistics

6th March, 2011

7 Distribution Relationships 8 17 Decision Theory 17

Uniform (discrete) Binomial Geometric Poisson

n = 25, p = 0.9 p = 0.8 λ = 10

a b −4 −2 0 2 4 0.0 0.5 1.0 1.5 2.0 2.5 3.0 −4 −2 0 2 4

Inverse Gamma Beta Weibull Pareto

• Outcome (point or element) ω ∈ Ω Bayes’ Theorem

Properties Probability Mass Function (PMF)

• P [∅] = 0 fX (x) = P [X = x] = P [{ω ∈ Ω : X(ω) = x}]

Expectation Standard deviation p

P [a < X ≤ b] = Φ b−µ − Φ a−µ

lim Fn (t) = F (t) ∀t where F continuous

lim E (Xn − X)2 = 0

√ • limn→∞ bias(θbn ) = 0 ∧ limn→∞ se(θbn ) = 0 =⇒ θbn is consistent

Dvoretzky-Kiefer-Wolfowitz (DKW) Inequality (X1 , . . . , Xn ∼ F ) 12.1 Method of Moments

Nonparametric 1 − α confidence band for F j th sample moment

Score Function where τb = ϕ(θ)

Properties of the mle Under appropriate regularity conditions

P [Hi | xn ] f (xn | Hi ) P [Hi ] i. Sample uniformly X1∗ , . . . , Xn∗ ∼ F̂n .

15 Exponential Family Pivotal Interval

Example 17.1 Risk

18.2 Prediction mle

18.3 Multiple Regression Procedure

Cross-validation estimate of E [J(h)]

Cross-validation estimate of E [J(h)]

pij ≡ P [Xn+1 = j | Xn = i] Waiting Times

Chapman-Kolmogorov Interarrival Times

µn = (µn (1), . . . , µn (N )) where µi (i) = P [Xn = i]

γx (s, t) = E [(xs − µs )(xt − µt )] = E [xs xt ] − µs µt ∀k ∈ N, tk , ck , h ∈ Z

γx (t, t) = E (xt − µt )2 = V [xt ]

Cross-covariance function (CCV) Autocovariance function

Sample cross-correlation function φ(z) = 1 − φ1 z − · · · − φp zp z ∈ C ∧ φp 6= 0

Spectral representation of a periodic process Periodogram

You might also like