You are on page 1of 28

Probability and Statistics

Cheat Sheet

Copyright
c Matthias Vallentin, 2010
vallentin@icir.org

February 27, 2011


This cheat sheet integrates a variety of topics in probability the- 12 Parametric Inference 11 20 Stochastic Processes 22
ory and statistics. It is based on literature [1, 6, 3] and in-class 12.1 Method of Moments . . . . . . . . . . . 12 20.1 Markov Chains . . . . . . . . . . . . . . 22
material from courses of the statistics department at the Univer- 12.2 Maximum Likelihood . . . . . . . . . . . 12 20.2 Poisson Processes . . . . . . . . . . . . . 22
sity of California in Berkeley but also influenced by other sources 12.2.1 Delta Method . . . . . . . . . . . 12
[4, 5]. If you find errors or have suggestions for further topics, I 21 Time Series 23
12.3 Multiparameter Models . . . . . . . . . 13
would appreciate if you send me an email. The most recent ver- 21.1 Stationary Time Series . . . . . . . . . . 23
12.3.1 Multiparameter Delta Method . 13
sion of this document is available at http://bit.ly/probstat. 21.2 Estimation of Correlation . . . . . . . . 24
12.4 Parametric Bootstrap . . . . . . . . . . 13 21.3 Non-Stationary Time Series . . . . . . . 24
To reproduce, please contact me.
21.3.1 Detrending . . . . . . . . . . . . 24
13 Hypothesis Testing 13 21.4 ARIMA models . . . . . . . . . . . . . . 24
Contents 14 Bayesian Inference 14
21.4.1 Causality and Invertibility . . . . 25
21.5 Spectral Analysis . . . . . . . . . . . . . 25
1 Distribution Overview 3 14.1 Credible Intervals . . . . . . . . . . . . . 14
1.1 Discrete Distributions . . . . . . . . . . 3 14.2 Function of Parameters . . . . . . . . . 14 22 Math 26
1.2 Continuous Distributions . . . . . . . . 4 14.3 Priors . . . . . . . . . . . . . . . . . . . 15 22.1 Series . . . . . . . . . . . . . . . . . . . 26
14.3.1 Conjugate Priors . . . . . . . . . 15 22.2 Combinatorics . . . . . . . . . . . . . . 27
2 Probability Theory 6 14.4 Bayesian Testing . . . . . . . . . . . . . 15

3 Random Variables 6 15 Exponential Family 16


3.1 Transformations . . . . . . . . . . . . . 7
16 Sampling Methods 16
4 Expectation 7 16.1 The Bootstrap . . . . . . . . . . . . . . 16
16.1.1 Bootstrap Confidence Intervals . 16
5 Variance 7
16.2 Rejection Sampling . . . . . . . . . . . . 17
6 Inequalities 8 16.3 Importance Sampling . . . . . . . . . . . 17

7 Distribution Relationships 817 Decision Theory 17


17.1 Risk . . . . . . . . . . . . . . . . . . . . 17
8 Probability and Moment Generating 17.2 Admissibility . . . . . . . . . . . . . . . 17
Functions 9 17.3 Bayes Rule . . . . . . . . . . . . . . . . 18
17.4 Minimax Rules . . . . . . . . . . . . . . 18
9 Multivariate Distributions 9
9.1 Standard Bivariate Normal . . . . . . . 9 18 Linear Regression 18
9.2 Bivariate Normal . . . . . . . . . . . . . 9 18.1 Simple Linear Regression . . . . . . . . 18
9.3 Multivariate Normal . . . . . . . . . . . 9
18.2 Prediction . . . . . . . . . . . . . . . . . 19
10 Convergence 10 18.3 Multiple Regression . . . . . . . . . . . 19
10.1 Law of Large Numbers (LLN) . . . . . . 10 18.4 Model Selection . . . . . . . . . . . . . . 19
10.2 Central Limit Theorem (CLT) . . . . . 10
19 Non-parametric Function Estimation 20
11 Statistical Inference 11 19.1 Density Estimation . . . . . . . . . . . . 20
11.1 Point Estimation . . . . . . . . . . . . . 11 19.1.1 Histograms . . . . . . . . . . . . 20
11.2 Normal-based Confidence Interval . . . . 11 19.1.2 Kernel Density Estimator (KDE) 21
11.3 Empirical Distribution Function . . . . . 11 19.2 Non-parametric Regression . . . . . . . 21
11.4 Statistical Functionals . . . . . . . . . . 11 19.3 Smoothing Using Orthogonal Functions 21
1 Distribution Overview
1.1 Discrete Distributions
FX (x) fX (x) E [X] V [X] MX (s)

0
 x<a
I(a < x < b) a+b (b − a + 1)2 − 1 eas − e−(b+1)s
bxc−a+1
Uniform{a, . . . , b} a≤x≤b
 b−a b−a+1 2 12 s(b − a)
1 x>b

1−x 1−x
Bernoulli(p) (1 − p) px (1 − p) p p(1 − p) 1 − p + pes
 
n x n−x
Binomial(n, p) I1−p (n − x, x + 1) p (1 − p) np np(1 − p) (1 − p + pes )n
x
k k
!n
n! X X
Multinomial(n, p) px1 1 · · · pxkk xi = n npi npi (1 − pi ) pi e si
x1 ! . . . xk ! i=1 i=0
! m m−x
 
x − np x n−x nm nm(N − n)(N − m)
Hypergeometric(N, m, n) ≈Φ N
N/A
N 2 (N − 1)
p 
np(1 − p) x
N
   r
x+r−1 r 1−p 1−p p
NegativeBinomial(r, p) Ip (r, x + 1) p (1 − p)x r r 2
r−1 p p 1 − (1 − p)es
1 1−p p
Geometric(p) 1 − (1 − p)x x ∈ N+ p(1 − p)x−1 x ∈ N+
p p2 1 − (1 − p)es
x
X λi λx e−λ s
Poisson(λ) e−λ λ λ eλ(e −1)

i=0
i! x!

Uniform (discrete) Binomial Geometric Poisson

n = 40, p = 0.3 p = 0.2


0.8
● ● ● ● ● λ=1
n = 30, p = 0.6 p = 0.5 λ=4
0.25

n = 25, p = 0.9 p = 0.8 λ = 10

0.3
0.20

0.6
0.15

0.2
1
PMF

PMF

PMF

PMF
0.4


● ● ● ● ● ● ● ●
n ●


0.10


0.1
0.2


0.05

● ●

● ●

● ● ●


● ● ●
● ● ●
● ● ●
0.00

● ● ●
0.0

0.0
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

a b 0 10 20 30 40 0 2 4 6 8 10 0 5 10 15 20

x x x x

3
1.2 Continuous Distributions
FX (x) fX (x) E [X] V [X] MX (s)

0
 x<a
I(a < x < b) a+b (b − a)2 esb − esa
x−a
Uniform(a, b) a<x<b
 b−a b−a 2 12 s(b − a)
1 x>b

( )
Z x 2
σ 2 s2
 
2 1 (x − µ) 2
Normal(µ, σ ) Φ(x) = φ(t) dt φ(x) = √ exp − µ σ exp µs +
−∞ σ 2π 2σ 2 2
(ln x − µ)2
   
1 1 ln x − µ 1 2 2 2
Log-Normal(µ, σ ) 2
+ erf √ √ exp − eµ+σ /2
(eσ − 1)e2µ+σ
2 2 2σ 2 x 2πσ 2 2σ 2
 
1 T
Σ−1 (x−µ) 1
Multivariate Normal(µ, Σ) (2π)−k/2 |Σ|−1/2 e− 2 (x−µ) µ Σ exp µT s + sT Σs
2
 
1 k x 1
Chi-square(k) γ , xk/2 e−x/2 k 2k (1 − 2s)−k/2 s < 1/2
Γ(k/2) 2 2 2k/2 Γ(k/2)
1 −x/β 1
Exponential(β) 1 − e−x/β e β β2 (s < 1/β)
β 1 − βs
 α
γ(α, x/β) 1 1
Gamma(α, β)1 xα−1 e−x/β αβ αβ 2 (s < 1/β)
Γ(α) Γ (α) β α 1 − βs
 
Γ α, βx β α −α−1 −β/x β β2 2(−βs)α/2 p 
InverseGamma(α, β) x e α>1 α>2 Kα −4βs
Γ (α) Γ (α) α−1 (α − 1)2 (α − 2)2 Γ(α)
P 
k k
Γ i=1 αi Y αi E [Xi ] (1 − E [Xi ])
i −1
Dirichlet(α) Qk xα
i Pk Pk
i=1 Γ (αi ) i=1 i=1 αi i=1 αi + 1
∞ k−1
!
2 Γ (α + β) α−1 β−1 α αβ X Y α+r sk
Beta(α, β) Ix (α, β) x (1 − x) 1+
Γ (α) Γ (β) α+β (α + β)2 (α + β + 1) r=0
α+β+r k!
k=1
    ∞ n n
k k  x k−1 −(x/λ)k 1 2 X s λ  n
Weibull(λ, k) 1 − e−(x/λ) e λΓ 1 + λ2 Γ 1 + − µ2 Γ 1+
λ λ k k n=0
n! k
 x α
m xα
m αxm xα
m
Pareto(xm , α) 1− x ≥ xm α x ≥ xm α>1 α>2 α(−xm s)α Γ(−α, −xm s) s < 0
x xα+1 α−1 (α − 1)2 (α − 2)

1 γ(s, x) in the CDF of the Gamma distribution denotes the lower incomplete gamma function which is defined as γ(s, x) = 0x ts−1 e−t dt.
R
2I
x (a, b) in the CDF of the
R Beta distribution denotes the regularized incomplete beta function and is defined as Ix (a, b) = B(x; a, b)/B(a, b), where B(x; a, b) is the incomplete beta function
and is defined as B(x; a, b) = 0x ta−1 (1 − t)b−1 dt.

4
Uniform (continuous) Normal Log−normal χ2

0.5
µ = 0, σ2 = 0.2 µ = 0, σ2 = 3 k=1

1.0
µ = 0, σ2 = 1 µ = 2, σ2 = 2 k=2
µ = 0, σ2 = 5 µ = 0, σ2 = 1 k=3

0.8
µ = −2, σ2 = 0.5 µ = 0.5, σ2 = 1 k=4
µ = 0.25, σ2 = 1 k=5

0.4
µ = 0.125, σ2 = 1

0.8
0.6

0.3
0.6
1
PDF

PDF

PDF
φ(x)
● ●
b−a

0.4

0.2
0.4
0.2

0.1
0.2
0.0

0.0

0.0
● ●

a b −4 −2 0 2 4 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0 2 4 6 8

x x x x

Exponential Gamma InverseGamma Beta

β=2 α = 1, β = 2 α = 1, β = 1 α = 0.5, β = 0.5


2.0

0.5

3.0
β=1 α = 2, β = 2 α = 2, β = 1 α = 5, β = 1
β = 0.4 α = 3, β = 2 α = 3, β = 1 α = 1, β = 3
α = 5, β = 1 α = 3, β = 0.5 α = 2, β = 2

2.5
α = 9, β = 0.5 α = 2, β = 5
0.4
1.5

2.0
3
0.3
PDF

PDF

PDF

PDF
1.0

1.5
2
0.2

1.0
0.5

0.1

0.5
0.0

0.0

0.0
0
0 1 2 3 4 5 0 5 10 15 20 0 1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0

x x x x

Weibull Pareto
2.5

λ = 1, k = 0.5 xm = 1, α = 1
λ = 1, k = 1 xm = 1, α = 2
λ = 1, k = 1.5 xm = 1, α = 4
λ = 1, k = 5
2.0

3
1.5

2
PDF

PDF
1.0

1
0.5
0.0

0.0 0.5 1.0 1.5 2.0 2.5 0 1 2 3 4 5

x x

5
2 Probability Theory Law of Total Probability
n n
Definitions X G
P [B] = P [B|Ai ] P [Ai ] Ω= Ai
• Sample space Ω i=1 i=1

• Outcome (point or element) ω ∈ Ω Bayes’ Theorem


• Event A ⊆ Ω
n
• σ-algebra A P [B | Ai ] P [Ai ] G
P [Ai | B] = Pn Ω= Ai
1. ∅ ∈ A j=1 P [B | Aj ] P [Aj ] i=1
S∞
2. A1 , A2 , . . . , ∈ A =⇒ i=1 Ai ∈ A Inclusion-Exclusion Principle
3. A ∈ A =⇒ ¬A ∈ A
n n
r
[ X \
• Probability distribution P
X
(−1)r−1

Ai = A ij


1. P [A] ≥ 0 for every A i=1 r=1 i≤i1 <···<ir ≤n j=1

2. P [Ω] = 1
"∞ #
G X∞ 3 Random Variables
3. P Ai = P [Ai ]
i=1 i=1 Random Variable
• Probability space (Ω, A, P) X:Ω→R

Properties Probability Mass Function (PMF)

• P [∅] = 0 fX (x) = P [X = x] = P [{ω ∈ Ω : X(ω) = x}]


• B = Ω ∩ B = (A ∪ ¬A) ∩ B = (A ∩ B) ∪ (¬A ∩ B)
Probability Density Function (PDF)
• P [¬A] = 1 − P [A]
b
• P [B] = P [A ∩ B] + P [¬A ∩ B]
Z
P [a ≤ X ≤ b] = f (x) dx
• P [Ω] = 1 P [∅] = 0 a
S T T S
• ¬( n An ) = n ¬An ¬( n An ) = n ¬An DeMorgan
S T Cumulative Distribution Function (CDF):
• P [ n An ] = 1 − P [ n ¬An ]
• P [A ∪ B] = P [A] + P [B] − P [A ∩ B] FX : R → [0, 1] FX (x) = P [X ≤ x]
=⇒ P [A ∪ B] ≤ P [A] + P [B] 1. Nondecreasing: x1 < x2 =⇒ F (x1 ) ≤ F (x2 )
• P [A ∪ B] = P [A ∩ ¬B] + P [¬A ∩ B] + P [A ∩ B] 2. Normalized: limx→−∞ = 0 and limx→∞ = 1
• P [A ∩ ¬B] = P [A] − P [A ∩ B] 3. Right-continuous: limy↓x F (y) = F (x)
Continuity of Probabilities
S∞ Z b
• A1 ⊂ A2 ⊂ . . . =⇒ limn→∞ P [An ] = P [A] where A = i=1 Ai P [a ≤ Y ≤ b | X = x] = fY |X (y | x)dy a≤b
T∞
• A1 ⊃ A2 ⊃ . . . =⇒ limn→∞ P [An ] = P [A] where A = i=1 Ai a
f (x, y)
Independence ⊥
⊥ fY |X (y | x) =
A⊥
⊥ B ⇐⇒ P [A ∩ B] = P [A] P [B] fX (x)
Independence
Conditional Probability
1. P [X ≤ x, Y ≤ y] = P [X ≤ x] P [Y ≤ y]
P [A ∩ B]
P [A | B] = if P [B] > 0 2. fX,Y (x, y) = fX (x)fY (y)
P [B] 6
Z
3.1 Transformations • E [XY ] = xyfX,Y (x, y) dFX (x) dFY (y)
X,Y
Transformation function
• E [ϕ(Y )] 6= ϕ(E [X]) (cf. Jensen inequality)
Z = ϕ(X)
• P [X ≥ Y ] = 0 =⇒ E [X] ≥ E [Y ] ∧ P [X = Y ] = 1 =⇒ E [X] = E [Y ]
Discrete X ∞
X • E [X] = P [X ≥ x]
fZ (z) = P [ϕ(X) = z] = P [{x : ϕ(x) = z}] = P X ∈ ϕ−1 (z) =
 
f (x) x=1
x∈ϕ−1 (z) Sample mean
n
Continuous 1X
X̄n = Xi
Z n i=1
FZ (z) = P [ϕ(X) ≤ z] = f (x) dx with Az = {x : ϕ(x) ≤ z} Conditional Expectation
Az Z
Special case if ϕ strictly monotone • E [Y | X = x] = yf (y | x) dy

d

dx 1 • E [X] = E [E [X | Y ]]
fZ (z) = fX (ϕ−1 (z)) ϕ−1 (z) = fX (x) = fX (x)
Z ∞
dz dz |J| • E[ϕ(X, Y ) | X = x] = ϕ(x, y)fY |X (y | x) dx
Z −∞

The Rule of the Lazy Statistician
• E [ϕ(Y, Z) | X = x] = ϕ(y, z)f(Y,Z)|X (y, z | x) dy dz
−∞
Z
E [Z] = ϕ(x) dFX (x) • E [Y + Z | X] = E [Y | X] + E [Z | X]
Z Z • E [ϕ(X)Y | X] = ϕ(X)E [Y | X]
E [IA (x)] = IA (x) dFX (x) = dFX (x) = P [X ∈ A] • E[Y | X] = c =⇒ Cov [X, Y ] = 0
A
Convolution
Z ∞ Z z
5 Variance
X,Y ≥0
• Z := X + Y fZ (z) = fX,Y (x, z − x) dx = fX,Y (x, z − x) dx Variance
−∞ 0
Z ∞ 2
    2
• Z := |X − Y | fZ (z) = 2 fX,Y (x, z + x) dx • V [X] = σX = E (X − E [X])2 = E X 2 − E [X]
" n # n
Z ∞ 0 Z ∞ X X X
X ⊥⊥ • V Xi = V [Xi ] + 2 Cov [Xi , Yj ]
• Z := fZ (z) = |x|fX,Y (x, xz) dx = xfx (x)fX (x)fY (xz) dx i=1 i=1
Y −∞ −∞ " n #
i6=j
X n
X
• V Xi = V [Xi ] iff Xi ⊥
⊥ Xj
4 Expectation i=1 i=1

Expectation Standard deviation p


X sd[X] = V [X] = σX


 xfX (x) X discrete Covariance
Z  x

• E [X] = µX = x dFX (x) = • Cov [X, Y ] = E [(X − E [X])(Y − E [Y ])] = E [XY ] − E [X] E [Y ]
• Cov [X, a] = 0

 Z
 xfX (x) X continuous


• Cov [X, X] = V [X]
• P [X = c] = 1 =⇒ E [c] = c • Cov [X, Y ] = Cov [Y, X]
• E [cX] = c E [X] • Cov [aX, bY ] = abCov [X, Y ]
• E [X + Y ] = E [X] + E [Y ] • Cov [X + a, Y + b] = Cov [X, Y ]
7
 
Xn m
X n X
X m • limn→∞ Bin (n, p) = N (np, np(1 − p)) (n large, p far from 0 and 1)
• Cov  Xi , Yj  = Cov [Xi , Yj ]
i=1 j=1 i=1 j=1
Negative Binomial
• X ∼ NBin (1, p) = Geometric (p)
Correlation Pr
Cov [X, Y ] • X ∼ NBin (r, p) = i=1 Geometric (p)
ρ [X, Y ] = p P P
V [X] V [Y ] • Xi ∼ NBin (ri , p) =⇒ Xi ∼ NBin ( ri , p)
• X ∼ NBin (r, p) . Y ∼ Bin (s + r, p) =⇒ P [X ≤ s] = P [Y ≥ r]
Independence
Poisson
X⊥
⊥ Y =⇒ ρ [X, Y ] = 0 ⇐⇒ Cov [X, Y ] = 0 ⇐⇒ E [XY ] = E [X] E [Y ] n n
!
X X
• Xi ∼ Poisson (λi ) ∧ Xi ⊥
⊥ Xj =⇒ Xi ∼ Poisson λi
Sample variance i=1 i=1
n
1 X  
S2 = (Xi − X̄n )2
n n
n−1
X X λ i
i=1 • Xi ∼ Poisson (λi ) ∧ Xi ⊥
⊥ Xj =⇒ Xi Xj ∼ Bin  Xj , Pn 
j=1 j=1 j=1 λ j
Conditional Variance
    2 Exponential
• V [Y | X] = E (Y − E [Y | X])2 | X = E Y 2 | X − E [Y | X]
n
• V [Y ] = E [V [Y | X]] + V [E [Y | X]]
X
• Xi ∼ Exp (β) ∧ Xi ⊥
⊥ Xj =⇒ Xi ∼ Gamma (n, β)
i=1
• Memoryless property: P [X > x + y | X > y] = P [X > x]
6 Inequalities
Normal
Cauchy-Schwarz  
X−µ

2
E [XY ] ≤ E X 2 E Y 2
    • X ∼ N µ, σ 2 =⇒ ∼ N (0, 1)
σ
 
Markov • X ∼ N µ, σ ∧ Z = aX + b =⇒ Z ∼ N aµ + b, a2 σ 2
2
  
E [ϕ(X)] • X ∼ N µ1 , σ12 ∧ Y ∼ N µ2 , σ22 =⇒ X + Y ∼ N µ1 + µ2 , σ12 + σ22
P [ϕ(X) ≥ t] ≤
t
 
Xi ∼ N µi , σi2 =⇒
P P P 2
• X ∼N i µi , i σi
Chebyshev  i i
P [a < X ≤ b] = Φ b−µ − Φ a−µ

V [X] • σ σ
P [|X − E [X]| ≥ t] ≤
t2 • Φ(−x) = 1 − Φ(x) φ0 (x) = −xφ(x) φ00 (x) = (x2 − 1)φ(x)
Chernoff • Upper quantile of N (0, 1): zα = Φ−1 (1 − α)

 
P [X ≥ (1 + δ)µ] ≤ δ > −1 Gamma (distribution)
(1 + δ)1+δ
Jensen • X ∼ Gamma (α, β) ⇐⇒ X/β ∼ Gamma (α, 1)

E [ϕ(X)] ≥ ϕ(E [X]) ϕ convex • Gamma (α, β) ∼ i=1 Exp (β)
P P
• Xi ∼ Gamma (αi , β) ∧ Xi ⊥
⊥ Xj =⇒ i Xi ∼ Gamma ( i αi , β)
Z ∞
Γ(α)
7 Distribution Relationships • = xα−1 e−λx dx
λα 0
Binomial Gamma (function)
n R∞
X • Ordinary: Γ(s) = 0 ts−1 e−t dt
• Xi ∼ Bernoulli (p) =⇒ Xi ∼ Bin (n, p) R∞
i=1
• Upper incomplete: Γ(s, x) = x ts−1 e−t dt
Rx
• X ∼ Bin (n, p) , Y ∼ Bin (m, p) =⇒ X + Y ∼ Bin (n + m, p) • Lower incomplete: γ(s, x) = 0 ts−1 e−t dt
• limn→∞ Bin (n, p) = Poisson (np) (n large, p small) • Γ(α + 1) = αΓ(α) α>1
8
• Γ(n) = (n − 1)!

n∈N 9 Multivariate Distributions
• Γ(1/2) = π
9.1 Standard Bivariate Normal
Beta (distribution)
p
Let X, Y ∼ N (0, 1) ∧ X ⊥
⊥ Z with Y = ρX + 1 − ρ2 Z

1 Γ(α + β) α−1 Joint density


• xα−1 (1 − x)β−1 = x (1 − x)β−1  2
x + y 2 − 2ρxy

B(α, β) Γ(α)Γ(β) 1
f (x, y) = exp −
  B(α + k, β) α+k−1 2(1 − ρ2 )
p
• E Xk = = E X k−1
  2π 1 − ρ2
B(α, β) α+β+k−1
• Beta (1, 1) ∼ Unif (0, 1) Conditionals

(Y | X = x) ∼ N ρx, 1 − ρ2 (X | Y = y) ∼ N ρy, 1 − ρ2
 
and
Beta (function):
Independence
Z 1
x−1 y−1 Γ(x)Γ(y) X⊥
⊥ Y ⇐⇒ ρ = 0
• Ordinary: B(x, y) = B(y, x) = t (1 − t) dt =
0 Γ(x + y)
Z x
• Incomplete: B(x; a, b) = ta−1 (1 − t)b−1 dt 9.2 Bivariate Normal
0  
• Regularized incomplete: Let X ∼ N µx , σx2 and Y ∼ N µy , σy2 .
a+b−1
B(x; a, b) a,b∈N X (a + b − 1)!
xj (1 − x)a+b−1−j
 
Ix (a, b) = = 1 z
B(a, b) j!(a + b − 1 − j)! f (x, y) = exp −
2(1 − ρ2 )
p
j=a 2πσx σy 1 − ρ2
• I0 (a, b) = 0 I1 (a, b) = 1
" 2 2 #
• Ix (a, b) = 1 − I1−x (b, a) x − µx

y − µy

x − µx

y − µy
z= + − 2ρ
σx σy σx σy

Conditional mean and variance


8 Probability and Moment Generating Functions
σX
E [X | Y ] = E [X] + ρ (Y − E [Y ])
 
• GX (t) = E tX |t| < 1 σY
"∞ # ∞  
X (Xt)i E Xi
p
V [X | Y ] = σX 1 − ρ2
X
· ti
t Xt


• MX (t) = GX (e ) = E e =E =
i=0
i! i=0
i!
• P [X = 0] = GX (0) 9.3 Multivariate Normal
• P [X = 1] = G0X (0)
(i) Covariance Matrix Σ (Precision Matrix Σ−1 )
G (0)
• P [X = i] = X
i!
 
V [X1 ] · · · Cov [X1 , Xk ]
• E [X] = G0X (1− ) .. .. ..
Σ=
 
  (k) . . . 
• E X k = MX (0)
  Cov [Xk , X1 ] · · · V [Xk ]
X! (k)
• E = GX (1− )
(X − k)! If X ∼ N (µ, Σ),
2
• V [X] = G00X (1− ) + G0X (1− ) − (G0X (1− ))  
−n/2 −1/2 1
• GX (t) = GY (t) =⇒ X = Y
d
fX (x) = (2π) |Σ| exp − (x − µ)T Σ−1 (x − µ)
2 9
Properties Slutzky’s Theorem
1/2 D P D
• Z ∼ N (0, 1) ∧ X = µ + Σ Z =⇒ X ∼ N (µ, Σ) • Xn → X and Yn → c =⇒ Xn + Yn → X + c
D P D
• X ∼ N (µ, Σ) =⇒ Σ−1/2 (X − µ) ∼ N (0, 1) • Xn → X and Yn → c =⇒ Xn Yn → cX
D D D

• X ∼ N (µ, Σ) =⇒ AX ∼ N Aµ, AΣAT • In general: Xn → X and Yn → Y =⇒
6 Xn + Yn → X + Y

• X ∼ N (µ, Σ) ∧ a is vector of length k =⇒ aT X ∼ N aT µ, aT Σa
10.1 Law of Large Numbers (LLN)
Let {X1 , . . . , Xn } be a sequence of iid rv’s, E [X1 ] = µ, and V [X1 ] < ∞.
10 Convergence
Weak (WLLN)
Let {X1 , X2 , . . .} be a sequence of rv’s and let X be another rv. Let Fn denote P
the cdf of Xn and let F denote the cdf of X. X̄n → µ as n → ∞
Strong (SLLN)
as
Types of Convergence X̄n → µ as n → ∞
D
1. In distribution (weakly, in law): Xn → X 10.2 Central Limit Theorem (CLT)
lim Fn (t) = F (t) ∀t where F continuous Let {X1 , . . . , Xn } be a sequence of iid rv’s, E [X1 ] = µ, and V [X1 ] = σ 2 .
n→∞

P X̄n − µ n(X̄n − µ) D
2. In probability: Xn → X Zn := q   = →Z where Z ∼ N (0, 1)
V X̄n σ
(∀ε > 0) lim P [|Xn − X| > ε] = 0
n→∞ lim P [Zn ≤ z] = Φ(z) z∈R
n→∞
as CLT Notations
3. Almost surely (strongly): Xn → X
h i h i Zn ≈ N (0, 1)
P lim Xn = X = P ω ∈ Ω : lim Xn (ω) = X(ω) = 1 σ2
 
n→∞ n→∞ X̄n ≈ N µ,
n
qm
4. In quadratic mean (L2 ): Xn → X σ2
 
X̄n − µ ≈ N 0,
n
lim E (Xn − X)2 = 0
 
√ 2

n→∞ n(X̄n − µ) ≈ N 0, σ

Relationships n(X̄n − µ)
≈ N (0, 1)
qm P D
n
• Xn → X =⇒ Xn → X =⇒ Xn → X
as P
• Xn → X =⇒ Xn → X
D P Continuity Correction
• Xn → X ∧ (∃c ∈ R) P [X = c] = 1 =⇒ Xn → X
x + 12 − µ
 
P P P  
• Xn →X ∧ Yn → Y =⇒ Xn + Yn → X + Y P X̄n ≤ x ≈ Φ √
qm qm qm σ/ n
• Xn →X ∧ Yn → Y =⇒ Xn + Yn → X + Y
x − 12 − µ
P P P
 
• Xn →X ∧ Yn → Y =⇒ Xn Yn → XY  
P X̄n ≥ x ≈ 1 − Φ √
P P
• Xn →X =⇒ ϕ(Xn ) → ϕ(X) σ/ n
D
• Xn → X =⇒ ϕ(Xn ) → ϕ(X)
D Delta Method
σ2 σ2
qm
   
• Xn → b ⇐⇒ limn→∞ E [Xn ] = b ∧ limn→∞ V [Xn ] = 0 Yn ≈ N µ, =⇒ ϕ(Yn ) ≈ N 0 2
ϕ(µ), (ϕ (µ))
qm
• X1 , . . . , Xn iid ∧ E [X] = µ ∧ V [X] < ∞ ⇐⇒ X̄n → µ n n
10
11 Statistical Inference P
• F̂n → F (x)
iid
Let X1 , · · · , Xn ∼ F if not otherwise noted. Dvoretzky-Kiefer-Wolfowitz (DKW) Inequality (X1 , . . . , Xn ∼ F )
 
2
11.1 Point Estimation P sup F (x) − F̂n (x) > ε = 2e−2nε

x
• Point estimator θbn of θ is a rv: θbn = g(X1 , . . . , Xn )
h i Nonparametric 1 − α confidence band for F
• bias(θbn ) = E θbn − θ
P L(x) = max{F̂n − n , 0}
• Consistency: θbn → θ
• Sampling distribution: F (θbn ) U (x) = min{F̂n + n , 1}
r h i s  
• Standard error: se(θn ) = V θbn
b 1 2
= log
h i h i 2n α
• Mean squared error: mse = E (θbn − θ)2 = bias(θbn )2 + V θbn
• limn→∞ bias(θbn ) = 0 ∧ limn→∞ se(θbn ) = 0 =⇒ θbn is consistent
P [L(x) ≤ F (x) ≤ U (x) ∀x] ≥ 1 − α
θbn − θ D
• Asymptotic normality: → N (0, 1)
se
• Slutzky’s Theorem often lets us replace se(θbn ) by some (weakly) consis- 11.4 Statistical Functionals
tent estimator σ
bn . • Statistical functional: T (F )
• Plug-in estimator of θ = T (F ) : θbn = T (F̂n )
11.2 Normal-based Confidence Interval •
R
Linear functional: T (F ) = ϕ(x) dFX (x)
 
b 2 . Let zα/2 = Φ−1 (1 − (α/2)), i.e., P Z > zα/2 = α/2 •
 
Suppose θbn ≈ N θ, se Plug-in estimator for linear functional:
 
and P −zα/2 < Z < zα/2 = 1 − α where Z ∼ N (0, 1). Then Z n
1X
T (F̂n ) =
ϕ(x) dFbn (x) = ϕ(Xi )
Cn = θbn ± zα/2 se
b n i=1
 
b 2 =⇒ T (F̂n ) ± zα/2 se
• Often: T (F̂n ) ≈ N T (F ), se
11.3 Empirical Distribution Function
b
• pth quantile: F −1 (p) = inf{x : F (x) ≥ p}
Empirical Distribution Function (ECDF)
• µ̂ = X̄n
Pn n
I(Xi ≤ x) 1 X
Fn (x) = i=1
b b2 =
• σ (Xi − X̄n )2
n n − 1 i=1
1
Pn 3
i=1 (Xi − µ̂)
(
1 Xi ≤ x n
• κ̂ =
I(Xi ≤ x) = b3 j
σ
0 Xi > x Pn
(Xi − X̄n )(Yi − Ȳn )
• ρ̂ = qP i=1 qP
Properties (for any fixed x) n
− 2 n
i=1 (X i X̄n ) i=1 (Yi − Ȳn )
h i
• E F̂n = F (x)
h i F (x)(1 − F (x)) 12 Parametric Inference
• V F̂n =
n 
F (x)(1 − F (x)) D Let F = f (x; θ : θ ∈ Θ be a parametric model with parameter space Θ ⊂ Rk
• mse = →0 and parameter θ = (θ1 , . . . , θk ).
n 11
12.1 Method of Moments Fisher Information
I(θ) = Vθ [s(X; θ)]
j th moment Z
In (θ) = nI(θ)
αj (θ) = E X j = xj dFX (x)
 

Fisher Information (exponential family)


j th sample moment  
n
1X j ∂
α̂j = X I(θ) = Eθ − s(X; θ)
n i=1 i ∂θ

Method of Moments Estimator (MoM) Observed Fisher Information


n
∂2 X
α1 (θ) = α̂1 Inobs (θ) = − log f (Xi ; θ)
∂θ2 i=1
α2 (θ) = α̂2
.. .. Properties of the mle
.=.
P
αk (θ) = α̂k • Consistency: θbn → θ
• Equivariance: θbn is the mle =⇒ ϕ(θbn ) is the mle of ϕ(θ)
Properties of the MoM estimator • Asymptotic normality:

p
θbn exists with probability tending to 1 1. se ≈ 1/In (θ)
P
• Consistency: θbn → θ (θbn − θ) D
→ N (0, 1)
• Asymptotic normality: se

q
D
n(θb − θ) → N (0, Σ) b ≈ 1/In (θbn )
2. se
  (θbn − θ) D
where Σ = gE Y Y T g T , Y = (X, X 2 , . . . , X k )T , → N (0, 1)
∂ −1 se
g = (g1 , . . . , gk ) and gj = ∂θ αj (θ)
b
• Asymptotic optimality (or efficiency), i.e., smallest variance for large sam-
ples. If θen is any other estimator, the asymptotic relative efficiency is
12.2 Maximum Likelihood
h i
Likelihood: Ln : Θ → [0, ∞) V θbn
are(θen , θbn ) = h i ≤ 1
n
Y V θen
Ln (θ) = f (Xi ; θ)
i=1
• Approximately the Bayes estimator
Log-likelihood
n
X 12.2.1 Delta Method
`n (θ) = log Ln (θ) = log f (Xi ; θ) b where ϕ is differentiable and ϕ0 (θ) 6= 0:
If τ = ϕ(θ)
i=1

Maximum Likelihood Estimator (mle) τn − τ ) D


(b
→ N (0, 1)
se(b
b τ)
Ln (θbn ) = sup Ln (θ)
θ
where τb = ϕ(θ)
b is the mle of τ and
Score Function
∂ b = ϕ0 (θ)
se se(
b θn )
b b
s(X; θ) = log f (X; θ)
∂θ 12
12.3 Multiparameter Models 13 Hypothesis Testing
Let θ = (θ1 , . . . , θk ) and θb = (θb1 , . . . , θbk ) be the mle.
H0 : θ ∈ Θ0 versus H1 : θ ∈ Θ1
∂ 2 `n ∂ 2 `n
Hjj = Hjk = Definitions
∂θ2 ∂θj ∂θk
Fisher Information Matrix • Null hypothesis H0
• Alternative hypothesis H1
 
Eθ [H11 ] · · · Eθ [H1k ]
In (θ) = − 
 .. .. ..  • Simple hypothesis θ = θ0
. . . 
• Composite hypothesis θ > θ0 or θ < θ0
Eθ [Hk1 ] · · · Eθ [Hkk ]
• Two-sided test: H0 : θ = θ0 versus H1 : θ 6= θ0
Under appropriate regularity conditions • One-sided test: H0 : θ ≤ θ0 versus H1 : θ > θ0
(θb − θ) ≈ N (0, Jn ) • Critical value c
• Test statistic T
with Jn (θ) = In−1 . Further, if θbj is the j th component of θ, then • Rejection Region R = {x : T (x) > c}
• Power function β(θ) = P [X ∈ R]
(θbj − θj ) D
→ N (0, 1) • Power of a test: 1 − P [Type II error] = 1 − β = inf β(θ)
se
bj θ∈Θ1
h i • Test size: α = P [Type I error] = sup β(θ)
b 2j = Jn (j, j) and Cov θbj , θbk = Jn (j, k)
where se θ∈Θ0

Retain H0 Reject H0
12.3.1 Multiparameter Delta Method √
H0 true Type
√ I error (α)
Let τ = ϕ(θ1 , . . . , θk ) be a function and let the gradient of ϕ be H1 true Type II error (β) (power)
p-value
∂ϕ
 

 ∂θ1  • p-value = supθ∈Θ0 Pθ [T (X) ≥ T (x)] = inf α : T (x) ∈ Rα
 . 
 .. 
∇ϕ =  Pθ [T (X ? ) ≥ T (X)]

• p-value = supθ∈Θ0 = inf α : T (X) ∈ Rα

 ∂ϕ  | {z }
1−Fθ (T (X)) since T (X ? )∼Fθ
∂θk
p-value evidence
Suppose ∇ϕ θ=θb 6= 0 and τb = ϕ(θ).
b Then,
< 0.01 very strong evidence against H0
τ − τ) D
(b 0.01 − 0.05 strong evidence against H0
→ N (0, 1) 0.05 − 0.1 weak evidence against H0
se(b
b τ)
> 0.1 little or no evidence against H0
where r Wald Test
 T  
se(b
b τ) = ˆ
∇ϕ Jˆn ∇ϕ
ˆ
• Two-sided test
and Jˆn = Jn (θ) ˆ = ∇ϕ b. θb − θ0

b and ∇ϕ
θ=θ • Reject H0 when |W | > zα/2 where W =
  se
b
12.4 Parametric Bootstrap • P |W | > zα/2 → α
• p-value = Pθ0 [|W | > |w|] ≈ P [|Z| > |w|] = 2Φ(−|w|)
Sample from f (x; θbn ) instead of from F̂n , where θbn could be the mle or method
of moments estimator. Likelihood Ratio Test (LRT)
13
supθ∈Θ Ln (θ) Ln (θbn ) • X n = (X1 , . . . , Xn )
• T (X) = =
supθ∈Θ0 Ln (θ) Ln (θbn,0 ) • xn = (x1 , . . . , xn )
k • Prior density f (θ)
iid
D
X
• λ(X) = 2 log T (X) → χ2r−q where Zi2 ∼ χ2k with Z1 , . . . , Zk ∼ N (0, 1) • Likelihood f (xn | θ): joint density of the data
n
 i=1 Y
In particular, X n iid =⇒ f (xn | θ) = f (xi | θ) = Ln (θ)

• p-value = Pθ0 [λ(X) > λ(x)] ≈ P χ2r−q > λ(x)
i=1
Multinomial LRT • Posterior density f (θ | xn )
• Normalizing constant cn = f (xn ) = f (x | θ)f (θ) dθ
  R
X1 Xk
• Let p̂n = ,..., be the mle
n n • Kernel: part of a density that depends Ron θ
k  θLn (θ)f (θ)
• Posterior Mean θ̄n = θf (θ | xn ) dθ = R Ln (θ)f
Xj R
Ln (p̂n ) Y p̂j (θ) dθ
• T (X) = =
Ln (p0 ) j=1
p0j
Xk 
p̂j
 14.1 Credible Intervals
D
• λ(X) = 2 Xj log → χ2k−1
j=1
p 0j 1 − α Posterior Interval
• The approximate size α LRT rejects H0 when λ(X) ≥ χ2k−1,α Z b
n
P [θ ∈ (a, b) | x ] = f (θ | xn ) dθ = 1 − α
2
Pearson χ Test a

k
X (Xj − E [Xj ])2 1 − α Equal-tail Credible Interval
• T = where E [Xj ] = np0j under H0
j=1
E [Xj ] Z a Z ∞
n
D f (θ | x ) dθ = f (θ | xn ) dθ = α/2
• T → χ2k−1 −∞ b
 
• p-value = P χ2k−1 > T (x)
2D
1 − α Highest Posterior Density (HPD) region Rn
• Faster → Xk−1 than LRT, hence preferable for small n
1. P [θ ∈ Rn ] = 1 − α
Independence Testing
2. Rn = {θ : f (θ | xn ) > k} for some k
• I rows, J columns, X multinomial sample of size n = I ∗ J
X
• mles unconstrained: p̂ij = nij Rn is unimodal =⇒ Rn is an interval
X
• mles under H0 : p̂0ij = p̂i· p̂·j = Xni· n·j
PI PJ 
nX
 14.2 Function of Parameters
• LRT: λ = 2 i=1 j=1 Xij log Xi· Xij·j
PI PJ (X −E[X ])2 Let τ = ϕ(θ) and A = {θ : ϕ(θ) ≤ τ }.
• Pearson χ2 : T = i=1 j=1 ijE[Xij ]ij
Posterior CDF for τ
D
• LRT and Pearson → χ2ν , where ν = (I − 1)(J − 1) Z
H(r | xn ) = P [ϕ(θ) ≤ τ | xn ] = f (θ | xn ) dθ
A
14 Bayesian Inference
Posterior Density
Bayes’ Theorem h(τ | xn ) = H 0 (τ | xn )
f (x | θ)f (θ) f (x | θ)f (θ) Bayesian Delta Method
f (θ | x) = =R ∝ Ln (θ)f (θ)
f (xn ) f (x | θ)f (θ) dθ  
τ | X n ≈ N ϕ(θ),
b seb ϕ0 (θ)
b
Definitions

14
14.3 Priors Continuous likelihood (subscript c denotes constant)
Likelihood Conjugate Prior Posterior hyperparameters
Choice 
Uniform(0, θ) Pareto(xm , k) max x(n) , xm , k + n
n
• Subjective Bayesianism: prior should incorporate as much detail as possible Exponential(λ) Gamma(α, β) α + n, β +
X
xi
the research’s a priori knowledge — via prior elicitation. i=1
• Objective Bayesianism: prior should incorporate as little detail as possible  Pn   
µ0 i=1 xi 1 n
(non-informative prior). Normal(µ, σc2 ) Normal(µ0 , σ02 ) + / + 2 ,
σ2 σ2 σ02 σc
• Robust Bayesianism: consider various priors and determine sensitivity of  0 c−1
1 n
our inferences to changes in the prior. + 2
σ02 σc
Pn
νσ02 + i=1 (xi − µ)2
Types Normal(µc , σ 2 ) Scaled Inverse Chi- ν + n,
ν+n
square(ν, σ02 )
• Flat: f (θ) ∝ constant νλ + nx̄ n
R∞ Normal(µ, σ 2 ) Normal- , ν + n, α + ,
• Proper: −∞ f (θ) dθ = 1 ν+n 2
scaled Inverse n
γ(x̄ − λ)2
R∞
• Improper: −∞ f (θ) dθ = ∞ 1X 2
Gamma(λ, ν, α, β) β+ (xi − x̄) +
• Jeffreys’ prior (transformation-invariant): 2 i=1 2(n + γ)
−1
Σ−1 −1
Σ−1 −1

p p MVN(µ, Σc ) MVN(µ0 , Σ0 ) 0 + nΣc 0 µ0 + nΣ x̄ ,
f (θ) ∝ I(θ) f (θ) ∝ det(I(θ))
−1 −1
Σ−1

0 + nΣc
n
• Conjugate: f (θ) and f (θ | xn ) belong to the same parametric family X
MVN(µc , Σ) Inverse- n + κ, Ψ + (xi − µc )(xi − µc )T
Wishart(κ, Ψ) i=1
n
X xi
14.3.1 Conjugate Priors Pareto(xmc , k) Gamma(α, β) α + n, β + log
i=1
xm c
Discrete likelihood Pareto(xm , kc ) Pareto(x0 , k0 ) x0 , k0 − kn where k0 > kn
Xn
Likelihood Conjugate Prior Posterior hyperparameters Gamma(αc , β) Gamma(α0 , β0 ) α0 + nαc , β0 + xi
n n i=1
X X
Bernoulli(p) Beta(α, β) α+ xi , β + n − xi
i=1
Xn n
X
i=1
n
X
14.4 Bayesian Testing
Binomial(p) Beta(α, β) α+ xi , β + Ni − xi If H0 : θ ∈ Θ0 :
i=1 i=1 i=1
n
X
Z
Negative Binomial(p) Beta(α, β) α + rn, β + xi Prior probability P [H0 ] = f (θ) dθ
n
i=1 ZΘ0
Posterior probability P [H0 | xn ] = f (θ | xn ) dθ
X
Poisson(λ) Gamma(α, β) α+ xi , β + n
Θ0
i=1
n
X
Multinomial(p) Dirichlet(α) α+ x(i)
i=1 Let H0 , . . . , HK−1 be K hypotheses. Suppose θ ∼ f (θ | Hk ),
n
f (xn | Hk )P [Hk ]
X
Geometric(p) Beta(α, β) α + n, β + xi
P [Hk | xn ] = PK ,
n
k=1 f (x | Hk )P [Hk ]
i=1
15
Marginal Likelihood 1. Estimate VF [Tn ] with VF̂n [Tn ].
Z 2. Approximate VF̂n [Tn ] using simulation:
f (xn | Hi ) = f (xn | θ, Hi )f (θ | Hi ) dθ
∗ ∗
Θ (a) Repeat the following B times to get Tn,1 , . . . , Tn,B , an iid sample from
Posterior Odds (of Hi relative to Hj ) the sampling distribution implied by F̂n

P [Hi | xn ] f (xn | Hi ) P [Hi ] i. Sample uniformly X1∗ , . . . , Xn∗ ∼ F̂n .


= × ii. Compute Tn∗ = g(X1∗ , . . . , Xn∗ ).
P [Hj | xn ] f (xn | Hj ) P [Hj ]
(b) Then
| {z } | {z }
Bayes Factor BFij prior odds

B B
!2
Bayes Factor 1 X ∗ 1 X ∗
log10 BF10 BF10 evidence vboot = V̂F̂n = Tn,b − T
B B r=1 n,r
b=1
0 − 0.5 1 − 1.5 Weak
0.5 − 1 1.5 − 10 Moderate
1−2 10 − 100 Strong
16.1.1 Bootstrap Confidence Intervals
>2 > 100 Decisive
p
1−p BF 10 Normal-based Interval
p∗ = p where p = P [H1 ] and p∗ = P [H1 | xn ]
1 + 1−p BF10
Tn ± zα/2 se
ˆ boot

15 Exponential Family Pivotal Interval


Scalar parameter
1. Location parameter θ = T (F )
fX (x | θ) = h(x) exp {η(θ)T (x) − A(θ)} 2. Pivot Rn = θbn − θ
= h(x)g(θ) exp {η(θ)T (x)} 3. Let H(r) = P [Rn ≤ r] be the cdf of Rn
∗ ∗
4. Let Rn,b = θbn,b − θbn . Approximate H using bootstrap:
Vector parameter
( s
)
B
1 X
X
fX (x | θ) = h(x) exp ηi (θ)Ti (x) − A(θ) Ĥ(r) = ∗
I(Rn,b ≤ r)
i=1 B
b=1
= h(x) exp {η(θ) · T (x) − A(θ)}
= h(x)g(θ) exp {η(θ) · T (x)} 5. Let θβ∗ denote the β sample quantile of (θbn,1
∗ ∗
, . . . , θbn,B )
Natural form 6. Let rβ∗ denote the β sample quantile of (Rn,1
∗ ∗
, . . . , Rn,B ), i.e., rβ∗ = θβ∗ − θbn
 
fX (x | η) = h(x) exp {η · T(x) − A(η)} 7. Then, an approximate 1 − α confidence interval is Cn = â, b̂ with
= h(x)g(η) exp {η · T(x)}  α
= h(x)g(η) exp η T T(x) â = θbn − Ĥ −1 1 − = ∗
θbn − r1−α/2 = ∗
2θbn − θ1−α/2

2

b̂ = θbn − Ĥ −1 = ∗
θbn − rα/2 = ∗
2θbn − θα/2
2
16 Sampling Methods
Percentile Interval
16.1 The Bootstrap  
∗ ∗
Cn = θα/2 , θ1−α/2
Let Tn = g(X1 , . . . , Xn ) be a statistic.
16
16.2 Rejection Sampling • Decision rule: synonymous for an estimator θb
• Action a ∈ A: possible value of the decision rule. In the estimation
Setup
context, the action is just an estimate of θ, θ(x).
b
• We can easily sample from g(θ) • Loss function L: consequences of taking action a when true state is θ or
• We want to sample from h(θ), but it is difficult discrepancy between θ and θ, b L : Θ × A → [−k, ∞).
k(θ)
• We know h(θ) up to proportional constant: h(θ) = R Loss functions
k(θ) dθ
• Envelope condition: we can find M > 0 such that k(θ) ≤ M g(θ) ∀θ • Squared error loss: L(θ, a) = (θ − a)2
(
K1 (θ − a) a − θ < 0
Algorithm • Linear loss: L(θ, a) =
K2 (a − θ) a − θ ≥ 0
1. Draw θcand ∼ g(θ) • Absolute error loss: L(θ, a) = |θ − a| (linear loss with K1 = K2 )
2. Generate u ∼ Unif (0, 1) • Lp loss: L(θ, a) = |θ − a|p
k(θcand ) (
3. Accept θcand if u ≤ 0 a=θ
M g(θcand ) • Zero-one loss: L(θ, a) =
1 a 6= θ
4. Repeat until B values of θcand have been accepted

Example 17.1 Risk


• We can easily sample from the prior g(θ) = f (θ) Posterior Risk
• Target is the posterior with h(θ) ∝ k(θ) = f (xn | θ)f (θ) Z h i
• Envelope condition: f (xn | θ) ≤ f (xn | θbn ) = Ln (θbn ) ≡ M r(θb | x) = L(θ, θ(x))f
b (θ | x) dθ = Eθ|X L(θ, θ(x))
b

• Algorithm
(Frequentist) Risk
1. Draw θcand ∼ f (θ)
Z
2. Generate u ∼ Unif (0, 1)
h i
R(θ, θ)
b = L(θ, θ(x))f
b (x | θ) dx = EX|θ L(θ, θ(X))
b
Ln (θcand )
3. Accept θcand if u ≤
Ln (θbn ) Bayes Risk
ZZ
16.3 Importance Sampling
h i
r(f, θ)
b = L(θ, θ(x))f
b (x, θ) dx dθ = Eθ,X L(θ, θ(X))
b
Sample from an importance function g rather than target density h.
Algorithm to obtain an approximation to E [q(θ) | xn ]:
h h ii h i
r(f, θ)
b = Eθ EX|θ L(θ, θ(X)
b = Eθ R(θ, θ)b
iid
1. Sample from the prior θ1 , . . . , θn ∼ f (θ)
h h ii h i
r(f, θ)
b = EX Eθ|X L(θ, θ(X)
b = EX r(θb | X)
Ln (θi )
2. For each i = 1, . . . , B, calculate wi = PB
i=1 Ln (θi )
n
PB 17.2 Admissibility
3. E [q(θ) | x ] ≈ i=1 q(θi )wi
• θb0 dominates θb if
∀θ : R(θ, θb0 ) ≤ R(θ, θ)
b
17 Decision Theory
∃θ : R(θ, θb0 ) < R(θ, θ)
b
Definitions
• θb is inadmissible if there is at least one other estimator θb0 that dominates
• Unknown quantity affecting our decision: θ ∈ Θ it. Otherwise it is called admissible.
17
17.3 Bayes Rule Residual Sums of Squares (rss)
Bayes Rule (or Bayes Estimator) n
X
rss(βb0 , βb1 ) = ˆ2i
• r(f, θ)
b = inf e r(f, θ)
θ
e
i=1
R
• θ(x) = inf r(θ | x) ∀x =⇒ r(f, θ)
b b b = r(θb | x)f (x) dx
Least Square Estimates
Theorems
βbT = (βb0 , βb1 )T : min rss
β
b0 ,β
b1
• Squared error loss: posterior mean
• Absolute error loss: posterior median
• Zero-one loss: posterior mode βb0 = Ȳn − βb1 X̄n
Pn Pn
(Xi − X̄n )(Yi − Ȳn ) i=1 Xi Yi − nX̄Y
17.4 Minimax Rules βb1 = i=1 Pn 2
= n
(X − X̄ ) 2
P 2
i=1 i n i=1 Xi − nX
Maximum Risk
 
β0
h i
R̄(θ)
b = sup R(θ, θ) R̄(a) = sup R(θ, a) E βb | X n =
b β1
θ θ
σ 2 n−1 ni=1 Xi2 −X n
h i  P 
Minimax Rule n
V β |X =
b
e = inf sup R(θ, θ)
b = inf R̄(θ)
sup R(θ, θ) e nsX −X n 1
θ θe θe θ
r Pn
2
σ i=1 Xi

b
se(
b βb0 ) =
θb = Bayes rule ∧ ∃c : R(θ, θ)
b =c sX n n
σ

b
Least Favorable Prior se(
b βb1 ) =
sX n
θbf = Bayes rule ∧ R(θ, θbf ) ≤ r(f, θbf ) ∀θ Pn Pn
where s2X = n−1 i=1 (Xi − X n )2 and σ
b2 = 1
n−2 ˆ2i
i=1  an (unbiased) estimate
of σ. Further properties:
18 Linear Regression
P P
• Consistency: βb0 → β0 and βb1 → β1
Definitions
• Asymptotic normality:
• Response variable Y
• Covariate X (aka predictor variable or feature) βb0 − β0 D βb1 − β1 D
→ N (0, 1) and → N (0, 1)
se(
b βb0 ) se(
b βb1 )
18.1 Simple Linear Regression
• Approximate 1 − α confidence intervals for β0 and β1 are
Model
Yi = β0 + β1 Xi + i E [i | Xi ] = 0, V [i | Xi ] = σ 2 βb0 ± zα/2 se(
b βb0 ) and βb1 ± zα/2 se(
b βb1 )
Fitted Line
rb(x) = βb0 + βb1 x • The Wald test for testing H0 : β1 = 0 vs. H1 : β1 6= 0 is: reject H0 if
|W | > zα/2 where W = βb1 /se(
b βb1 ).
Predicted (Fitted) Values
Ybi = rb(Xi ) R2
Pn b 2
Pn 2
Residuals i=1 (Yi − Y ) ˆ rss
2
= 1 − Pn i=1 i 2 = 1 −
 
ˆi = Yi − Ybi = Yi − βb0 + βb1 Xi R = Pn 2
i=1 (Yi − Y ) i=1 (Yi − Y )
tss
18
Likelihood If the (k × k) matrix X T X is invertible,
n n n
Y Y Y βb = (X T X)−1 X T Y
L= f (Xi , Yi ) = fX (Xi ) × fY |X (Yi | Xi ) = L1 × L2 h i
i=1 i=1 i=1 V βb | X n = σ 2 (X T X)−1
n
βb ≈ N β, σ 2 (X T X)−1
Y 
L1 = fX (Xi )
i=1
n
(
2
) Estimate regression function
Y
−n 1 X
L2 = fY |X (Yi | Xi ) ∝ σ exp − 2 Yi − (β0 − β1 Xi ) k
2σ i X
i=1 rb(x) = βbj xj
j=1
Under the assumption of Normality, the least squares estimator is also the mle
2
Unbiased estimate for σ
n
1X 2 n
b2 =
σ ˆ 1 X 2
n i=1 i b2 =
σ ˆ ˆ = X βb − Y
n − k i=1 i

18.2 Prediction mle


n−k 2
µ
b = X̄ b2 =
σ σ
Observe X = x∗ of the covarite and want to predict their outcome Y∗ . n
1 − α Confidence Interval
Yb∗ = βb0 + βb1 x∗ βbj ± zα/2 se(
b βbj )
h i h i h i h i
V Yb∗ = V βb0 + x2∗ V βb1 + 2x∗ Cov βb0 , βb1
18.4 Model Selection
Prediction Interval  Pn Consider predicting a new observation Y ∗ for covariates X ∗ and let S ⊂ J
2

2 2 i=1 (Xi − X∗ ) denote a subset of the covariates in the model, where |S| = k and |J| = n.
ξn = σ
b P +1
n i (Xi − X̄)2 j
b
Issues
• Underfitting: too few covariates yields high bias
Yb∗ ± zα/2 ξbn
• Overfitting: too many covariates yields high variance

18.3 Multiple Regression Procedure


1. Assign a score to each model
Y = Xβ + 
2. Search through all models to find the one with the highest score
where       Hypothesis Testing
X11 ··· X1k β1 1
 .. ..  β =  ... 
..  ..  H0 : βj = 0 vs. H1 : βj 6= 0 ∀j ∈ J
X= . =.
 
. . 
Xn1 ··· Xnk βk n Mean Squared Prediction Error (mspe)
Likelihood h i

1
 mspe = E (Yb (S) − Y ∗ )2
2 −n/2
L(µ, Σ) = (2πσ ) exp − 2 rss

Prediction Risk
N
X n
X n
X h i
rss = (y − Xβ)T (y − Xβ) = ||Y − Xβ||2 = (Yi − xTi β)2 R(S) = mspei = E (Ybi (S) − Yi∗ )2
i=1 i=1 i=1 19
Training Error 19 Non-parametric Function Estimation
n
X
R
btr (S) = (Ybi (S) − Yi )2 19.1 Density Estimation
i=1 R
Estimate f (x), where f (x) = P [X ∈ A] = A f (x) dx.
2
R Integrated Square Error (ise)
Pn b 2
R i=1 (Yi (S) − Y )
rss(S) btr (S) Z  2 Z
R2 (S) = 1 − =1− =1− P n 2 L(f, fbn ) = f (x) − fbn (x) dx = J(h) + f 2 (x) dx
i=1 (Yi − Y )
tss tss

The training error is a downward-biased estimate of the prediction risk. Frequentist Risk
h i Z Z
h i R(f, fbn ) = E L(f, fbn ) = b2 (x) dx + v(x) dx
E R btr (S) < R(S)
h i
h
i n
X h i b(x) = E fbn (x) − f (x)
bias(R btr (S) − R(S) = −2
btr (S)) = E R Cov Ybi , Yi h i
i=1 v(x) = V fbn (x)

Adjusted R2
19.1.1 Histograms
2 n − 1 rss
R (S) = 1 −
n − k tss Definitions
Mallow’s Cp statistic • Number of bins m
1
• Binwidth h = m
R(S)
b =R σ 2 = lack of fit + complexity penalty
btr (S) + 2kb • Bin Bj has νj observations
R
• Define pbj = νj /n and pj = Bj f (u) du
Akaike Information Criterion (AIC)
Histogram Estimator
m
AIC(S) = bS2 )
`n (βbS , σ −k X pbj
fbn (x) = I(x ∈ Bj )
j=1
h
Bayesian Information Criterion (BIC) h i p
j
E fbn (x) =
k h
bS2 ) − log n
BIC(S) = `n (βbS , σ h i p (1 − p )
j j
2 V fbn (x) =
nh2
h2
Z
Validation and Training 2 1
R(fn , f ) ≈
b (f 0 (u)) du +
12 nh
m
X n n !1/3
R
bV (S) = (Ybi∗ (S) − Yi∗ )2 m = |{validation data}|, often or ∗ 1 6
i=1
4 2 h = 1/3 R 2 du
n (f 0 (u))
 2/3 Z 1/3
Leave-one-out Cross-validation C 3 2
R∗ (fbn , f ) ≈ 2/3 C= (f 0 (u)) du
n n
!2 n 4
X
2
X Yi − Ybi (S)
R
bCV (S) = (Yi − Yb(i) ) = Cross-validation estimate of E [J(h)]
i=1 i=1
1 − Uii (S) Z n m
2 2Xb 2 n+1 X 2
JCV (h) = fn (x) dx −
b b f(−i) (Xi ) = − pb
U (S) = XS (XST XS )−1 XS (“hat matrix”) n i=1 (n − 1)h (n − 1)h j=1 j
20
19.1.2 Kernel Density Estimator (KDE) k-nearest Neighbor Estimator
Kernel K 1 X
rb(x) = Yi where Nk (x) = {k values of x1 , . . . , xn closest to x}
k
i:xi ∈Nk (x)
• K(x) ≥ 0
Nadaraya-Watson Kernel Estimator
R
• K(x) dx = 1
R
• xK(x) dx = 0 n
X
rb(x) = wi (x)Yi
R 2 2
• x K(x) dx ≡ σK >0
i=1
x−xi

KDE K
wi (x) = h  ∈ [0, 1]
n
Pn x−xj
j=1 K
 
1X1 x − Xi h
fbn (x) = K
n i=1 h h h4
Z 4 Z 
f 0 (x)
2
Z Z rn , r) ≈
R(b x2 K 2 (x) dx r00 (x) + 2r0 (x) dx
1 4 00 2 1 4 f (x)
R(f, fn ) ≈ (hσK )
b (f (x)) dx + K 2 (x) dx
4 nh σ 2 K 2 (x) dx
Z R
−2/5 −1/5 −1/5 + dx
nhf (x)
Z Z
c c2 c3
h∗ = 1 c1 = σ 2
K , c 2 = K 2
(x) dx, c 3 = (f 00 (x))2 dx c1
n1/5 h∗ ≈
Z 4/5 Z 1/5 n1/5
c4 5 2 2/5 c2
R∗ (f, fbn ) = 4/5 c4 = (σK ) K 2 (x) dx (f 00 )2 dx R∗ (b
rn , r) ≈ 4/5
n 4 n
| {z }
C(K)

Cross-validation estimate of E [J(h)]


Epanechnikov Kernel
n n
√ X X (Yi − rb(xi ))2
(Yi − rb(−i) (xi ))2 =
(
√ 3
|x| < 5 JbCV (h) = !2
K(x) = 4 5(1−x2 /5)
i=1 i=1 K(0)
0 otherwise 1− Pn  x−x 
j
j=1 K h

Cross-validation estimate of E [J(h)]


19.3 Smoothing Using Orthogonal Functions
n n n  
1 X X ∗ Xi − Xj
Z
2 2Xb 2 Approximation
JbCV (h) = fn (x) dx −
b f(−i) (Xi ) ≈ 2
K + K(0)
n i=1 hn i=1 j=1 h nh ∞
X J
X
r(x) = βj φj (x) ≈ βj φj (x)
Z j=1 i=1
K ∗ (x) = K (2) (x) − 2K(x) K (2) (x) = K(x − y)K(y) dy Multivariate Regression
Y = Φβ + η
 
19.2 Non-parametric Regression φ0 (x1 ) ··· φJ (x1 )
 .. .. .. 
where ηi = i and Φ =  . . . 
Estimate f (x), where f (x) = E [Y | X = x]. Consider pairs of points
(x1 , Y1 ), . . . , (xn , Yn ) related by φ0 (xn ) · · · φJ (xn )
Least Squares Estimator
Yi = r(xi ) + i
βb = (ΦT Φ)−1 ΦT Y
E [i ] = 0
1
V [i ] = σ 2 ≈ ΦT Y (for equallly spaced observations only)
n
21
Cross-validation estimate of E [J(h)] 20.2 Poisson Processes
 2
Xn J
X Poisson Process
R
bCV (J) = Yi − φj (xi )βbj,(−i) 
i=1 j=1
• {Xt : t ∈ [0, ∞)} – number of events up to and including time t
• X0 = 0
20 Stochastic Processes • Independent increments:
Stochastic Process
( ∀t0 < · · · < tn : Xt1 − Xt0 ⊥
⊥ ··· ⊥
⊥ Xtn − Xtn−1
{0, ±1, . . . } = Z discrete
{Xt : t ∈ T } T =
[0, ∞) continuous
• Intensity function λ(t)
• Notations: Xt , X(t)
• State space X – P [Xt+h − Xt = 1] = λ(t)h + o(h)
• Index set T – P [Xt+h − Xt = 2] = o(h)
Rt
• Xs+t − Xs ∼ Poisson (m(s + t) − m(s)) where m(t) = 0
λ(s) ds
20.1 Markov Chains
Markov Chain {Xn : n ∈ T } Homogeneous Poisson Process
P [Xn = x | X0 , . . . , Xn−1 ] = P [Xn = x | Xn−1 ] ∀n ∈ T, x ∈ X
λ(t) ≡ λ =⇒ Xt ∼ Poisson (λt) λ>0
Transition probabilities

pij ≡ P [Xn+1 = j | Xn = i] Waiting Times


pij (n) ≡ P [Xm+n = j | Xm = i] n-step
Wt := time at which Xt occurs
Transition matrix P (n-step: Pn )
• (i, j) element is pij  
1
• pij > 0 Wt ∼ Gamma t,
P λ
• i pij = 1

Chapman-Kolmogorov Interarrival Times


X
pij (m + n) = pij (m)pkj (n) St = Wt+1 − Wt
k

Pm+n = Pm Pn  
1
Pn = P × · · · × P = Pn St ∼ Exp
λ
Marginal probability

µn = (µn (1), . . . , µn (N )) where µi (i) = P [Xn = i]


St
µ0 , initial distribution
µn = µ0 Pn Wt−1 Wt t
22
21 Time Series 21.1 Stationary Time Series
Mean function Z ∞
Strictly stationary
µxt = E [xt ] = xft (x) dx
−∞ P [xt1 ≤ c1 , . . . , xtk ≤ ck ] = P [xt1 +h ≤ c1 , . . . , xtk +h ≤ ck ]
Autocovariance function

γx (s, t) = E [(xs − µs )(xt − µt )] = E [xs xt ] − µs µt ∀k ∈ N, tk , ck , h ∈ Z

γx (t, t) = E (xt − µt )2 = V [xt ]


 
Weakly stationary
Autocorrelation function (ACF)  
• E x2t < ∞ ∀t ∈ Z
 2
Cov [xs , xt ] γ(s, t) • E xt = m ∀t ∈ Z
ρ(s, t) = p =p
V [xs ] V [xt ] γ(s, s)γ(t, t) • γx (s, t) = γx (s + r, t + r) ∀r, s, t ∈ Z

Cross-covariance function (CCV) Autocovariance function


γxy (s, t) = E [(xs − µxs )(yt − µyt )]
• γ(h) = E [(xt+h − µ)(xt − µ)] ∀h ∈ Z
 
Cross-correlation function (CCF) • γ(0) = E (xt − µ)2
γxy (s, t) • γ(0) ≥ 0
ρxy (s, t) = p • γ(0) ≥ |γ(h)|
γx (s, s)γy (t, t)
• γ(h) = γ(−h)
Backshift operator
B k (xt ) = xt−k Autocorrelation function (ACF)
Difference operator
∇d = (1 − B)d Cov [xt+h , xt ] γ(t + h, t) γ(h)
ρx (h) = p =p =
V [xt+h ] V [xt ] γ(t + h, t + h)γ(t, t) γ(0)
White Noise
2
• wt ∼ wn(0, σw ) Jointly stationary time series
iid 2

• Gaussian: wt ∼ N 0, σw
γxy (h) = E [(xt+h − µx )(yt − µy )]
• E [wt ] = 0 t ∈ T
• V [wt ] = σ 2 t ∈ T
• γw (s, t) = 0 s 6= t ∧ s, t ∈ T γxy (h)
ρxy (h) = p
γx (0)γy (h)
Random Walk
• Drift δ Linear Process
Pt
• xt = δt + j=1 wj ∞
X ∞
X
• E [xt ] = δt xt = µ + ψj wt−j where |ψj | < ∞
j=−∞ j=−∞
Symmetric Moving Average
k
X k
X ∞
X
2
mt = aj xt−j where aj = a−j ≥ 0 and aj = 1 γ(h) = σw ψj+h ψj
j=−k j=−k j=−∞
23
21.2 Estimation of Correlation 21.3.1 Detrending
Sample mean Least Squares
n
1X
x̄ = xt 1. Choose trend model, e.g., µt = β0 + β1 t + β2 t2
n t=1
2. Minimize rss to obtain trend estimate µ bt = βb0 + βb1 t + βb2 t2
Sample variance 3. Residuals , noise wt
n  
1 X |h|
V [x̄] = 1− γx (h) Moving average
n n
h=−n
1
• The low-pass filter vt is a symmetric moving average mt with aj = 2k+1 :
Sample autocovariance function
k
n−h 1 X
1 X vt = xt−1
γ
b(h) = (xt+h − x̄)(xt − x̄) 2k + 1
n t=1 i=−k

1
Pk
Sample autocorrelation function • If 2k+1 i=−k wt−j ≈ 0, a linear trend function µt = β0 + β1 t passes
without distortion
γ
b(h)
ρb(h) = Differencing
γ
b(0)
• µt = β0 + β1 t =⇒ ∇xt = β1
Sample cross-variance function
n−h
1 X 21.4 ARIMA models
γ
bxy (h) = (xt+h − x̄)(yt − y)
n t=1 Autoregressive polynomial

Sample cross-correlation function φ(z) = 1 − φ1 z − · · · − φp zp z ∈ C ∧ φp 6= 0

γ
bxy (h) Autoregressive operator
ρbxy (h) = p
γbx (0)b
γy (0)
φ(B) = 1 − φ1 B − · · · − φp B p
Properties
Autoregressive model order p, AR (p)
1
• σρbx (h) = √ if xt is white noise
n xt = φ1 xt−1 + · · · + φp xt−p + wt ⇐⇒ φ(B)xt = wt
1
• σρbxy (h) = √ if xt or yt is white noise AR (1)
n
k−1 ∞
X k→∞,|φ|<1 X
21.3 Non-Stationary Time Series • xt = φk (xt−k ) + φj (wt−j ) = φj (wt−j )
j=0 j=0
Classical decomposition model | {z }
linear process
P∞
xt = µt + st + wt • E [xt ] = j=0 φj (E [wt−j ]) = 0
2 h
σw φ
• µt = trend • γ(h) = Cov [xt+h , xt ] = 1−φ2
γ(h)
• st = seasonal component • ρ(h) = γ(0) = φh
• wt = random noise term • ρ(h) = φρ(h − 1) h = 1, 2, . . .
24
Moving average polynomial Seasonal ARIMA
θ(z) = 1 + θ1 z + · · · + θq zq z ∈ C ∧ θq 6= 0 • Denoted by ARIMA (p, d, q) × (P, D, Q)s
Moving average operator • ΦP (B s )φ(B)∇D d s
s ∇ xt = δ + ΘQ (B )θ(B)wt

θ(B) = 1 + θ1 B + · · · + θp B p
21.4.1 Causality and Invertibility
MA (q) (moving average model order q) P∞
ARMA (p, q) is causal (future-independent) ⇐⇒ ∃{ψj } : j=0 ψj < ∞ such that
xt = wt + θ1 wt−1 + · · · + θq wt−q ⇐⇒ xt = θ(B)wt
q ∞
X
xt = wt−j = ψ(B)wt
X
E [xt ] = θj E [wt−j ] = 0
j=0
j=0
( Pq−h P∞
2
σw j=0 θj θj+h 0≤h≤q ARMA (p, q) is invertible ⇐⇒ ∃{πj } : j=0 πj < ∞ such that
γ(h) = Cov [xt+h , xt ] =
0 h>q

X
MA (1) π(B)xt = Xt−j = wt
xt = wt + θwt−1 j=0

2 2
(1 + θ )σw h = 0

Properties
γ(h) = θσw 2
h=1


0 h>1 • ARMA (p, q) causal ⇐⇒ roots of φ(z) lie outside the unit circle
(
θ
2 h=1 ∞
θ(z)
ρ(h) = (1+θ )
X
0 h>1 ψ(z) = ψj z j = |z| ≤ 1
j=0
φ(z)
ARMA (p, q)
xt = φ1 xt−1 + · · · + φp xt−p + wt + θ1 wt−1 + · · · + θq wt−q • ARMA (p, q) invertible ⇐⇒ roots of θ(z) lie outside the unit circle

φ(B)xt = θ(B)wt X φ(z)
π(z) = πj z j = |z| ≤ 1
Partial autocorrelation function (PACF) j=0
θ(z)
• xh−1
i , regression of xi on {xh−1 , xh−2 , . . . , x1 }
Behavior of the ACF and PACF for causal and invertible ARMA models
• φhh = corr(xh − xh−1
h , x0 − xh−1
0 ) h≥2
• E.g., φ11 = corr(x1 , x0 ) = ρ(1) AR (p) MA (q) ARMA (p, q)
ARIMA (p, d, q) ACF tails off cuts off after lag q tails off
∇d xt = (1 − B)d xt is ARMA (p, q) PACF cuts off after lag p tails off q tails off
φ(B)(1 − B)d xt = θ(B)wt
Exponentially Weighted Moving Average (EWMA) 21.5 Spectral Analysis
xt = xt−1 + wt − λwt−1 Periodic process

X xt = A cos(2πωt + φ)
xt = (1 − λ)λj−1 xt−j + wt when |λ| < 1
j=1 = U1 cos(2πωt) + U2 sin(2πωt)
x̃n+1 = (1 − λ)xn + λx̃n
• Frequency index ω (cycles per unit time), period 1/ω
25
• Amplitude A Discrete Fourier Transform (DFT)
• Phase φ n
• U1 = A cos φ and U2 = A sin φ often normally distributed rv’s
X
d(ωj ) = n−1/2 xt e−2πiωj t
Periodic mixture i=1

q
X Fourier/Fundamental frequencies
xt = (Uk1 cos(2πωk t) + Uk2 sin(2πωk t))
k=1
ωj = j/n
• Uk1 , Uk2 , for k = 1, . . . , q, are independent zero-mean rv’s with variances σk2
Pq
• γ(h) = k=1 σk2 cos(2πωk h) Inverse DFT
  Pq n−1
• γ(0) = E x2t = k=1 σk2
X
xt = n−1/2 d(ωj )e2πiωj t
Spectral representation of a periodic process j=0

γ(h) = σ 2 cos(2πω0 h) Periodogram


I(j/n) = |d(j/n)|2
σ 2 −2πiω0 h σ 2 2πiω0 h
= e + e
2 2 Scaled Periodogram
Z 1/2
= e2πiωh dF (ω) 4
−1/2
P (j/n) = I(j/n)
n
!2 !2
Spectral distribution function 2X
n
2X
n
 = xt cos(2πtj/n + xt sin(2πtj/n
0
 ω < −ω0 n t=1 n t=1
F (ω) = σ 2 /2 −ω ≤ ω < ω0

 2
σ ω ≥ ω0 22 Math
• F (−∞) = F (−1/2) = 0
• F (∞) = F (1/2) = γ(0)
22.1 Series
Spectral density Finite Binomial
∞ n  
X 1 1 n n
γ(h)e−2πiωh − ≤ω≤ n(n + 1)
X
f (ω) = X
• = 2n
2 2 • k= k
h=−∞ 2 k=0
k=1
n n    
• Needs
P∞
|γ(h)| < ∞ =⇒ γ(h) =
R 1/2
e2πiωh f (ω) dω h = 0, ±1, . . .
X r+k r+n+1

X
h=−∞ −1/2 • (2k − 1) = n 2 =
k n
• f (ω) ≥ 0 k=1 k=0
n n
X k    
• f (ω) = f (−ω) n(n + 1)(2n + 1) n+1

X
• k2 = =
• f (ω) = f (1 − ω) 6 m m+1
k=1 k=0
R 1/2
• γ(0) = V [xt ] = −1/2 f (ω) dω n  2 • Vandermonde’s Identity:
X n(n + 1) r  
• k3 =
  
2
• White noise: fw (ω) = σw
X m n m+n
2 =
• ARMA (p, q) , φ(B)xt = θ(B)wt :
k=1 k r−k r
n k=0
X cn+1 − 1 • Binomial Theorem:
|θ(e )| −2πiω 2 • ck = c 6= 1 n  
2
fx (ω) = σw c−1 X n n−k k
|φ(e−2πiω )|2
k=0 a b = (a + b)n
k
Pp Pq k=0
where φ(z) = 1 − k=1 φk z k and θ(z) = 1 + k=1 θk z k
26
Infinite Balls and Urns f :B→U D = distinguishable, ¬D = indistinguishable.

∞ ∞
X 1 X p |B| = n, |U | = m f arbitrary f injective f surjective f bijective
• pk = , pk = |p| < 1
1−p 1−p
k=0 k=1 ( (
mn m ≥ n
 
∞ ∞
!   n n n! m = n
X d X d 1 1 B : D, U : ¬D m m!
• kpk−1 = pk
= = |p| < 1 0 else m 0 else
dp dp 1 − p 1 − p2
k=0 k=0
∞         (
X r+k−1 k n+n−1 m n−1 1 m=n
• x = (1 − x)−r r ∈ N+ B : ¬D, U : D
k n n m−1 0 else
k=0
∞  
X α k m  
( (
p = (1 + p)α |p| < 1 , α ∈ C
 
• X n 1 m≥n n 1 m=n
k B : D, U : ¬D
k=0 k 0 else m 0 else
k=1
m
( (
X 1 m≥n 1 m=n
B : ¬D, U : ¬D Pn,k Pn,m
k=1
0 else 0 else
22.2 Combinatorics

Sampling References
[1] P. G. Hoel, S. C. Port, and C. J. Stone. Introduction to Probability Theory.
Brooks Cole, 1972.
k out of n w/o replacement w/ replacement
[2] L. M. Leemis and J. T. McQueston. Univariate Distribution Relationships.
k−1
Y n! The American Statistician, 62(1):45–53, 2008.
ordered nk = (n − i) = nk
(n − k)!
i=0 [3] R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications
k
     
n n n! n−1+r n−1+r With R Examples. Springer, 2006.
unordered = = =
k k! k!(n − k)! r n−1
[4] A. Steger. Diskrete Strukturen – Band 1: Kombinatorik, Graphentheorie,
Algebra. Springer, 2001.

Stirling numbers, 2nd kind [5] A. Steger. Diskrete Strukturen – Band 2: Wahrscheinlichkeitstheorie und
Statistik. Springer, 2002.
 
n

n−1
 
n−1
   (
n 1 n=0 [6] L. Wasserman. All of Statistics: A Concise Course in Statistical Inference.
=k + 1≤k≤n = Springer, 2003.
k k k−1 0 0 else

Partitions

n
X
Pn+k,k = Pn,i k > n : Pn,k = 0 n ≥ 1 : Pn,0 = 0, P0,0 = 1
i=1

27
Univariate distribution relationships, courtesy of Leemis and McQueston [2].
28

You might also like