Professional Documents
Culture Documents
Probability and Statistics: Cookbook
Probability and Statistics: Cookbook
Cookbook
Copyright
c Matthias Vallentin, 2014
vallentin@icir.org
i=0
i! x!
0.2
PMF
PMF
PMF
PMF
1 ●
● ● ● ● ● ● ● 0.4 ●
n ●
●
●
0.1
●
●
0.1
0.2 ●
● ● ●
● ●
● ● ●
●
●
● ● ● ●
● ● ● ●
● ●●●●●●●●●●●●●●●●●●●●● ●
0.0 ●●●● 0.0 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
a b 0 10 20 30 40 3 6 9 0 5 10 15 20
x x x x
1 We use the notation γ(s, x) and Γ(x) to refer to the Gamma functions (see §22.1), and use B(x, y) and Ix to refer to the Beta functions (see §22.2).
3
1.2 Continuous Distributions
Notation FX (x) fX (x) E [X] V [X] MX (s)
0 x<a
(b − a)2 esb − esa
x−a I(a < x < b) a+b
Uniform Unif (a, b) a<x<b
b−a b−a 2 12 s(b − a)
1 x>b
(x − µ)2
Z x
σ 2 s2
1
N µ, σ 2 σ2
Normal Φ(x) = φ(t) dt φ(x) = √ exp − µ exp µs +
−∞ σ 2π 2σ 2 2
(ln x − µ)2
1 1 ln x − µ 1 2 2 2
ln N µ, σ 2 eµ+σ /2
(eσ − 1)e2µ+σ
Log-Normal + erf √ √ exp −
2 2 2σ 2 x 2πσ 2 2σ 2
1 T
Σ−1 (x−µ) 1
Multivariate Normal MVN (µ, Σ) (2π)−k/2 |Σ|−1/2 e− 2 (x−µ) µ Σ exp µT s + sT Σs
2
−(ν+1)/2 ( ν
Γ ν+1
ν ν
2 x2 ν−2
ν>2
Student’s t Student(ν) Ix , √ 1 + 0
νπΓ ν2
2 2 ν ∞ 1<ν≤2
1 k x 1
Chi-square χ2k γ , xk/2−1 e−x/2 k 2k (1 − 2s)−k/2 s < 1/2
Γ(k/2) 2 2 2k/2 Γ(k/2)
r
d
(d1 x)d1 d2 2
2d22 (d1 + d2 − 2)
d1 d1 (d1 x+d2 )d1 +d2 d2
F F(d1 , d2 ) I d1 x , d1 d1 d2 − 2 d1 (d2 − 2)2 (d2 − 4)
d1 x+d2 2 2 xB 2
, 2
1 −x/β 1
Exponential Exp (β) 1 − e−x/β e β β2 (s < 1/β)
β 1 − βs
α
γ(α, x/β) 1 1
Gamma Gamma (α, β) xα−1 e−x/β αβ αβ 2 (s < 1/β)
Γ(α) Γ (α) β α 1 − βs
Γ α, βx
β α −α−1 −β/x β β2 2(−βs)α/2 p
Inverse Gamma InvGamma (α, β) x e α>1 α>2 Kα −4βs
Γ (α) Γ (α) α−1 (α − 1)2 (α − 2)2 Γ(α)
P
k
Γ i=1 αi Y α −1
k
αi E [Xi ] (1 − E [Xi ])
Dirichlet Dir (α) Qk xi i Pk Pk
i=1 Γ (αi ) i=1 i=1 αi i=1 αi + 1
∞ k−1
!
Γ (α + β) α−1 α αβ X Y α+r sk
Beta Beta (α, β) Ix (α, β) x (1 − x)β−1 1+
Γ (α) Γ (β) α+β (α + β)2 (α + β + 1) r=0
α+β+r k!
k=1
∞ n n
k k x k−1 −(x/λ)k 1 2 X s λ n
Weibull Weibull(λ, k) 1 − e−(x/λ) e λΓ 1 + λ2 Γ 1 + − µ2 Γ 1+
λ λ k k n=0
n! k
x α
m xα αxm xα
Pareto Pareto(xm , α) 1− x ≥ xm m
α α+1 x ≥ xm α>1 m
α>2 α(−xm s)α Γ(−α, −xm s) s < 0
x x α−1 (α − 1)2 (α − 2)
4
Uniform (continuous) Normal Log−Normal Student's t
0.4
µ = 0, σ = 0.2
2 1.00 µ = 0, σ = 3
2
ν=1
µ = 0, σ2 = 1 µ = 2, σ2 = 2 ν=2
µ = 0, σ2 = 5 µ = 0, σ2 = 1 ν=5
0.75 ν=∞
µ = −2, σ2 = 0.5 µ = 0.5, σ2 = 1
µ = 0.25, σ2 = 1 0.3
0.75 µ = 0.125, σ2 = 1
0.50
PDF
PDF
φ(x)
1
● ●
0.50 0.2
b−a
0.25
0.25 0.1
a b −5.0 −2.5 0.0 2.5 5.0 0 1 2 3 −5.0 −2.5 0.0 2.5 5.0
x x x x
χ2 F Exponential Gamma
2.0 0.5
4
3 d1 = 1, d2 = 1 β=2 α = 1, β = 2
k=1 d1 = 2, d2 = 1 β=1 α = 2, β = 2
k=2 d1 = 5, d2 = 2 β = 0.4 α = 3, β = 2
k=3 d1 = 100, d2 = 1 α = 5, β = 1
k=4 d1 = 100, d2 = 100 0.4 α = 9, β = 0.5
3 k=5 1.5
2
0.3
PDF
PDF
PDF
2 1.0
0.2
1
1 0.5
0.1
0 0 0.0 0.0
0 2 4 6 8 0 1 2 3 4 5 0 1 2 3 4 5 0 5 10 15 20
x x x x
Inverse Gamma Beta Weibull Pareto
5
α = 1, β = 1 α = 0.5, β = 0.5 λ = 1, k = 0.5 xm = 1, α = 1
α = 2, β = 1 α = 5, β = 1 λ = 1, k = 1 xm = 1, α = 2
α = 3, β = 1 α = 1, β = 3 4 λ = 1, k = 1.5 xm = 1, α = 4
4
α = 3, β = 0.5 α = 2, β = 2 λ = 1, k = 5
4 α = 2, β = 5 3
3
3
3
PDF
PDF
2
2 2
2
1
1 1
1
0 0 0 0
0 1 2 3 4 5 0.00 0.25 0.50 0.75 1.00 0.0 0.5 1.0 1.5 2.0 2.5 0 1 2 3 4 5
x x x x
5
2 Probability Theory Law of Total Probability
n n
Definitions X G
P [B] = P [B|Ai ] P [Ai ] Ω= Ai
• Sample space Ω i=1 i=1
f (x, y)
Independence ⊥
⊥ fY |X (y | x) =
A⊥
⊥ B ⇐⇒ P [A ∩ B] = P [A] P [B] fX (x)
Independence
Conditional Probability
1. P [X ≤ x, Y ≤ y] = P [X ≤ x] P [Y ≤ y]
P [A ∩ B]
P [A | B] = P [B] > 0 2. fX,Y (x, y) = fX (x)fY (y)
P [B] 6
Z
3.1 Transformations • E [XY ] = xyfX,Y (x, y) dFX (x) dFY (y)
X,Y
Transformation function
• E [ϕ(Y )] 6= ϕ(E [X]) (cf. Jensen inequality)
Z = ϕ(X)
• P [X ≥ Y ] = 0 =⇒ E [X] ≥ E [Y ] ∧ P [X = Y ] = 1 =⇒ E [X] = E [Y ]
Discrete X ∞
X • E [X] = P [X ≥ x]
fZ (z) = P [ϕ(X) = z] = P [{x : ϕ(x) = z}] = P X ∈ ϕ−1 (z) =
f (x) x=1
x∈ϕ−1 (z)
Sample mean
n
Continuous 1X
X̄n = Xi
Z n i=1
FZ (z) = P [ϕ(X) ≤ z] = f (x) dx with Az = {x : ϕ(x) ≤ z}
Az Conditional expectation
Z
Special case if ϕ strictly monotone • E [Y | X = x] = yf (y | x) dy
d dx 1 • E [X] = E [E [X | Y ]]
fZ (z) = fX (ϕ−1 (z)) ϕ−1 (z) = fX (x) = fX (x)
∞
dz dz |J|
Z
• E[ϕ(X, Y ) | X = x] = ϕ(x, y)fY |X (y | x) dx
The Rule of the Lazy Statistician Z −∞
∞
Z • E [ϕ(Y, Z) | X = x] = ϕ(y, z)f(Y,Z)|X (y, z | x) dy dz
E [Z] = ϕ(x) dFX (x) −∞
• E [Y + Z | X] = E [Y | X] + E [Z | X]
Z Z • E [ϕ(X)Y | X] = ϕ(X)E [Y | X]
E [IA (x)] = IA (x) dFX (x) = dFX (x) = P [X ∈ A] • E[Y | X] = c =⇒ Cov [X, Y ] = 0
A
Convolution
∞ z
5 Variance
Z Z
X,Y ≥0
• Z := X + Y fZ (z) = fX,Y (x, z − x) dx = fX,Y (x, z − x) dx
−∞ 0 Definition and properties
Z ∞
• Z := |X − Y | fZ (z) = 2 fX,Y (x, z + x) dx • V [X] = σX2
= E (X − E [X])2 = E X 2 − E [X]
2
Z ∞ 0 Z ∞ " n #
X n
⊥⊥
• Z := |x|fX,Y (x, xz) dx =
X X X
fZ (z) = xfx (x)fX (x)fY (xz) dx • V Xi = V [Xi ] + 2 Cov [Xi , Yj ]
Y −∞ −∞
i=1 i=1 i6=j
" n # n
4 Expectation
X X
• V Xi = V [Xi ] if Xi ⊥
⊥ Xj
i=1 i=1
Definition and properties
Standard deviation p
X
xfX (x) X discrete sd[X] = V [X] = σX
x Covariance
Z
• E [X] = µX = x dFX (x) =
• Cov [X, Y ] = E [(X − E [X])(Y − E [Y ])] = E [XY ] − E [X] E [Y ]
Z
xfX (x) dx X continuous
• Cov [X, a] = 0
• P [X = c] = 1 =⇒ E [X] = c • Cov [X, X] = V [X]
• E [cX] = c E [X] • Cov [X, Y ] = Cov [Y, X]
• E [X + Y ] = E [X] + E [Y ] • Cov [aX, bY ] = abCov [X, Y ]
7
• Cov [X + a, Y + b] = Cov [X, Y ] • limn→∞ Bin (n, p) = Po (np) (n large, p small)
Xn Xm Xn Xm • limn→∞ Bin (n, p) = N (np, np(1 − p)) (n large, p far from 0 and 1)
• Cov Xi , Yj =
Cov [Xi , Yj ] Negative Binomial
i=1 j=1 i=1 j=1
• X ∼ NBin (1, p) = Geo (p)
Correlation Pr
• X ∼ NBin (r, p) = i=1 Geo (p)
Cov [X, Y ] P P
ρ [X, Y ] = p • Xi ∼ NBin (ri , p) =⇒ Xi ∼ NBin ( ri , p)
V [X] V [Y ]
• X ∼ NBin (r, p) . Y ∼ Bin (s + r, p) =⇒ P [X ≤ s] = P [Y ≥ r]
Independence
Poisson
n n
!
X⊥
⊥ Y =⇒ ρ [X, Y ] = 0 ⇐⇒ Cov [X, Y ] = 0 ⇐⇒ E [XY ] = E [X] E [Y ] X X
• Xi ∼ Po (λi ) ∧ Xi ⊥⊥ Xj =⇒ Xi ∼ Po λi
Sample variance i=1 i=1
n
1 X n n
X
λi
S2 = (Xi − X̄n )2
X
• Xi ∼ Po (λi ) ∧ Xi ⊥⊥ Xj =⇒ Xi Xj ∼ Bin Xj , Pn
n − 1 i=1 j=1 j=1 λj
j=1
Conditional variance Exponential
2 n
• V [Y | X] = E (Y − E [Y | X])2 | X = E Y 2 | X − E [Y | X] X
• Xi ∼ Exp (β) ∧ Xi ⊥
⊥ Xj =⇒ Xi ∼ Gamma (n, β)
• V [Y ] = E [V [Y | X]] + V [E [Y | X]] i=1
• Memoryless property: P [X > x + y | X > y] = P [X > x]
6 Inequalities Normal
X−µ
• X ∼ N µ, σ 2 =⇒ ∼ N (0, 1)
σ
Cauchy-Schwarz
2
E [XY ] ≤ E X 2 E Y 2 • X ∼ N µ, σ ∧ Z = aX + b =⇒ Z ∼ N aµ + b, a2 σ 2
2
Markov • X ∼ N µ1 , σ12 ∧ Y ∼ N µ2 , σ22 =⇒ X + Y ∼ N µ1 + µ2 , σ12 + σ22
E [ϕ(X)] Xi ∼ N µi , σi2 =⇒ 2
P P P
• X ∼N i µi , i σi
P [ϕ(X) ≥ t] ≤ i i
t
P [a < X ≤ b] = Φ b−µ − Φ a−µ
• σ σ
Chebyshev
V [X] • Φ(−x) = 1 − Φ(x) φ0 (x) = −xφ(x) φ00 (x) = (x2 − 1)φ(x)
P [|X − E [X]| ≥ t] ≤ −1
t2 • Upper quantile of N (0, 1): zα = Φ (1 − α)
Chernoff Gamma
eδ
P [X ≥ (1 + δ)µ] ≤ δ > −1 • X ∼ Gamma (α, β) ⇐⇒ X/β ∼ Gamma (α, 1)
(1 + δ)1+δ Pα
• Gamma (α, β) ∼ i=1 Exp (β)
Jensen P P
• Xi ∼ Gamma (αi , β) ∧ Xi ⊥ ⊥ Xj =⇒ Xi ∼ Gamma ( i αi , β)
E [ϕ(X)] ≥ ϕ(E [X]) ϕ convex Z ∞ i
Γ(α)
• = xα−1 e−λx dx
λα 0
7 Distribution Relationships Beta
1 Γ(α + β) α−1
Binomial • xα−1 (1 − x)β−1 = x (1 − x)β−1
B(α, β) Γ(α)Γ(β)
n
X B(α + k, β) α+k−1
E X k−1
• Xi ∼ Bern (p) =⇒ Xi ∼ Bin (n, p) • E Xk = =
i=1
B(α, β) α+β+k−1
• X ∼ Bin (n, p) , Y ∼ Bin (m, p) =⇒ X + Y ∼ Bin (n + m, p) • Beta (1, 1) ∼ Unif (0, 1)
8
8 Probability and Moment Generating Functions Conditional mean and variance
σX
• GX (t) = E tX |t| < 1 E [X | Y ] = E [X] + ρ (Y − E [Y ])
"∞ # ∞ σY
X (Xt)i X E Xi
· ti
t
Xt
• MX (t) = GX (e ) = E e =E = p
V [X | Y ] = σX 1 − ρ2
i=0
i! i=0
i!
• P [X = 0] = GX (0)
• P [X = 1] = G0X (0) 9.3 Multivariate Normal
(i)
GX (0) Covariance matrix Σ (Precision matrix Σ−1 )
• P [X = i] =
i!
• E [X] = G0X (1− ) V [X1 ] · · · Cov [X1 , Xk ]
(k) Σ=
.. .. ..
.
• E X k = MX (0) . .
X!
(k)
Cov [Xk , X1 ] · · · V [Xk ]
• E = GX (1− )
(X − k)!
2
If X ∼ N (µ, Σ),
• V [X] = G00X (1− ) + G0X (1− ) − (G0X (1− ))
d −n/2 −1/2 1
• GX (t) = GY (t) =⇒ X = Y fX (x) = (2π) |Σ| exp − (x − µ)T Σ−1 (x − µ)
2
Properties
9 Multivariate Distributions
• Z ∼ N (0, 1) ∧ X = µ + Σ1/2 Z =⇒ X ∼ N (µ, Σ)
9.1 Standard Bivariate Normal • X ∼ N (µ, Σ) =⇒ Σ−1/2 (X − µ) ∼ N (0, 1)
• X ∼ N (µ, Σ) =⇒ AX ∼ N Aµ, AΣAT
p
Let X, Y ∼ N (0, 1) ∧ X ⊥
⊥ Z where Y = ρX + 1 − ρ2 Z
• X ∼ N (µ, Σ) ∧ kak = k =⇒ aT X ∼ N aT µ, aT Σa
Joint density 2
x + y 2 − 2ρxy
1
f (x, y) = p
2π 1 − ρ2
exp −
2(1 − ρ2 )
10 Convergence
Conditionals Let {X1 , X2 , . . .} be a sequence of rv’s and let X be another rv. Let Fn denote
the cdf of Xn and let F denote the cdf of X.
(Y | X = x) ∼ N ρx, 1 − ρ2 (X | Y = y) ∼ N ρy, 1 − ρ2
and
Types of convergence
Independence D
X⊥
⊥ Y ⇐⇒ ρ = 0 1. In distribution (weakly, in law): Xn → X
1
z
(∀ε > 0) lim P [|Xn − X| > ε] = 0
n→∞
f (x, y) = exp −
2(1 − ρ2 )
p
2πσx σy 1−ρ 2
as
3. Almost surely (strongly): Xn → X
" 2 2 #
x − µx y − µy x − µx y − µy h i h i
z= + − 2ρ P lim Xn = X = P ω ∈ Ω : lim Xn (ω) = X(ω) = 1
σx σy σx σy n→∞ n→∞
9
qm
4. In quadratic mean (L2 ): Xn → X CLT notations
p-value evidence
where < 0.01 very strong evidence against H0
0.01 − 0.05 strong evidence against H0
r
T 0.05 − 0.1 weak evidence against H0
se(b
b τ) = ∇ϕ
b Jbn ∇ϕ
b > 0.1 little or no evidence against H0
Wald test
• Two-sided test
b and ∇ϕ
and Jbn = Jn (θ) b = ∇ϕ b.
θ=θ θb − θ0
• Reject H0 when |W | > zα/2 where W =
se
b
• P |W | > zα/2 → α
• p-value = Pθ0 [|W | > |w|] ≈ P [|Z| > |w|] = 2Φ(−|w|)
12.4 Parametric Bootstrap
Likelihood ratio test (LRT)
Sample from f (x; θbn ) instead of from Fbn , where θbn could be the mle or method supθ∈Θ Ln (θ) Ln (θbn )
• T (X) = =
of moments estimator. supθ∈Θ0 Ln (θ) Ln (θbn,0 ) 13
k
D
X iid • xn = (x1 , . . . , xn )
• λ(X) = 2 log T (X) → χ2r−q where Zi2 ∼ χ2k and Z1 , . . . , Zk ∼ N (0, 1)
• Prior density f (θ)
i=1 • Likelihood f (xn | θ): joint density of the data
• p-value = Pθ0 [λ(X) > λ(x)] ≈ P χ2r−q > λ(x) n
Y
In particular, X n iid =⇒ f (xn | θ) = f (xi | θ) = Ln (θ)
Multinomial LRT
i=1
• Posterior density f (θ | xn )
X1 Xk
• mle: pbn = ,...,
• Normalizing constant cn = f (xn ) = f (x | θ)f (θ) dθ
R
n n
k Xj • Kernel: part of a density that depends Ron θ
Ln (b
pn ) Y pbj
• T (X) = = θLn (θ)f (θ)
• Posterior mean θ̄n = θf (θ | xn ) dθ = R Ln (θ)f
R
Ln (p0 ) j=1
p0j (θ) dθ
k
X pbj D
• λ(X) = 2 Xj log → χ2k−1 14.1 Credible Intervals
j=1
p 0j
• The approximate size α LRT rejects H0 when λ(X) ≥ χ2k−1,α Posterior interval
Pearson Chi-square Test Z b
n
P [θ ∈ (a, b) | x ] = f (θ | xn ) dθ = 1 − α
k
X (Xj − E [Xj ])2 a
• T = where E [Xj ] = np0j under H0
j=1
E [Xj ] Equal-tail credible interval
D
• T → χ2k−1 Z a Z ∞
f (θ | xn ) dθ = f (θ | xn ) dθ = α/2
• p-value = P χ2k−1 > T (x)
−∞ b
2D
• Faster → Xk−1 than LRT, hence preferable for small n
Highest posterior density (HPD) region Rn
Independence testing
1. P [θ ∈ Rn ] = 1 − α
• I rows, J columns, X multinomial sample of size n = I ∗ J
X 2. Rn = {θ : f (θ | xn ) > k} for some k
• mles unconstrained: pbij = nij
X
• mles under H0 : pb0ij = pbi· pb·j = Xni· n·j Rn is unimodal =⇒ Rn is an interval
PI PJ nX
• LRT: λ = 2 i=1 j=1 Xij log Xi· Xij·j
PI PJ (X −E[X ])2
• PearsonChiSq: T = i=1 j=1 ijE[Xij ]ij
14.2 Function of parameters
D
• LRT and Pearson → χ2k ν, where ν = (I − 1)(J − 1) Let τ = ϕ(θ) and A = {θ : ϕ(θ) ≤ τ }.
Posterior CDF for τ
Z
14 Bayesian Inference H(r | xn ) = P [ϕ(θ) ≤ τ | xn ] = f (θ | xn ) dθ
A
Bayes’ Theorem
Posterior density
f (x | θ)f (θ) f (x | θ)f (θ) h(τ | xn ) = H 0 (τ | xn )
f (θ | x) = n
=R ∝ Ln (θ)f (θ)
f (x ) f (x | θ)f (θ) dθ
Bayesian delta method
Definitions
τ | X n ≈ N ϕ(θ),
b seb ϕ0 (θ)
b
n
• X = (X1 , . . . , Xn )
14
14.3 Priors Continuous likelihood (subscript c denotes constant)
Likelihood Conjugate prior Posterior hyperparameters
Choice
Unif (0, θ) Pareto(xm , k) max x(n) , xm , k + n
Xn
• Algorithm
(Frequentist) risk
1. Draw θcand ∼ f (θ)
Z
2. Generate u ∼ Unif (0, 1)
h i
R(θ, θ)
b = L(θ, θ(x))f
b (x | θ) dx = EX|θ L(θ, θ(X))
b
Ln (θcand )
3. Accept θcand if u ≤
Ln (θbn ) Bayes risk
ZZ
16.3 Importance Sampling
h i
r(f, θ)
b = L(θ, θ(x))f
b (x, θ) dx dθ = Eθ,X L(θ, θ(X))
b
Sample from an importance function g rather than target density h.
Algorithm to obtain an approximation to E [q(θ) | xn ]:
h h ii h i
r(f, θ)
b = Eθ EX|θ L(θ, θ(X)
b = Eθ R(θ, θ)b
iid
1. Sample from the prior θ1 , . . . , θn ∼ f (θ)
h h ii h i
r(f, θ)
b = EX Eθ|X L(θ, θ(X)
b = EX r(θb | X)
Ln (θi )
2. wi = PB ∀i = 1, . . . , B
i=1 Ln (θi )
PB 17.2 Admissibility
3. E [q(θ) | xn ] ≈ i=1 q(θi )wi
• θb0 dominates θb if
∀θ : R(θ, θb0 ) ≤ R(θ, θ)
b
17 Decision Theory
∃θ : R(θ, θb0 ) < R(θ, θ)
b
Definitions
• θb is inadmissible if there is at least one other estimator θb0 that dominates
• Unknown quantity affecting our decision: θ ∈ Θ it. Otherwise it is called admissible.
17
17.3 Bayes Rule Residual sums of squares (rss)
Bayes rule (or Bayes estimator) n
X
rss(βb0 , βb1 ) = ˆ2i
• r(f, θ)
b = inf e r(f, θ)
θ
e
i=1
R
• θ(x) = inf r(θ | x) ∀x =⇒ r(f, θ)
b b b = r(θb | x)f (x) dx
Least square estimates
Theorems
βbT = (βb0 , βb1 )T : min rss
β
b0 ,β
b1
• Squared error loss: posterior mean
• Absolute error loss: posterior median
• Zero-one loss: posterior mode βb0 = Ȳn − βb1 X̄n
Pn Pn
(Xi − X̄n )(Yi − Ȳn ) i=1 Xi Yi − nX̄Y
17.4 Minimax Rules βb1 = i=1 Pn 2
= n
(X − X̄ ) 2
P 2
i=1 i n i=1 Xi − nX
Maximum risk h i
β0
R̄(θ)
b = sup R(θ, θ) R̄(a) = sup R(θ, a) E βb | X n =
β1
b
θ θ
σ 2 n−1 ni=1 Xi2 −X n
h i P
Minimax rule n
V β |X = 2
b
e = inf sup R(θ, θ)
b = inf R̄(θ)
sup R(θ, θ) e nsX −X n 1
θ θe θe θ r Pn
2
σ i=1 Xi
√
b
se(
b βb0 ) =
θb = Bayes rule ∧ ∃c : R(θ, θ)
b =c sX n n
σ
Least favorable prior √
b
se(
b βb1 ) =
sX n
θbf = Bayes rule ∧ R(θ, θbf ) ≤ r(f, θbf ) ∀θ Pn Pn
where s2X = n−1 i=1 (Xi − X n )2 and σ
b2 = 1
n−2 ˆ2i
i=1 (unbiased estimate).
Further properties:
18 Linear Regression
P P
• Consistency: βb0 → β0 and βb1 → β1
Definitions
• Asymptotic normality:
• Response variable Y
• Covariate X (aka predictor variable or feature) βb0 − β0 D βb1 − β1 D
→ N (0, 1) and → N (0, 1)
se(
b βb0 ) se(
b βb1 )
18.1 Simple Linear Regression
• Approximate 1 − α confidence intervals for β0 and β1 :
Model
Yi = β0 + β1 Xi + i E [i | Xi ] = 0, V [i | Xi ] = σ 2 βb0 ± zα/2 se(
b βb0 ) and βb1 ± zα/2 se(
b βb1 )
Fitted line
rb(x) = βb0 + βb1 x • Wald test for H0 : β1 = 0 vs. H1 : β1 6= 0: reject H0 if |W | > zα/2 where
W = βb1 /se(
b βb1 ).
Predicted (fitted) values
Ybi = rb(Xi ) R2
Pn b 2
Pn 2
Residuals i=1 (Yi − Y ) ˆ rss
2
= 1 − Pn i=1 i 2 = 1 −
ˆi = Yi − Ybi = Yi − βb0 + βb1 Xi R = Pn 2
i=1 (Yi − Y ) i=1 (Yi − Y )
tss
18
Likelihood If the (k × k) matrix X T X is invertible,
n n n
Y Y Y βb = (X T X)−1 X T Y
L= f (Xi , Yi ) = fX (Xi ) × fY |X (Yi | Xi ) = L1 × L2 h i
i=1 i=1 i=1 V βb | X n = σ 2 (X T X)−1
Yn
βb ≈ N β, σ 2 (X T X)−1
L1 = fX (Xi )
i=1
n
(
2
) Estimate regression function
Y
−n 1 X
L2 = fY |X (Yi | Xi ) ∝ σ exp − 2 Yi − (β0 − β1 Xi ) k
i=1
2σ i X
rb(x) = βbj xj
j=1
Under the assumption of Normality, the least squares parameter estimators are
also the MLEs, but the least squares variance estimator is not the MLE Unbiased estimate for σ 2
n n
1X 2 1 X 2
b2 =
σ ˆ b2 =
σ ˆ ˆ = X βb − Y
n i=1 i n − k i=1 i
mle
18.2 Prediction n−k 2
µ
b = X̄ b2 =
σ σ
n
Observe X = x∗ of the covariate and want to predict their outcome Y∗ .
1 − α Confidence interval
βbj ± zα/2 se(
b βbj )
Yb∗ = βb0 + βb1 x∗
h i h i h i h i
V Yb∗ = V βb0 + x2∗ V βb1 + 2x∗ Cov βb0 , βb1 18.4 Model Selection
Prediction interval Consider predicting a new observation Y ∗ for covariates X ∗ and let S ⊂ J
Pn 2 denote a subset of the covariates in the model, where |S| = k and |J| = n.
2 2 i=1 (Xi − X∗ )
ξn = σ
b P +1 Issues
n i (Xi − X̄)2 j
b
The training error is a downward-biased estimate of the prediction risk. Frequentist risk
h i Z Z
h i R(f, fbn ) = E L(f, fbn ) = b2 (x) dx + v(x) dx
E R btr (S) < R(S)
h i
h
i n
X h i b(x) = E fbn (x) − f (x)
bias(R btr (S) − R(S) = −2
btr (S)) = E R Cov Ybi , Yi h i
i=1 v(x) = V fbn (x)
Adjusted R2
19.1.1 Histograms
2 n − 1 rss
R (S) = 1 −
n − k tss Definitions
Mallow’s Cp statistic • Number of bins m
1
• Binwidth h = m
R(S)
b =R σ 2 = lack of fit + complexity penalty
btr (S) + 2kb • Bin Bj has νj observations
R
• Define pbj = νj /n and pj = Bj f (u) du
Akaike Information Criterion (AIC)
Histogram estimator
m
AIC(S) = bS2 )
`n (βbS , σ −k X pbj
fbn (x) = I(x ∈ Bj )
j=1
h
Bayesian Information Criterion (BIC) h i p
j
E fbn (x) =
k h
bS2 ) − log n
BIC(S) = `n (βbS , σ h i p (1 − p )
j j
2 V fbn (x) =
nh2
h2
Z
Validation and training 2 1
R(fn , f ) ≈
b (f 0 (u)) du +
12 nh
m
X n n !1/3
R
bV (S) = (Ybi∗ (S) − Yi∗ )2 m = |{validation data}|, often or ∗ 1 6
i=1
4 2 h = 1/3 R 2 du
n (f 0 (u))
2/3 Z 1/3
Leave-one-out cross-validation C 3 2
R∗ (fbn , f ) ≈ 2/3 C= (f 0 (u)) du
n n
!2 n 4
X
2
X Yi − Ybi (S)
R
bCV (S) = (Yi − Yb(i) ) = Cross-validation estimate of E [J(h)]
i=1 i=1
1 − Uii (S) Z n m
2 2Xb 2 n+1 X 2
JCV (h) = fn (x) dx −
b b f(−i) (Xi ) = − pb
U (S) = XS (XST XS )−1 XS (“hat matrix”) n i=1 (n − 1)h (n − 1)h j=1 j
20
19.1.2 Kernel Density Estimator (KDE) k-nearest Neighbor Estimator
Kernel K 1 X
rb(x) = Yi where Nk (x) = {k values of x1 , . . . , xn closest to x}
k
i:xi ∈Nk (x)
• K(x) ≥ 0
R
• K(x) dx = 1 Nadaraya-Watson Kernel Estimator
R
• xK(x) dx = 0 n
X
•
R 2 2
x K(x) dx ≡ σK >0 rb(x) = wi (x)Yi
i=1
x−xi
KDE K
wi (x) = h ∈ [0, 1]
Pn x−xj
1X1
n
x − Xi
j=1 K h
fbn (x) = K 4 Z 2
n i=1 h h h4 f 0 (x)
Z
rn , r) ≈
R(b x2 K 2 (x) dx r00 (x) + 2r0 (x) dx
4 f (x)
Z Z
1 4 00 2 1
R(f, fn ) ≈ (hσK )
b (f (x)) dx + K 2 (x) dx
σ 2 K 2 (x) dx
Z R
4 nh
−2/5 −1/5 −1/5
+ dx
c c2 c3
Z Z nhf (x)
h∗ = 1 c1 = σ 2
K , c 2 = K 2
(x) dx, c 3 = (f 00 (x))2 dx c1
n1/5 h∗ ≈
Z 4/5 Z 1/5 n1/5
∗ c4 5 2 2/5 2 00 2 c2
R (f, fn ) = 4/5
b c4 = (σK ) K (x) dx (f ) dx R∗ (b
rn , r) ≈ 4/5
n 4 n
| {z }
C(K)
Pm+n = Pm Pn
1
Pn = P × · · · × P = Pn St ∼ Exp
λ
Marginal probability
1
Pk
Sample autocorrelation function • If 2k+1 i=−k wt−j ≈ 0, a linear trend function µt = β0 + β1 t passes
without distortion
γ
b(h)
ρb(h) = Differencing
γ
b(0)
• µt = β0 + β1 t =⇒ ∇xt = β1
Sample cross-variance function
n−h
1 X 21.4 ARIMA models
γ
bxy (h) = (xt+h − x̄)(yt − y)
n t=1 Autoregressive polynomial
γ
bxy (h) Autoregressive operator
ρbxy (h) = p
γbx (0)b
γy (0)
φ(B) = 1 − φ1 B − · · · − φp B p
Properties
Autoregressive model order p, AR (p)
1
• σρbx (h) = √ if xt is white noise
n xt = φ1 xt−1 + · · · + φp xt−p + wt ⇐⇒ φ(B)xt = wt
1
• σρbxy (h) = √ if xt or yt is white noise AR (1)
n
k−1 ∞
X k→∞,|φ|<1 X
21.3 Non-Stationary Time Series • xt = φk (xt−k ) + φj (wt−j ) = φj (wt−j )
j=0 j=0
Classical decomposition model | {z }
linear process
P∞
xt = µt + st + wt • E [xt ] = j=0 φj (E [wt−j ]) = 0
2 h
σw φ
• µt = trend • γ(h) = Cov [xt+h , xt ] = 1−φ2
γ(h)
• st = seasonal component • ρ(h) = γ(0) = φh
• wt = random noise term • ρ(h) = φρ(h − 1) h = 1, 2, . . .
24
Moving average polynomial Seasonal ARIMA
θ(z) = 1 + θ1 z + · · · + θq zq z ∈ C ∧ θq 6= 0 • Denoted by ARIMA (p, d, q) × (P, D, Q)s
Moving average operator • ΦP (B s )φ(B)∇D d s
s ∇ xt = δ + ΘQ (B )θ(B)wt
θ(B) = 1 + θ1 B + · · · + θp B p
21.4.1 Causality and Invertibility
MA (q) (moving average model order q) P∞
ARMA (p, q) is causal (future-independent) ⇐⇒ ∃{ψj } : j=0 ψj < ∞ such that
xt = wt + θ1 wt−1 + · · · + θq wt−q ⇐⇒ xt = θ(B)wt
q ∞
X
xt = wt−j = ψ(B)wt
X
E [xt ] = θj E [wt−j ] = 0
j=0
j=0
( Pq−h P∞
2
σw j=0 θj θj+h 0≤h≤q ARMA (p, q) is invertible ⇐⇒ ∃{πj } : j=0 πj < ∞ such that
γ(h) = Cov [xt+h , xt ] =
0 h>q
∞
X
MA (1) π(B)xt = Xt−j = wt
xt = wt + θwt−1 j=0
2 2
(1 + θ )σw h = 0
Properties
γ(h) = θσw 2
h=1
0 h>1 • ARMA (p, q) causal ⇐⇒ roots of φ(z) lie outside the unit circle
(
θ
2 h=1 ∞
θ(z)
ρ(h) = (1+θ )
X
0 h>1 ψ(z) = ψj z j = |z| ≤ 1
j=0
φ(z)
ARMA (p, q)
xt = φ1 xt−1 + · · · + φp xt−p + wt + θ1 wt−1 + · · · + θq wt−q • ARMA (p, q) invertible ⇐⇒ roots of θ(z) lie outside the unit circle
∞
φ(B)xt = θ(B)wt X φ(z)
π(z) = πj z j = |z| ≤ 1
Partial autocorrelation function (PACF) j=0
θ(z)
• xh−1
i , regression of xi on {xh−1 , xh−2 , . . . , x1 }
Behavior of the ACF and PACF for causal and invertible ARMA models
• φhh = corr(xh − xh−1
h , x0 − xh−1
0 ) h≥2
• E.g., φ11 = corr(x1 , x0 ) = ρ(1) AR (p) MA (q) ARMA (p, q)
ARIMA (p, d, q) ACF tails off cuts off after lag q tails off
∇d xt = (1 − B)d xt is ARMA (p, q) PACF cuts off after lag p tails off q tails off
φ(B)(1 − B)d xt = θ(B)wt
Exponentially Weighted Moving Average (EWMA) 21.5 Spectral Analysis
xt = xt−1 + wt − λwt−1 Periodic process
∞
X xt = A cos(2πωt + φ)
xt = (1 − λ)λj−1 xt−j + wt when |λ| < 1
j=1 = U1 cos(2πωt) + U2 sin(2πωt)
x̃n+1 = (1 − λ)xn + λx̃n
• Frequency index ω (cycles per unit time), period 1/ω
25
• Amplitude A Discrete Fourier Transform (DFT)
• Phase φ n
X
• U1 = A cos φ and U2 = A sin φ often normally distributed rv’s d(ωj ) = n−1/2 xt e−2πiωj t
i=1
Periodic mixture
q
X Fourier/Fundamental frequencies
xt = (Uk1 cos(2πωk t) + Uk2 sin(2πωk t))
ωj = j/n
k=1
• Uk1 , Uk2 , for k = 1, . . . , q, are independent zero-mean rv’s with variances σk2 Inverse DFT
n−1
Pq X
• γ(h) = k=1 σk2 cos(2πωk h) xt = n−1/2 d(ωj )e2πiωj t
Pq
• γ(0) = E x2t = k=1 σk2 j=0
22.4 Combinatorics [3] R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications
With R Examples. Springer, 2006.
Sampling
[4] A. Steger. Diskrete Strukturen – Band 1: Kombinatorik, Graphentheorie,
k out of n w/o replacement w/ replacement Algebra. Springer, 2001.
k−1
Y n! [5] A. Steger. Diskrete Strukturen – Band 2: Wahrscheinlichkeitstheorie und
ordered nk = (n − i) = nk Statistik. Springer, 2002.
i=0
(n − k)!
n nk n!
n−1+r
n−1+r
[6] L. Wasserman. All of Statistics: A Concise Course in Statistical Inference.
unordered = = = Springer, 2003.
k k! k!(n − k)! r n−1
27
Univariate distribution relationships, courtesy Leemis and McQueston [2].
28