Stat Cookbook PDF

Probability and Statistics
Cookbook
Version 0.2.1
14th July, 2016
http://statistics.zone/
c Matthias Vallentin, 2016
Copyright
Contents
3 15 Bayesian Inference
15.1 Credible Intervals . . . . . . . . . . . .
3
15.2 Function of parameters . . . . . . . . .
5
15.3 Priors . . . . . . . . . . . . . . . . . .
Probability Theory
8
15.3.1 Conjugate Priors . . . . . . . .
15.4 Bayesian Testing . . . . . . . . . . . .
Random Variables
8
3.1 Transformations . . . . . . . . . . . . . 9 16 Sampling Methods
16.1 Inverse Transform Sampling . . . . . .
Expectation
9
16.2 The Bootstrap . . . . . . . . . . . . .
16.2.1 Bootstrap Confidence Intervals
Variance
9
16.3 Rejection Sampling . . . . . . . . . . .
16.4 Importance Sampling . . . . . . . . . .
Inequalities
10
17 Decision Theory
Distribution Relationships
10
17.1 Risk . . . . . . . . . . . . . . . . . . .
17.2 Admissibility . . . . . . . . . . . . . .
Probability and Moment Generating
17.3 Bayes Rule . . . . . . . . . . . . . . .
Functions
11
17.4 Minimax Rules . . . . . . . . . . . . .
1 Distribution Overview
1.1 Discrete Distributions . . . . . . . . . .
1.2 Continuous Distributions . . . . . . . .
2
3
4
5
6
7
8
14 Exponential Family
16
21.5 Spectral Analysis . . . . . . . . . . . . . 28
.
.
.
.
.
16 22 Math
22.1 Gamma Function
16
22.2 Beta Function . .
17
22.3 Series . . . . . .
17
22.4 Combinatorics .
17
18
.
.
.
.
.
18
18
18
18
19
19
.
.
.
.
19
19
20
20
20
9 Multivariate Distributions
11 18 Linear Regression
9.1 Standard Bivariate Normal . . . . . . . 11
18.1 Simple Linear Regression . . . . . . . .
9.2 Bivariate Normal . . . . . . . . . . . . . 11
18.2 Prediction . . . . . . . . . . . . . . . . .
9.3 Multivariate Normal . . . . . . . . . . . 11
18.3 Multiple Regression . . . . . . . . . . .
18.4 Model Selection . . . . . . . . . . . . . .
10 Convergence
11
10.1 Law of Large Numbers (LLN) . . . . . . 12 19 Non-parametric Function Estimation
10.2 Central Limit Theorem (CLT) . . . . . 12
19.1 Density Estimation . . . . . . . . . . . .
19.1.1 Histograms . . . . . . . . . . . .
11 Statistical Inference
12
19.1.2 Kernel Density Estimator (KDE)
11.1 Point Estimation . . . . . . . . . . . . . 12
19.2 Non-parametric Regression . . . . . . .
11.2 Normal-Based Confidence Interval . . . 13
19.3 Smoothing Using Orthogonal Functions
11.3 Empirical distribution . . . . . . . . . . 13
11.4 Statistical Functionals . . . . . . . . . . 13 20 Stochastic Processes
20.1 Markov Chains . . . . . . . . . . . . . .
12 Parametric Inference
13
20.2 Poisson Processes . . . . . . . . . . . . .
12.1 Method of Moments . . . . . . . . . . . 13
12.2 Maximum Likelihood . . . . . . . . . . . 14 21 Time Series
21.1 Stationary Time Series . . . . . . . . . .
12.2.1 Delta Method . . . . . . . . . . . 14
21.2 Estimation of Correlation . . . . . . . .
12.3 Multiparameter Models . . . . . . . . . 14
21.3 Non-Stationary Time Series . . . . . . .
12.3.1 Multiparameter delta method . . 15
21.3.1 Detrending . . . . . . . . . . . .
12.4 Parametric Bootstrap . . . . . . . . . . 15
21.4 ARIMA models . . . . . . . . . . . . . .
13 Hypothesis Testing
15
21.4.1 Causality and Invertibility . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
29
29
29
29
30
20
20
21
21
22
22
22
23
23
23
24
24
24
25
25
26
26
26
27
27
28
This cookbook integrates various topics in probability theory

and statistics, based on literature [1, 6, 3] and in-class material
from courses of the statistics department at the University of
California in Berkeley but also influenced by others [4, 5]. If you
find errors or have suggestions for improvements, please get in
touch at http://statistics.zone/.
Distribution Overview
1.1
Discrete Distributions
Notation1
Uniform
Unif {a, . . . , b}
FX (x)
Bernoulli
Binomial
Multinomial
Hypergeometric
Negative Binomial
1 We
Bern (p)
Bin (n, p)
x<a
axb
x>b
(1 p)1x
bxca+1
ba
I1p (n x, x + 1)
Mult (n, p)
Hyp (N, m, n)
NBin (r, p)
Geometric
Geo (p)
Poisson
Po ()
x np
p
np(1 p)
Ip (r, x + 1)
1 (1 p)x
e
x N+
x
X
i
i!
i=0
fX (x)
E [X]
V [X]
MX (s)
I(a x b)
ba+1
a+b
2
(b a + 1)2 1
12
eas e(b+1)s
s(b a)
px (1 p)1x
p(1 p)
1 p + pes
np
np(1 p)
(1 p + pes )n
!
n x
p (1 p)nx
x
n!
x
px1 pk k
x1 ! . . . xk ! 1

m N m
x
k
X
xi = n
i=1
nx

N
n
!
x+r1 r
p (1 p)x
r1
p(1 p)x1
x e
x!
x N+
np1
.
..
npk
nm
N
np1 (1 p1 )
np2 p1
np1 p2
..
.
k
X
!n
pi e
si
i=0
nm(N n)(N m)
N 2 (N 1)
1p
r
p
1p
r 2
p
1
p
1p
p2
p
1 (1 p)es
r
pes
1 (1 p)es
e(e
1)
use the notation (s, x) and (x) to refer to the Gamma functions (see 22.1), and use B(x, y) and Ix to refer to the Beta functions (see 22.2).
Uniform (discrete)
Binomial
Geometric
n = 40, p = 0.3
n = 30, p = 0.6
n = 25, p = 0.9
0.8
Poisson
p = 0.2
p = 0.5
p = 0.8
=1
=4
= 10
0.3
0.2
0.6
0.1
10
1.00
30
0.0
2.5
5.0
7.5
0.0
10.0
10
Geometric
1.0
1.00
0.75
20
Poisson
0.8
15
0.6
10
0.4
0.25
0.00
0.50
CDF
0.50
CDF
CDF
CDF
0.25
40
i
n
0.0
0.75
i
n
Binomial
Uniform (discrete)
20
x
1
0.1
0.0
0.2
0.2
0.4
PMF
PMF
PMF
PMF
1
n
20
30
n = 40, p = 0.3
n = 30, p = 0.6
n = 25, p = 0.9
40
0.2
0.0
2.5
5.0
7.5
p = 0.2
p = 0.5
p = 0.8
10.0
0.00
10
15
=1
=4
= 10
20
1.2
Continuous Distributions
Notation
FX (x)
Uniform
Unif (a, b)
Normal
N , 2
x<a
a<x<b
1
x>b
Z x
(x) =
(t) dt
xa
ba
Log-Normal
ln N , 2
Multivariate Normal
MVN (, )
Students t
Chi-square
Student()
2k

1
1
ln x
+ erf
2
2
2 2
fX (x)
E [X]
V [X]
MX (s)
I(a < x < b)

ba

(x )2
1
(x) = exp
2 2
2

(ln x )2
1
exp
2 2
x 2 2
a+b
2
(b a)2
12
esb esa
s(b a)

2 s2
exp s +
2
(2)k/2 ||1/2 e 2 (x)

Ix
,
2 2

k x
1
,
(k/2)
2 2
e+
1 (x)
(e 1)e2+
/2
1
xk/21 ex/2
2k/2 (k/2)
r
>1

1
exp T s + sT s
2

(+1)/2
+1
x2
2

1
+
>2
1<2
2k
d2
d2 2
2d22 (d1 + d2 2)
d1 (d2 2)2 (d2 4)
(1 2s)k/2 s < 1/2

F
F(d1 , d2 )
Exponential
Exp ()
Gamma
Inverse Gamma
Dirichlet
Beta
Weibull
Pareto
d1 x
d1 x+d2
Gamma (, )
InvGamma (, )
d1 d1
,
2 2
Pareto(xm , )
1
1
(, x)
()

, x
()
1 x
x
e
()
1
1
1 /x
x
e
()
P

k
k
i=1 i Y 1
xi i
Qk
i=1 (i ) i=1
>1
1
2
>2
( 1)2 ( 2)
( + ) 1
x
(1 x)1
() ()
+

1
1 +
k
( + )2 ( + + 1)

2
2 1 +
2
k
xm
>1
1
x2m
>2
( 1)2 ( 2)
1 e(x/)
x
We use the rate parameterization where =
xB
d1 d1
, 2
2
1 x/
e
Ix (, )
Weibull(, k)
(d1 x+d2 )d1 +d2
1 ex/
Dir ()
Beta (, )
(d1 x)d1 d2 2
x
1
.
x xm
k x k1 (x/)k
e

x
m
+1
x
i
Pk
i=1 i
x xm
(s < )
!
s
(s < )
p

2(s)/2
K
4s
()
E [Xi ] (1 E [Xi ])
Pk
i=1 i + 1
1+
X
k=1
X
n=0
k1
Y
r=0
+r
++r
sk
k!
sn n
n
1+
n!
k
(xm s) (, xm s) s < 0
Some textbooks use as scale parameter instead [6].
Uniform (continuous)
Normal
2.0
LogNormal
1.00
= 0, = 0.2
= 0, 2 = 1
= 0, 2 = 5
= 2, 2 = 0.5
2
1.0
0.5
0.0
0.00
2.5
0.0
2
k=1
k=2
k=3
k=4
k=5
2.5
5.0
0.0
0
5.0
0.0
Exponential
5.0
Gamma
= 0.5
=1
= 2.5
2.0
d1 = 1, d2 = 1
d1 = 2, d2 = 1
d1 = 5, d2 = 2
d1 = 100, d2 = 1
d1 = 100, d2 = 100
2.5
x
= 1, = 0.5
= 2, = 0.5
= 3, = 0.5
= 5, = 1
= 9, = 2
0.5
0.4
1.5
PDF
PDF
0.3
PDF
PDF
2.5
0.75
0.50
0.2
0.1
0.25
5.0
1.00
0.50
=1
=2
=5
=
0.3
PDF
= 0, = 3
= 2, 2 = 2
= 0, 2 = 1
= 0.5, 2 = 1
= 0.25, 2 = 1
= 0.125, 2 = 1
0.75
PDF
1
ba
PDF
PDF
1.5
Student's t
0.4
1.0
0.2
1
0.5
0.25
0.1
0.00
0
0.0
0.0
0
Inverse Gamma
= 1, = 1
= 2, = 1
= 3, = 1
= 3, = 0.5
10
15
20
Weibull
= 0.5, = 0.5
= 5, = 1
= 1, = 3
= 2, = 2
= 2, = 5
Beta
5
2.0
Pareto
4
= 1, k = 0.5
= 1, k = 1
= 1, k = 1.5
= 1, k = 5
xm = 1, k = 1
xm = 1, k = 2
xm = 1, k = 4
1.5
3
PDF
PDF
PDF
PDF
3
1.0
0.5
0
0
0.0
0.00
0.25
0.50
0.75
1.00
0.0
0.5
1.0
1.5
2.0
2.5
1.0
1.5
2.0
2.5
Uniform (continuous)
Normal
LogNormal
1.00
Student's t
1.00
= 0, = 3
= 2, 2 = 2
= 0, 2 = 1
= 0.5, 2 = 1
= 0.25, 2 = 1
= 0.125, 2 = 1
2
0.75
0.75
0.75
CDF
CDF
CDF
CDF
0.50
0.50
0.50
0.25
0.25
0.25
= 0, 2 = 0.2
= 0, 2 = 1
= 0, 2 = 5
= 2, 2 = 0.5
0.00
a
5.0
2.5
0.0
x
2
k=1
k=2
k=3
k=4
k=5
0.00
4
Exponential
0.75
0.75
0.50
0.00
2
CDF
0.75
CDF
0.75
0.50
0.00
3
0.00
0.25
0.50
10
0.75
1.00
15
20
Pareto
1.00
1.00
0.75
0.75
0.50
0.50
0.25
= 1, k = 0.5
= 1, k = 1
= 1, k = 1.5
= 1, k = 5
0.00
0.00
0.00
0.25
0.25
= 1, = 1
= 2, = 1
= 3, = 1
= 3, = 0.5
= 1, = 0.5
= 2, = 0.5
= 3, = 0.5
= 5, = 1
= 9, = 2
Weibull
= 0.5, = 0.5
= 5, = 1
= 1, = 3
= 2, = 2
= 2, = 5
5.0
0.50
Beta
1.00
CDF
Inverse Gamma
= 0.5
=1
= 2.5
0.00
2.5
0.25
1.00
0.50
0.25
d1 = 1, d2 = 1
d1 = 2, d2 = 1
d1 = 5, d2 = 2
d1 = 100, d2 = 1
d1 = 100, d2 = 100
1
0.0
Gamma
0.75
2.5
x
1.00
0.25
5.0
1.00
0.50
1.00
0.25
0.25
CDF
0.50
CDF
CDF
0.75
0.00
0
1.00
CDF
0.00
5.0
CDF
2.5
=1
=2
=5
=
0.0
0.5
1.0
1.5
2.0
2.5
xm = 1, k = 1
xm = 1, k = 2
xm = 1, k = 4
0.00
1.0
1.5
2.0
2.5
Probability Theory
Law of Total Probability
Definitions
P [B] =
Sample space
Outcome (point or element)
Event A
-algebra A
P [B|Ai ] P [Ai ]
n
G
Ai
i=1
Bayes Theorem
P [B | Ai ] P [Ai ]
P [Ai | B] = Pn
j=1 P [B | Aj ] P [Aj ]
Inclusion-Exclusion Principle
[
X
n
n

Ai =
(1)r1

Probability Distribution P
i=1
1. P [A] 0 A
2. P [] = 1
" #
G
X
3. P
Ai =
P [Ai ]
r=1
n
G
Ai
i=1
\

r

A
ij

ii1 <<ir n j=1
Random Variables
Random Variable (RV)
i=1
X:R
Probability space (, A, P)
Probability Mass Function (PMF)
Properties
i=1
1. A
S
2. A1 , A2 , . . . , A = i=1 Ai A
3. A A = A A
i=1
n
X
P [] = 0
B = B = (A A) B = (A B) (A B)
P [A] = 1 P [A]
P [B] = P [A B] + P [A B]
P [] = 1
P [] = 0
S
T
T
S
( n An ) = n An ( n An ) = n An
DeMorgan
S
T
P [ n An ] = 1 P [ n An ]
P [A B] = P [A] + P [B] P [A B]
= P [A B] P [A] + P [B]
P [A B] = P [A B] + P [A B] + P [A B]
P [A B] = P [A] P [A B]
fX (x) = P [X = x] = P [{ : X() = x}]

Probability Density Function (PDF)
Z
P [a X b] =
f (x) dx
a
Cumulative Distribution Function (CDF)

FX : R [0, 1]
FX (x) = P [X x]
1. Nondecreasing: x1 < x2 = F (x1 ) F (x2 )

2. Normalized: limx = 0 and limx = 1
3. Right-Continuous: limyx F (y) = F (x)
Continuity of Probabilities
A1 A2 . . . = limn P [An ] = P [A]
A1 A2 . . . = limn P [An ] = P [A]
S
where A = i=1 Ai
T
where A = i=1 Ai
Z
P [a Y b | X = x] =
fY |X (y | x)dy
Independence
fY |X (y | x) =
A
B P [A B] = P [A] P [B]
f (x, y)
fX (x)
Independence
Conditional Probability
P [A | B] =
ab
P [A B]
P [B]
P [B] > 0
1. P [X x, Y y] = P [X x] P [Y y]
2. fX,Y (x, y) = fX (x)fY (y)
3.1
Transformations
E [XY ] =
xyfX,Y (x, y) dFX (x) dFY (y)

X,Y
Transformation function
E [(Y )] 6= (E [X])
(cf. Jensen inequality)
P [X Y ] = 1 = E [X] E [Y ]
P [X = Y ] = 1 E [X] = E [Y ]
X
E [X] =
P [X x]
Z = (X)
Discrete
X

fZ (z) = P [(X) = z] = P [{x : (x) = z}] = P X 1 (z) =
fX (x)
x1 (z)
x=1
Sample mean
Continuous
X
n = 1
Xi
X
n i=1
Z
FZ (z) = P [(X) z] =
with Az = {x : (x) z}
f (x) dx
Az
Special case if strictly monotone

dx

d
1
fZ (z) = fX (1 (z)) 1 (z) = fX (x) = fX (x)
dz
dz
|J|
Conditional expectation
Z
E [Y | X = x] = yf (y | x) dy
E [X] = E [E [X | Y ]]
Z
E(X,Y ) | X=x [=]
(x, y)fY |X (y | x) dx
Z
E [(Y, Z) | X = x] =
(y, z)f(Y,Z)|X (y, z | x) dy dz
The Rule of the Lazy Statistician

Z
E [Z] =
(x) dFX (x)
Z
E [IA (x)] =
E [Y + Z | X] = E [Y | X] + E [Z | X]
E [(X)Y | X] = (X)E [Y | X]
EY | X [=] c = Cov [X, Y ] = 0
Z
dFX (x) = P [X A]
IA (x) dFX (x) =

A
Convolution
Z
Z := X + Y
fX,Y (x, z x) dx
fZ (z) =
X,Y 0
fX,Y (x, z x) dx
Z := |X Y |
Z :=
X
Y
fZ (z) = 2
fX,Y (x, z + x) dx
0
Z
Z
fZ (z) =
|x|fX,Y (x, xz) dx =
xfx (x)fX (x)fY (xz) dx
Variance
Definition and properties

2
2
V [X] = X
= E (X E [X])2 = E X 2 E [X]
#
" n
n
X
X
X
V
Xi =
V [Xi ] + 2
Cov [Xi , Yj ]
i=1
Expectation
V
Definition and properties
Z
E [X] = X =
x dFX (x) =
i=1
#
Xi =
i=1
xfX (x)
x
Z
xfX (x) dx
P [X = c] = 1 = E [X] = c
E [cX] = c E [X]
E [X + Y ] = E [X] + E [Y ]
" n
X
X discrete
n
X
i6=j
V [Xi ]
if Xi
Xj
i=1
Standard deviation
sd[X] =
V [X] = X
Covariance
X continuous
Cov [X, Y ] = E [(X E [X])(Y E [Y ])] = E [XY ] E [X] E [Y ]

Cov [X, a] = 0
Cov [X, X] = V [X]
Cov [X, Y ] = Cov [Y, X]
Cov [aX, bY ] = abCov [X, Y ]

Cov [X + a, Y + b] = Cov [X, Y ]
n
m
n X
m
X
X
X
Cov
Xi ,
Yj =
Cov [Xi , Yj ]
i=1
j=1
Distribution Relationships
Binomial
Xi Bern (p) =
i=1 j=1
n
X
Xi Bin (n, p)
i=1
Correlation
X Bin (n, p) , Y Bin (m, p) = X + Y Bin (n + m, p)

limn Bin (n, p) = Po (np)
(n large, p small)
limn Bin (n, p) = N (np, np(1 p))
(n large, p far from 0 and 1)
Cov [X, Y ]
[X, Y ] = p
V [X] V [Y ]
Independence
Negative Binomial
X
Y = [X, Y ] = 0 Cov [X, Y ] = 0 E [XY ] = E [X] E [Y ]
Sample variance
n
S2 =
1 X
n )2
(Xi X
n 1 i=1
Conditional variance

2
V [Y | X] = E (Y E [Y | X])2 | X = E Y 2 | X E [Y | X]
V [Y ] = E [V [Y | X]] + V [E [Y | X]]
X NBin (1, p) = Geo (p)

Pr
X NBin (r, p) = i=1 Geo (p)
P
P
Xi NBin (ri , p) =
Xi NBin ( ri , p)
X NBin (r, p) . Y Bin (s + r, p) = P [X s] = P [Y r]
Poisson
Xi Po (i ) Xi Xj =
n
X
Xi Po
i=1
n
X
!
i
i=1
X
n
X
n
Xj Bin
Xj , Pn
Xi Po (i ) Xi Xj = Xi
j
j=1
j=1
j=1
Inequalities
Cauchy-Schwarz
Exponential

2
E [XY ] E X 2 E Y 2
Markov
P [(X) t]
E [(X)]
t
Chebyshev
P [|X E [X]| t]
Chernoff

P [X (1 + )]
Xi Exp () Xi
Xj =
Xi Gamma (n, )
i=1
Memoryless property: P [X > x + y | X > y] = P [X > x]
V [X]
t2
e
(1 + )1+
n
X
Normal
X N , 2

> 1
Hoeffding
X1 , . . . , Xn independent P [Xi [ai , bi ]] = 1 1 i n

E X
t e2nt2 t > 0
P X

2 2

E X
| t 2 exp Pn 2n t
P |X
t>0
2
i=1 (bi ai )
Jensen
E [(X)] (E [X]) convex
N (0, 1)

2

X N , Z = aX + b = Z N a + b, a2 2

P
P
P
Xi N i , i2 Xi
Xj =
Xi N
i , i i2
i
i

P [a < X b] = b
a
(x) = 1 (x)
0 (x) = x(x)
00 (x) = (x2 1)(x)
1
Upper quantile of N (0, 1): z = (1 )
Gamma
X Gamma (, ) X/ Gamma (, 1)
P
Gamma (, ) i=1 Exp ()
P
P
Xi Gamma (i , ) Xi
Xj =
i Xi Gamma (
i i , )
10
()
=
9.2
x1 ex dx
Bivariate Normal

Let X N x , x2 and Y N y , y2 .
Beta
1
( + ) 1
x1 (1 x)1 =
x
(1 x)1
B(, )
()()
B( + k, )

+k1
=
E X k1
E Xk =
B(, )
++k1
Beta (1, 1) Unif (0, 1)
f (x, y) =
"
z=
x x
x
2x y
2

+
1
p
1 2
y y
y

exp
2

2
z
2(1 2 )
x x
x

y y
y
#
Conditional mean and variance
Probability and Moment Generating Functions

GX (t) = E tX
X
(Y E [Y ])
Y
p
V [X | Y ] = X 1 2
E [X | Y ] = E [X] +
|t| < 1
"
#

X (Xt)i
X

E Xi
MX (t) = GX (et ) = E eXt = E
=
ti
i!
i!
i=0
i=0
9.3
P [X = 0] = GX (0)
P [X = 1] = G0X (0)
P [X = i] =
Multivariate Normal
(Precision matrix 1 )
V [X1 ]
Cov [X1 , Xk ]
..
..
..
=
.
.
.
Covariance matrix
(i)
GX (0)
i!
E [X] = G0X (1 )

(k)
E X k = MX (0)

X!
(k)
E
= GX (1 )
(X k)!
Cov [Xk , X1 ]
V [Xk ]
If X N (, ),
2
V [X] = G00X (1 ) + G0X (1 ) (G0X (1 ))
1/2
fX (x) = (2)n/2 ||
GX (t) = GY (t) = X = Y

1
exp (x )T 1 (x )
2
Properties
9
9.1
Multivariate Distributions
Standard Bivariate Normal
Let X, Y N (0, 1) X
Z where Y = X +
1 2 Z
Joint density
x2 + y 2 2xy
p
f (x, y) =
exp
2(1 2 )
2 1 2

10
Conditionals
(Y | X = x) N x, 1 2
and
(X | Y = y) N y, 1 2
Z N (0, 1) X = + 1/2 Z = X N (, )
X N (, ) = 1/2 (X ) N (0, 1)

X N (, ) = AX N A, AAT

X N (, ) kak = k = aT X N aT , aT a
Convergence
Let {X1 , X2 , . . .} be a sequence of rvs and let X be another rv. Let Fn denote
the cdf of Xn and let F denote the cdf of X.
Types of Convergence
D
1. In distribution (weakly, in law): Xn X

Independence
X
Y = 0
lim Fn (t) = F (t)
t where F continuous
11
2. In probability: Xn X

n
X
n(Xn ) D
Zn := q =
Z
n
V X
( > 0) lim P [|Xn X| > ] = 0

n
as
3. Almost surely (strongly): Xn X

h
i
h
i
P lim Xn = X = P : lim Xn () = X() = 1
n
lim P [Zn z] = (z)
CLT notations
Zn N (0, 1)

2
Xn N ,
n

2
Xn N 0,
n
n(Xn ) N 0,

n(Xn )
N (0, 1)
qm
Relationships
qm
Xn X = Xn X = Xn X
as
P
Xn X = Xn X
D
P
Xn X (c R) P [X = c] = 1 = Xn X
Xn
Xn
Xn
Xn
X
qm
X
P
X
P
X
Yn
Yn
Yn
=
Y = Xn + Yn X + Y
qm
qm
Y = Xn + Yn X + Y
P
P
Y = Xn Yn XY
P
(Xn ) (X)
Continuity correction

n x
P X
Xn X = (Xn ) (X)
qm
Xn b limn E [Xn ] = b limn V [Xn ] = 0
qm
n
X1 , . . . , Xn iid E [X] = V [X] < X
Slutzkys Theorem
D
x + 12
/ n

x 12
/ n
Delta method
P
Law of Large Numbers (LLN)
Let {X1 , . . . , Xn } be a sequence of iid rvs, E [X1 ] = .

Weak (WLLN)
P
n
X
n
Strong (SLLN)
as
n
X
10.2

n x 1
P X
Xn X and Yn c = Xn + Yn X + c
D
P
D
Xn X and Yn c = Xn Yn cX
D
D
D
In general: Xn X and Yn Y =
6
Xn + Yn X + Y
10.1
zR
4. In quadratic mean (L2 ): Xn X

lim E (Xn X)2 = 0
where Z N (0, 1)
Central Limit Theorem (CLT)
Let {X1 , . . . , Xn } be a sequence of iid rvs, E [X1 ] = , and V [X1 ] = 2 .

Yn N
11
2
,
n

= (Yn ) N
2
(), ( ())
n
0
Statistical Inference
iid
Let X1 , , Xn F if not otherwise noted.
11.1
Point Estimation
Point estimator bn of is a rv: bn = g(X1 , . . . , Xn )

h i
bias(bn ) = E bn
P
Consistency: bn
Sampling distribution: F (bn )
r h i
b
Standard error: se(n ) = V bn
12
h
i
h i
Mean squared error: mse = E (bn )2 = bias(bn )2 + V bn
11.4
limn bias(bn ) = 0 limn se(bn ) = 0 = bn is consistent

bn D
Asymptotic normality:
N (0, 1)
se
Slutzkys Theorem often lets us replace se(bn ) by some (weakly) consistent estimator
bn .
11.2
Statistical functional: T (F )
Plug-in estimator of = (F ): bn = T (Fbn )
R
Linear functional: T (F ) = (x) dFX (x)
Plug-in estimator for linear functional:
Z
T (Fbn ) =
Normal-Based Confidence Interval

b 2 . Let z/2 = 1 (1 (/2)), i.e., P Z > z/2 = /2
Suppose bn N , se

and P z/2 < Z < z/2 = 1 where Z N (0, 1). Then
pth quantile: F 1 (p) = inf{x : F (x) p}

n

b=X
n
1 X
n )2
(Xi X

b2 =
n 1 i=1
Pn
1
b)3
i=1 (Xi
n

b=
b3
P
n
i=1 (Xi Xn )(Yi Yn )

qP
b = qP
n
n
2
2
i=1 (Xi Xn )
i=1 (Yi Yn )
Empirical distribution
Empirical Distribution Function (ECDF)

Pn
I(Xi x)
b
Fn (x) = i=1
n
(
1 Xi x
I(Xi x) =
0 Xi > x
Properties (for any fixed x)
h i
E Fbn = F (x)
h i F (x)(1 F (x))
V Fbn =
n
F (x)(1 F (x)) D
mse =
0
n
P
Fbn F (x)
Dvoretzky-Kiefer-Wolfowitz (DKW) inequality (X1 , . . . , Xn F )

2

b
P sup F (x) Fn (x) > = 2e2n
1X
(Xi )
(x) dFbn (x) =
n i=1

b 2 = T (Fbn ) z/2 se
b
Often: T (Fbn ) N T (F ), se
b
Cn = bn z/2 se
11.3
Statistical Functionals
12
Parametric Inference

Let F = f (x; ) : be a parametric model with parameter space Rk
and parameter = (1 , . . . , k ).
12.1
j
th
Method of Moments
moment

j () = E X j =
U (x) = min{Fbn + n , 1}
s

1
2
log
=
2n
P [L(x) F (x) U (x) x] 1
xj dFX (x)
j th sample moment
Nonparametric 1 confidence band for F

L(x) = max{Fbn n , 0}
bj =
1X j
X
n i=1 i
Method of Moments estimator (MoM)

1 () =
b1
2 () =
b2
.. ..
.=.
k () =
bk
13
Properties of the MoM estimator

bn exists with probability tending to 1
P
Consistency: bn
n(b ) N (0, )

where = gE Y Y T g T , Y = (X, X 2 , . . . , X k )T ,
1
g = (g1 , . . . , gk ) and gj =
j ()
12.2
Maximum Likelihood
Equivariance: bn is the mle = (bn ) is the mle of ()

Asymptotic optimality (or efficiency), i.e., smallest variance for large samples. If en is any other estimator, the asymptotic relative efficiency is:
p
1. se 1/In ()
(bn ) D
N (0, 1)
se
q
b 1/In (bn )
2. se
(bn ) D
N (0, 1)
b
se
Asymptotic optimality
Likelihood: Ln : [0, )
Ln () =
n
Y
h i
V bn
are(en , bn ) = h i 1
V en
f (Xi ; )
i=1
Approximately the Bayes estimator

Log-likelihood
`n () = log Ln () =
n
X
log f (Xi ; )
i=1
12.2.1
Delta Method
b where is differentiable and 0 () 6= 0:

If = ()
Maximum likelihood estimator (mle)
(b
n ) D
N (0, 1)
b )
se(b
Ln (bn ) = sup Ln ()
b is the mle of and

where b = ()
Score function
s(X; ) =
log f (X; )

b b
b = 0 ()
b n )
se
se(
Fisher information
I() = V [s(X; )]
In () = nI()
Fisher information (exponential family)

I() = E s(X; )
Observed Fisher information

Inobs () =
Properties of the mle
P
Consistency: bn
12.3
Multiparameter Models
Let = (1 , . . . , k ) and b = (b1 , . . . , bk ) be the mle.

Hjj =
2 `n
2
Hjk =
Fisher information matrix
n
2 X
log f (Xi ; )
2 i=1
2 `n
j k
..
.
E [Hk1 ]
E [H11 ]
..
In () =
.
E [H1k ]
..
.
E [Hkk ]
Under appropriate regularity conditions

(b ) N (0, Jn )
14
with Jn () = In1 . Further, if bj is the j th component of , then

(bj j ) D
N (0, 1)
bj
se
h
i
b 2j = Jn (j, j) and Cov bj , bk = Jn (j, k)
where se
Critical value c
Test statistic T
Rejection region R = {x : T (x) > c}
Power function () = P [X R]
Power of a test: 1 P [Type II error] = 1 = inf ()
1
Test size: = P [Type I error] = sup ()

0
12.3.1
Multiparameter delta method
Let = (1 , . . . , k ) and let the gradient of be

H0 true
H1 true
1
.
=
..

p-value = sup0 P [T (X) T (x)] = inf : T (x) R

P [T (X ? ) T (X)]
p-value = sup0
= inf : T (X) R
|
{z
}

b Then,
Suppose =b 6= 0 and b = ().
1F (T (X))
(b
) D
N (0, 1)
b )
se(b
b ) =
se(b
r

T
evidence
very strong evidence against H0
strong evidence against H0
weak evidence against H0
little or no evidence against H0
Parametric Bootstrap
Hypothesis Testing
H0 : 0
versus
Two-sided test
b 0
Reject H0 when |W | > z/2 where W =
b
se

P |W | > z/2
p-value = P0 [|W | > |w|] P [|Z| > |w|] = 2(|w|)
H1 : 1
Definitions
since T (X ? )F
Wald test
Sample from f (x; bn ) instead of from Fbn , where bn could be the mle or method
of moments estimator.
13
p-value
< 0.01
0.01 0.05
0.05 0.1
> 0.1

b
Jbn

b and
b = b.
and Jbn = Jn ()
=
12.4
Type II Error ()
Reject H0
Type
I Error ()
(power)
p-value
where
Retain H0
Null hypothesis H0
Alternative hypothesis H1
Simple hypothesis = 0
Composite hypothesis > 0 or < 0
Two-sided test: H0 : = 0 versus H1 : 6= 0
One-sided test: H0 : 0 versus H1 : > 0
Likelihood ratio test

sup Ln ()
Ln (bn )
=
sup0 Ln ()
Ln (bn,0 )
k
X
D
iid
(X) = 2 log T (X) 2rq where
Zi2 2k and Z1 , . . . , Zk N (0, 1)
T (X) =
i=1

p-value = P0 [(X) > (x)] P 2rq > (x)
15
Multinomial LRT

X1
Xk
mle: pbn =
,...,
n
n
k
Y pbj Xj
Ln (b
pn )
T (X) =
=
Ln (p0 )
p0j
j=1

k
X
pbj
D
Xj log
2k1
(X) = 2
p
0j
j=1
Natural form
fX (x | ) = h(x) exp { T(x) A()}
= h(x)g() exp { T(x)}

= h(x)g() exp T T(x)
15
The approximate size LRT rejects H0 when (X) 2k1,
Bayesian Inference
Bayes Theorem
Pearson Chi-square Test

T =
k
X
j=1
f ( | x) =
(Xj E [Xj ])2

where E [Xj ] = np0j under H0
E [Xj ]
f (x | )f ()
f (x | )f ()
=R
Ln ()f ()
f (xn )
f (x | )f () d
Definitions
T 2k1

p-value = P 2k1 > T (x)
2
than LRT, hence preferable for small n
Faster Xk1
Independence testing
I rows, J columns, X multinomial sample of size n = I J
X
mles unconstrained: pbij = nij
X
mles under H0 : pb0ij = pbi pbj = Xni nj

PI PJ
nX
LRT: = 2 i=1 j=1 Xij log Xi Xijj
PI PJ (X E[X ])2
PearsonChiSq: T = i=1 j=1 ijE[Xij ]ij
X n = (X1 , . . . , Xn )
xn = (x1 , . . . , xn )
Prior density f ()
Likelihood f (xn | ): joint density of the data
n
Y
In particular, X n iid = f (xn | ) =
f (xi | ) = Ln ()
i=1
Posterior density f ( | xn )
R
Normalizing constant cn = f (xn ) = f (x | )f () d
Kernel: part of a density that dependsRon
R
L ()f ()d
Posterior mean n = f ( | xn ) d = R n
Ln ()f () d
LRT and Pearson 2k , where = (I 1)(J 1)
15.1
Credible Intervals
Posterior interval
14
Exponential Family
P [ (a, b) | xn ] =
Scalar parameter
f ( | xn ) d = 1
Equal-tail credible interval

Z a
Z
f ( | xn ) d =
fX (x | ) = h(x) exp {()T (x) A()}

= h(x)g() exp {()T (x)}
Vector parameter
(
fX (x | ) = h(x) exp
s
X
)
i ()Ti (x) A()
i=1
= h(x) exp {() T (x) A()}

= h(x)g() exp {() T (x)}
f ( | xn ) d = /2
Highest posterior density (HPD) region Rn

1. P [ Rn ] = 1
2. Rn = { : f ( | xn ) > k} for some k
Rn is unimodal = Rn is an interval
16
15.2
Function of parameters
15.3.1
Continuous likelihood (subscript c denotes constant)
Let = () and A = { : () }.
Posterior CDF for
n
Conjugate Priors
H(r | x ) = P [() | x ] =
f ( | x ) d
Likelihood
Conjugate prior
Unif (0, )
Pareto(xm , k)
Exp ()
Gamma (, )
Posterior hyperparameters

max x(n) , xm , k + n
n
X
xi
+ n, +
i=1

2
N , c
Posterior density

2
N 0 , 0
h( | xn ) = H 0 ( | xn )
N c , 2
Bayesian delta method

b
b se
b 0 ()
| X n N (),

15.3
Priors
N , 2
MVN(, c )
Scaled Inverse Chisquare(, 02 )

Normalscaled
Inverse
Gamma(, , , )
MVN(0 , 0 )
Choice
Subjective Bayesianism: prior should incorporate as much detail as possible
the researchs a priori knowledgevia prior elicitation
Objective Bayesianism: prior should incorporate as little detail as possible
(non-informative prior)
Robust Bayesianism: consider various priors and determine sensitivity of
our inferences to changes in the prior
MVN(c , )
InverseWishart(, )
Pareto(xmc , k)
Gamma (, )
Pareto(xm , kc )
Pareto(x0 , k0 )
Gamma (c , )
Gamma (0 , 0 )
Pn

0
1
n
i=1 xi
+
/
+ 2 ,
2
2
02
c
0
c1
1
n
+ 2
02
c
Pn
02 + i=1 (xi )2
+ n,
+n

+ n
x
n
,
+ n,
+ ,
+n
2
n
2
1X
(
x
)
+
(xi x
)2 +
2 i=1
2(n + )
1
1
0 + nc
1

1
1
x
,
0 0 + n

1 1
1
0 + nc
n
X
(xi c )(xi c )T
n + , +
i=1
n
X
xi
x
mc
i=1
x0 , k0 kn where k0 > kn
n
X
0 + nc , 0 +
xi
+ n, +
log
i=1
Types
Flat: f () constant
R
Proper: f () d = 1
R
Improper: f () d =
Jeffreys Prior (transformation-invariant):
f ()
I()
f ()
det(I())
Conjugate: f () and f ( | xn ) belong to the same parametric family
17
Discrete likelihood
Likelihood
Conjugate prior
Bern (p)
Posterior hyperparameters
Beta (, )
Bin (p)
Bayes factor
Beta (, )
n
X
i=1
n
X
xi , + n
xi , +
i=1
NBin (p)
Beta (, )
Po ()
+ rn, +
Gamma (, )
n
X
n
X
n
X
n
X
xi
i=1
Ni
i=1
n
X
xi
i=1
p =
xi
log10 BF10
BF10
evidence
0 0.5
0.5 1
12
>2
1 1.5
1.5 10
10 100
> 100
Weak
Moderate
Strong
Decisive
p
1p BF10
p
+ 1p
BF10
where p = P [H1 ] and p = P [H1 | xn ]
i=1
xi , + n
16
Sampling Methods
i=1
Multinomial(p)
Dir ()
n
X
i=1
Geo (p)
Beta (, )
16.1
x(i)
+ n, +
n
X
Setup
xi
i=1
15.4
Inverse Transform Sampling
U Unif (0, 1)
XF
F 1 (u) = inf{x | F (x) u}
Algorithm
Bayesian Testing
1. Generate u Unif (0, 1)

2. Compute x = F 1 (u)
If H0 : 0 :
Z
Prior probability P [H0 ] =
f () d
0
Posterior probability P [H0 | xn ] =
f ( | xn ) d
Let H0 . . .Hk1 be k hypotheses. Suppose f ( | Hk ),

f (xn | Hk )P [Hk ]
,
P [Hk | xn ] = PK
n
k=1 f (x | Hk )P [Hk ]
Marginal likelihood
16.2
The Bootstrap
Let Tn = g(X1 , . . . , Xn ) be a statistic.

1. Estimate VF [Tn ] with VFbn [Tn ].
2. Approximate VFbn [Tn ] using simulation:
(a) Repeat the following B times to get Tn,1

, . . . , Tn,B
, an iid sample from
b
the sampling distribution implied by Fn
i. Sample uniformly X1 , . . . , Xn Fbn .
ii. Compute Tn = g(X1 , . . . , Xn ).
(b) Then
!2
B
B
1 X
1 X
b
vboot = VFbn =
Tn,b
T
B
B r=1 n,r
b=1
f (xn | Hi ) =
f (xn | , Hi )f ( | Hi ) d
16.2.1
Bootstrap Confidence Intervals
Normal-based interval
b boot
Tn z/2 se
Posterior odds (of Hi relative to Hj )

P [Hi | xn ]
P [Hj | xn ]
f (xn | Hi )
f (xn | Hj )
| {z }
Bayes Factor BFij
Pivotal interval
P [Hi ]
P [Hj ]
| {z }
prior odds
1. Location parameter = T (F )
18
2. Pivot Rn = bn
3. Let H(r) = P [Rn r] be the cdf of Rn
4. Let Rn,b
= bn,b
bn . Approximate H using bootstrap:

Ln (cand )
3. Accept cand if u
Ln (bn )
B
1 X
b
H(r)
=
I(Rn,b
r)
B
16.4
Sample from an importance function g rather than target density h.

Algorithm to obtain an approximation to E [q() | xn ]:
b=1
5. = sample quantile of (bn,1

, . . . , bn,B
)
iid
), i.e., r = bn
6. r = beta sample quantile of (Rn,1
, . . . , Rn,B

7. Approximate 1 confidence interval Cn = a
, b where
a
=
b =

b 1 1 =
bn H
2

1
b
=
bn H
2
Percentile interval
16.3
bn r1/2
=
2bn 1/2
bn r/2
=
2bn /2
Rejection Sampling
We can easily sample from g()

We want to sample from h(), but it is difficult
k()
k() d
Envelope condition: we can find M > 0 such that k() M g()
We know h() up to a proportional constant: h() = R
Algorithm
1. Draw cand g()
k(cand )
3. Accept cand if u
M g(cand )
4. Repeat until B values of cand have been accepted
Example
We can easily sample from the prior g() = f ()
Target is the posterior h() k() = f (xn | )f ()
Envelope condition: f (xn | ) f (xn | bn ) = Ln (bn ) M
Algorithm
1. Draw cand f ()
1. Sample from the prior 1 , . . . , n f ()

Ln (i )
i = 1, . . . , B
2. wi = PB
i=1 Ln (i )
PB
3. E [q() | xn ] i=1 q(i )wi
17
Decision Theory
Definitions

Cn = /2
, 1/2
Setup
Importance Sampling
Unknown quantity affecting our decision:

Decision rule: synonymous for an estimator b
Action a A: possible value of the decision rule. In the estimation
b
context, the action is just an estimate of , (x).
Loss function L: consequences of taking action a when true state is or
b L : A [k, ).
discrepancy between and ,
Loss functions
Squared error loss: L(, a) = ( a)2
(
K1 ( a) a < 0
Linear loss: L(, a) =
K2 (a ) a 0
Absolute error loss: L(, a) = | a| (linear loss with K1 = K2 )
Lp loss: L(, a) = | a|p
(
0 a=
Zero-one loss: L(, a) =
1 a 6=
17.1
Risk
Posterior risk
Z
h
i
b
b
L(, (x))f
( | x) d = E|X L(, (x))
h
i
b
b
L(, (x))f
(x | ) dx = EX| L(, (X))
r(b | x) =
(Frequentist) risk
b =
R(, )
19
18
Bayes risk
ZZ
b =
r(f, )
h
i
b
b
L(, (x))f
(x, ) dx d = E,X L(, (X))
h
h
ii
h
i
b = E EX| L(, (X)
b
b
r(f, )
= E R(, )
h
h
ii
h
i
b = EX E|X L(, (X)
b
r(f, )
= EX r(b | X)
Linear Regression
Definitions
Response variable Y
Covariate X (aka predictor variable or feature)
18.1
Simple Linear Regression
Model
17.2
Yi = 0 + 1 Xi + i
Admissibility
E [i | Xi ] = 0, V [i | Xi ] = 2
Fitted line
b0 dominates b if
b0
b
: R(, ) R(, )
b
: R(, b0 ) < R(, )
b is inadmissible if there is at least one other estimator b0 that dominates
it. Otherwise it is called admissible.
rb(x) = b0 + b1 x
Predicted (fitted) values
Ybi = rb(Xi )
Residuals

i = Yi Ybi = Yi b0 + b1 Xi
Residual sums of squares (rss)
17.3
Bayes Rule
rss(b0 , b1 ) =
Bayes rule (or Bayes estimator)
n
X
2i
i=1
b = inf e r(f, )
e
r(f, )
R
b
b = r(b | x)f (x) dx
(x)
= inf r(b | x) x = r(f, )
Least square estimates

bT = (b0 , b1 )T : min rss
b0 ,
b1
Theorems
Squared error loss: posterior mean
Absolute error loss: posterior median
Zero-one loss: posterior mode
17.4
Minimax Rules
Maximum risk
b = sup R(, )
b
)
R(
R(a)
= sup R(, a)
Minimax rule
b = inf R(
e = inf sup R(, )
e
)
sup R(, )
b =c
b = Bayes rule c : R(, )
Least favorable prior
bf = Bayes rule R(, bf ) r(f, bf )
n
b0 = Yn b1 X
Pn
Pn
i=1 Xi Yi nXY
i=1 (Xi Xn )(Yi Yn )
b
Pn
= P
1 =
n
2
2
2
i=1 (Xi Xn )
i=1 Xi nX

h
i
0
E b | X n =
1

P
h
i
2 n1 ni=1 Xi2 X n
V b | X n = 2
1
X n
nsX
r Pn
2
b
i=1 Xi
b b0 ) =
se(
n
sX n
b b1 ) =
se(
sX n
Pn
Pn 2
1
where s2X = n1 i=1 (Xi X n )2 and
b2 = n2
i (unbiased estimate).
i=1
Further properties:
P
P
Consistency: b0 0 and b1 1
20
18.3
b0 0 D
N (0, 1)
b b0 )
se(
and
b1 1 D
N (0, 1)
b b1 )
se(
Multiple Regression
Y = X +
where
Approximate 1 confidence intervals for 0 and 1 :
b b1 )
and b1 z/2 se(
b b0 )
b0 z/2 se(
Wald test for H0 : 1 = 0 vs. H1 : 1 6= 0: reject H0 if |W | > z/2 where

b b1 ).
W = b1 /se(
Xn1
Pn b
Pn 2
2

rss
i=1 (Yi Y )
R = Pn
= 1 Pn i=1 i 2 = 1
2
tss
i=1 (Yi Y )
i=1 (Yi Y )
Likelihood
L2 =
1

= ...
Xnk
n
Y
i=1
n
Y
i=1
n
Y
f (Xi , Yi ) =
n
Y
fX (Xi )
n
Y
fY |X (Yi | Xi )
2
1 X
exp 2
Yi (0 1 Xi )
2 i
b = (X T X)1 X T Y
h
i
V b | X n = 2 (X T X)1
Under the assumption of Normality, the least squares estimator is also the mle
but the least squares variance estimator is not the mle.
b N , 2 (X T X)1
1X 2

n i=1 i
rb(x) =
k
X
bj xj
j=1
Unbiased estimate for 2
Prediction
Observe X = x of the covariate and want to predict their outcome Y .

Yb = b0 + b1 x
h i
h i
h i
h
i
V Yb = V b0 + x2 V b1 + 2x Cov b0 , b1
Prediction interval
Estimate regression function
b2 =
N
X
(Yi xTi )2
If the (k k) matrix X T X is invertible,
fX (Xi )
n
n
i=1
fY |X (Yi | Xi ) = L1 L2
i=1
i=1

1
..
=.
rss = (y X)T (y X) = kY Xk2 =
i=1
18.2

1
L(, ) = (2 2 )n/2 exp 2 rss
2
L1 =
X1k
..
.
Likelihood
R2
L=
..
.
X11
..
X= .
Pn

2
2
2
i=1 (Xi X )
b
P
n =
b
2j + 1
n i (Xi X)
Yb z/2 bn
b2 =
n
1 X 2

n k i=1 i
= X b Y
mle
b=X
b2 =
nk 2
1 Confidence interval
b bj )
bj z/2 se(
21
18.4
Model Selection
Akaike Information Criterion (AIC)
Consider predicting a new observation Y for covariates X and let S J

denote a subset of the covariates in the model, where |S| = k and |J| = n.
Issues
Underfitting: too few covariates yields high bias
Overfitting: too many covariates yields high variance
AIC(S) = `n (bS ,
bS2 ) k
Bayesian Information Criterion (BIC)
k
BIC(S) = `n (bS ,
bS2 ) log n
2
Procedure
1. Assign a score to each model
2. Search through all models to find the one with the highest score
Validation and training

bV (S) =
R
Hypothesis testing
m
X
(Ybi (S) Yi )2
m = |{validation data}|, often
i=1
n
n
or
4
2
H0 : j = 0 vs. H1 : j 6= 0 j J
Leave-one-out cross-validation
Mean squared prediction error (mspe)
h
i
mspe = E (Yb (S) Y )2
bCV (S) =
R
n
X
(Yi Yb(i) )2 =
i=1
Prediction risk
R(S) =
n
X
mspei =
i=1
n
X
h
i
E (Ybi (S) Yi )2
n
X
i=1
Yi Ybi (S)
1 Uii (S)
!2
U (S) = XS (XST XS )1 XS (hat matrix)
i=1
Training error
btr (S) =
R
n
X
(Ybi (S) Yi )2
19
Non-parametric Function Estimation
i=1
19.1
btr (S)
rss(S)
R
R2 (S) = 1
=1
=1
tss
tss
Pn b
2
i=1 (Yi (S) Y )
P
n
2
i=1 (Yi Y )
The training error is a downward-biased estimate of the prediction risk.

h
i
btr (S) < R(S)
E R
n
h
i
h
i
X
b
b
bias(Rtr (S)) = E Rtr (S) R(S) = 2
Cov Ybi , Yi
i=1
Adjusted R2
R2 (S) = 1
Density Estimation
Estimate f (x), where f (x) = P [X A] =

Integrated square error (ise)
L(f, fbn ) =
Z
R
A
f (x) dx.
Z
2
f (x) fbn (x) dx = J(h) + f 2 (x) dx
Frequentist risk
Z
h
i Z
R(f, fbn ) = E L(f, fbn ) = b2 (x) dx + v(x) dx
n 1 rss
n k tss
Mallows Cp statistic
b
btr (S) + 2kb
R(S)
=R
2 = lack of fit + complexity penalty
h
i
b(x) = E fbn (x) f (x)
h
i
v(x) = V fbn (x)
22
19.1.1
Histograms
KDE

n
x Xi
1X1
K
fbn (x) =
n i=1 h
h
Z
Z
1
1
4
00
2
b
R(f, fn ) (hK )
K 2 (x) dx
(f (x)) dx +
4
nh
Z
Z
2/5 1/5 1/5
c
c2 c3
2
2
h = 1
c
=
,
c
=
K
(x)
dx,
c
=
(f 00 (x))2 dx
1
2
3
K
n1/5
Z
4/5 Z
1/5
c4
5 2 2/5
2
00 2
b
R (f, fn ) = 4/5
K (x) dx
(f ) dx
c4 = (K )
4
n
|
{z
}
Definitions
Number of bins m
1
Binwidth h = m
Bin Bj has j observations
R
Define pbj = j /n and pj = Bj f (u) du
Histogram estimator
fbn (x) =
m
X
pbj
j=1
C(K)
I(x Bj )
Epanechnikov Kernel
pj
E fbn (x) =
h
h
i p (1 p )
j
j
V fbn (x) =
nh2
Z
h2
1
2
R(fbn , f )
(f 0 (u)) du +
12
nh
!1/3
1
6
h = 1/3 R
2 du
n
(f 0 (u))
2/3 Z
1/3
C
3
2
b
0
R (fn , f ) 2/3
(f (u)) du
C=
4
n
h
(
K(x) =
4 5(1x2 /5)
|x| <
otherwise
Cross-validation estimate of E [J(h)]

Z
JbCV (h) =

n
n
n
2Xb
2
1 X X Xi Xj
fbn2 (x) dx
K
+
K(0)
f(i) (Xi )
2
n i=1
hn i=1 j=1
h
nh
K (x) = K
(2)
(x) 2K(x)
(2)
Z
(x) =
K(x y)K(y) dy

Z
JbCV (h) =
19.1.2
2Xb
2
n+1 X 2
fbn2 (x) dx
f(i) (Xi ) =
pb
n i=1
(n 1)h (n 1)h j=1 j
19.2
Non-parametric Regression
Estimate f (x) where f (x) = E [Y | X = x]. Consider pairs of points

(x1 , Y1 ), . . . , (xn , Yn ) related by
Kernel Density Estimator (KDE)
Yi = r(xi ) + i
E [i ] = 0
Kernel K
K(x) 0
R
K(x) dx = 1
R
xK(x) dx = 0
R 2
2
x K(x) dx K
>0
V [i ] = 2
k-nearest Neighbor Estimator
rb(x) =
1
k
X
i:xi Nk (x)
Yi
where Nk (x) = {k values of x1 , . . . , xn closest to x}
23
20
Nadaraya-Watson Kernel Estimator

rb(x) =
n
X
Stochastic Processes
Stochastic Process
wi (x)Yi
i=1
wi (x) =
xxi
h

Pn
xxj
K
j=1
h
(
{0, 1, . . . } = Z discrete
T =
[0, )
continuous
{Xt : t T }
[0, 1]
Z
4 Z
2
h4
f 0 (x)
2 2
00
0
R(b
rn , r)
x K (x) dx
r (x) + 2r (x)
dx
4
f (x)
R
Z 2
K 2 (x) dx
+
dx
nhf (x)
c1
h 1/5
n
c2
R (b
rn , r) 4/5
n
Notations Xt , X(t)
State space X
Index set T
20.1
Markov Chains
Markov chain
P [Xn = x | X0 , . . . , Xn1 ] = P [Xn = x | Xn1 ]
n T, x X

JbCV (h) =
n
X
(Yi rb(i) (xi ))2 =
i=1
19.3
n
X
i=1
(Yi rb(xi ))2

1
K(0)
xx
Pn
j
j=1 K
h
Smoothing Using Orthogonal Functions
Approximation
r(x) =
j j (x)
j=1
J
X
j j (x)
Multivariate regression
where
i = i
..
.
0 (xn )
!2
pij P [Xn+1 = j | Xn = i]
pij (n) P [Xm+n = j | Xm = i]
Transition matrix P (n-step: Pn )
Chapman-Kolmogorov
J (x1 )
..
.
pij (m + n) =
J (xn )
b = (T )1 T Y
1
T Y (for equally spaced observations only)
n
2
n
J
X
X
bCV (J) =
Yi
R
j (xi )bj,(i)
j=1
pij (m)pkj (n)
Pm+n = Pm Pn
Least squares estimator
i=1
n-step
(i, j) element is pij

pij > 0
P
i pij = 1
j=1
Y = +
0 (x1 )
..
and = .
Transition probabilities
Pn = P P = Pn
Marginal probability
n = (n (1), . . . , n (N ))
where
i (i) = P [Xn = i]
0 , initial distribution
n = 0 Pn
24
20.2
Poisson Processes
Autocorrelation function (ACF)
Poisson process
{Xt : t [0, )} = number of events up to and including time t
X0 = 0
Independent increments:
(s, t)
Cov [xs , xt ]
=p
(s, t) = p
V [xs ] V [xt ]
(s, s)(t, t)
Cross-covariance function (CCV)
t0 < < tn : Xt1 Xt0

Xtn Xtn1
xy (s, t) = E [(xs xs )(yt yt )]
Intensity function (t)

Cross-correlation function (CCF)
P [Xt+h Xt = 1] = (t)h + o(h)

P [Xt+h Xt = 2] = o(h)
Xs+t Xs Po (m(s + t) m(s)) where m(t) =
Rt
0
xy (s, t) = p
(s) ds
xy (s, t)
x (s, s)y (t, t)
Homogeneous Poisson process

(t) = Xt Po (t)
Backshift operator
>0
B k (xt ) = xtk
Waiting times
Wt := time at which Xt occurs

1
Wt Gamma t,
Difference operator
d = (1 B)d
Interarrival times
White noise
St = Wt+1 Wt

1
St Exp
2
wt wn(0, w
)
St
Wt1
21
Wt

iid
2
Gaussian: wt N 0, w
E [wt ] = 0 t T
V [wt ] = 2 t T
w (s, t) = 0 s 6= t s, t T
Random walk
Time Series
Mean function
xt = E [xt ] =
xft (x) dx
Drift
Pt
xt = t + j=1 wj
E [xt ] = t
Autocovariance function
Symmetric moving average
x (s, t) = E [(xs s )(xt t )] = E [xs xt ] s t

x (t, t) = E (xt t )2 = V [xt ]
mt =
k
X
j=k
aj xtj
where aj = aj 0 and
k
X
j=k
aj = 1
25
21.1
Stationary Time Series
21.2
Estimation of Correlation
Sample mean
Strictly stationary
x
=
P [xt1 c1 , . . . , xtk ck ] = P [xt1 +h c1 , . . . , xtk +h ck ]
1X
xt
n t=1
Sample variance

n
|h|
1 X
1
x (h)
V [
x] =
n
n
k N, tk , ck , h Z
h=n
Weakly stationary
Sample autocovariance function

t Z
E x2t <
2
E xt = m
t Z
x (s, t) = x (s + r, t + r)
r, s, t Z
Sample autocorrelation function
Autocovariance function
nh
1 X
(xt+h x
)(xt x
)
n t=1
b(h) =
(h) = E [(xt+h )(xt )]

(0) = E (xt )2
(0) 0
(0) |(h)|
(h) = (h)
h Z
b(h) =
b(h)
b(0)
Sample cross-variance function
bxy (h) =
nh
1 X
(xt+h x
)(yt y)
n t=1
Autocorrelation function (ACF)

Sample cross-correlation function
Cov [xt+h , xt ]
(t + h, t)
(h)
=p
=
x (h) = p
(0)
V [xt+h ] V [xt ]
(t + h, t + h)(t, t)
Jointly stationary time series
bxy (h)
bxy (h) = p
bx (0)b
y (0)
Properties
xy (h) = E [(xt+h x )(yt y )]
xy (h) = p
xy (h)
x (0)y (h)
1
bx (h) = if xt is white noise
n
1
bxy (h) = if xt or yt is white noise
n
21.3
Linear process
Non-Stationary Time Series
Classical decomposition model

xt = +
j wtj
where
j=
2
(h) = w
|j | <
xt = t + st + wt
j=
X
j=
j+h j
t = trend
st = seasonal component
wt = random noise term
26
21.3.1
Detrending
Moving average polynomial
Least squares
(z) = 1 + 1 z + + q zq
z C q 6= 0
1. Choose trend model, e.g., t = 0 + 1 t + 2 t

2. Minimize rss to obtain trend estimate
bt = b0 + b1 t + b2 t2
3. Residuals , noise wt
Moving average operator

(B) = 1 + 1 B + + p B p
Moving average
MA (q) (moving average model order q)
The low-pass filter vt is a symmetric moving average mt with aj =

vt =
1
2k+1 :
xt = wt + 1 wt1 + + q wtq xt = (B)wt
k
X
1
xt1
2k + 1
E [xt ] =
i=k
j E [wtj ] = 0
j=0
Pk
1
If 2k+1
i=k wtj 0, a linear trend function t = 0 + 1 t passes
without distortion
Differencing
(h) = Cov [xt+h , xt ] =
2
w
0
Pqh
j=0
j j+h
0hq
h>q
MA (1)
xt = wt + wt1
2 2
(1 + )w h = 0
2
(h) = w
h=1
0
h>1
(
h=1
2
(h) = (1+ )
0
h>1
t = 0 + 1 t = xt = 1
21.4
q
X
ARIMA models
Autoregressive polynomial
(z) = 1 1 z p zp
z C p 6= 0
Autoregressive operator
ARMA (p, q)
(B) = 1 1 B p B p
xt = 1 xt1 + + p xtp + wt + 1 wt1 + + q wtq
Autoregressive model order p, AR (p)

(B)xt = (B)wt
xt = 1 xt1 + + p xtp + wt (B)xt = wt
AR (1)
xt = k (xtk ) +
k1
X
j (wtj )
k,||<1
j=0
j (wtj )
{z
linear process
j
j=0 (E [wtj ]) = 0
(h) = Cov [xt+h , xt ] =

(h) =
(h)
(0)
xih1 , regression of xi on {xh1 , xh2 , . . . , x1 }

hh = corr(xh xh1
, x0 xh1
) h2
0
h
E.g., 11 = corr(x1 , x0 ) = (1)
j=0
|
E [xt ] =
Partial autocorrelation function (PACF)
2 h
w
12
= h
(h) = (h 1) h = 1, 2, . . .
ARIMA (p, d, q)
d xt = (1 B)d xt is ARMA (p, q)
(B)(1 B)d xt = (B)wt
Exponentially Weighted Moving Average (EWMA)
xt = xt1 + wt wt1
27
xt =
(1 )j1 xtj + wt
when || < 1
j=1
x
n+1 = (1 )xn +
xn
Seasonal ARIMA
Frequency index (cycles per unit time), period 1/

Amplitude A
Phase
U1 = A cos and U2 = A sin often normally distributed rvs
Periodic mixture
Denoted by ARIMA (p, d, q) (P, D, Q)s

d
s
P (B s )(B)D
s xt = + Q (B )(B)wt
xt =
q
X
(Uk1 cos(2k t) + Uk2 sin(2k t))
k=1
21.4.1
Causality and Invertibility
ARMA (p, q) is causal (future-independent) {j } :

xt =
j=0
j < such that
wtj = (B)wt
Spectral representation of a periodic process
j=0
ARMA (p, q) is invertible {j } :

(B)xt =
Uk1 , Uk2 , for k = 1, . . . , q, are independent zero-mean rvs with variances k2

Pq
(h) = k=1 k2 cos(2k h)
Pq
(0) = E x2t = k=1 k2
j=0
(h) = 2 cos(20 h)
j < such that
2 2i0 h 2 2i0 h
e
+
e
2
2
Z 1/2
=
e2ih dF ()
Xtj = wt
j=0
1/2
Properties
ARMA (p, q) causal roots of (z) lie outside the unit circle
(z)
(z) =
j z =
(z)
j=0
j
Spectral distribution function
0
F () = 2 /2
|z| 1
< 0
< 0
0
ARMA (p, q) invertible roots of (z) lie outside the unit circle
(z)
(z) =
j z j =
(z)
j=0
F () = F (1/2) = 0
F () = F (1/2) = (0)
|z| 1
Spectral density
Behavior of the ACF and PACF for causal and invertible ARMA models
ACF
PACF
21.5
AR (p)
tails off
cuts off after lag p
MA (q)
cuts off after lag q
tails off q
Spectral Analysis
Periodic process
xt = A cos(2t + )
= U1 cos(2t) + U2 sin(2t)
ARMA (p, q)
tails off
tails off
(h)e2ih
|(h)| < = (h) =
R 1/2
f () =
h=
Needs
h=
1
1
2
2
1/2
e2ih f () d
h = 0, 1, . . .
f () 0
f () = f ()
f () = f (1 )
R 1/2
(0) = V [xt ] = 1/2 f () d
2
White noise: fw () = w
28
22.2
ARMA (p, q) , (B)xt = (B)wt :
Z 1
(x)(y)
Ordinary: B(x, y) = B(y, x) =
tx1 (1 t)y1 dt =
(x + y)
0
Z x
a1
b1
Incomplete: B(x; a, b) =
t
(1 t)
dt
|(e2i )|2
fx () =
|(e2i )|2
Pp
Pq
where (z) = 1 k=1 k z k and (z) = 1 + k=1 k z k
2
w
Regularized incomplete:
a+b1
(a + b 1)!
B(x; a, b) a,bN X
=
xj (1 x)a+b1j
Ix (a, b) =
B(a, b)
j!(a
+
b
j)!
j=a
Discrete Fourier Transform (DFT)

d(j ) = n1/2
n
X
Beta Function
xt e2ij t
I0 (a, b) = 0
I1 (a, b) = 1
Ix (a, b) = 1 I1x (b, a)
i=1
Fourier/Fundamental frequencies
22.3
j = j/n
Series
Finite
Inverse DFT
xt = n
1/2
n1
X
d(j )e
2ij t
j=0
Periodogram
I(j/n) = |d(j/n)|2
Scaled Periodogram
4
I(j/n)
n
!2
n
2X
=
xt cos(2tj/n +
n t=1
2X
xt sin(2tj/n
n t=1
!2
k=
k=1
n
X
n(n + 1)
2
(2k 1) = n2
k=1
n
X
k=1
n
X
ck =
22.1
k=0
n
X
= 2n
k=0
Vandermondes Identity:

r
X
m
n
m+n
=
k
rk
r
k=0
c 6= 1
Math
Binomial
Theorem:
n
X
n nk k
a
b = (a + b)n
k
k=0
Gamma Function
Z
Infinite
ts1 et dt
0
Z
Upper incomplete: (s, x) =
ts1 et dt
Z xx
Lower incomplete: (s, x) =
ts1 et dt
Ordinary: (s) =
cn+1 1
c1
n
X
n

r+k
r+n+1
=
k
n
k=0

n
X k
n+1
=
m
m+1
k2 =
k=0
22
n(n + 1)(2n + 1)
6
k=1
2

n
X
n(n + 1)
k3 =
2
P (j/n) =
Binomial
n
X
( + 1) = ()
>1
(n) = (n 1)!
nN
(0) = (1) =
(1/2) =
(1/2) = 2(1/2)
X
k=0
pk =
1
,
1p
p
|p| < 1
1p
k=1
!

X
d
1
1
k
p
=
=
dp 1 p
(1 p)2
pk =
d
dp
k=0
k=0

X
r+k1 k
x = (1 x)r r N+
k
k=0

X
k
p = (1 + p) |p| < 1 , C
k
kpk1 =
|p| < 1
k=0
29
22.4
Combinatorics
[3] R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications With R
Examples. Springer, 2006.
Sampling
[4] A. Steger. Diskrete Strukturen Band 1: Kombinatorik, Graphentheorie, Algebra.

Springer, 2001.
k out of n
w/o replacement
nk =
ordered
k1
Y
(n i) =
i=0
w/ replacement
n!
(n k)!

n
nk
n!
=
=
k
k!
k!(n k)!
unordered
nk
[5] A. Steger. Diskrete Strukturen Band 2: Wahrscheinlichkeitstheorie und Statistik.

Springer, 2002.
[6] L. Wasserman. All of Statistics: A Concise Course in Statistical Inference. Springer, 2003.

n1+r
n1+r
=
r
n1
Stirling numbers, 2nd kind

n
n1
n1
=k
+
k
k
k1
(
1
n
=
0
0
1kn
n=0
else
Partitions
Pn+k,k =
n
X
Pn,i
k > n : Pn,k = 0
n 1 : Pn,0 = 0, P0,0 = 1
i=1
Balls and Urns
|B| = n, |U | = m
f :BU
f arbitrary
mn
B : D, U : D

B : D, U : D
B : D, U : D
f injective
(
mn m n
0
else

m+n1
n
m
X
n
k=1
B : D, U : D
D = distinguishable, D = indistinguishable.
m
X
k=1

m
n
m!

n
m
n1
m1
1
0
mn
else

n
m
1
0
mn
else
Pn,m
k
Pn,k
f surjective
f bijective
(
n! m = n
0 else
(
1 m=n
0 else
(
1 m=n
0 else
(
1 m=n
0 else
References
[1] P. G. Hoel, S. C. Port, and C. J. Stone. Introduction to Probability Theory. Brooks Cole,
1972.
[2] L. M. Leemis and J. T. McQueston. Univariate Distribution Relationships. The American
Statistician, 62(1):4553, 2008.
30
31
Univariate distribution relationships, courtesy Leemis and McQueston [2].

Stat Cookbook PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Stat Cookbook PDF

Uploaded by

Copyright:

Available Formats

Probability and Statistics

21.5 Spectral Analysis . . . . . . . . . . . . . 28

This cookbook integrates various topics in probability theory

I(a < x < b)

(2)k/2 ||1/2 e 2 (x)

(1 2s)k/2 s < 1/2

We use the rate parameterization where =

(d1 x+d2 )d1 +d2

Some textbooks use as scale parameter instead [6].

Law of Total Probability

ii1 <<ir n j=1

Random Variable (RV)

fX (x) = P [X = x] = P [{ : X() = x}]

Cumulative Distribution Function (CDF)

1. Nondecreasing: x1 < x2 = F (x1 ) F (x2 )

xyfX,Y (x, y) dFX (x) dFY (y)

Special case if strictly monotone

The Rule of the Lazy Statistician

(x) dFX (x)

IA (x) dFX (x) =

Definition and properties

Definition and properties

Cov [X, Y ] = E [(X E [X])(Y E [Y ])] = E [XY ] E [X] E [Y ]

Cov [aX, bY ] = abCov [X, Y ]

X Bin (n, p) , Y Bin (m, p) = X + Y Bin (n + m, p)

X NBin (1, p) = Geo (p)

Memoryless property: P [X > x + y | X > y] = P [X > x]

Conditional mean and variance

Probability and Moment Generating Functions

V [X] = G00X (1 ) + G0X (1 ) (G0X (1 ))

Standard Bivariate Normal

1. In distribution (weakly, in law): Xn X

lim Fn (t) = F (t)

( > 0) lim P [|Xn X| > ] = 0

3. Almost surely (strongly): Xn X

lim P [Zn z] = (z)

Law of Large Numbers (LLN)

Let {X1 , . . . , Xn } be a sequence of iid rvs, E [X1 ] = .

4. In quadratic mean (L2 ): Xn X

Central Limit Theorem (CLT)

Let {X1 , . . . , Xn } be a sequence of iid rvs, E [X1 ] = , and V [X1 ] = 2 .

Let X1 , , Xn F if not otherwise noted.

Point estimator bn of is a rv: bn = g(X1 , . . . , Xn )

limn bias(bn ) = 0 limn se(bn ) = 0 = bn is consistent

Normal-Based Confidence Interval

pth quantile: F 1 (p) = inf{x : F (x) p}

i=1 (Xi Xn )(Yi Yn )

Empirical Distribution Function (ECDF)

P [L(x) F (x) U (x) x] 1

Nonparametric 1 confidence band for F

Method of Moments estimator (MoM)

Properties of the MoM estimator

Equivariance: bn is the mle = (bn ) is the mle of ()

Approximately the Bayes estimator

b where is differentiable and 0 () 6= 0:

Maximum likelihood estimator (mle)

b is the mle of and

Observed Fisher information

Let = (1 , . . . , k ) and b = (b1 , . . . , bk ) be the mle.

Fisher information matrix

Under appropriate regularity conditions

with Jn () = In1 . Further, if bj is the j th component of , then

Test size: = P [Type I error] = sup ()