You are on page 1of 28

# Probability and Statistics

Cookbook
vallentin@icir.org
13
th
August, 2012
This cookbook integrates a variety of topics in probability the-
ory and statistics. It is based on literature [1, 6, 3] and in-class
material from courses of the statistics department at the Uni-
versity of California in Berkeley but also inuenced by other
sources [4, 5]. If you nd errors or have suggestions for further
topics, I would appreciate if you send me an email. The most re-
cent version of this document is available at http://matthias.
vallentin.net/probability-and-statistics-cookbook/. To
Contents
1 Distribution Overview 3
1.1 Discrete Distributions . . . . . . . . . . 3
1.2 Continuous Distributions . . . . . . . . 4
2 Probability Theory 6
3 Random Variables 6
3.1 Transformations . . . . . . . . . . . . . 7
4 Expectation 7
5 Variance 7
6 Inequalities 8
7 Distribution Relationships 8
8 Probability and Moment Generating
Functions 9
9 Multivariate Distributions 9
9.1 Standard Bivariate Normal . . . . . . . 9
9.2 Bivariate Normal . . . . . . . . . . . . . 9
9.3 Multivariate Normal . . . . . . . . . . . 9
10 Convergence 9
10.1 Law of Large Numbers (LLN) . . . . . . 10
10.2 Central Limit Theorem (CLT) . . . . . 10
11 Statistical Inference 10
11.1 Point Estimation . . . . . . . . . . . . . 10
11.2 Normal-Based Condence Interval . . . 11
11.3 Empirical distribution . . . . . . . . . . 11
11.4 Statistical Functionals . . . . . . . . . . 11
12 Parametric Inference 11
12.1 Method of Moments . . . . . . . . . . . 11
12.2 Maximum Likelihood . . . . . . . . . . . 12
12.2.1 Delta Method . . . . . . . . . . . 12
12.3 Multiparameter Models . . . . . . . . . 12
12.3.1 Multiparameter delta method . . 13
12.4 Parametric Bootstrap . . . . . . . . . . 13
13 Hypothesis Testing 13
14 Bayesian Inference 14
14.1 Credible Intervals . . . . . . . . . . . . . 14
14.2 Function of parameters . . . . . . . . . . 14
14.3 Priors . . . . . . . . . . . . . . . . . . . 15
14.3.1 Conjugate Priors . . . . . . . . . 15
14.4 Bayesian Testing . . . . . . . . . . . . . 15
15 Exponential Family 16
16 Sampling Methods 16
16.1 The Bootstrap . . . . . . . . . . . . . . 16
16.1.1 Bootstrap Condence Intervals . 16
16.2 Rejection Sampling . . . . . . . . . . . . 17
16.3 Importance Sampling . . . . . . . . . . . 17
17 Decision Theory 17
17.1 Risk . . . . . . . . . . . . . . . . . . . . 17
17.2 Admissibility . . . . . . . . . . . . . . . 17
17.3 Bayes Rule . . . . . . . . . . . . . . . . 18
17.4 Minimax Rules . . . . . . . . . . . . . . 18
18 Linear Regression 18
18.1 Simple Linear Regression . . . . . . . . 18
18.2 Prediction . . . . . . . . . . . . . . . . . 19
18.3 Multiple Regression . . . . . . . . . . . 19
18.4 Model Selection . . . . . . . . . . . . . . 19
19 Non-parametric Function Estimation 20
19.1 Density Estimation . . . . . . . . . . . . 20
19.1.1 Histograms . . . . . . . . . . . . 20
19.1.2 Kernel Density Estimator (KDE) 21
19.2 Non-parametric Regression . . . . . . . 21
19.3 Smoothing Using Orthogonal Functions 21
20 Stochastic Processes 22
20.1 Markov Chains . . . . . . . . . . . . . . 22
20.2 Poisson Processes . . . . . . . . . . . . . 22
21 Time Series 23
21.1 Stationary Time Series . . . . . . . . . . 23
21.2 Estimation of Correlation . . . . . . . . 24
21.3 Non-Stationary Time Series . . . . . . . 24
21.3.1 Detrending . . . . . . . . . . . . 24
21.4 ARIMA models . . . . . . . . . . . . . . 24
21.4.1 Causality and Invertibility . . . . 25
21.5 Spectral Analysis . . . . . . . . . . . . . 25
22 Math 26
22.1 Gamma Function . . . . . . . . . . . . . 26
22.2 Beta Function . . . . . . . . . . . . . . . 26
22.3 Series . . . . . . . . . . . . . . . . . . . 27
22.4 Combinatorics . . . . . . . . . . . . . . 27
1 Distribution Overview
1.1 Discrete Distributions
Notation
1
FX(x) fX(x) E [X] V [X] MX(s)
Uniform Unif {a, . . . , b}
_

_
0 x < a
xa+1
ba
a x b
1 x > b
I(a < x < b)
b a + 1
a +b
2
(b a + 1)
2
1
12
e
as
e
(b+1)s
s(b a)
Bernoulli Bern (p) (1 p)
1x
p
x
(1 p)
1x
p p(1 p) 1 p +pe
s
Binomial Bin (n, p) I1p(n x, x + 1)
_
n
x
_
p
x
(1 p)
nx
np np(1 p) (1 p +pe
s
)
n
Multinomial Mult (n, p)
n!
x1! . . . xk!
p
x1
1
p
x
k
k
k

i=1
xi = n npi npi(1 pi)
_
k

i=0
pie
si
_
n
Hypergeometric Hyp (N, m, n)
_
x np
_
np(1 p)
_ _
m
x
__
mx
nx
_
_
N
x
_
nm
N
nm(N n)(N m)
N
2
(N 1)
Negative Binomial NBin (r, p) Ip(r, x + 1)
_
x +r 1
r 1
_
p
r
(1 p)
x
r
1 p
p
r
1 p
p
2
_
p
1 (1 p)e
s
_
r
Geometric Geo (p) 1 (1 p)
x
x N
+
p(1 p)
x1
x N
+
1
p
1 p
p
2
p
1 (1 p)e
s
Poisson Po () e

i=0

i
i!

x
e

x!
e
(e
s
1)
1
n
q q q q q q
a b
x
P
M
F
Uniform (discrete)
0.00
0.05
0.10
0.15
0.20
0.25
qqqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqqqqqqqqqqqqqqqqqqqq
0 10 20 30 40
x
P
M
F
q n = 40, p = 0.3
n = 30, p = 0.6
n = 25, p = 0.9
Binomial
0.0
0.2
0.4
0.6
0.8
q
q
q
q
q
q
q
q
q q q
2 4 6 8 10
x
P
M
F
q p = 0.2
p = 0.5
p = 0.8
Geometric
0.0
0.1
0.2
0.3
q q
q
q
q
q q q q q q q q q q q q q q q q
0 5 10 15 20
x
P
M
F
q = 1
= 4
= 10
Poisson
1
We use the notation (s, x) and (x) to refer to the Gamma functions (see 22.1), and use B(x, y) and Ix to refer to the Beta functions (see 22.2).
3
1.2 Continuous Distributions
Notation FX(x) fX(x) E [X] V [X] MX(s)
Uniform Unif (a, b)
_

_
0 x < a
xa
ba
a < x < b
1 x > b
I(a < x < b)
b a
a +b
2
(b a)
2
12
e
sb
e
sa
s(b a)
Normal N
_
,
2
_
(x) =
_
x

(t) dt (x) =
1

2
exp
_

(x )
2
2
2
_

2
exp
_
s +

2
s
2
2
_
Log-Normal ln N
_
,
2
_
1
2
+
1
2
erf
_
ln x

2
2
_
1
x

2
2
exp
_

(ln x )
2
2
2
_
e
+
2
/2
(e

2
1)e
2+
2
Multivariate Normal MVN(, ) (2)
k/2
||
1/2
e

1
2
(x)
T

1
(x)
exp
_

T
s +
1
2
s
T
s
_
Students t Student() Ix
_

2
,

2
_

_
+1
2
_

2
_
_
1 +
x
2

_
(+1)/2
0
_

2
> 2
1 < 2
Chi-square
2
k
1
(k/2)

_
k
2
,
x
2
_
1
2
k/2
(k/2)
x
k/21
e
x/2
k 2k (1 2s)
k/2
s < 1/2
F F(d1, d2) I d
1
x
d
1
x+d
2
_
d1
2
,
d1
2
_
_
(d1x)
d
1d
d
2
2
(d1x+d2)
d
1
+d
2
xB
_
d1
2
,
d1
2
_
d2
d2 2
2d
2
2
(d1 +d2 2)
d1(d2 2)
2
(d2 4)
Exponential Exp () 1 e
x/
1

e
x/

2
1
1 s
(s < 1/)
Gamma Gamma (, )
(, x/)
()
1
()

x
1
e
x/

2
_
1
1 s
_

(s < 1/)
Inverse Gamma InvGamma (, )

_
,

x
_
()

()
x
1
e
/x

1
> 1

2
( 1)
2
( 2)
2
> 2
2(s)
/2
()
K
_
_
4s
_
Dirichlet Dir ()

k
i=1
i
_

k
i=1
(i)
k

i=1
x
i1
i
i

k
i=1
i
E [Xi] (1 E [Xi])

k
i=1
i + 1
Beta Beta (, ) Ix(, )
( +)
() ()
x
1
(1 x)
1

+

( +)
2
( + + 1)
1 +

k=1
_
k1

r=0
+r
+ +r
_
s
k
k!
Weibull Weibull(, k) 1 e
(x/)
k k

_
x

_
k1
e
(x/)
k

_
1 +
1
k
_

2

_
1 +
2
k
_

n=0
s
n

n
n!

_
1 +
n
k
_
Pareto Pareto(xm, ) 1
_
xm
x
_

x xm
x

m
x
+1
x xm
xm
1
> 1
x

m
( 1)
2
( 2)
> 2 (xms)

(, xms) s < 0
4
1
b a
q q
q q
a b
x
P
D
F
Uniform (continuous)
0.0
0.2
0.4
0.6
0.8
4 2 0 2 4
x

(
x
)
= 0,
2
= 0.2
= 0,
2
= 1
= 0,
2
= 5
= 2,
2
= 0.5
Normal
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0
x
P
D
F
= 0,
2
= 3
= 2,
2
= 2
= 0,
2
= 1
= 0.5,
2
= 1
= 0.25,
2
= 1
= 0.125,
2
= 1
LogNormal
0.0
0.1
0.2
0.3
0.4
4 2 0 2 4
x
P
D
F
= 1
= 2
= 5
=
Student's t
0
1
2
3
4
0 2 4 6 8
x
P
D
F
k = 1
k = 2
k = 3
k = 4
k = 5

2
0.0
0.5
1.0
1.5
2.0
2.5
3.0
0 1 2 3 4 5
x
P
D
F
d1 = 1, d2 = 1
d1 = 2, d2 = 1
d1 = 5, d2 = 2
d1 = 100, d2 = 1
d1 = 100, d2 = 100

2
0.0
0.5
1.0
1.5
2.0
0 1 2 3 4 5
x
P
D
F
= 2
= 1
= 0.4
Exponential
0.0
0.1
0.2
0.3
0.4
0.5
0 5 10 15 20
x
P
D
F
= 1, = 2
= 2, = 2
= 3, = 2
= 5, = 1
= 9, = 0.5
Gamma
0
1
2
3
4
0 1 2 3 4 5
x
P
D
F
= 1, = 1
= 2, = 1
= 3, = 1
= 3, = 0.5
Inverse Gamma
0
1
2
3
4
5
0.0 0.2 0.4 0.6 0.8 1.0
x
P
D
F
= 0.5, = 0.5
= 5, = 1
= 1, = 3
= 2, = 2
= 2, = 5
Beta
0
1
2
3
4
0.0 0.5 1.0 1.5 2.0 2.5
x
P
D
F
= 1, k = 0.5
= 1, k = 1
= 1, k = 1.5
= 1, k = 5
Weibull
0
1
2
3
0 1 2 3 4 5
x
P
D
F
xm = 1, = 1
xm = 1, = 2
xm = 1, = 4
Pareto
5
2 Probability Theory
Denitions
Sample space
Outcome (point or element)
Event A
-algebra A
1. A
2. A
1
, A
2
, . . . , A =

i=1
A
i
A
3. A A = A A
Probability Distribution P
1. P [A] 0 A
2. P [] = 1
3. P
_

_
i=1
A
i
_
=

i=1
P [A
i
]
Probability space (, A, P)
Properties
P [] = 0
B = B = (A A) B = (A B) (A B)
P [A] = 1 P [A]
P [B] = P [A B] +P [A B]
P [] = 1 P [] = 0
(

n
A
n
) =

n
A
n
(

n
A
n
) =

n
A
n
DeMorgan
P [

n
A
n
] = 1 P [

n
A
n
]
P [A B] = P [A] +P [B] P [A B]
= P [A B] P [A] +P [B]
P [A B] = P [A B] +P [A B] +P [A B]
P [A B] = P [A] P [A B]
Continuity of Probabilities
A
1
A
2
. . . = lim
n
P [A
n
] = P [A] whereA =

i=1
A
i
A
1
A
2
. . . = lim
n
P [A
n
] = P [A] whereA =

i=1
A
i
Independence
A B P [A B] = P [A] P [B]
Conditional Probability
P [A| B] =
P [A B]
P [B]
P [B] > 0
Law of Total Probability
P [B] =
n

i=1
P [B|A
i
] P [A
i
] =
n
_
i=1
A
i
Bayes Theorem
P [A
i
| B] =
P [B| A
i
] P [A
i
]

n
j=1
P [B| A
j
] P [A
j
]
=
n
_
i=1
A
i
Inclusion-Exclusion Principle

n
_
i=1
A
i

=
n

r=1
(1)
r1

ii1<<irn

j=1
A
ij

3 Random Variables
Random Variable (RV)
X : R
Probability Mass Function (PMF)
f
X
(x) = P [X = x] = P [{ : X() = x}]
Probability Density Function (PDF)
P [a X b] =
_
b
a
f(x) dx
Cumulative Distribution Function (CDF)
F
X
: R [0, 1] F
X
(x) = P [X x]
1. Nondecreasing: x
1
< x
2
= F(x
1
) F(x
2
)
2. Normalized: lim
x
= 0 and lim
x
= 1
3. Right-Continuous: lim
yx
F(y) = F(x)
P [a Y b | X = x] =
_
b
a
f
Y |X
(y | x)dy a b
f
Y |X
(y | x) =
f(x, y)
f
X
(x)
Independence
1. P [X x, Y y] = P [X x] P [Y y]
2. f
X,Y
(x, y) = f
X
(x)f
Y
(y)
6
3.1 Transformations
Transformation function
Z = (X)
Discrete
f
Z
(z) = P [(X) = z] = P [{x : (x) = z}] = P
_
X
1
(z)

x
1
(z)
f(x)
Continuous
F
Z
(z) = P [(X) z] =
_
Az
f(x) dx with A
z
= {x : (x) z}
Special case if strictly monotone
f
Z
(z) = f
X
(
1
(z))

d
dz

1
(z)

= f
X
(x)

dx
dz

= f
X
(x)
1
|J|
The Rule of the Lazy Statistician
E [Z] =
_
(x) dF
X
(x)
E [I
A
(x)] =
_
I
A
(x) dF
X
(x) =
_
A
dF
X
(x) = P [X A]
Convolution
Z := X +Y f
Z
(z) =
_

f
X,Y
(x, z x) dx
X,Y 0
=
_
z
0
f
X,Y
(x, z x) dx
Z := |X Y | f
Z
(z) = 2
_

0
f
X,Y
(x, z +x) dx
Z :=
X
Y
f
Z
(z) =
_

|x|f
X,Y
(x, xz) dx

=
_

xf
x
(x)f
X
(x)f
Y
(xz) dx
4 Expectation
Denition and properties
E [X] =
X
=
_
xdF
X
(x) =
_

x
xf
X
(x) X discrete
_
xf
X
(x) X continuous
P [X = c] = 1 = E [c] = c
E [cX] = c E [X]
E [X +Y ] = E [X] +E [Y ]
E [XY ] =
_
X,Y
xyf
X,Y
(x, y) dF
X
(x) dF
Y
(y)
E [(Y )] = (E [X]) (cf. Jensen inequality)
P [X Y ] = 0 = E [X] E [Y ] P [X = Y ] = 1 = E [X] = E [Y ]
E [X] =

x=1
P [X x]
Sample mean

X
n
=
1
n
n

i=1
X
i
Conditional expectation
E [Y | X = x] =
_
yf(y | x) dy
E [X] = E [E [X| Y ]]
E[(X, Y ) | X = x] =
_

(x, y)f
Y |X
(y | x) dx
E [(Y, Z) | X = x] =
_

(y, z)f
(Y,Z)|X
(y, z | x) dy dz
E [Y +Z | X] = E [Y | X] +E [Z | X]
E [(X)Y | X] = (X)E [Y | X]
E[Y | X] = c = Cov [X, Y ] = 0
5 Variance
Denition and properties
V [X] =
2
X
= E
_
(X E [X])
2

= E
_
X
2

E [X]
2
V
_
n

i=1
X
i
_
=
n

i=1
V [X
i
] + 2

i=j
Cov [X
i
, Y
j
]
V
_
n

i=1
X
i
_
=
n

i=1
V [X
i
] i X
i
X
j
Standard deviation
sd[X] =
_
V [X] =
X
Covariance
Cov [X, Y ] = E [(X E [X])(Y E [Y ])] = E [XY ] E [X] E [Y ]
Cov [X, a] = 0
Cov [X, X] = V [X]
Cov [X, Y ] = Cov [Y, X]
Cov [aX, bY ] = abCov [X, Y ]
7
Cov [X +a, Y +b] = Cov [X, Y ]
Cov
_
_
n

i=1
X
i
,
m

j=1
Y
j
_
_
=
n

i=1
m

j=1
Cov [X
i
, Y
j
]
Correlation
[X, Y ] =
Cov [X, Y ]
_
V [X] V [Y ]
Independence
X Y = [X, Y ] = 0 Cov [X, Y ] = 0 E [XY ] = E [X] E [Y ]
Sample variance
S
2
=
1
n 1
n

i=1
(X
i

X
n
)
2
Conditional variance
V [Y | X] = E
_
(Y E [Y | X])
2
| X

= E
_
Y
2
| X

E [Y | X]
2
V [Y ] = E [V [Y | X]] +V [E [Y | X]]
6 Inequalities
Cauchy-Schwarz
E [XY ]
2
E
_
X
2

E
_
Y
2

Markov
P [(X) t]
E [(X)]
t
Chebyshev
P [|X E [X]| t]
V [X]
t
2
Chernoff
P [X (1 +)]
_
e

(1 +)
1+
_
> 1
Jensen
E [(X)] (E [X]) convex
7 Distribution Relationships
Binomial
X
i
Bern (p) =
n

i=1
X
i
Bin (n, p)
X Bin (n, p) , Y Bin (m, p) = X +Y Bin (n +m, p)
lim
n
Bin (n, p) = Po (np) (n large, p small)
lim
n
Bin (n, p) = N (np, np(1 p)) (n large, p far from 0 and 1)
Negative Binomial
X NBin (1, p) = Geo (p)
X NBin (r, p) =

r
i=1
Geo (p)
X
i
NBin (r
i
, p) =

X
i
NBin (

r
i
, p)
X NBin (r, p) . Y Bin (s +r, p) = P [X s] = P [Y r]
Poisson
X
i
Po (
i
) X
i
X
j
=
n

i=1
X
i
Po
_
n

i=1

i
_
X
i
Po (
i
) X
i
X
j
= X
i

j=1
X
j
Bin
_
_
n

j=1
X
j
,

i

n
j=1

j
_
_
Exponential
X
i
Exp () X
i
X
j
=
n

i=1
X
i
Gamma (n, )
Memoryless property: P [X > x +y | X > y] = P [X > x]
Normal
X N
_
,
2
_
=
_
X

_
N (0, 1)
X N
_
,
2
_
Z = aX +b = Z N
_
a +b, a
2

2
_
X N
_

1
,
2
1
_
Y N
_

2
,
2
2
_
= X +Y N
_

1
+
2
,
2
1
+
2
2
_
X
i
N
_

i
,
2
i
_
=

i
X
i
N
_
i

i
,

2
i
_
P [a < X b] =
_
b

_
a

_
(x) = 1 (x)

(x) = x(x)

(x) = (x
2
1)(x)
Upper quantile of N (0, 1): z

=
1
(1 )
Gamma
X Gamma (, ) X/ Gamma (, 1)
Gamma (, )

i=1
Exp ()
X
i
Gamma (
i
, ) X
i
X
j
=

i
X
i
Gamma (

i
, )

()

=
_

0
x
1
e
x
dx
Beta

1
B(, )
x
1
(1 x)
1
=
( +)
()()
x
1
(1 x)
1
E
_
X
k

=
B( +k, )
B(, )
=
+k 1
+ +k 1
E
_
X
k1

## Beta (1, 1) Unif (0, 1)

8
8 Probability and Moment Generating Functions
G
X
(t) = E
_
t
X

|t| < 1
M
X
(t) = G
X
(e
t
) = E
_
e
Xt

= E
_

i=0
(Xt)
i
i!
_
=

i=0
E
_
X
i

i!
t
i
P [X = 0] = G
X
(0)
P [X = 1] = G

X
(0)
P [X = i] =
G
(i)
X
(0)
i!
E [X] = G

X
(1

)
E
_
X
k

= M
(k)
X
(0)
E
_
X!
(X k)!
_
= G
(k)
X
(1

)
V [X] = G

X
(1

) +G

X
(1

) (G

X
(1

))
2
G
X
(t) = G
Y
(t) = X
d
= Y
9 Multivariate Distributions
9.1 Standard Bivariate Normal
Let X, Y N (0, 1) X Z where Y = X +
_
1
2
Z
Joint density
f(x, y) =
1
2
_
1
2
exp
_

x
2
+y
2
2xy
2(1
2
)
_
Conditionals
(Y | X = x) N
_
x, 1
2
_
and (X| Y = y) N
_
y, 1
2
_
Independence
X Y = 0
9.2 Bivariate Normal
Let X N
_

x
,
2
x
_
and Y N
_

y
,
2
y
_
.
f(x, y) =
1
2
x

y
_
1
2
exp
_

z
2(1
2
)
_
z =
_
_
x
x

x
_
2
+
_
y
y

y
_
2
2
_
x
x

x
__
y
y

y
_
_
Conditional mean and variance
E [X| Y ] = E [X] +

Y
(Y E [Y ])
V [X| Y ] =
X
_
1
2
9.3 Multivariate Normal
Covariance matrix (Precision matrix
1
)
=
_
_
_
V [X
1
] Cov [X
1
, X
k
]
.
.
.
.
.
.
.
.
.
Cov [X
k
, X
1
] V [X
k
]
_
_
_
If X N (, ),
f
X
(x) = (2)
n/2
||
1/2
exp
_

1
2
(x )
T

1
(x )
_
Properties
Z N (0, 1) X = +
1/2
Z = X N (, )
X N (, ) =
1/2
(X ) N (0, 1)
X N (, ) = AX N
_
A, AA
T
_
X N (, ) a = k = a
T
X N
_
a
T
, a
T
a
_
10 Convergence
Let {X
1
, X
2
, . . .} be a sequence of rvs and let X be another rv. Let F
n
denote
the cdf of X
n
and let F denote the cdf of X.
Types of convergence
1. In distribution (weakly, in law): X
n
D
X
lim
n
F
n
(t) = F(t) t where F continuous
2. In probability: X
n
P
X
( > 0) lim
n
P [|X
n
X| > ] = 0
3. Almost surely (strongly): X
n
as
X
P
_
lim
n
X
n
= X
_
= P
_
: lim
n
X
n
() = X()
_
= 1
9
2
): X
n
qm
X
lim
n
E
_
(X
n
X)
2

= 0
Relationships
X
n
qm
X = X
n
P
X = X
n
D
X
X
n
as
X = X
n
P
X
X
n
D
X (c R) P [X = c] = 1 = X
n
P
X
X
n
P
X Y
n
P
Y = X
n
+Y
n
P
X +Y
X
n
qm
X Y
n
qm
Y = X
n
+Y
n
qm
X +Y
X
n
P
X Y
n
P
Y = X
n
Y
n
P
XY
X
n
P
X = (X
n
)
P
(X)
X
n
D
X = (X
n
)
D
(X)
X
n
qm
b lim
n
E [X
n
] = b lim
n
V [X
n
] = 0
X
1
, . . . , X
n
iid E [X] = V [X] <

X
n
qm

Slutzkys Theorem
X
n
D
X and Y
n
P
c = X
n
+Y
n
D
X +c
X
n
D
X and Y
n
P
c = X
n
Y
n
D
cX
In general: X
n
D
X and Y
n
D
Y = X
n
+Y
n
D
X +Y
10.1 Law of Large Numbers (LLN)
Let {X
1
, . . . , X
n
} be a sequence of iid rvs, E [X
1
] = , and V [X
1
] < .
Weak (WLLN)

X
n
P
n
Strong (WLLN)

X
n
as
n
10.2 Central Limit Theorem (CLT)
Let {X
1
, . . . , X
n
} be a sequence of iid rvs, E [X
1
] = , and V [X
1
] =
2
.
Z
n
:=

X
n

_
V
_

X
n

n(

X
n
)

D
Z where Z N (0, 1)
lim
n
P [Z
n
z] = (z) z R
CLT notations
Z
n
N (0, 1)

X
n
N
_
,

2
n
_

X
n
N
_
0,

2
n
_

n(

X
n
) N
_
0,
2
_

n(

X
n
)
n
N (0, 1)
Continuity correction
P
_

X
n
x

_
x +
1
2

/

n
_
P
_

X
n
x

1
_
x
1
2

/

n
_
Delta method
Y
n
N
_
,

2
n
_
= (Y
n
) N
_
(), (

())
2
2
n
_
11 Statistical Inference
Let X
1
, , X
n
iid
F if not otherwise noted.
11.1 Point Estimation
Point estimator

n
of is a rv:

n
= g(X
1
, . . . , X
n
)
bias(

n
) = E
_

n
_

Consistency:

n
P

Sampling distribution: F(

n
)
Standard error: se(

n
) =
_
V
_

n
_
Mean squared error: mse = E
_
(

n
)
2
_
= bias(

n
)
2
+V
_

n
_
lim
n
bias(

n
) = 0 lim
n
se(

n
) = 0 =

n
is consistent
Asymptotic normality:

se
D
N (0, 1)
Slutzkys Theorem often lets us replace se(

n
) by some (weakly) consis-
tent estimator
n
.
10
11.2 Normal-Based Condence Interval
Suppose

n
N
_
, se
2
_
. Let z
/2
=
1
(1 (/2)), i.e., P
_
Z > z
/2

= /2
and P
_
z
/2
< Z < z
/2

## = 1 where Z N (0, 1). Then

C
n
=

n
z
/2
se
11.3 Empirical distribution
Empirical Distribution Function (ECDF)

F
n
(x) =

n
i=1
I(X
i
x)
n
I(X
i
x) =
_
1 X
i
x
0 X
i
> x
Properties (for any xed x)
E
_

F
n
_
= F(x)
V
_

F
n
_
=
F(x)(1 F(x))
n
mse =
F(x)(1 F(x))
n
D
0

F
n
P
F(x)
Dvoretzky-Kiefer-Wolfowitz (DKW) inequality (X
1
, . . . , X
n
F)
P
_
sup
x

F(x)

F
n
(x)

>
_
= 2e
2n
2
Nonparametric 1 condence band for F
L(x) = max{

F
n

n
, 0}
U(x) = min{

F
n
+
n
, 1}
=

1
2n
log
_
2

_
P [L(x) F(x) U(x) x] 1
11.4 Statistical Functionals
Statistical functional: T(F)
Plug-in estimator of = (F):

n
= T(

F
n
)
Linear functional: T(F) =
_
(x) dF
X
(x)
Plug-in estimator for linear functional:
T(

F
n
) =
_
(x) d

F
n
(x) =
1
n
n

i=1
(X
i
)
Often: T(

F
n
) N
_
T(F), se
2
_
=T(

F
n
) z
/2
se
p
th
quantile: F
1
(p) = inf{x : F(x) p}
=

X
n

2
=
1
n 1
n

i=1
(X
i

X
n
)
2
=
1
n

n
i=1
(X
i
)
3

3
j
=

n
i=1
(X
i

X
n
)(Y
i

Y
n
)
_

n
i=1
(X
i

X
n
)
2
_

n
i=1
(Y
i

Y
n
)
12 Parametric Inference
Let F =
_
f(x; ) :
_
be a parametric model with parameter space R
k
and parameter = (
1
, . . . ,
k
).
12.1 Method of Moments
j
th
moment

j
() = E
_
X
j

=
_
x
j
dF
X
(x)
j
th
sample moment

j
=
1
n
n

i=1
X
j
i
Method of moments estimator (MoM)

1
() =
1

2
() =
2
.
.
. =
.
.
.

k
() =
k
11
Properties of the MoM estimator

n
exists with probability tending to 1
Consistency:

n
P

Asymptotic normality:

n(

)
D
N (0, )
where = gE
_
Y Y
T

g
T
, Y = (X, X
2
, . . . , X
k
)
T
,
g = (g
1
, . . . , g
k
) and g
j
=

1
j
()
12.2 Maximum Likelihood
Likelihood: L
n
: [0, )
L
n
() =
n

i=1
f(X
i
; )
Log-likelihood

n
() = log L
n
() =
n

i=1
log f(X
i
; )
Maximum likelihood estimator (mle)
L
n
(

n
) = sup

L
n
()
Score function
s(X; ) =

log f(X; )
Fisher information
I() = V

[s(X; )]
I
n
() = nI()
Fisher information (exponential family)
I() = E

s(X; )
_
Observed Fisher information
I
obs
n
() =

2

2
n

i=1
log f(X
i
; )
Properties of the mle
Consistency:

n
P

Equivariance:

n
is the mle =(

n
) ist the mle of ()
Asymptotic normality:
1. se
_
1/I
n
()
(

n
)
se
D
N (0, 1)
2. se
_
1/I
n
(

n
)
(

n
)
se
D
N (0, 1)
Asymptotic optimality (or eciency), i.e., smallest variance for large sam-
ples. If

n
is any other estimator, the asymptotic relative eciency is
are(

n
,

n
) =
V
_

n
_
V
_

n
_ 1
Approximately the Bayes estimator
12.2.1 Delta Method
If = (

() = 0:
(
n
)
se( )
D
N (0, 1)
where = (

## ) is the mle of and

se =

se(

n
)
12.3 Multiparameter Models
Let = (
1
, . . . ,
k
) and

= (

1
, . . . ,

k
) be the mle.
H
jj
=

2

2
H
jk
=

2

k
Fisher information matrix
I
n
() =
_

_
E

[H
11
] E

[H
1k
]
.
.
.
.
.
.
.
.
.
E

[H
k1
] E

[H
kk
]
_

_
Under appropriate regularity conditions
(

) N (0, J
n
)
12
with J
n
() = I
1
n
. Further, if

j
is the j
th
component of , then
(

j

j
)
se
j
D
N (0, 1)
where se
2
j
= J
n
(j, j) and Cov
_

j
,

k
_
= J
n
(j, k)
12.3.1 Multiparameter delta method
Let = (
1
, . . . ,
k
) and let the gradient of be
=
_
_
_
_
_
_

1
.
.
.

k
_
_
_
_
_
_
Suppose

= 0 and = (

). Then,
( )
se( )
D
N (0, 1)
where
se( ) =
_
_

_
T

J
n
_

_
and

J
n
= J
n
(

) and

=

.
12.4 Parametric Bootstrap
Sample from f(x;

n

F
n
, where

n
could be the mle or method
of moments estimator.
13 Hypothesis Testing
H
0
:
0
versus H
1
:
1
Denitions
Null hypothesis H
0
Alternative hypothesis H
1
Simple hypothesis =
0
Composite hypothesis >
0
or <
0
Two-sided test: H
0
: =
0
versus H
1
: =
0
One-sided test: H
0
:
0
versus H
1
: >
0
Critical value c
Test statistic T
Rejection region R = {x : T(x) > c}
Power function () = P [X R]
Power of a test: 1 P [Type II error] = 1 = inf
1
()
Test size: = P [Type I error] = sup
0
()
Retain H
0
Reject H
0
H
0
true

Type I Error ()
H
1
true Type II Error ()

(power)
p-value
p-value = sup
0
P

## [T(X) T(x)] = inf

_
: T(x) R

_
p-value = sup
0
P

[T(X

) T(X)]
. .
1F(T(X)) since T(X

)F
= inf
_
: T(X) R

_
p-value evidence
< 0.01 very strong evidence against H
0
0.01 0.05 strong evidence against H
0
0.05 0.1 weak evidence against H
0
> 0.1 little or no evidence against H
0
Wald test
Two-sided test
Reject H
0
when |W| > z
/2
where W =

0
se
P
_
|W| > z
/2

p-value = P
0
[|W| > |w|] P [|Z| > |w|] = 2(|w|)
Likelihood ratio test (LRT)
T(X) =
sup

L
n
()
sup
0
L
n
()
=
L
n
(

n
)
L
n
(

n,0
)
13
(X) = 2 log T(X)
D

2
rq
where
k

i=1
Z
2
i

2
k
and Z
1
, . . . , Z
k
iid
N (0, 1)
p-value = P
0
[(X) > (x)] P
_

2
rq
> (x)

Multinomial LRT
mle: p
n
=
_
X
1
n
, . . . ,
X
k
n
_
T(X) =
L
n
( p
n
)
L
n
(p
0
)
=
k

j=1
_
p
j
p
0j
_
Xj
(X) = 2
k

j=1
X
j
log
_
p
j
p
0j
_
D

2
k1
The approximate size LRT rejects H
0
when (X)
2
k1,
Pearson Chi-square Test
T =
k

j=1
(X
j
E [X
j
])
2
E [X
j
]
where E [X
j
] = np
0j
under H
0
T
D

2
k1
p-value = P
_

2
k1
> T(x)

Faster
D
X
2
k1
than LRT, hence preferable for small n
Independence testing
I rows, J columns, X multinomial sample of size n = I J
mles unconstrained: p
ij
=
Xij
n
mles under H
0
: p
0ij
= p
i
p
j
=
Xi
n
Xj
n
LRT: = 2

I
i=1

J
j=1
X
ij
log
_
nXij
XiXj
_
PearsonChiSq: T =

I
i=1

J
j=1
(XijE[Xij])
2
E[Xij]
LRT and Pearson
D

2
k
, where = (I 1)(J 1)
14 Bayesian Inference
Bayes Theorem
f( | x) =
f(x| )f()
f(x
n
)
=
f(x| )f()
_
f(x| )f() d
L
n
()f()
Denitions
X
n
= (X
1
, . . . , X
n
)
x
n
= (x
1
, . . . , x
n
)
Prior density f()
Likelihood f(x
n
| ): joint density of the data
In particular, X
n
iid =f(x
n
| ) =
n

i=1
f(x
i
| ) = L
n
()
Posterior density f( | x
n
)
Normalizing constant c
n
= f(x
n
) =
_
f(x| )f() d
Kernel: part of a density that depends on
Posterior mean

n
=
_
f( | x
n
) d =

Ln()f()

Ln()f() d
14.1 Credible Intervals
Posterior interval
P [ (a, b) | x
n
] =
_
b
a
f( | x
n
) d = 1
Equal-tail credible interval
_
a

f( | x
n
) d =
_

b
f( | x
n
) d = /2
Highest posterior density (HPD) region R
n
1. P [ R
n
] = 1
2. R
n
= { : f( | x
n
) > k} for some k
R
n
is unimodal =R
n
is an interval
14.2 Function of parameters
Let = () and A = { : () }.
Posterior CDF for
H(r | x
n
) = P [() | x
n
] =
_
A
f( | x
n
) d
Posterior density
h( | x
n
) = H

( | x
n
)
Bayesian delta method
| X
n
N
_
(

), se

_
14
14.3 Priors
Choice
Subjective bayesianism.
Objective bayesianism.
Robust bayesianism.
Types
Flat: f() constant
Proper:
_

f() d = 1
Improper:
_

f() d =
Jeffreys prior (transformation-invariant):
f()
_
I() f()
_
det(I())
Conjugate: f() and f( | x
n
) belong to the same parametric family
14.3.1 Conjugate Priors
Discrete likelihood
Likelihood Conjugate prior Posterior hyperparameters
Bern (p) Beta (, ) +
n

i=1
x
i
, +n
n

i=1
x
i
Bin (p) Beta (, ) +
n

i=1
x
i
, +
n

i=1
N
i

i=1
x
i
NBin (p) Beta (, ) +rn, +
n

i=1
x
i
Po () Gamma (, ) +
n

i=1
x
i
, +n
Multinomial(p) Dir () +
n

i=1
x
(i)
Geo (p) Beta (, ) +n, +
n

i=1
x
i
Continuous likelihood (subscript c denotes constant)
Likelihood Conjugate prior Posterior hyperparameters
Unif (0, ) Pareto(x
m
, k) max
_
x
(n)
, x
m
_
, k +n
Exp () Gamma (, ) +n, +
n

i=1
x
i
N
_
,
2
c
_
N
_

0
,
2
0
_
_

2
0
+

n
i=1
x
i

2
c
_
/
_
1

2
0
+
n

2
c
_
,
_
1

2
0
+
n

2
c
_
1
N
_

c
,
2
_
Scaled Inverse Chi-
square(,
2
0
)
+n,

2
0
+

n
i=1
(x
i
)
2
+n
N
_
,
2
_
Normal-
scaled Inverse
Gamma(, , , )
+n x
+n
, + n, +
n
2
,
+
1
2
n

i=1
(x
i
x)
2
+
( x )
2
2(n +)
MVN(,
c
) MVN(
0
,
0
)
_

1
0
+n
1
c
_
1
_

1
0

0
+n
1
x
_
,
_

1
0
+n
1
c
_
1
MVN(
c
, ) Inverse-
Wishart(, )
n +, +
n

i=1
(x
i

c
)(x
i

c
)
T
Pareto(x
mc
, k) Gamma (, ) +n, +
n

i=1
log
x
i
x
mc
Pareto(x
m
, k
c
) Pareto(x
0
, k
0
) x
0
, k
0
kn where k
0
> kn
Gamma (
c
, ) Gamma (
0
,
0
)
0
+n
c
,
0
+
n

i=1
x
i
14.4 Bayesian Testing
If H
0
:
0
:
Prior probability P [H
0
] =
_
0
f() d
Posterior probability P [H
0
| x
n
] =
_
0
f( | x
n
) d
Let H
0
, . . . , H
K1
be K hypotheses. Suppose f( | H
k
),
P [H
k
| x
n
] =
f(x
n
| H
k
)P [H
k
]

K
k=1
f(x
n
| H
k
)P [H
k
]
,
15
Marginal likelihood
f(x
n
| H
i
) =
_

f(x
n
| , H
i
)f( | H
i
) d
Posterior odds (of H
i
relative to H
j
)
P [H
i
| x
n
]
P [H
j
| x
n
]
=
f(x
n
| H
i
)
f(x
n
| H
j
)
. .
Bayes Factor BFij

P [H
i
]
P [H
j
]
. .
prior odds
Bayes factor
log
10
BF
10
BF
10
evidence
0 0.5 1 1.5 Weak
0.5 1 1.5 10 Moderate
1 2 10 100 Strong
> 2 > 100 Decisive
p

=
p
1p
BF
10
1 +
p
1p
BF
10
where p = P [H
1
] and p

= P [H
1
| x
n
]
15 Exponential Family
Scalar parameter
f
X
(x| ) = h(x) exp {()T(x) A()}
= h(x)g() exp {()T(x)}
Vector parameter
f
X
(x| ) = h(x) exp
_
s

i=1

i
()T
i
(x) A()
_
= h(x) exp {() T(x) A()}
= h(x)g() exp {() T(x)}
Natural form
f
X
(x| ) = h(x) exp { T(x) A()}
= h(x)g() exp { T(x)}
= h(x)g() exp
_

T
T(x)
_
16 Sampling Methods
16.1 The Bootstrap
Let T
n
= g(X
1
, . . . , X
n
) be a statistic.
1. Estimate V
F
[T
n
] with V

Fn
[T
n
].
2. Approximate V

Fn
[T
n
] using simulation:
(a) Repeat the following B times to get T

n,1
, . . . , T

n,B
, an iid sample from
the sampling distribution implied by

F
n
i. Sample uniformly X

1
, . . . , X

n

F
n
.
ii. Compute T

n
= g(X

1
, . . . , X

n
).
(b) Then
v
boot
=

V

Fn
=
1
B
B

b=1
_
T

n,b

1
B
B

r=1
T

n,r
_2
16.1.1 Bootstrap Condence Intervals
Normal-based interval
T
n
z
/2
se
boot
Pivotal interval
1. Location parameter = T(F)
2. Pivot R
n
=

3. Let H(r) = P [R
n
r] be the cdf of R
n
4. Let R

n,b
=

n,b

n
. Approximate H using bootstrap:

H(r) =
1
B
B

b=1
I(R

n,b
r)
5.

= sample quantile of (

n,1
, . . . ,

n,B
)
6. r

= sample quantile of (R

n,1
, . . . , R

n,B
), i.e., r

n
7. Approximate 1 condence interval C
n
=
_
a,

b
_
where
a =

H
1
_
1

2
_
=

n
r

1/2
= 2

1/2

b =

H
1
_

2
_
=

n
r

/2
= 2

/2
Percentile interval
C
n
=
_

/2
,

1/2
_
16
16.2 Rejection Sampling
Setup
We can easily sample from g()
We want to sample from h(), but it is dicult
We know h() up to a proportional constant: h() =
k()
_
k() d
Envelope condition: we can nd M > 0 such that k() Mg()
Algorithm
1. Draw
cand
g()
2. Generate u Unif (0, 1)
3. Accept
cand
if u
k(
cand
)
Mg(
cand
)
4. Repeat until B values of
cand
have been accepted
Example
We can easily sample from the prior g() = f()
Target is the posterior h() k() = f(x
n
| )f()
Envelope condition: f(x
n
| ) f(x
n
|

n
) = L
n
(

n
) M
Algorithm
1. Draw
cand
f()
2. Generate u Unif (0, 1)
3. Accept
cand
if u
L
n
(
cand
)
L
n
(

n
)
16.3 Importance Sampling
Sample from an importance function g rather than target density h.
Algorithm to obtain an approximation to E [q() | x
n
]:
1. Sample from the prior
1
, . . . ,
n
iid
f()
2. w
i
=
L
n
(
i
)

B
i=1
L
n
(
i
)
i = 1, . . . , B
3. E [q() | x
n
]

B
i=1
q(
i
)w
i
17 Decision Theory
Denitions
Unknown quantity aecting our decision:
Decision rule: synonymous for an estimator

## Action a A: possible value of the decision rule. In the estimation

context, the action is just an estimate of ,

(x).
Loss function L: consequences of taking action a when true state is or
discrepancy between and

, L : A [k, ).
Loss functions
Squared error loss: L(, a) = ( a)
2
Linear loss: L(, a) =
_
K
1
( a) a < 0
K
2
(a ) a 0
Absolute error loss: L(, a) = | a| (linear loss with K
1
= K
2
)
L
p
loss: L(, a) = | a|
p
Zero-one loss: L(, a) =
_
0 a =
1 a =
17.1 Risk
Posterior risk
r(

| x) =
_
L(,

(x))f( | x) d = E
|X
_
L(,

(x))
_
(Frequentist) risk
R(,

) =
_
L(,

(x))f(x| ) dx = E
X|
_
L(,

(X))
_
Bayes risk
r(f,

) =
__
L(,

(x))f(x, ) dxd = E
,X
_
L(,

(X))
_
r(f,

) = E

_
E
X|
_
L(,

(X)
__
= E

_
R(,

)
_
r(f,

) = E
X
_
E
|X
_
L(,

(X)
__
= E
X
_
r(

| X)
_

dominates

if
: R(,

) R(,

)
: R(,

) < R(,

is inadmissible if there is at least one other estimator

that dominates
it. Otherwise it is called admissible.
17
17.3 Bayes Rule
Bayes rule (or Bayes estimator)
r(f,

) = inf

r(f,

(x) = inf r(

| x) x = r(f,

) =
_
r(

| x)f(x) dx
Theorems
Squared error loss: posterior mean
Absolute error loss: posterior median
Zero-one loss: posterior mode
17.4 Minimax Rules
Maximum risk

R(

) = sup

R(,

)

R(a) = sup

R(, a)
Minimax rule
sup

R(,

) = inf

R(

) = inf

sup

R(,

## = Bayes rule c : R(,

) = c
Least favorable prior

f
= Bayes rule R(,

f
) r(f,

f
)
18 Linear Regression
Denitions
Response variable Y
Covariate X (aka predictor variable or feature)
18.1 Simple Linear Regression
Model
Y
i
=
0
+
1
X
i
+
i
E [
i
| X
i
] = 0, V [
i
| X
i
] =
2
Fitted line
r(x) =

0
+

1
x
Predicted (tted) values

Y
i
= r(X
i
)
Residuals

i
= Y
i

Y
i
= Y
i

0
+

1
X
i
_

0
,

1
) =
n

i=1

2
i
Least square estimates

T
= (

0
,

1
)
T
: min

0,

1

0
=

Y
n

1

X
n

1
=

n
i=1
(X
i

X
n
)(Y
i

Y
n
)

n
i=1
(X
i

X
n
)
2
=

n
i=1
X
i
Y
i
n

XY

n
i=1
X
2
i
nX
2
E
_

| X
n
_
=
_

1
_
V
_

| X
n
_
=

2
ns
X
_
n
1

n
i=1
X
2
i
X
n
X
n
1
_
se(

0
) =

s
X

n
_

n
i=1
X
2
i
n
se(

1
) =

s
X

n
where s
2
X
= n
1

n
i=1
(X
i
X
n
)
2
and
2
=
1
n2

n
i=1

2
i
(unbiased estimate).
Further properties:
Consistency:

0
P

0
and

1
P

1
Asymptotic normality:

0
se(

0
)
D
N (0, 1) and

1
se(

1
)
D
N (0, 1)
Approximate 1 condence intervals for
0
and
1
:

0
z
/2
se(

0
) and

1
z
/2
se(

1
)
Wald test for H
0
:
1
= 0 vs. H
1
:
1
= 0: reject H
0
if |W| > z
/2
where
W =

1
/ se(

1
).
R
2
R
2
=

n
i=1
(

Y
i
Y )
2

n
i=1
(Y
i
Y )
2
= 1

n
i=1

2
i

n
i=1
(Y
i
Y )
2
= 1
tss
18
Likelihood
L =
n

i=1
f(X
i
, Y
i
) =
n

i=1
f
X
(X
i
)
n

i=1
f
Y |X
(Y
i
| X
i
) = L
1
L
2
L
1
=
n

i=1
f
X
(X
i
)
L
2
=
n

i=1
f
Y |X
(Y
i
| X
i
)
n
exp
_

1
2
2

i
_
Y
i
(
0

1
X
i
)
_
2
_
Under the assumption of Normality, the least squares estimator is also the mle

2
=
1
n
n

i=1

2
i
18.2 Prediction
Observe X = x

## of the covariate and want to predict their outcome Y

0
+

1
x

V
_

_
= V
_

0
_
+x
2

V
_

1
_
+ 2x

Cov
_

0
,

1
_
Prediction interval

2
n
=
2
_
n
i=1
(X
i
X

)
2
n

i
(X
i

X)
2
j
+ 1
_

z
/2

n
18.3 Multiple Regression
Y = X +
where
X =
_
_
_
X
11
X
1k
.
.
.
.
.
.
.
.
.
X
n1
X
nk
_
_
_ =
_
_
_

1
.
.
.

k
_
_
_ =
_
_
_

1
.
.
.

n
_
_
_
Likelihood
L(, ) = (2
2
)
n/2
exp
_

1
2
2
_
T
(y X) = Y X
2
=
N

i=1
(Y
i
x
T
i
)
2
If the (k k) matrix X
T
X is invertible,

= (X
T
X)
1
X
T
Y
V
_

| X
n
_
=
2
(X
T
X)
1

N
_
,
2
(X
T
X)
1
_
Estimate regression function
r(x) =
k

j=1

j
x
j
Unbiased estimate for
2

2
=
1
n k
n

i=1

2
i
= X

Y
mle
=

X
2
=
n k
n

2
1 Condence interval

j
z
/2
se(

j
)
18.4 Model Selection
Consider predicting a new observation Y

for covariates X

and let S J
denote a subset of the covariates in the model, where |S| = k and |J| = n.
Issues
Undertting: too few covariates yields high bias
Overtting: too many covariates yields high variance
Procedure
1. Assign a score to each model
2. Search through all models to nd the one with the highest score
Hypothesis testing
H
0
:
j
= 0 vs. H
1
:
j
= 0 j J
Mean squared prediction error (mspe)
mspe = E
_
(

Y (S) Y

)
2
_
Prediction risk
R(S) =
n

i=1
mspe
i
=
n

i=1
E
_
(

Y
i
(S) Y

i
)
2
_
Training error

R
tr
(S) =
n

i=1
(

Y
i
(S) Y
i
)
2
19
R
2
R
2
(S) = 1
tss
= 1

R
tr
(S)
tss
= 1

n
i=1
(

Y
i
(S) Y )
2

n
i=1
(Y
i
Y )
2
The training error is a downward-biased estimate of the prediction risk.
E
_

R
tr
(S)
_
< R(S)
bias(

R
tr
(S)) = E
_

R
tr
(S)
_
R(S) = 2
n

i=1
Cov
_

Y
i
, Y
i
_
2
R
2
(S) = 1
n 1
n k
tss
Mallows C
p
statistic

R(S) =

R
tr
(S) + 2k
2
= lack of t + complexity penalty
Akaike Information Criterion (AIC)
AIC(S) =
n
(

S
,
2
S
) k
Bayesian Information Criterion (BIC)
BIC(S) =
n
(

S
,
2
S
)
k
2
log n
Validation and training

R
V
(S) =
m

i=1
(

i
(S) Y

i
)
2
m = |{validation data}|, often
n
4
or
n
2
Leave-one-out cross-validation

R
CV
(S) =
n

i=1
(Y
i

Y
(i)
)
2
=
n

i=1
_
Y
i

Y
i
(S)
1 U
ii
(S)
_
2
U(S) = X
S
(X
T
S
X
S
)
1
X
S
(hat matrix)
19 Non-parametric Function Estimation
19.1 Density Estimation
Estimate f(x), where f(x) = P [X A] =
_
A
f(x) dx.
Integrated square error (ise)
L(f,

f
n
) =
_
_
f(x)

f
n
(x)
_
2
dx = J(h) +
_
f
2
(x) dx
Frequentist risk
R(f,

f
n
) = E
_
L(f,

f
n
)
_
=
_
b
2
(x) dx +
_
v(x) dx
b(x) = E
_

f
n
(x)
_
f(x)
v(x) = V
_

f
n
(x)
_
19.1.1 Histograms
Denitions
Number of bins m
Binwidth h =
1
m
Bin B
j
has
j
observations
Dene p
j
=
j
/n and p
j
=
_
Bj
f(u) du
Histogram estimator

f
n
(x) =
m

j=1
p
j
h
I(x B
j
)
E
_

f
n
(x)
_
=
p
j
h
V
_

f
n
(x)
_
=
p
j
(1 p
j
)
nh
2
R(

f
n
, f)
h
2
12
_
(f

(u))
2
du +
1
nh
h

=
1
n
1/3
_
6
_
(f

(u))
2
du
_
1/3
R

f
n
, f)
C
n
2/3
C =
_
3
4
_
2/3
__
(f

(u))
2
du
_
1/3
Cross-validation estimate of E [J(h)]

J
CV
(h) =
_

f
2
n
(x) dx
2
n
n

i=1

f
(i)
(X
i
) =
2
(n 1)h

n + 1
(n 1)h
m

j=1
p
2
j
20
19.1.2 Kernel Density Estimator (KDE)
Kernel K
K(x) 0

_
K(x) dx = 1

_
xK(x) dx = 0

_
x
2
K(x) dx
2
K
> 0
KDE

f
n
(x) =
1
n
n

i=1
1
h
K
_
x X
i
h
_
R(f,

f
n
)
1
4
(h
K
)
4
_
(f

(x))
2
dx +
1
nh
_
K
2
(x) dx
h

=
c
2/5
1
c
1/5
2
c
1/5
3
n
1/5
c
1
=
2
K
, c
2
=
_
K
2
(x) dx, c
3
=
_
(f

(x))
2
dx
R

(f,

f
n
) =
c
4
n
4/5
c
4
=
5
4
(
2
K
)
2/5
__
K
2
(x) dx
_
4/5
. .
C(K)
__
(f

)
2
dx
_
1/5
Epanechnikov Kernel
K(x) =
_
3
4

5(1x
2
/5)
|x| <

5
0 otherwise
Cross-validation estimate of E [J(h)]

J
CV
(h) =
_

f
2
n
(x) dx
2
n
n

i=1

f
(i)
(X
i
)
1
hn
2
n

i=1
n

j=1
K

_
X
i
X
j
h
_
+
2
nh
K(0)
K

(x) = K
(2)
(x) 2K(x) K
(2)
(x) =
_
K(x y)K(y) dy
19.2 Non-parametric Regression
Estimate f(x) where f(x) = E [Y | X = x]. Consider pairs of points
(x
1
, Y
1
), . . . , (x
n
, Y
n
) related by
Y
i
= r(x
i
) +
i
E [
i
] = 0
V [
i
] =
2
k-nearest Neighbor Estimator
r(x) =
1
k

i:xiNk(x)
Y
i
where N
k
(x) = {k values of x
1
, . . . , x
n
closest to x}
r(x) =
n

i=1
w
i
(x)Y
i
w
i
(x) =
K
_
xxi
h
_

n
j=1
K
_
xxj
h
_ [0, 1]
R( r
n
, r)
h
4
4
__
x
2
K
2
(x) dx
_
4
_ _
r

(x) + 2r

(x)
f

(x)
f(x)
_
2
dx
+
_

2
_
K
2
(x) dx
nhf(x)
dx
h

c
1
n
1/5
R

( r
n
, r)
c
2
n
4/5
Cross-validation estimate of E [J(h)]

J
CV
(h) =
n

i=1
(Y
i
r
(i)
(x
i
))
2
=
n

i=1
(Y
i
r(x
i
))
2
_
1
K(0)

n
j=1
K

xx
j
h

_
2
19.3 Smoothing Using Orthogonal Functions
Approximation
r(x) =

j=1

j
(x)
J

i=1

j
(x)
Multivariate regression
Y = +
where
i
=
i
and =
_
_
_

0
(x
1
)
J
(x
1
)
.
.
.
.
.
.
.
.
.

0
(x
n
)
J
(x
n
)
_
_
_
Least squares estimator

= (
T
)
1

T
Y

1
n

T
Y (for equally spaced observations only)
21
Cross-validation estimate of E [J(h)]

R
CV
(J) =
n

i=1
_
_
Y
i

j=1

j
(x
i
)

j,(i)
_
_
2
20 Stochastic Processes
Stochastic Process
{X
t
: t T} T =
_
{0, 1, . . . } = Z discrete
[0, ) continuous
Notations X
t
, X(t)
State space X
Index set T
20.1 Markov Chains
Markov chain
P [X
n
= x| X
0
, . . . , X
n1
] = P [X
n
= x| X
n1
] n T, x X
Transition probabilities
p
ij
P [X
n+1
= j | X
n
= i]
p
ij
(n) P [X
m+n
= j | X
m
= i] n-step
Transition matrix P (n-step: P
n
)
(i, j) element is p
ij
p
ij
> 0

i
p
ij
= 1
Chapman-Kolmogorov
p
ij
(m+n) =

k
p
ij
(m)p
kj
(n)
P
m+n
= P
m
P
n
P
n
= P P = P
n
Marginal probability

n
= (
n
(1), . . . ,
n
(N)) where
i
(i) = P [X
n
= i]

0
initial distribution

n
=
0
P
n
20.2 Poisson Processes
Poisson process
{X
t
: t [0, )} = number of events up to and including time t
X
0
= 0
Independent increments:
t
0
< < t
n
: X
t1
X
t0
X
tn
X
tn1
Intensity function (t)
P [X
t+h
X
t
= 1] = (t)h +o(h)
P [X
t+h
X
t
= 2] = o(h)
X
s+t
X
s
Po (m(s +t) m(s)) where m(t) =
_
t
0
(s) ds
Homogeneous Poisson process
(t) = X
t
Po (t) > 0
Waiting times
W
t
:= time at which X
t
occurs
W
t
Gamma
_
t,
1

_
Interarrival times
S
t
= W
t+1
W
t
S
t
Exp
_
1

_
t Wt1 Wt
St
22
21 Time Series
Mean function

xt
= E [x
t
] =
_

xf
t
(x) dx
Autocovariance function

x
(s, t) = E [(x
s

s
)(x
t

t
)] = E [x
s
x
t
]
s

x
(t, t) = E
_
(x
t

t
)
2

= V [x
t
]
Autocorrelation function (ACF)
(s, t) =
Cov [x
s
, x
t
]
_
V [x
s
] V [x
t
]
=
(s, t)
_
(s, s)(t, t)
Cross-covariance function (CCV)

xy
(s, t) = E [(x
s

xs
)(y
t

yt
)]
Cross-correlation function (CCF)

xy
(s, t) =

xy
(s, t)
_

x
(s, s)
y
(t, t)
Backshift operator
B
k
(x
t
) = x
tk
Dierence operator

d
= (1 B)
d
White noise
w
t
wn(0,
2
w
)
Gaussian: w
t
iid
N
_
0,
2
w
_
E [w
t
] = 0 t T
V [w
t
] =
2
t T

w
(s, t) = 0 s = t s, t T
Random walk
Drift
x
t
= t +

t
j=1
w
j
E [x
t
] = t
Symmetric moving average
m
t
=
k

j=k
a
j
x
tj
where a
j
= a
j
0 and
k

j=k
a
j
= 1
21.1 Stationary Time Series
Strictly stationary
P [x
t1
c
1
, . . . , x
tk
c
k
] = P [x
t1+h
c
1
, . . . , x
tk+h
c
k
]
k N, t
k
, c
k
, h Z
Weakly stationary
E
_
x
2
t

< t Z
E
_
x
2
t

= m t Z

x
(s, t) =
x
(s +r, t +r) r, s, t Z
Autocovariance function
(h) = E [(x
t+h
)(x
t
)] h Z
(0) = E
_
(x
t
)
2

(0) 0
(0) |(h)|
(h) = (h)
Autocorrelation function (ACF)

x
(h) =
Cov [x
t+h
, x
t
]
_
V [x
t+h
] V [x
t
]
=
(t +h, t)
_
(t +h, t +h)(t, t)
=
(h)
(0)
Jointly stationary time series

xy
(h) = E [(x
t+h

x
)(y
t

y
)]

xy
(h) =

xy
(h)
_

x
(0)
y
(h)
Linear process
x
t
= +

j=

j
w
tj
where

j=
|
j
| <
(h) =
2
w

j=

j+h

j
23
21.2 Estimation of Correlation
Sample mean
x =
1
n
n

t=1
x
t
Sample variance
V [ x] =
1
n
n

h=n
_
1
|h|
n
_

x
(h)
Sample autocovariance function
(h) =
1
n
nh

t=1
(x
t+h
x)(x
t
x)
Sample autocorrelation function
(h) =
(h)
(0)
Sample cross-variance function

xy
(h) =
1
n
nh

t=1
(x
t+h
x)(y
t
y)
Sample cross-correlation function

xy
(h) =

xy
(h)
_

x
(0)
y
(0)
Properties

x(h)
=
1

n
if x
t
is white noise

xy(h)
=
1

n
if x
t
or y
t
is white noise
21.3 Non-Stationary Time Series
Classical decomposition model
x
t
=
t
+s
t
+w
t

t
= trend
s
t
= seasonal component
w
t
= random noise term
21.3.1 Detrending
Least squares
1. Choose trend model, e.g.,
t
=
0
+
1
t +
2
t
2
2. Minimize rss to obtain trend estimate
t
=

0
+

1
t +

2
t
2
3. Residuals noise w
t
Moving average
The low-pass lter v
t
is a symmetric moving average m
t
with a
j
=
1
2k+1
:
v
t
=
1
2k + 1
k

i=k
x
t1
If
1
2k+1

k
i=k
w
tj
0, a linear trend function
t
=
0
+
1
t passes
without distortion
Dierencing

t
=
0
+
1
t = x
t
=
1
21.4 ARIMA models
Autoregressive polynomial
(z) = 1
1
z
p
z
p
z C
p
= 0
Autoregressive operator
(B) = 1
1
B
p
B
p
Autoregressive model order p, AR(p)
x
t
=
1
x
t1
+ +
p
x
tp
+w
t
(B)x
t
= w
t
AR(1)
x
t
=
k
(x
tk
) +
k1

j=0

j
(w
tj
)
k,||<1
=

j=0

j
(w
tj
)
. .
linear process
E [x
t
] =

j=0

j
(E [w
tj
]) = 0
(h) = Cov [x
t+h
, x
t
] =

2
w

h
1
2
(h) =
(h)
(0)
=
h
(h) = (h 1) h = 1, 2, . . .
24
Moving average polynomial
(z) = 1 +
1
z + +
q
z
q
z C
q
= 0
Moving average operator
(B) = 1 +
1
B + +
p
B
p
MA(q) (moving average model order q)
x
t
= w
t
+
1
w
t1
+ +
q
w
tq
x
t
= (B)w
t
E [x
t
] =
q

j=0

j
E [w
tj
] = 0
(h) = Cov [x
t+h
, x
t
] =
_

2
w

qh
j=0

j

j+h
0 h q
0 h > q
MA(1)
x
t
= w
t
+w
t1
(h) =
_

_
(1 +
2
)
2
w
h = 0

2
w
h = 1
0 h > 1
(h) =
_

(1+
2
)
h = 1
0 h > 1
ARMA(p, q)
x
t
=
1
x
t1
+ +
p
x
tp
+w
t
+
1
w
t1
+ +
q
w
tq
(B)x
t
= (B)w
t
Partial autocorrelation function (PACF)
x
h1
i
regression of x
i
on {x
h1
, x
h2
, . . . , x
1
}

hh
= corr(x
h
x
h1
h
, x
0
x
h1
0
) h 2
E.g.,
11
= corr(x
1
, x
0
) = (1)
ARIMA(p, d, q)

d
x
t
= (1 B)
d
x
t
is ARMA(p, q)
(B)(1 B)
d
x
t
= (B)w
t
Exponentially Weighted Moving Average (EWMA)
x
t
= x
t1
+w
t
w
t1
x
t
=

j=1
(1 )
j1
x
tj
+w
t
when || < 1
x
n+1
= (1 )x
n
+ x
n
Seasonal ARIMA
Denoted by ARIMA(p, d, q) (P, D, Q)
s

P
(B
s
)(B)
D
s

d
x
t
= +
Q
(B
s
)(B)w
t
21.4.1 Causality and Invertibility
ARMA(p, q) is causal (future-independent) {
j
} :

j=0

j
< such that
x
t
=

j=0
w
tj
= (B)w
t
ARMA(p, q) is invertible {
j
} :

j=0

j
< such that
(B)x
t
=

j=0
X
tj
= w
t
Properties
ARMA(p, q) causal roots of (z) lie outside the unit circle
(z) =

j=0

j
z
j
=
(z)
(z)
|z| 1
ARMA(p, q) invertible roots of (z) lie outside the unit circle
(z) =

j=0

j
z
j
=
(z)
(z)
|z| 1
Behavior of the ACF and PACF for causal and invertible ARMA models
AR(p) MA(q) ARMA(p, q)
ACF tails o cuts o after lag q tails o
PACF cuts o after lag p tails o q tails o
21.5 Spectral Analysis
Periodic process
x
t
= Acos(2t +)
= U
1
cos(2t) +U
2
sin(2t)
Frequency index (cycles per unit time), period 1/
25
Amplitude A
Phase
U
1
= Acos and U
2
= Asin often normally distributed rvs
Periodic mixture
x
t
=
q

k=1
(U
k1
cos(2
k
t) +U
k2
sin(2
k
t))
U
k1
, U
k2
, for k = 1, . . . , q, are independent zero-mean rvs with variances
2
k
(h) =

q
k=1

2
k
cos(2
k
h)
(0) = E
_
x
2
t

q
k=1

2
k
Spectral representation of a periodic process
(h) =
2
cos(2
0
h)
=

2
2
e
2i0h
+

2
2
e
2i0h
=
_
1/2
1/2
e
2ih
dF()
Spectral distribution function
F() =
_

_
0 <
0

2
/2 <
0

2

0
F() = F(1/2) = 0
F() = F(1/2) = (0)
Spectral density
f() =

h=
(h)e
2ih

1
2

1
2
Needs

h=
|(h)| < = (h) =
_
1/2
1/2
e
2ih
f() d h = 0, 1, . . .
f() 0
f() = f()
f() = f(1 )
(0) = V [x
t
] =
_
1/2
1/2
f() d
White noise: f
w
() =
2
w
ARMA(p, q) , (B)x
t
= (B)w
t
:
f
x
() =
2
w
|(e
2i
)|
2
|(e
2i
)|
2
where (z) = 1

p
k=1

k
z
k
and (z) = 1 +

q
k=1

k
z
k
Discrete Fourier Transform (DFT)
d(
j
) = n
1/2
n

i=1
x
t
e
2ijt
Fourier/Fundamental frequencies

j
= j/n
Inverse DFT
x
t
= n
1/2
n1

j=0
d(
j
)e
2ijt
Periodogram
I(j/n) = |d(j/n)|
2
Scaled Periodogram
P(j/n) =
4
n
I(j/n)
=
_
2
n
n

t=1
x
t
cos(2tj/n
_
2
+
_
2
n
n

t=1
x
t
sin(2tj/n
_
2
22 Math
22.1 Gamma Function
Ordinary: (s) =
_

0
t
s1
e
t
dt
Upper incomplete: (s, x) =
_

x
t
s1
e
t
dt
Lower incomplete: (s, x) =
_
x
0
t
s1
e
t
dt
( + 1) = () > 1
(n) = (n 1)! n N
(1/2) =

## 22.2 Beta Function

Ordinary: B(x, y) = B(y, x) =
_
1
0
t
x1
(1 t)
y1
dt =
(x)(y)
(x +y)
Incomplete: B(x; a, b) =
_
x
0
t
a1
(1 t)
b1
dt
Regularized incomplete:
I
x
(a, b) =
B(x; a, b)
B(a, b)
a,bN
=
a+b1

j=a
(a +b 1)!
j!(a +b 1 j)!
x
j
(1 x)
a+b1j
26
I
0
(a, b) = 0 I
1
(a, b) = 1
I
x
(a, b) = 1 I
1x
(b, a)
22.3 Series
Finite

k=1
k =
n(n + 1)
2

k=1
(2k 1) = n
2

k=1
k
2
=
n(n + 1)(2n + 1)
6

k=1
k
3
=
_
n(n + 1)
2
_
2

k=0
c
k
=
c
n+1
1
c 1
c = 1
Binomial

k=0
_
n
k
_
= 2
n

k=0
_
r +k
k
_
=
_
r +n + 1
n
_

k=0
_
k
m
_
=
_
n + 1
m+ 1
_
Vandermondes Identity:
r

k=0
_
m
k
__
n
r k
_
=
_
m+n
r
_
Binomial Theorem:
n

k=0
_
n
k
_
a
nk
b
k
= (a +b)
n
Innite

k=0
p
k
=
1
1 p
,

k=1
p
k
=
p
1 p
|p| < 1

k=0
kp
k1
=
d
dp
_

k=0
p
k
_
=
d
dp
_
1
1 p
_
=
1
(1 p)
2
|p| < 1

k=0
_
r +k 1
k
_
x
k
= (1 x)
r
r N
+

k=0
_

k
_
p
k
= (1 +p)

|p| < 1 , C
22.4 Combinatorics
Sampling
k out of n w/o replacement w/ replacement
ordered n
k
=
k1

i=0
(n i) =
n!
(n k)!
n
k
unordered
_
n
k
_
=
n
k
k!
=
n!
k!(n k)!
_
n 1 +r
r
_
=
_
n 1 +r
n 1
_
Stirling numbers, 2
nd
kind
_
n
k
_
= k
_
n 1
k
_
+
_
n 1
k 1
_
1 k n
_
n
0
_
=
_
1 n = 0
0 else
Partitions
P
n+k,k
=
n

i=1
P
n,i
k > n : P
n,k
= 0 n 1 : P
n,0
= 0, P
0,0
= 1
Balls and Urns f : B U D = distinguishable, D = indistinguishable.
|B| = n, |U| = m f arbitrary f injective f surjective f bijective
B : D, U : D m
n
_
m
n
m n
0 else
m!
_
n
m
_
_
n! m = n
0 else
B : D, U : D
_
n +n 1
n
_ _
m
n
_ _
n 1
m1
_
_
1 m = n
0 else
B : D, U : D
m

k=1
_
n
k
_
_
1 m n
0 else
_
n
m
_
_
1 m = n
0 else
B : D, U : D
m

k=1
P
n,k
_
1 m n
0 else
P
n,m
_
1 m = n
0 else
References
[1] P. G. Hoel, S. C. Port, and C. J. Stone. Introduction to Probability Theory.
Brooks Cole, 1972.
[2] L. M. Leemis and J. T. McQueston. Univariate Distribution Relationships.
The American Statistician, 62(1):4553, 2008.
[3] R. H. Shumway and D. S. Stoer. Time Series Analysis and Its Applications
With R Examples. Springer, 2006.
[4] A. Steger. Diskrete Strukturen Band 1: Kombinatorik, Graphentheorie,
Algebra. Springer, 2001.
[5] A. Steger. Diskrete Strukturen Band 2: Wahrscheinlichkeitstheorie und
Statistik. Springer, 2002.
[6] L. Wasserman. All of Statistics: A Concise Course in Statistical Inference.
Springer, 2003.
27
U
n
i
v
a
r
i
a
t
e
d
i
s
t
r
i
b
u
t
i
o
n
r
e
l
a
t
i
o
n
s
h
i
p
s
,
c
o
u
r
t
e
s
y
L
e
e
m
i
s
a
n
d
M
c
Q
u
e
s
t
o
n
[
2
]
.
28