You are on page 1of 28

# Probability and Statistics

Cookbook
Copyright c Matthias Vallentin, 2012
vallentin@icir.org
13
th
August, 2012
This cookbook integrates a variety of topics in probability the-
ory and statistics. It is based on literature [1, 6, 3] and in-class
material from courses of the statistics department at the Uni-
versity of California in Berkeley but also inuenced by other
sources [4, 5]. If you nd errors or have suggestions for further
topics, I would appreciate if you send me an email. The most re-
cent version of this document is available at http://matthias.
vallentin.net/probability-and-statistics-cookbook/. To
reproduce, please contact me.
Contents
1 Distribution Overview 3
1.1 Discrete Distributions . . . . . . . . . . 3
1.2 Continuous Distributions . . . . . . . . 4
2 Probability Theory 6
3 Random Variables 6
3.1 Transformations . . . . . . . . . . . . . 7
4 Expectation 7
5 Variance 7
6 Inequalities 8
7 Distribution Relationships 8
8 Probability and Moment Generating
Functions 9
9 Multivariate Distributions 9
9.1 Standard Bivariate Normal . . . . . . . 9
9.2 Bivariate Normal . . . . . . . . . . . . . 9
9.3 Multivariate Normal . . . . . . . . . . . 9
10 Convergence 9
10.1 Law of Large Numbers (LLN) . . . . . . 10
10.2 Central Limit Theorem (CLT) . . . . . 10
11 Statistical Inference 10
11.1 Point Estimation . . . . . . . . . . . . . 10
11.2 Normal-Based Condence Interval . . . 11
11.3 Empirical distribution . . . . . . . . . . 11
11.4 Statistical Functionals . . . . . . . . . . 11
12 Parametric Inference 11
12.1 Method of Moments . . . . . . . . . . . 11
12.2 Maximum Likelihood . . . . . . . . . . . 12
12.2.1 Delta Method . . . . . . . . . . . 12
12.3 Multiparameter Models . . . . . . . . . 12
12.3.1 Multiparameter delta method . . 13
12.4 Parametric Bootstrap . . . . . . . . . . 13
13 Hypothesis Testing 13
14 Bayesian Inference 14
14.1 Credible Intervals . . . . . . . . . . . . . 14
14.2 Function of parameters . . . . . . . . . . 14
14.3 Priors . . . . . . . . . . . . . . . . . . . 15
14.3.1 Conjugate Priors . . . . . . . . . 15
14.4 Bayesian Testing . . . . . . . . . . . . . 15
15 Exponential Family 16
16 Sampling Methods 16
16.1 The Bootstrap . . . . . . . . . . . . . . 16
16.1.1 Bootstrap Condence Intervals . 16
16.2 Rejection Sampling . . . . . . . . . . . . 17
16.3 Importance Sampling . . . . . . . . . . . 17
17 Decision Theory 17
17.1 Risk . . . . . . . . . . . . . . . . . . . . 17
17.2 Admissibility . . . . . . . . . . . . . . . 17
17.3 Bayes Rule . . . . . . . . . . . . . . . . 18
17.4 Minimax Rules . . . . . . . . . . . . . . 18
18 Linear Regression 18
18.1 Simple Linear Regression . . . . . . . . 18
18.2 Prediction . . . . . . . . . . . . . . . . . 19
18.3 Multiple Regression . . . . . . . . . . . 19
18.4 Model Selection . . . . . . . . . . . . . . 19
19 Non-parametric Function Estimation 20
19.1 Density Estimation . . . . . . . . . . . . 20
19.1.1 Histograms . . . . . . . . . . . . 20
19.1.2 Kernel Density Estimator (KDE) 21
19.2 Non-parametric Regression . . . . . . . 21
19.3 Smoothing Using Orthogonal Functions 21
20 Stochastic Processes 22
20.1 Markov Chains . . . . . . . . . . . . . . 22
20.2 Poisson Processes . . . . . . . . . . . . . 22
21 Time Series 23
21.1 Stationary Time Series . . . . . . . . . . 23
21.2 Estimation of Correlation . . . . . . . . 24
21.3 Non-Stationary Time Series . . . . . . . 24
21.3.1 Detrending . . . . . . . . . . . . 24
21.4 ARIMA models . . . . . . . . . . . . . . 24
21.4.1 Causality and Invertibility . . . . 25
21.5 Spectral Analysis . . . . . . . . . . . . . 25
22 Math 26
22.1 Gamma Function . . . . . . . . . . . . . 26
22.2 Beta Function . . . . . . . . . . . . . . . 26
22.3 Series . . . . . . . . . . . . . . . . . . . 27
22.4 Combinatorics . . . . . . . . . . . . . . 27
1 Distribution Overview
1.1 Discrete Distributions
Notation
1
FX(x) fX(x) E [X] V [X] MX(s)
Uniform Unif {a, . . . , b}
_

_
0 x < a
xa+1
ba
a x b
1 x > b
I(a < x < b)
b a + 1
a +b
2
(b a + 1)
2
1
12
e
as
e
(b+1)s
s(b a)
Bernoulli Bern (p) (1 p)
1x
p
x
(1 p)
1x
p p(1 p) 1 p +pe
s
Binomial Bin (n, p) I1p(n x, x + 1)
_
n
x
_
p
x
(1 p)
nx
np np(1 p) (1 p +pe
s
)
n
Multinomial Mult (n, p)
n!
x1! . . . xk!
p
x1
1
p
x
k
k
k

i=1
xi = n npi npi(1 pi)
_
k

i=0
pie
si
_
n
Hypergeometric Hyp (N, m, n)
_
x np
_
np(1 p)
_ _
m
x
__
mx
nx
_
_
N
x
_
nm
N
nm(N n)(N m)
N
2
(N 1)
Negative Binomial NBin (r, p) Ip(r, x + 1)
_
x +r 1
r 1
_
p
r
(1 p)
x
r
1 p
p
r
1 p
p
2
_
p
1 (1 p)e
s
_
r
Geometric Geo (p) 1 (1 p)
x
x N
+
p(1 p)
x1
x N
+
1
p
1 p
p
2
p
1 (1 p)e
s
Poisson Po () e

i=0

i
i!

x
e

x!
e
(e
s
1)
1
n
q q q q q q
a b
x
P
M
F
Uniform (discrete)
0.00
0.05
0.10
0.15
0.20
0.25
qqqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqqqqqqqqqqqqqqqqqqqq
0 10 20 30 40
x
P
M
F
q n = 40, p = 0.3
n = 30, p = 0.6
n = 25, p = 0.9
Binomial
0.0
0.2
0.4
0.6
0.8
q
q
q
q
q
q
q
q
q q q
2 4 6 8 10
x
P
M
F
q p = 0.2
p = 0.5
p = 0.8
Geometric
0.0
0.1
0.2
0.3
q q
q
q
q
q q q q q q q q q q q q q q q q
0 5 10 15 20
x
P
M
F
q = 1
= 4
= 10
Poisson
1
We use the notation (s, x) and (x) to refer to the Gamma functions (see 22.1), and use B(x, y) and Ix to refer to the Beta functions (see 22.2).
3
1.2 Continuous Distributions
Notation FX(x) fX(x) E [X] V [X] MX(s)
Uniform Unif (a, b)
_

_
0 x < a
xa
ba
a < x < b
1 x > b
I(a < x < b)
b a
a +b
2
(b a)
2
12
e
sb
e
sa
s(b a)
Normal N
_
,
2
_
(x) =
_
x

(t) dt (x) =
1

2
exp
_

(x )
2
2
2
_

2
exp
_
s +

2
s
2
2
_
Log-Normal ln N
_
,
2
_
1
2
+
1
2
erf
_
ln x

2
2
_
1
x

2
2
exp
_

(ln x )
2
2
2
_
e
+
2
/2
(e

2
1)e
2+
2
Multivariate Normal MVN(, ) (2)
k/2
||
1/2
e

1
2
(x)
T

1
(x)
exp
_

T
s +
1
2
s
T
s
_
Students t Student() Ix
_

2
,

2
_

_
+1
2
_

2
_
_
1 +
x
2

_
(+1)/2
0
_

2
> 2
1 < 2
Chi-square
2
k
1
(k/2)

_
k
2
,
x
2
_
1
2
k/2
(k/2)
x
k/21
e
x/2
k 2k (1 2s)
k/2
s < 1/2
F F(d1, d2) I d
1
x
d
1
x+d
2
_
d1
2
,
d1
2
_
_
(d1x)
d
1d
d
2
2
(d1x+d2)
d
1
+d
2
xB
_
d1
2
,
d1
2
_
d2
d2 2
2d
2
2
(d1 +d2 2)
d1(d2 2)
2
(d2 4)
Exponential Exp () 1 e
x/
1

e
x/

2
1
1 s
(s < 1/)
Gamma Gamma (, )
(, x/)
()
1
()

x
1
e
x/

2
_
1
1 s
_

(s < 1/)
Inverse Gamma InvGamma (, )

_
,

x
_
()

()
x
1
e
/x

1
> 1

2
( 1)
2
( 2)
2
> 2
2(s)
/2
()
K
_
_
4s
_
Dirichlet Dir ()

k
i=1
i
_

k
i=1
(i)
k

i=1
x
i1
i
i

k
i=1
i
E [Xi] (1 E [Xi])

k
i=1
i + 1
Beta Beta (, ) Ix(, )
( +)
() ()
x
1
(1 x)
1

+

( +)
2
( + + 1)
1 +

k=1
_
k1

r=0
+r
+ +r
_
s
k
k!
Weibull Weibull(, k) 1 e
(x/)
k k

_
x

_
k1
e
(x/)
k

_
1 +
1
k
_

2

_
1 +
2
k
_

n=0
s
n

n
n!

_
1 +
n
k
_
Pareto Pareto(xm, ) 1
_
xm
x
_

x xm
x

m
x
+1
x xm
xm
1
> 1
x

m
( 1)
2
( 2)
> 2 (xms)

(, xms) s < 0
4
1
b a
q q
q q
a b
x
P
D
F
Uniform (continuous)
0.0
0.2
0.4
0.6
0.8
4 2 0 2 4
x

(
x
)
= 0,
2
= 0.2
= 0,
2
= 1
= 0,
2
= 5
= 2,
2
= 0.5
Normal
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0
x
P
D
F
= 0,
2
= 3
= 2,
2
= 2
= 0,
2
= 1
= 0.5,
2
= 1
= 0.25,
2
= 1
= 0.125,
2
= 1
LogNormal
0.0
0.1
0.2
0.3
0.4
4 2 0 2 4
x
P
D
F
= 1
= 2
= 5
=
Student's t
0
1
2
3
4
0 2 4 6 8
x
P
D
F
k = 1
k = 2
k = 3
k = 4
k = 5

2
0.0
0.5
1.0
1.5
2.0
2.5
3.0
0 1 2 3 4 5
x
P
D
F
d1 = 1, d2 = 1
d1 = 2, d2 = 1
d1 = 5, d2 = 2
d1 = 100, d2 = 1
d1 = 100, d2 = 100

2
0.0
0.5
1.0
1.5
2.0
0 1 2 3 4 5
x
P
D
F
= 2
= 1
= 0.4
Exponential
0.0
0.1
0.2
0.3
0.4
0.5
0 5 10 15 20
x
P
D
F
= 1, = 2
= 2, = 2
= 3, = 2
= 5, = 1
= 9, = 0.5
Gamma
0
1
2
3
4
0 1 2 3 4 5
x
P
D
F
= 1, = 1
= 2, = 1
= 3, = 1
= 3, = 0.5
Inverse Gamma
0
1
2
3
4
5
0.0 0.2 0.4 0.6 0.8 1.0
x
P
D
F
= 0.5, = 0.5
= 5, = 1
= 1, = 3
= 2, = 2
= 2, = 5
Beta
0
1
2
3
4
0.0 0.5 1.0 1.5 2.0 2.5
x
P
D
F
= 1, k = 0.5
= 1, k = 1
= 1, k = 1.5
= 1, k = 5
Weibull
0
1
2
3
0 1 2 3 4 5
x
P
D
F
xm = 1, = 1
xm = 1, = 2
xm = 1, = 4
Pareto
5
2 Probability Theory
Denitions
Sample space
Outcome (point or element)
Event A
-algebra A
1. A
2. A
1
, A
2
, . . . , A =

i=1
A
i
A
3. A A = A A
Probability Distribution P
1. P [A] 0 A
2. P [] = 1
3. P
_

_
i=1
A
i
_
=

i=1
P [A
i
]
Probability space (, A, P)
Properties
P [] = 0
B = B = (A A) B = (A B) (A B)
P [A] = 1 P [A]
P [B] = P [A B] +P [A B]
P [] = 1 P [] = 0
(

n
A
n
) =

n
A
n
(

n
A
n
) =

n
A
n
DeMorgan
P [

n
A
n
] = 1 P [

n
A
n
]
P [A B] = P [A] +P [B] P [A B]
= P [A B] P [A] +P [B]
P [A B] = P [A B] +P [A B] +P [A B]
P [A B] = P [A] P [A B]
Continuity of Probabilities
A
1
A
2
. . . = lim
n
P [A
n
] = P [A] whereA =

i=1
A
i
A
1
A
2
. . . = lim
n
P [A
n
] = P [A] whereA =

i=1
A
i
Independence
A B P [A B] = P [A] P [B]
Conditional Probability
P [A| B] =
P [A B]
P [B]
P [B] > 0
Law of Total Probability
P [B] =
n

i=1
P [B|A
i
] P [A
i
] =
n
_
i=1
A
i
Bayes Theorem
P [A
i
| B] =
P [B| A
i
] P [A
i
]

n
j=1
P [B| A
j
] P [A
j
]
=
n
_
i=1
A
i
Inclusion-Exclusion Principle

n
_
i=1
A
i

=
n

r=1
(1)
r1

ii1<<irn

j=1
A
ij

3 Random Variables
Random Variable (RV)
X : R
Probability Mass Function (PMF)
f
X
(x) = P [X = x] = P [{ : X() = x}]
Probability Density Function (PDF)
P [a X b] =
_
b
a
f(x) dx
Cumulative Distribution Function (CDF)
F
X
: R [0, 1] F
X
(x) = P [X x]
1. Nondecreasing: x
1
< x
2
= F(x
1
) F(x
2
)
2. Normalized: lim
x
= 0 and lim
x
= 1
3. Right-Continuous: lim
yx
F(y) = F(x)
P [a Y b | X = x] =
_
b
a
f
Y |X
(y | x)dy a b
f
Y |X
(y | x) =
f(x, y)
f
X
(x)
Independence
1. P [X x, Y y] = P [X x] P [Y y]
2. f
X,Y
(x, y) = f
X
(x)f
Y
(y)
6
3.1 Transformations
Transformation function
Z = (X)
Discrete
f
Z
(z) = P [(X) = z] = P [{x : (x) = z}] = P
_
X
1
(z)

x
1
(z)
f(x)
Continuous
F
Z
(z) = P [(X) z] =
_
Az
f(x) dx with A
z
= {x : (x) z}
Special case if strictly monotone
f
Z
(z) = f
X
(
1
(z))

d
dz

1
(z)

= f
X
(x)

dx
dz

= f
X
(x)
1
|J|
The Rule of the Lazy Statistician
E [Z] =
_
(x) dF
X
(x)
E [I
A
(x)] =
_
I
A
(x) dF
X
(x) =
_
A
dF
X
(x) = P [X A]
Convolution
Z := X +Y f
Z
(z) =
_

f
X,Y
(x, z x) dx
X,Y 0
=
_
z
0
f
X,Y
(x, z x) dx
Z := |X Y | f
Z
(z) = 2
_

0
f
X,Y
(x, z +x) dx
Z :=
X
Y
f
Z
(z) =
_

|x|f
X,Y
(x, xz) dx

=
_

xf
x
(x)f
X
(x)f
Y
(xz) dx
4 Expectation
Denition and properties
E [X] =
X
=
_
xdF
X
(x) =
_

x
xf
X
(x) X discrete
_
xf
X
(x) X continuous
P [X = c] = 1 = E [c] = c
E [cX] = c E [X]
E [X +Y ] = E [X] +E [Y ]
E [XY ] =
_
X,Y
xyf
X,Y
(x, y) dF
X
(x) dF
Y
(y)
E [(Y )] = (E [X]) (cf. Jensen inequality)
P [X Y ] = 0 = E [X] E [Y ] P [X = Y ] = 1 = E [X] = E [Y ]
E [X] =

x=1
P [X x]
Sample mean

X
n
=
1
n
n

i=1
X
i
Conditional expectation
E [Y | X = x] =
_
yf(y | x) dy
E [X] = E [E [X| Y ]]
E[(X, Y ) | X = x] =
_

(x, y)f
Y |X
(y | x) dx
E [(Y, Z) | X = x] =
_

(y, z)f
(Y,Z)|X
(y, z | x) dy dz
E [Y +Z | X] = E [Y | X] +E [Z | X]
E [(X)Y | X] = (X)E [Y | X]
E[Y | X] = c = Cov [X, Y ] = 0
5 Variance
Denition and properties
V [X] =
2
X
= E
_
(X E [X])
2

= E
_
X
2

E [X]
2
V
_
n

i=1
X
i
_
=
n

i=1
V [X
i
] + 2

i=j
Cov [X
i
, Y
j
]
V
_
n

i=1
X
i
_
=
n

i=1
V [X
i
] i X
i
X
j
Standard deviation
sd[X] =
_
V [X] =
X
Covariance
Cov [X, Y ] = E [(X E [X])(Y E [Y ])] = E [XY ] E [X] E [Y ]
Cov [X, a] = 0
Cov [X, X] = V [X]
Cov [X, Y ] = Cov [Y, X]
Cov [aX, bY ] = abCov [X, Y ]
7
Cov [X +a, Y +b] = Cov [X, Y ]
Cov
_
_
n

i=1
X
i
,
m

j=1
Y
j
_
_
=
n

i=1
m

j=1
Cov [X
i
, Y
j
]
Correlation
[X, Y ] =
Cov [X, Y ]
_
V [X] V [Y ]
Independence
X Y = [X, Y ] = 0 Cov [X, Y ] = 0 E [XY ] = E [X] E [Y ]
Sample variance
S
2
=
1
n 1
n

i=1
(X
i

X
n
)
2
Conditional variance
V [Y | X] = E
_
(Y E [Y | X])
2
| X

= E
_
Y
2
| X

E [Y | X]
2
V [Y ] = E [V [Y | X]] +V [E [Y | X]]
6 Inequalities
Cauchy-Schwarz
E [XY ]
2
E
_
X
2

E
_
Y
2

Markov
P [(X) t]
E [(X)]
t
Chebyshev
P [|X E [X]| t]
V [X]
t
2
Chernoff
P [X (1 +)]
_
e

(1 +)
1+
_
> 1
Jensen
E [(X)] (E [X]) convex
7 Distribution Relationships
Binomial
X
i
Bern (p) =
n

i=1
X
i
Bin (n, p)
X Bin (n, p) , Y Bin (m, p) = X +Y Bin (n +m, p)
lim
n
Bin (n, p) = Po (np) (n large, p small)
lim
n
Bin (n, p) = N (np, np(1 p)) (n large, p far from 0 and 1)
Negative Binomial
X NBin (1, p) = Geo (p)
X NBin (r, p) =

r
i=1
Geo (p)
X
i
NBin (r
i
, p) =

X
i
NBin (

r
i
, p)
X NBin (r, p) . Y Bin (s +r, p) = P [X s] = P [Y r]
Poisson
X
i
Po (
i
) X
i
X
j
=
n

i=1
X
i
Po
_
n

i=1

i
_
X
i
Po (
i
) X
i
X
j
= X
i

j=1
X
j
Bin
_
_
n

j=1
X
j
,

i

n
j=1

j
_
_
Exponential
X
i
Exp () X
i
X
j
=
n

i=1
X
i
Gamma (n, )
Memoryless property: P [X > x +y | X > y] = P [X > x]
Normal
X N
_
,
2
_
=
_
X

_
N (0, 1)
X N
_
,
2
_
Z = aX +b = Z N
_
a +b, a
2

2
_
X N
_

1
,
2
1
_
Y N
_

2
,
2
2
_
= X +Y N
_

1
+
2
,
2
1
+
2
2
_
X
i
N
_

i
,
2
i
_
=

i
X
i
N
_
i

i
,

2
i
_
P [a < X b] =
_
b

_
a

_
(x) = 1 (x)

(x) = x(x)

(x) = (x
2
1)(x)
Upper quantile of N (0, 1): z

=
1
(1 )
Gamma
X Gamma (, ) X/ Gamma (, 1)
Gamma (, )

i=1
Exp ()
X
i
Gamma (
i
, ) X
i
X
j
=

i
X
i
Gamma (

i
, )

()

=
_

0
x
1
e
x
dx
Beta

1
B(, )
x
1
(1 x)
1
=
( +)
()()
x
1
(1 x)
1
E
_
X
k

=
B( +k, )
B(, )
=
+k 1
+ +k 1
E
_
X
k1

## Beta (1, 1) Unif (0, 1)

8
8 Probability and Moment Generating Functions
G
X
(t) = E
_
t
X

|t| < 1
M
X
(t) = G
X
(e
t
) = E
_
e
Xt

= E
_

i=0
(Xt)
i
i!
_
=

i=0
E
_
X
i

i!
t
i
P [X = 0] = G
X
(0)
P [X = 1] = G

X
(0)
P [X = i] =
G
(i)
X
(0)
i!
E [X] = G

X
(1

)
E
_
X
k

= M
(k)
X
(0)
E
_
X!
(X k)!
_
= G
(k)
X
(1

)
V [X] = G

X
(1

) +G

X
(1

) (G

X
(1

))
2
G
X
(t) = G
Y
(t) = X
d
= Y
9 Multivariate Distributions
9.1 Standard Bivariate Normal
Let X, Y N (0, 1) X Z where Y = X +
_
1
2
Z
Joint density
f(x, y) =
1
2
_
1
2
exp
_

x
2
+y
2
2xy
2(1
2
)
_
Conditionals
(Y | X = x) N
_
x, 1
2
_
and (X| Y = y) N
_
y, 1
2
_
Independence
X Y = 0
9.2 Bivariate Normal
Let X N
_

x
,
2
x
_
and Y N
_

y
,
2
y
_
.
f(x, y) =
1
2
x

y
_
1
2
exp
_

z
2(1
2
)
_
z =
_
_
x
x

x
_
2
+
_
y
y

y
_
2
2
_
x
x

x
__
y
y

y
_
_
Conditional mean and variance
E [X| Y ] = E [X] +

Y
(Y E [Y ])
V [X| Y ] =
X
_
1
2
9.3 Multivariate Normal
Covariance matrix (Precision matrix
1
)
=
_
_
_
V [X
1
] Cov [X
1
, X
k
]
.
.
.
.
.
.
.
.
.
Cov [X
k
, X
1
] V [X
k
]
_
_
_
If X N (, ),
f
X
(x) = (2)
n/2
||
1/2
exp
_

1
2
(x )
T

1
(x )
_
Properties
Z N (0, 1) X = +
1/2
Z = X N (, )
X N (, ) =
1/2
(X ) N (0, 1)
X N (, ) = AX N
_
A, AA
T
_
X N (, ) a = k = a
T
X N
_
a
T
, a
T
a
_
10 Convergence
Let {X
1
, X
2
, . . .} be a sequence of rvs and let X be another rv. Let F
n
denote
the cdf of X
n
and let F denote the cdf of X.
Types of convergence
1. In distribution (weakly, in law): X
n
D
X
lim
n
F
n
(t) = F(t) t where F continuous
2. In probability: X
n
P
X
( > 0) lim
n
P [|X
n
X| > ] = 0
3. Almost surely (strongly): X
n
as
X
P
_
lim
n
X
n
= X
_
= P
_
: lim
n
X
n
() = X()
_
= 1
9
4. In quadratic mean (L
2
): X
n
qm
X
lim
n
E
_
(X
n
X)
2

= 0
Relationships
X
n
qm
X = X
n
P
X = X
n
D
X
X
n
as
X = X
n
P
X
X
n
D
X (c R) P [X = c] = 1 = X
n
P
X
X
n
P
X Y
n
P
Y = X
n
+Y
n
P
X +Y
X
n
qm
X Y
n
qm
Y = X
n
+Y
n
qm
X +Y
X
n
P
X Y
n
P
Y = X
n
Y
n
P
XY
X
n
P
X = (X
n
)
P
(X)
X
n
D
X = (X
n
)
D
(X)
X
n
qm
b lim
n
E [X
n
] = b lim
n
V [X
n
] = 0
X
1
, . . . , X
n
iid E [X] = V [X] <

X
n
qm

Slutzkys Theorem
X
n
D
X and Y
n
P
c = X
n
+Y
n
D
X +c
X
n
D
X and Y
n
P
c = X
n
Y
n
D
cX
In general: X
n
D
X and Y
n
D
Y = X
n
+Y
n
D
X +Y
10.1 Law of Large Numbers (LLN)
Let {X
1
, . . . , X
n
} be a sequence of iid rvs, E [X
1
] = , and V [X
1
] < .
Weak (WLLN)

X
n
P
n
Strong (WLLN)

X
n
as
n
10.2 Central Limit Theorem (CLT)
Let {X
1
, . . . , X
n
} be a sequence of iid rvs, E [X
1
] = , and V [X
1
] =
2
.
Z
n
:=

X
n

_
V
_

X
n

n(

X
n
)

D
Z where Z N (0, 1)
lim
n
P [Z
n
z] = (z) z R
CLT notations
Z
n
N (0, 1)

X
n
N
_
,

2
n
_

X
n
N
_
0,

2
n
_

n(

X
n
) N
_
0,
2
_

n(

X
n
)
n
N (0, 1)
Continuity correction
P
_

X
n
x

_
x +
1
2

/

n
_
P
_

X
n
x

1
_
x
1
2

/

n
_
Delta method
Y
n
N
_
,

2
n
_
= (Y
n
) N
_
(), (

())
2
2
n
_
11 Statistical Inference
Let X
1
, , X
n
iid
F if not otherwise noted.
11.1 Point Estimation
Point estimator

n
of is a rv:

n
= g(X
1
, . . . , X
n
)
bias(

n
) = E
_

n
_

Consistency:

n
P

Sampling distribution: F(

n
)
Standard error: se(

n
) =
_
V
_

n
_
Mean squared error: mse = E
_
(

n
)
2
_
= bias(

n
)
2
+V
_

n
_
lim
n
bias(

n
) = 0 lim
n
se(

n
) = 0 =

n
is consistent
Asymptotic normality:

se
D
N (0, 1)
Slutzkys Theorem often lets us replace se(

n
) by some (weakly) consis-
tent estimator
n
.
10
11.2 Normal-Based Condence Interval
Suppose

n
N
_
, se
2
_
. Let z
/2
=
1
(1 (/2)), i.e., P
_
Z > z
/2

= /2
and P
_
z
/2
< Z < z
/2

## = 1 where Z N (0, 1). Then

C
n
=

n
z
/2
se
11.3 Empirical distribution
Empirical Distribution Function (ECDF)

F
n
(x) =

n
i=1
I(X
i
x)
n
I(X
i
x) =
_
1 X
i
x
0 X
i
> x
Properties (for any xed x)
E
_

F
n
_
= F(x)
V
_

F
n
_
=
F(x)(1 F(x))
n
mse =
F(x)(1 F(x))
n
D
0

F
n
P
F(x)
Dvoretzky-Kiefer-Wolfowitz (DKW) inequality (X
1
, . . . , X
n
F)
P
_
sup
x

F(x)

F
n
(x)

>
_
= 2e
2n
2
Nonparametric 1 condence band for F
L(x) = max{

F
n

n
, 0}
U(x) = min{

F
n
+
n
, 1}
=

1
2n
log
_
2

_
P [L(x) F(x) U(x) x] 1
11.4 Statistical Functionals
Statistical functional: T(F)
Plug-in estimator of = (F):

n
= T(

F
n
)
Linear functional: T(F) =
_
(x) dF
X
(x)
Plug-in estimator for linear functional:
T(

F
n
) =
_
(x) d

F
n
(x) =
1
n
n

i=1
(X
i
)
Often: T(

F
n
) N
_
T(F), se
2
_
=T(

F
n
) z
/2
se
p
th
quantile: F
1
(p) = inf{x : F(x) p}
=

X
n

2
=
1
n 1
n

i=1
(X
i

X
n
)
2
=
1
n

n
i=1
(X
i
)
3

3
j
=

n
i=1
(X
i

X
n
)(Y
i

Y
n
)
_

n
i=1
(X
i

X
n
)
2
_

n
i=1
(Y
i

Y
n
)
12 Parametric Inference
Let F =
_
f(x; ) :
_
be a parametric model with parameter space R
k
and parameter = (
1
, . . . ,
k
).
12.1 Method of Moments
j
th
moment

j
() = E
_
X
j

=
_
x
j
dF
X
(x)
j
th
sample moment

j
=
1
n
n

i=1
X
j
i
Method of moments estimator (MoM)

1
() =
1

2
() =
2
.
.
. =
.
.
.

k
() =
k
11
Properties of the MoM estimator

n
exists with probability tending to 1
Consistency:

n
P

Asymptotic normality:

n(

)
D
N (0, )
where = gE
_
Y Y
T

g
T
, Y = (X, X
2
, . . . , X
k
)
T
,
g = (g
1
, . . . , g
k
) and g
j
=

1
j
()
12.2 Maximum Likelihood
Likelihood: L
n
: [0, )
L
n
() =
n

i=1
f(X
i
; )
Log-likelihood

n
() = log L
n
() =
n

i=1
log f(X
i
; )
Maximum likelihood estimator (mle)
L
n
(

n
) = sup

L
n
()
Score function
s(X; ) =

log f(X; )
Fisher information
I() = V

[s(X; )]
I
n
() = nI()
Fisher information (exponential family)
I() = E

s(X; )
_
Observed Fisher information
I
obs
n
() =

2

2
n

i=1
log f(X
i
; )
Properties of the mle
Consistency:

n
P

Equivariance:

n
is the mle =(

n
) ist the mle of ()
Asymptotic normality:
1. se
_
1/I
n
()
(

n
)
se
D
N (0, 1)
2. se
_
1/I
n
(

n
)
(

n
)
se
D
N (0, 1)
Asymptotic optimality (or eciency), i.e., smallest variance for large sam-
ples. If

n
is any other estimator, the asymptotic relative eciency is
are(

n
,

n
) =
V
_

n
_
V
_

n
_ 1
Approximately the Bayes estimator
12.2.1 Delta Method
If = (

() = 0:
(
n
)
se( )
D
N (0, 1)
where = (

## ) is the mle of and

se =

se(

n
)
12.3 Multiparameter Models
Let = (
1
, . . . ,
k
) and

= (

1
, . . . ,

k
) be the mle.
H
jj
=

2

2
H
jk
=

2

k
Fisher information matrix
I
n
() =
_

_
E

[H
11
] E

[H
1k
]
.
.
.
.
.
.
.
.
.
E

[H
k1
] E

[H
kk
]
_

_
Under appropriate regularity conditions
(

) N (0, J
n
)
12
with J
n
() = I
1
n
. Further, if

j
is the j
th
component of , then
(

j

j
)
se
j
D
N (0, 1)
where se
2
j
= J
n
(j, j) and Cov
_

j
,

k
_
= J
n
(j, k)
12.3.1 Multiparameter delta method
Let = (
1
, . . . ,
k
) and let the gradient of be
=
_
_
_
_
_
_

1
.
.
.

k
_
_
_
_
_
_
Suppose

= 0 and = (

). Then,
( )
se( )
D
N (0, 1)
where
se( ) =
_
_

_
T

J
n
_

_
and

J
n
= J
n
(

) and

=

.
12.4 Parametric Bootstrap
Sample from f(x;

n
) instead of from

F
n
, where

n
could be the mle or method
of moments estimator.
13 Hypothesis Testing
H
0
:
0
versus H
1
:
1
Denitions
Null hypothesis H
0
Alternative hypothesis H
1
Simple hypothesis =
0
Composite hypothesis >
0
or <
0
Two-sided test: H
0
: =
0
versus H
1
: =
0
One-sided test: H
0
:
0
versus H
1
: >
0
Critical value c
Test statistic T
Rejection region R = {x : T(x) > c}
Power function () = P [X R]
Power of a test: 1 P [Type II error] = 1 = inf
1
()
Test size: = P [Type I error] = sup
0
()
Retain H
0
Reject H
0
H
0
true

Type I Error ()
H
1
true Type II Error ()

(power)
p-value
p-value = sup
0
P

## [T(X) T(x)] = inf

_
: T(x) R

_
p-value = sup
0
P

[T(X

) T(X)]
. .
1F(T(X)) since T(X

)F
= inf
_
: T(X) R

_
p-value evidence
< 0.01 very strong evidence against H
0
0.01 0.05 strong evidence against H
0
0.05 0.1 weak evidence against H
0
> 0.1 little or no evidence against H
0
Wald test
Two-sided test
Reject H
0
when |W| > z
/2
where W =

0
se
P
_
|W| > z
/2

p-value = P
0
[|W| > |w|] P [|Z| > |w|] = 2(|w|)
Likelihood ratio test (LRT)
T(X) =
sup

L
n
()
sup
0
L
n
()
=
L
n
(

n
)
L
n
(

n,0
)
13
(X) = 2 log T(X)
D

2
rq
where
k

i=1
Z
2
i

2
k
and Z
1
, . . . , Z
k
iid
N (0, 1)
p-value = P
0
[(X) > (x)] P
_

2
rq
> (x)

Multinomial LRT
mle: p
n
=
_
X
1
n
, . . . ,
X
k
n
_
T(X) =
L
n
( p
n
)
L
n
(p
0
)
=
k

j=1
_
p
j
p
0j
_
Xj
(X) = 2
k

j=1
X
j
log
_
p
j
p
0j
_
D

2
k1
The approximate size LRT rejects H
0
when (X)
2
k1,
Pearson Chi-square Test
T =
k

j=1
(X
j
E [X
j
])
2
E [X
j
]
where E [X
j
] = np
0j
under H
0
T
D

2
k1
p-value = P
_

2
k1
> T(x)

Faster
D
X
2
k1
than LRT, hence preferable for small n
Independence testing
I rows, J columns, X multinomial sample of size n = I J
mles unconstrained: p
ij
=
Xij
n
mles under H
0
: p
0ij
= p
i
p
j
=
Xi
n
Xj
n
LRT: = 2

I
i=1

J
j=1
X
ij
log
_
nXij
XiXj
_
PearsonChiSq: T =

I
i=1

J
j=1
(XijE[Xij])
2
E[Xij]
LRT and Pearson
D

2
k
, where = (I 1)(J 1)
14 Bayesian Inference
Bayes Theorem
f( | x) =
f(x| )f()
f(x
n
)
=
f(x| )f()
_
f(x| )f() d
L
n
()f()
Denitions
X
n
= (X
1
, . . . , X
n
)
x
n
= (x
1
, . . . , x
n
)
Prior density f()
Likelihood f(x
n
| ): joint density of the data
In particular, X
n
iid =f(x
n
| ) =
n

i=1
f(x
i
| ) = L
n
()
Posterior density f( | x
n
)
Normalizing constant c
n
= f(x
n
) =
_
f(x| )f() d
Kernel: part of a density that depends on
Posterior mean

n
=
_
f( | x
n
) d =

Ln()f()

Ln()f() d
14.1 Credible Intervals
Posterior interval
P [ (a, b) | x
n
] =
_
b
a
f( | x
n
) d = 1
Equal-tail credible interval
_
a

f( | x
n
) d =
_

b
f( | x
n
) d = /2
Highest posterior density (HPD) region R
n
1. P [ R
n
] = 1
2. R
n
= { : f( | x
n
) > k} for some k
R
n
is unimodal =R
n
is an interval
14.2 Function of parameters
Let = () and A = { : () }.
Posterior CDF for
H(r | x
n
) = P [() | x
n
] =
_
A
f( | x
n
) d
Posterior density
h( | x
n
) = H

( | x
n
)
Bayesian delta method
| X
n
N
_
(

), se

_
14
14.3 Priors
Choice
Subjective bayesianism.
Objective bayesianism.
Robust bayesianism.
Types
Flat: f() constant
Proper:
_

f() d = 1
Improper:
_

f() d =
Jeffreys prior (transformation-invariant):
f()
_
I() f()
_
det(I())
Conjugate: f() and f( | x
n
) belong to the same parametric family
14.3.1 Conjugate Priors
Discrete likelihood
Likelihood Conjugate prior Posterior hyperparameters
Bern (p) Beta (, ) +
n

i=1
x
i
, +n
n

i=1
x
i
Bin (p) Beta (, ) +
n

i=1
x
i
, +
n

i=1
N
i

i=1
x
i
NBin (p) Beta (, ) +rn, +
n

i=1
x
i
Po () Gamma (, ) +
n

i=1
x
i
, +n
Multinomial(p) Dir () +
n

i=1
x
(i)
Geo (p) Beta (, ) +n, +
n

i=1
x
i
Continuous likelihood (subscript c denotes constant)
Likelihood Conjugate prior Posterior hyperparameters
Unif (0, ) Pareto(x
m
, k) max
_
x
(n)
, x
m
_
, k +n
Exp () Gamma (, ) +n, +
n

i=1
x
i
N
_
,
2
c
_
N
_

0
,
2
0
_
_

2
0
+

n
i=1
x
i

2
c
_
/
_
1

2
0
+
n

2
c
_
,
_
1

2
0
+
n

2
c
_
1
N
_

c
,
2
_
Scaled Inverse Chi-
square(,
2
0
)
+n,

2
0
+

n
i=1
(x
i
)
2
+n
N
_
,
2
_
Normal-
scaled Inverse
Gamma(, , , )
+n x
+n
, + n, +
n
2
,
+
1
2
n

i=1
(x
i
x)
2
+
( x )
2
2(n +)
MVN(,
c
) MVN(
0
,
0
)
_

1
0
+n
1
c
_
1
_

1
0

0
+n
1
x
_
,
_

1
0
+n
1
c
_
1
MVN(
c
, ) Inverse-
Wishart(, )
n +, +
n

i=1
(x
i

c
)(x
i

c
)
T
Pareto(x
mc
, k) Gamma (, ) +n, +
n

i=1
log
x
i
x
mc
Pareto(x
m
, k
c
) Pareto(x
0
, k
0
) x
0
, k
0
kn where k
0
> kn
Gamma (
c
, ) Gamma (
0
,
0
)
0
+n
c
,
0
+
n

i=1
x
i
14.4 Bayesian Testing
If H
0
:
0
:
Prior probability P [H
0
] =
_
0
f() d
Posterior probability P [H
0
| x
n
] =
_
0
f( | x
n
) d
Let H
0
, . . . , H
K1
be K hypotheses. Suppose f( | H
k
),
P [H
k
| x
n
] =
f(x
n
| H
k
)P [H
k
]

K
k=1
f(x
n
| H
k
)P [H
k
]
,
15
Marginal likelihood
f(x
n
| H
i
) =
_

f(x
n
| , H
i
)f( | H
i
) d
Posterior odds (of H
i
relative to H
j
)
P [H
i
| x
n
]
P [H
j
| x
n
]
=
f(x
n
| H
i
)
f(x
n
| H
j
)
. .
Bayes Factor BFij

P [H
i
]
P [H
j
]
. .
prior odds
Bayes factor
log
10
BF
10
BF
10
evidence
0 0.5 1 1.5 Weak
0.5 1 1.5 10 Moderate
1 2 10 100 Strong
> 2 > 100 Decisive
p

=
p
1p
BF
10
1 +
p
1p
BF
10
where p = P [H
1
] and p

= P [H
1
| x
n
]
15 Exponential Family
Scalar parameter
f
X
(x| ) = h(x) exp {()T(x) A()}
= h(x)g() exp {()T(x)}
Vector parameter
f
X
(x| ) = h(x) exp
_
s

i=1

i
()T
i
(x) A()
_
= h(x) exp {() T(x) A()}
= h(x)g() exp {() T(x)}
Natural form
f
X
(x| ) = h(x) exp { T(x) A()}
= h(x)g() exp { T(x)}
= h(x)g() exp
_

T
T(x)
_
16 Sampling Methods
16.1 The Bootstrap
Let T
n
= g(X
1
, . . . , X
n
) be a statistic.
1. Estimate V
F
[T
n
] with V

Fn
[T
n
].
2. Approximate V

Fn
[T
n
] using simulation:
(a) Repeat the following B times to get T

n,1
, . . . , T

n,B
, an iid sample from
the sampling distribution implied by

F
n
i. Sample uniformly X

1
, . . . , X

n

F
n
.
ii. Compute T

n
= g(X

1
, . . . , X

n
).
(b) Then
v
boot
=

V

Fn
=
1
B
B

b=1
_
T

n,b

1
B
B

r=1
T

n,r
_2
16.1.1 Bootstrap Condence Intervals
Normal-based interval
T
n
z
/2
se
boot
Pivotal interval
1. Location parameter = T(F)
2. Pivot R
n
=

3. Let H(r) = P [R
n
r] be the cdf of R
n
4. Let R

n,b
=

n,b

n
. Approximate H using bootstrap:

H(r) =
1
B
B

b=1
I(R

n,b
r)
5.

= sample quantile of (

n,1
, . . . ,

n,B
)
6. r

= sample quantile of (R

n,1
, . . . , R

n,B
), i.e., r

n
7. Approximate 1 condence interval C
n
=
_
a,

b
_
where
a =

H
1
_
1

2
_
=

n
r

1/2
= 2

1/2

b =

H
1
_

2
_
=

n
r

/2
= 2

/2
Percentile interval
C
n
=
_

/2
,

1/2
_
16
16.2 Rejection Sampling
Setup
We can easily sample from g()
We want to sample from h(), but it is dicult
We know h() up to a proportional constant: h() =
k()
_
k() d
Envelope condition: we can nd M > 0 such that k() Mg()
Algorithm
1. Draw
cand
g()
2. Generate u Unif (0, 1)
3. Accept
cand
if u
k(
cand
)
Mg(
cand
)
4. Repeat until B values of
cand
have been accepted
Example
We can easily sample from the prior g() = f()
Target is the posterior h() k() = f(x
n
| )f()
Envelope condition: f(x
n
| ) f(x
n
|

n
) = L
n
(

n
) M
Algorithm
1. Draw
cand
f()
2. Generate u Unif (0, 1)
3. Accept
cand
if u
L
n
(
cand
)
L
n
(

n
)
16.3 Importance Sampling
Sample from an importance function g rather than target density h.
Algorithm to obtain an approximation to E [q() | x
n
]:
1. Sample from the prior
1
, . . . ,
n
iid
f()
2. w
i
=
L
n
(
i
)

B
i=1
L
n
(
i
)
i = 1, . . . , B
3. E [q() | x
n
]

B
i=1
q(
i
)w
i
17 Decision Theory
Denitions
Unknown quantity aecting our decision:
Decision rule: synonymous for an estimator

## Action a A: possible value of the decision rule. In the estimation

context, the action is just an estimate of ,

(x).
Loss function L: consequences of taking action a when true state is or
discrepancy between and

, L : A [k, ).
Loss functions
Squared error loss: L(, a) = ( a)
2
Linear loss: L(, a) =
_
K
1
( a) a < 0
K
2
(a ) a 0
Absolute error loss: L(, a) = | a| (linear loss with K
1
= K
2
)
L
p
loss: L(, a) = | a|
p
Zero-one loss: L(, a) =
_
0 a =
1 a =
17.1 Risk
Posterior risk
r(

| x) =
_
L(,

(x))f( | x) d = E
|X
_
L(,

(x))
_
(Frequentist) risk
R(,

) =
_
L(,

(x))f(x| ) dx = E
X|
_
L(,

(X))
_
Bayes risk
r(f,

) =
__
L(,

(x))f(x, ) dxd = E
,X
_
L(,

(X))
_
r(f,

) = E

_
E
X|
_
L(,

(X)
__
= E

_
R(,

)
_
r(f,

) = E
X
_
E
|X
_
L(,

(X)
__
= E
X
_
r(

| X)
_
17.2 Admissibility

dominates

if
: R(,

) R(,

)
: R(,

) < R(,

is inadmissible if there is at least one other estimator

that dominates
it. Otherwise it is called admissible.
17
17.3 Bayes Rule
Bayes rule (or Bayes estimator)
r(f,

) = inf

r(f,

(x) = inf r(

| x) x = r(f,

) =
_
r(

| x)f(x) dx
Theorems
Squared error loss: posterior mean
Absolute error loss: posterior median
Zero-one loss: posterior mode
17.4 Minimax Rules
Maximum risk

R(

) = sup

R(,

)

R(a) = sup

R(, a)
Minimax rule
sup

R(,

) = inf

R(

) = inf

sup

R(,

## = Bayes rule c : R(,

) = c
Least favorable prior

f
= Bayes rule R(,

f
) r(f,

f
)
18 Linear Regression
Denitions
Response variable Y
Covariate X (aka predictor variable or feature)
18.1 Simple Linear Regression
Model
Y
i
=
0
+
1
X
i
+
i
E [
i
| X
i
] = 0, V [
i
| X
i
] =
2
Fitted line
r(x) =

0
+

1
x
Predicted (tted) values

Y
i
= r(X
i
)
Residuals

i
= Y
i

Y
i
= Y
i

0
+

1
X
i
_
Residual sums of squares (rss)
rss(

0
,

1
) =
n

i=1

2
i
Least square estimates

T
= (

0
,

1
)
T
: min

0,

1
rss

0
=

Y
n

1

X
n

1
=

n
i=1
(X
i

X
n
)(Y
i

Y
n
)

n
i=1
(X
i

X
n
)
2
=

n
i=1
X
i
Y
i
n

XY

n
i=1
X
2
i
nX
2
E
_

| X
n
_
=
_

1
_
V
_

| X
n
_
=

2
ns
X
_
n
1

n
i=1
X
2
i
X
n
X
n
1
_
se(

0
) =

s
X

n
_

n
i=1
X
2
i
n
se(

1
) =

s
X

n
where s
2
X
= n
1

n
i=1
(X
i
X
n
)
2
and
2
=
1
n2

n
i=1

2
i
(unbiased estimate).
Further properties:
Consistency:

0
P

0
and

1
P

1
Asymptotic normality:

0
se(

0
)
D
N (0, 1) and

1
se(

1
)
D
N (0, 1)
Approximate 1 condence intervals for
0
and
1
:

0
z
/2
se(

0
) and

1
z
/2
se(

1
)
Wald test for H
0
:
1
= 0 vs. H
1
:
1
= 0: reject H
0
if |W| > z
/2
where
W =

1
/ se(

1
).
R
2
R
2
=

n
i=1
(

Y
i
Y )
2

n
i=1
(Y
i
Y )
2
= 1

n
i=1

2
i

n
i=1
(Y
i
Y )
2
= 1
rss
tss
18
Likelihood
L =
n

i=1
f(X
i
, Y
i
) =
n

i=1
f
X
(X
i
)
n

i=1
f
Y |X
(Y
i
| X
i
) = L
1
L
2
L
1
=
n

i=1
f
X
(X
i
)
L
2
=
n

i=1
f
Y |X
(Y
i
| X
i
)
n
exp
_

1
2
2

i
_
Y
i
(
0

1
X
i
)
_
2
_
Under the assumption of Normality, the least squares estimator is also the mle

2
=
1
n
n

i=1

2
i
18.2 Prediction
Observe X = x

## of the covariate and want to predict their outcome Y

0
+

1
x

V
_

_
= V
_

0
_
+x
2

V
_

1
_
+ 2x

Cov
_

0
,

1
_
Prediction interval

2
n
=
2
_
n
i=1
(X
i
X

)
2
n

i
(X
i

X)
2
j
+ 1
_

z
/2

n
18.3 Multiple Regression
Y = X +
where
X =
_
_
_
X
11
X
1k
.
.
.
.
.
.
.
.
.
X
n1
X
nk
_
_
_ =
_
_
_

1
.
.
.

k
_
_
_ =
_
_
_

1
.
.
.

n
_
_
_
Likelihood
L(, ) = (2
2
)
n/2
exp
_

1
2
2
rss
_
rss = (y X)
T
(y X) = Y X
2
=
N

i=1
(Y
i
x
T
i
)
2
If the (k k) matrix X
T
X is invertible,

= (X
T
X)
1
X
T
Y
V
_

| X
n
_
=
2
(X
T
X)
1

N
_
,
2
(X
T
X)
1
_
Estimate regression function
r(x) =
k

j=1

j
x
j
Unbiased estimate for
2

2
=
1
n k
n

i=1

2
i
= X

Y
mle
=

X
2
=
n k
n

2
1 Condence interval

j
z
/2
se(

j
)
18.4 Model Selection
Consider predicting a new observation Y

for covariates X

and let S J
denote a subset of the covariates in the model, where |S| = k and |J| = n.
Issues
Undertting: too few covariates yields high bias
Overtting: too many covariates yields high variance
Procedure
1. Assign a score to each model
2. Search through all models to nd the one with the highest score
Hypothesis testing
H
0
:
j
= 0 vs. H
1
:
j
= 0 j J
Mean squared prediction error (mspe)
mspe = E
_
(

Y (S) Y

)
2
_
Prediction risk
R(S) =
n

i=1
mspe
i
=
n

i=1
E
_
(

Y
i
(S) Y

i
)
2
_
Training error

R
tr
(S) =
n

i=1
(

Y
i
(S) Y
i
)
2
19
R
2
R
2
(S) = 1
rss(S)
tss
= 1

R
tr
(S)
tss
= 1

n
i=1
(

Y
i
(S) Y )
2

n
i=1
(Y
i
Y )
2
The training error is a downward-biased estimate of the prediction risk.
E
_

R
tr
(S)
_
< R(S)
bias(

R
tr
(S)) = E
_

R
tr
(S)
_
R(S) = 2
n

i=1
Cov
_

Y
i
, Y
i
_
Adjusted R
2
R
2
(S) = 1
n 1
n k
rss
tss
Mallows C
p
statistic

R(S) =

R
tr
(S) + 2k
2
= lack of t + complexity penalty
Akaike Information Criterion (AIC)
AIC(S) =
n
(

S
,
2
S
) k
Bayesian Information Criterion (BIC)
BIC(S) =
n
(

S
,
2
S
)
k
2
log n
Validation and training

R
V
(S) =
m

i=1
(

i
(S) Y

i
)
2
m = |{validation data}|, often
n
4
or
n
2
Leave-one-out cross-validation

R
CV
(S) =
n

i=1
(Y
i

Y
(i)
)
2
=
n

i=1
_
Y
i

Y
i
(S)
1 U
ii
(S)
_
2
U(S) = X
S
(X
T
S
X
S
)
1
X
S
(hat matrix)
19 Non-parametric Function Estimation
19.1 Density Estimation
Estimate f(x), where f(x) = P [X A] =
_
A
f(x) dx.
Integrated square error (ise)
L(f,

f
n
) =
_
_
f(x)

f
n
(x)
_
2
dx = J(h) +
_
f
2
(x) dx
Frequentist risk
R(f,

f
n
) = E
_
L(f,

f
n
)
_
=
_
b
2
(x) dx +
_
v(x) dx
b(x) = E
_

f
n
(x)
_
f(x)
v(x) = V
_

f
n
(x)
_
19.1.1 Histograms
Denitions
Number of bins m
Binwidth h =
1
m
Bin B
j
has
j
observations
Dene p
j
=
j
/n and p
j
=
_
Bj
f(u) du
Histogram estimator

f
n
(x) =
m

j=1
p
j
h
I(x B
j
)
E
_

f
n
(x)
_
=
p
j
h
V
_

f
n
(x)
_
=
p
j
(1 p
j
)
nh
2
R(

f
n
, f)
h
2
12
_
(f

(u))
2
du +
1
nh
h

=
1
n
1/3
_
6
_
(f

(u))
2
du
_
1/3
R

f
n
, f)
C
n
2/3
C =
_
3
4
_
2/3
__
(f

(u))
2
du
_
1/3
Cross-validation estimate of E [J(h)]

J
CV
(h) =
_

f
2
n
(x) dx
2
n
n

i=1

f
(i)
(X
i
) =
2
(n 1)h

n + 1
(n 1)h
m

j=1
p
2
j
20
19.1.2 Kernel Density Estimator (KDE)
Kernel K
K(x) 0

_
K(x) dx = 1

_
xK(x) dx = 0

_
x
2
K(x) dx
2
K
> 0
KDE

f
n
(x) =
1
n
n

i=1
1
h
K
_
x X
i
h
_
R(f,

f
n
)
1
4
(h
K
)
4
_
(f

(x))
2
dx +
1
nh
_
K
2
(x) dx
h

=
c
2/5
1
c
1/5
2
c
1/5
3
n
1/5
c
1
=
2
K
, c
2
=
_
K
2
(x) dx, c
3
=
_
(f

(x))
2
dx
R

(f,

f
n
) =
c
4
n
4/5
c
4
=
5
4
(
2
K
)
2/5
__
K
2
(x) dx
_
4/5
. .
C(K)
__
(f

)
2
dx
_
1/5
Epanechnikov Kernel
K(x) =
_
3
4

5(1x
2
/5)
|x| <

5
0 otherwise
Cross-validation estimate of E [J(h)]

J
CV
(h) =
_

f
2
n
(x) dx
2
n
n

i=1

f
(i)
(X
i
)
1
hn
2
n

i=1
n

j=1
K

_
X
i
X
j
h
_
+
2
nh
K(0)
K

(x) = K
(2)
(x) 2K(x) K
(2)
(x) =
_
K(x y)K(y) dy
19.2 Non-parametric Regression
Estimate f(x) where f(x) = E [Y | X = x]. Consider pairs of points
(x
1
, Y
1
), . . . , (x
n
, Y
n
) related by
Y
i
= r(x
i
) +
i
E [
i
] = 0
V [
i
] =
2
k-nearest Neighbor Estimator
r(x) =
1
k

i:xiNk(x)
Y
i
where N
k
(x) = {k values of x
1
, . . . , x
n
closest to x}
Nadaraya-Watson Kernel Estimator
r(x) =
n

i=1
w
i
(x)Y
i
w
i
(x) =
K
_
xxi
h
_

n
j=1
K
_
xxj
h
_ [0, 1]
R( r
n
, r)
h
4
4
__
x
2
K
2
(x) dx
_
4
_ _
r

(x) + 2r

(x)
f

(x)
f(x)
_
2
dx
+
_

2
_
K
2
(x) dx
nhf(x)
dx
h

c
1
n
1/5
R

( r
n
, r)
c
2
n
4/5
Cross-validation estimate of E [J(h)]

J
CV
(h) =
n

i=1
(Y
i
r
(i)
(x
i
))
2
=
n

i=1
(Y
i
r(x
i
))
2
_
1
K(0)

n
j=1
K

xx
j
h

_
2
19.3 Smoothing Using Orthogonal Functions
Approximation
r(x) =

j=1

j
(x)
J

i=1

j
(x)
Multivariate regression
Y = +
where
i
=
i
and =
_
_
_

0
(x
1
)
J
(x
1
)
.
.
.
.
.
.
.
.
.

0
(x
n
)
J
(x
n
)
_
_
_
Least squares estimator

= (
T
)
1

T
Y

1
n

T
Y (for equally spaced observations only)
21
Cross-validation estimate of E [J(h)]

R
CV
(J) =
n

i=1
_
_
Y
i

j=1

j
(x
i
)

j,(i)
_
_
2
20 Stochastic Processes
Stochastic Process
{X
t
: t T} T =
_
{0, 1, . . . } = Z discrete
[0, ) continuous
Notations X
t
, X(t)
State space X
Index set T
20.1 Markov Chains
Markov chain
P [X
n
= x| X
0
, . . . , X
n1
] = P [X
n
= x| X
n1
] n T, x X
Transition probabilities
p
ij
P [X
n+1
= j | X
n
= i]
p
ij
(n) P [X
m+n
= j | X
m
= i] n-step
Transition matrix P (n-step: P
n
)
(i, j) element is p
ij
p
ij
> 0

i
p
ij
= 1
Chapman-Kolmogorov
p
ij
(m+n) =

k
p
ij
(m)p
kj
(n)
P
m+n
= P
m
P
n
P
n
= P P = P
n
Marginal probability

n
= (
n
(1), . . . ,
n
(N)) where
i
(i) = P [X
n
= i]

0
initial distribution

n
=
0
P
n
20.2 Poisson Processes
Poisson process
{X
t
: t [0, )} = number of events up to and including time t
X
0
= 0
Independent increments:
t
0
< < t
n
: X
t1
X
t0
X
tn
X
tn1
Intensity function (t)
P [X
t+h
X
t
= 1] = (t)h +o(h)
P [X
t+h
X
t
= 2] = o(h)
X
s+t
X
s
Po (m(s +t) m(s)) where m(t) =
_
t
0
(s) ds
Homogeneous Poisson process
(t) = X
t
Po (t) > 0
Waiting times
W
t
:= time at which X
t
occurs
W
t
Gamma
_
t,
1

_
Interarrival times
S
t
= W
t+1
W
t
S
t
Exp
_
1

_
t Wt1 Wt
St
22
21 Time Series
Mean function

xt
= E [x
t
] =
_

xf
t
(x) dx
Autocovariance function

x
(s, t) = E [(x
s

s
)(x
t

t
)] = E [x
s
x
t
]
s

x
(t, t) = E
_
(x
t

t
)
2

= V [x
t
]
Autocorrelation function (ACF)
(s, t) =
Cov [x
s
, x
t
]
_
V [x
s
] V [x
t
]
=
(s, t)
_
(s, s)(t, t)
Cross-covariance function (CCV)

xy
(s, t) = E [(x
s

xs
)(y
t

yt
)]
Cross-correlation function (CCF)

xy
(s, t) =

xy
(s, t)
_

x
(s, s)
y
(t, t)
Backshift operator
B
k
(x
t
) = x
tk
Dierence operator

d
= (1 B)
d
White noise
w
t
wn(0,
2
w
)
Gaussian: w
t
iid
N
_
0,
2
w
_
E [w
t
] = 0 t T
V [w
t
] =
2
t T

w
(s, t) = 0 s = t s, t T
Random walk
Drift
x
t
= t +

t
j=1
w
j
E [x
t
] = t
Symmetric moving average
m
t
=
k

j=k
a
j
x
tj
where a
j
= a
j
0 and
k

j=k
a
j
= 1
21.1 Stationary Time Series
Strictly stationary
P [x
t1
c
1
, . . . , x
tk
c
k
] = P [x
t1+h
c
1
, . . . , x
tk+h
c
k
]
k N, t
k
, c
k
, h Z
Weakly stationary
E
_
x
2
t

< t Z
E
_
x
2
t

= m t Z

x
(s, t) =
x
(s +r, t +r) r, s, t Z
Autocovariance function
(h) = E [(x
t+h
)(x
t
)] h Z
(0) = E
_
(x
t
)
2

(0) 0
(0) |(h)|
(h) = (h)
Autocorrelation function (ACF)

x
(h) =
Cov [x
t+h
, x
t
]
_
V [x
t+h
] V [x
t
]
=
(t +h, t)
_
(t +h, t +h)(t, t)
=
(h)
(0)
Jointly stationary time series

xy
(h) = E [(x
t+h

x
)(y
t

y
)]

xy
(h) =

xy
(h)
_

x
(0)
y
(h)
Linear process
x
t
= +

j=

j
w
tj
where

j=
|
j
| <
(h) =
2
w

j=

j+h

j
23
21.2 Estimation of Correlation
Sample mean
x =
1
n
n

t=1
x
t
Sample variance
V [ x] =
1
n
n

h=n
_
1
|h|
n
_

x
(h)
Sample autocovariance function
(h) =
1
n
nh

t=1
(x
t+h
x)(x
t
x)
Sample autocorrelation function
(h) =
(h)
(0)
Sample cross-variance function

xy
(h) =
1
n
nh

t=1
(x
t+h
x)(y
t
y)
Sample cross-correlation function

xy
(h) =

xy
(h)
_

x
(0)
y
(0)
Properties

x(h)
=
1

n
if x
t
is white noise

xy(h)
=
1

n
if x
t
or y
t
is white noise
21.3 Non-Stationary Time Series
Classical decomposition model
x
t
=
t
+s
t
+w
t

t
= trend
s
t
= seasonal component
w
t
= random noise term
21.3.1 Detrending
Least squares
1. Choose trend model, e.g.,
t
=
0
+
1
t +
2
t
2
2. Minimize rss to obtain trend estimate
t
=

0
+

1
t +

2
t
2
3. Residuals noise w
t
Moving average
The low-pass lter v
t
is a symmetric moving average m
t
with a
j
=
1
2k+1
:
v
t
=
1
2k + 1
k

i=k
x
t1
If
1
2k+1

k
i=k
w
tj
0, a linear trend function
t
=
0
+
1
t passes
without distortion
Dierencing

t
=
0
+
1
t = x
t
=
1
21.4 ARIMA models
Autoregressive polynomial
(z) = 1
1
z
p
z
p
z C
p
= 0
Autoregressive operator
(B) = 1
1
B
p
B
p
Autoregressive model order p, AR(p)
x
t
=
1
x
t1
+ +
p
x
tp
+w
t
(B)x
t
= w
t
AR(1)
x
t
=
k
(x
tk
) +
k1

j=0

j
(w
tj
)
k,||<1
=

j=0

j
(w
tj
)
. .
linear process
E [x
t
] =

j=0

j
(E [w
tj
]) = 0
(h) = Cov [x
t+h
, x
t
] =

2
w

h
1
2
(h) =
(h)
(0)
=
h
(h) = (h 1) h = 1, 2, . . .
24
Moving average polynomial
(z) = 1 +
1
z + +
q
z
q
z C
q
= 0
Moving average operator
(B) = 1 +
1
B + +
p
B
p
MA(q) (moving average model order q)
x
t
= w
t
+
1
w
t1
+ +
q
w
tq
x
t
= (B)w
t
E [x
t
] =
q

j=0

j
E [w
tj
] = 0
(h) = Cov [x
t+h
, x
t
] =
_

2
w

qh
j=0

j

j+h
0 h q
0 h > q
MA(1)
x
t
= w
t
+w
t1
(h) =
_

_
(1 +
2
)
2
w
h = 0

2
w
h = 1
0 h > 1
(h) =
_

(1+
2
)
h = 1
0 h > 1
ARMA(p, q)
x
t
=
1
x
t1
+ +
p
x
tp
+w
t
+
1
w
t1
+ +
q
w
tq
(B)x
t
= (B)w
t
Partial autocorrelation function (PACF)
x
h1
i
regression of x
i
on {x
h1
, x
h2
, . . . , x
1
}

hh
= corr(x
h
x
h1
h
, x
0
x
h1
0
) h 2
E.g.,
11
= corr(x
1
, x
0
) = (1)
ARIMA(p, d, q)

d
x
t
= (1 B)
d
x
t
is ARMA(p, q)
(B)(1 B)
d
x
t
= (B)w
t
Exponentially Weighted Moving Average (EWMA)
x
t
= x
t1
+w
t
w
t1
x
t
=

j=1
(1 )
j1
x
tj
+w
t
when || < 1
x
n+1
= (1 )x
n
+ x
n
Seasonal ARIMA
Denoted by ARIMA(p, d, q) (P, D, Q)
s

P
(B
s
)(B)
D
s

d
x
t
= +
Q
(B
s
)(B)w
t
21.4.1 Causality and Invertibility
ARMA(p, q) is causal (future-independent) {
j
} :

j=0

j
< such that
x
t
=

j=0
w
tj
= (B)w
t
ARMA(p, q) is invertible {
j
} :

j=0

j
< such that
(B)x
t
=

j=0
X
tj
= w
t
Properties
ARMA(p, q) causal roots of (z) lie outside the unit circle
(z) =

j=0

j
z
j
=
(z)
(z)
|z| 1
ARMA(p, q) invertible roots of (z) lie outside the unit circle
(z) =

j=0

j
z
j
=
(z)
(z)
|z| 1
Behavior of the ACF and PACF for causal and invertible ARMA models
AR(p) MA(q) ARMA(p, q)
ACF tails o cuts o after lag q tails o
PACF cuts o after lag p tails o q tails o
21.5 Spectral Analysis
Periodic process
x
t
= Acos(2t +)
= U
1
cos(2t) +U
2
sin(2t)
Frequency index (cycles per unit time), period 1/
25
Amplitude A
Phase
U
1
= Acos and U
2
= Asin often normally distributed rvs
Periodic mixture
x
t
=
q

k=1
(U
k1
cos(2
k
t) +U
k2
sin(2
k
t))
U
k1
, U
k2
, for k = 1, . . . , q, are independent zero-mean rvs with variances
2
k
(h) =

q
k=1

2
k
cos(2
k
h)
(0) = E
_
x
2
t

q
k=1

2
k
Spectral representation of a periodic process
(h) =
2
cos(2
0
h)
=

2
2
e
2i0h
+

2
2
e
2i0h
=
_
1/2
1/2
e
2ih
dF()
Spectral distribution function
F() =
_

_
0 <
0

2
/2 <
0

2

0
F() = F(1/2) = 0
F() = F(1/2) = (0)
Spectral density
f() =

h=
(h)e
2ih

1
2

1
2
Needs

h=
|(h)| < = (h) =
_
1/2
1/2
e
2ih
f() d h = 0, 1, . . .
f() 0
f() = f()
f() = f(1 )
(0) = V [x
t
] =
_
1/2
1/2
f() d
White noise: f
w
() =
2
w
ARMA(p, q) , (B)x
t
= (B)w
t
:
f
x
() =
2
w
|(e
2i
)|
2
|(e
2i
)|
2
where (z) = 1

p
k=1

k
z
k
and (z) = 1 +

q
k=1

k
z
k
Discrete Fourier Transform (DFT)
d(
j
) = n
1/2
n

i=1
x
t
e
2ijt
Fourier/Fundamental frequencies

j
= j/n
Inverse DFT
x
t
= n
1/2
n1

j=0
d(
j
)e
2ijt
Periodogram
I(j/n) = |d(j/n)|
2
Scaled Periodogram
P(j/n) =
4
n
I(j/n)
=
_
2
n
n

t=1
x
t
cos(2tj/n
_
2
+
_
2
n
n

t=1
x
t
sin(2tj/n
_
2
22 Math
22.1 Gamma Function
Ordinary: (s) =
_

0
t
s1
e
t
dt
Upper incomplete: (s, x) =
_

x
t
s1
e
t
dt
Lower incomplete: (s, x) =
_
x
0
t
s1
e
t
dt
( + 1) = () > 1
(n) = (n 1)! n N
(1/2) =

## 22.2 Beta Function

Ordinary: B(x, y) = B(y, x) =
_
1
0
t
x1
(1 t)
y1
dt =
(x)(y)
(x +y)
Incomplete: B(x; a, b) =
_
x
0
t
a1
(1 t)
b1
dt
Regularized incomplete:
I
x
(a, b) =
B(x; a, b)
B(a, b)
a,bN
=
a+b1

j=a
(a +b 1)!
j!(a +b 1 j)!
x
j
(1 x)
a+b1j
26
I
0
(a, b) = 0 I
1
(a, b) = 1
I
x
(a, b) = 1 I
1x
(b, a)
22.3 Series
Finite

k=1
k =
n(n + 1)
2

k=1
(2k 1) = n
2

k=1
k
2
=
n(n + 1)(2n + 1)
6

k=1
k
3
=
_
n(n + 1)
2
_
2

k=0
c
k
=
c
n+1
1
c 1
c = 1
Binomial

k=0
_
n
k
_
= 2
n

k=0
_
r +k
k
_
=
_
r +n + 1
n
_

k=0
_
k
m
_
=
_
n + 1
m+ 1
_
Vandermondes Identity:
r

k=0
_
m
k
__
n
r k
_
=
_
m+n
r
_
Binomial Theorem:
n

k=0
_
n
k
_
a
nk
b
k
= (a +b)
n
Innite

k=0
p
k
=
1
1 p
,

k=1
p
k
=
p
1 p
|p| < 1

k=0
kp
k1
=
d
dp
_

k=0
p
k
_
=
d
dp
_
1
1 p
_
=
1
(1 p)
2
|p| < 1

k=0
_
r +k 1
k
_
x
k
= (1 x)
r
r N
+

k=0
_

k
_
p
k
= (1 +p)

|p| < 1 , C
22.4 Combinatorics
Sampling
k out of n w/o replacement w/ replacement
ordered n
k
=
k1

i=0
(n i) =
n!
(n k)!
n
k
unordered
_
n
k
_
=
n
k
k!
=
n!
k!(n k)!
_
n 1 +r
r
_
=
_
n 1 +r
n 1
_
Stirling numbers, 2
nd
kind
_
n
k
_
= k
_
n 1
k
_
+
_
n 1
k 1
_
1 k n
_
n
0
_
=
_
1 n = 0
0 else
Partitions
P
n+k,k
=
n

i=1
P
n,i
k > n : P
n,k
= 0 n 1 : P
n,0
= 0, P
0,0
= 1
Balls and Urns f : B U D = distinguishable, D = indistinguishable.
|B| = n, |U| = m f arbitrary f injective f surjective f bijective
B : D, U : D m
n
_
m
n
m n
0 else
m!
_
n
m
_
_
n! m = n
0 else
B : D, U : D
_
n +n 1
n
_ _
m
n
_ _
n 1
m1
_
_
1 m = n
0 else
B : D, U : D
m

k=1
_
n
k
_
_
1 m n
0 else
_
n
m
_
_
1 m = n
0 else
B : D, U : D
m

k=1
P
n,k
_
1 m n
0 else
P
n,m
_
1 m = n
0 else
References
 P. G. Hoel, S. C. Port, and C. J. Stone. Introduction to Probability Theory.
Brooks Cole, 1972.
 L. M. Leemis and J. T. McQueston. Univariate Distribution Relationships.
The American Statistician, 62(1):4553, 2008.
 R. H. Shumway and D. S. Stoer. Time Series Analysis and Its Applications
With R Examples. Springer, 2006.
 A. Steger. Diskrete Strukturen Band 1: Kombinatorik, Graphentheorie,
Algebra. Springer, 2001.
 A. Steger. Diskrete Strukturen Band 2: Wahrscheinlichkeitstheorie und
Statistik. Springer, 2002.
 L. Wasserman. All of Statistics: A Concise Course in Statistical Inference.
Springer, 2003.
27
U
n
i
v
a
r
i
a
t
e
d
i
s
t
r
i
b
u
t
i
o
n
r
e
l
a
t
i
o
n
s
h
i
p
s
,
c
o
u
r
t
e
s
y
L
e
e
m
i
s
a
n
d
M
c
Q
u
e
s
t
o
n
[
2
]
.
28

## Footer menu

### Get our free apps

Copyright © 2021 Scribd Inc.