You are on page 1of 31

Probability and Statistics

Cookbook

Version 0.2.1
14th July, 2016
http://statistics.zone/
c Matthias Vallentin, 2016
Copyright

Contents

3 15 Bayesian Inference
15.1 Credible Intervals . . . . . . . . . . . .
3
15.2 Function of parameters . . . . . . . . .
5
15.3 Priors . . . . . . . . . . . . . . . . . .
Probability Theory
8
15.3.1 Conjugate Priors . . . . . . . .
15.4 Bayesian Testing . . . . . . . . . . . .
Random Variables
8
3.1 Transformations . . . . . . . . . . . . . 9 16 Sampling Methods
16.1 Inverse Transform Sampling . . . . . .
Expectation
9
16.2 The Bootstrap . . . . . . . . . . . . .
16.2.1 Bootstrap Confidence Intervals
Variance
9
16.3 Rejection Sampling . . . . . . . . . . .
16.4 Importance Sampling . . . . . . . . . .
Inequalities
10
17 Decision Theory
Distribution Relationships
10
17.1 Risk . . . . . . . . . . . . . . . . . . .
17.2 Admissibility . . . . . . . . . . . . . .
Probability and Moment Generating
17.3 Bayes Rule . . . . . . . . . . . . . . .
Functions
11
17.4 Minimax Rules . . . . . . . . . . . . .

1 Distribution Overview
1.1 Discrete Distributions . . . . . . . . . .
1.2 Continuous Distributions . . . . . . . .
2
3

4
5
6
7
8

14 Exponential Family

16

21.5 Spectral Analysis . . . . . . . . . . . . . 28

.
.
.
.
.

16 22 Math
22.1 Gamma Function
16
22.2 Beta Function . .
17
22.3 Series . . . . . .
17
22.4 Combinatorics .
17
18

.
.
.
.
.

18
18
18
18
19
19

.
.
.
.

19
19
20
20
20

9 Multivariate Distributions
11 18 Linear Regression
9.1 Standard Bivariate Normal . . . . . . . 11
18.1 Simple Linear Regression . . . . . . . .
9.2 Bivariate Normal . . . . . . . . . . . . . 11
18.2 Prediction . . . . . . . . . . . . . . . . .
9.3 Multivariate Normal . . . . . . . . . . . 11
18.3 Multiple Regression . . . . . . . . . . .
18.4 Model Selection . . . . . . . . . . . . . .
10 Convergence
11
10.1 Law of Large Numbers (LLN) . . . . . . 12 19 Non-parametric Function Estimation
10.2 Central Limit Theorem (CLT) . . . . . 12
19.1 Density Estimation . . . . . . . . . . . .
19.1.1 Histograms . . . . . . . . . . . .
11 Statistical Inference
12
19.1.2 Kernel Density Estimator (KDE)
11.1 Point Estimation . . . . . . . . . . . . . 12
19.2 Non-parametric Regression . . . . . . .
11.2 Normal-Based Confidence Interval . . . 13
19.3 Smoothing Using Orthogonal Functions
11.3 Empirical distribution . . . . . . . . . . 13
11.4 Statistical Functionals . . . . . . . . . . 13 20 Stochastic Processes
20.1 Markov Chains . . . . . . . . . . . . . .
12 Parametric Inference
13
20.2 Poisson Processes . . . . . . . . . . . . .
12.1 Method of Moments . . . . . . . . . . . 13
12.2 Maximum Likelihood . . . . . . . . . . . 14 21 Time Series
21.1 Stationary Time Series . . . . . . . . . .
12.2.1 Delta Method . . . . . . . . . . . 14
21.2 Estimation of Correlation . . . . . . . .
12.3 Multiparameter Models . . . . . . . . . 14
21.3 Non-Stationary Time Series . . . . . . .
12.3.1 Multiparameter delta method . . 15
21.3.1 Detrending . . . . . . . . . . . .
12.4 Parametric Bootstrap . . . . . . . . . . 15
21.4 ARIMA models . . . . . . . . . . . . . .
13 Hypothesis Testing
15
21.4.1 Causality and Invertibility . . . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

29
29
29
29
30

20
20
21
21
22
22
22
23
23
23
24
24
24
25
25
26
26
26
27
27
28

This cookbook integrates various topics in probability theory


and statistics, based on literature [1, 6, 3] and in-class material
from courses of the statistics department at the University of
California in Berkeley but also influenced by others [4, 5]. If you
find errors or have suggestions for improvements, please get in
touch at http://statistics.zone/.

Distribution Overview

1.1

Discrete Distributions
Notation1

Uniform

Unif {a, . . . , b}

FX (x)

Bernoulli

Binomial

Multinomial

Hypergeometric

Negative Binomial

1 We

Bern (p)

Bin (n, p)

x<a
axb
x>b
(1 p)1x

bxca+1
ba

I1p (n x, x + 1)

Mult (n, p)

Hyp (N, m, n)

NBin (r, p)

Geometric

Geo (p)

Poisson

Po ()

x np
p
np(1 p)

Ip (r, x + 1)
1 (1 p)x
e

x N+

x
X
i
i!
i=0

fX (x)

E [X]

V [X]

MX (s)

I(a x b)
ba+1

a+b
2

(b a + 1)2 1
12

eas e(b+1)s
s(b a)

px (1 p)1x

p(1 p)

1 p + pes

np

np(1 p)

(1 p + pes )n

!
n x
p (1 p)nx
x
n!
x
px1 pk k
x1 ! . . . xk ! 1


m N m
x

k
X

xi = n

i=1

nx

N
n

!
x+r1 r
p (1 p)x
r1
p(1 p)x1
x e
x!

x N+

np1
.
..
npk
nm
N

np1 (1 p1 )
np2 p1

np1 p2
..
.

k
X

!n
pi e

si

i=0

nm(N n)(N m)
N 2 (N 1)

1p
r
p

1p
r 2
p

1
p

1p
p2

p
1 (1 p)es

r

pes
1 (1 p)es
e(e

1)

use the notation (s, x) and (x) to refer to the Gamma functions (see 22.1), and use B(x, y) and Ix to refer to the Beta functions (see 22.2).

Uniform (discrete)

Binomial

Geometric

n = 40, p = 0.3
n = 30, p = 0.6
n = 25, p = 0.9

0.8

Poisson

p = 0.2
p = 0.5
p = 0.8

=1
=4
= 10

0.3
0.2

0.6

0.1

10

1.00

30

0.0

2.5

5.0

7.5

0.0

10.0

10

Geometric
1.0

1.00

0.75

20

Poisson

0.8

15

0.6

10

0.4

0.25

0.00

0.50

CDF

0.50

CDF

CDF

CDF

0.25

40

i
n

0.0

0.75

i
n

Binomial

Uniform (discrete)

20

x
1

0.1

0.0

0.2

0.2

0.4

PMF

PMF

PMF

PMF

1
n

20

30

n = 40, p = 0.3
n = 30, p = 0.6
n = 25, p = 0.9
40

0.2

0.0

2.5

5.0

7.5

p = 0.2
p = 0.5
p = 0.8
10.0

0.00

10

15

=1
=4
= 10
20

1.2

Continuous Distributions
Notation

FX (x)

Uniform

Unif (a, b)

Normal

N , 2

x<a
a<x<b

1
x>b
Z x
(x) =
(t) dt
xa
ba

Log-Normal

ln N , 2

Multivariate Normal

MVN (, )

Students t
Chi-square

Student()
2k



1
1
ln x
+ erf
2
2
2 2

fX (x)

E [X]

V [X]

MX (s)

I(a < x < b)


ba


(x )2
1
(x) = exp
2 2
2


(ln x )2
1

exp
2 2
x 2 2

a+b
2

(b a)2
12

esb esa
s(b a)


2 s2
exp s +
2

(2)k/2 ||1/2 e 2 (x)


 
Ix
,
2 2


k x
1

,
(k/2)
2 2

e+

1 (x)

(e 1)e2+

/2

1
xk/21 ex/2
2k/2 (k/2)
r

>1



1
exp T s + sT s
2

 
(+1)/2
+1
x2
2

1
+

>2
1<2

2k

d2
d2 2

2d22 (d1 + d2 2)
d1 (d2 2)2 (d2 4)

(1 2s)k/2 s < 1/2


F

F(d1 , d2 )

Exponential

Exp ()

Gamma
Inverse Gamma

Dirichlet

Beta
Weibull
Pareto

d1 x
d1 x+d2

Gamma (, )
InvGamma (, )

d1 d1
,
2 2

Pareto(xm , )

1
1

(, x)
()

, x
()

1 x
x
e
()

1
1

1 /x
x
e
()
P

k
k

i=1 i Y 1
xi i
Qk
i=1 (i ) i=1

>1
1

2
>2
( 1)2 ( 2)

( + ) 1
x
(1 x)1
() ()

+


1
1 +
k

( + )2 ( + + 1)


2
2 1 +
2
k

xm
>1
1

x2m
>2
( 1)2 ( 2)

1 e(x/)
 x 

We use the rate parameterization where =

xB

d1 d1
, 2
2

1 x/
e

Ix (, )

Weibull(, k)

(d1 x+d2 )d1 +d2

1 ex/

Dir ()

Beta (, )

(d1 x)d1 d2 2

x
1
.

x xm

k  x k1 (x/)k
e

x
m
+1
x

i
Pk

i=1 i

x xm

(s < )

!
s

(s < )

p

2(s)/2
K
4s
()

E [Xi ] (1 E [Xi ])
Pk
i=1 i + 1
1+

X
k=1

X
n=0

k1
Y
r=0

+r
++r

sk
k!

sn n 
n
1+
n!
k

(xm s) (, xm s) s < 0

Some textbooks use as scale parameter instead [6].

Uniform (continuous)

Normal
2.0

LogNormal
1.00

= 0, = 0.2
= 0, 2 = 1
= 0, 2 = 5
= 2, 2 = 0.5
2

1.0

0.5

0.0

0.00
2.5

0.0

2
k=1
k=2
k=3
k=4
k=5

2.5

5.0

0.0
0

5.0

0.0

Exponential

5.0

Gamma
= 0.5
=1
= 2.5

2.0

d1 = 1, d2 = 1
d1 = 2, d2 = 1
d1 = 5, d2 = 2
d1 = 100, d2 = 1
d1 = 100, d2 = 100

2.5

x
= 1, = 0.5
= 2, = 0.5
= 3, = 0.5
= 5, = 1
= 9, = 2

0.5

0.4
1.5

PDF

PDF

0.3

PDF

PDF

2.5

0.75

0.50

0.2

0.1

0.25

5.0

1.00

0.50

=1
=2
=5
=

0.3

PDF

= 0, = 3
= 2, 2 = 2
= 0, 2 = 1
= 0.5, 2 = 1
= 0.25, 2 = 1
= 0.125, 2 = 1

0.75

PDF

1
ba

PDF

PDF

1.5

Student's t
0.4

1.0

0.2
1
0.5

0.25

0.1

0.00
0

0.0

0.0
0

Inverse Gamma
= 1, = 1
= 2, = 1
= 3, = 1
= 3, = 0.5

10

15

20

Weibull

= 0.5, = 0.5
= 5, = 1
= 1, = 3
= 2, = 2
= 2, = 5

Beta
5

2.0

Pareto
4

= 1, k = 0.5
= 1, k = 1
= 1, k = 1.5
= 1, k = 5

xm = 1, k = 1
xm = 1, k = 2
xm = 1, k = 4

1.5
3

PDF

PDF

PDF

PDF

3
1.0

0.5

0
0

0.0
0.00

0.25

0.50

0.75

1.00

0.0

0.5

1.0

1.5

2.0

2.5

1.0

1.5

2.0

2.5

Uniform (continuous)

Normal

LogNormal

1.00

Student's t
1.00

= 0, = 3
= 2, 2 = 2
= 0, 2 = 1
= 0.5, 2 = 1
= 0.25, 2 = 1
= 0.125, 2 = 1
2

0.75
0.75

0.75

CDF

CDF

CDF

CDF

0.50
0.50

0.50

0.25
0.25

0.25
= 0, 2 = 0.2
= 0, 2 = 1
= 0, 2 = 5
= 2, 2 = 0.5

0.00
a

5.0

2.5

0.0

x
2

k=1
k=2
k=3
k=4
k=5

0.00
4

Exponential

0.75

0.75

0.50

0.00
2

CDF

0.75

CDF

0.75

0.50

0.00
3

0.00
0.25

0.50

10

0.75

1.00

15

20

Pareto

1.00

1.00

0.75

0.75

0.50

0.50

0.25
= 1, k = 0.5
= 1, k = 1
= 1, k = 1.5
= 1, k = 5

0.00
0.00

0.00

0.25

0.25
= 1, = 1
= 2, = 1
= 3, = 1
= 3, = 0.5

= 1, = 0.5
= 2, = 0.5
= 3, = 0.5
= 5, = 1
= 9, = 2

Weibull

= 0.5, = 0.5
= 5, = 1
= 1, = 3
= 2, = 2
= 2, = 5

5.0

0.50

Beta
1.00

CDF

Inverse Gamma

= 0.5
=1
= 2.5

0.00

2.5

0.25

1.00

0.50

0.25
d1 = 1, d2 = 1
d1 = 2, d2 = 1
d1 = 5, d2 = 2
d1 = 100, d2 = 1
d1 = 100, d2 = 100
1

0.0

Gamma

0.75

2.5

x
1.00

0.25

5.0

1.00

0.50

1.00

0.25

0.25

CDF

0.50

CDF

CDF

0.75

0.00
0

1.00

CDF

0.00

5.0

CDF

2.5

=1
=2
=5
=

0.0

0.5

1.0

1.5

2.0

2.5

xm = 1, k = 1
xm = 1, k = 2
xm = 1, k = 4

0.00
1.0

1.5

2.0

2.5

Probability Theory

Law of Total Probability

Definitions

P [B] =

Sample space
Outcome (point or element)
Event A
-algebra A

P [B|Ai ] P [Ai ]

n
G

Ai

i=1

Bayes Theorem
P [B | Ai ] P [Ai ]
P [Ai | B] = Pn
j=1 P [B | Aj ] P [Aj ]
Inclusion-Exclusion Principle
[
X
n
n

Ai =
(1)r1

Probability Distribution P

i=1

1. P [A] 0 A
2. P [] = 1
" #

G
X
3. P
Ai =
P [Ai ]

r=1

n
G

Ai

i=1

\

r



A
ij

ii1 <<ir n j=1

Random Variables

Random Variable (RV)

i=1

X:R

Probability space (, A, P)
Probability Mass Function (PMF)

Properties

i=1

1. A
S
2. A1 , A2 , . . . , A = i=1 Ai A
3. A A = A A

i=1

n
X

P [] = 0
B = B = (A A) B = (A B) (A B)
P [A] = 1 P [A]
P [B] = P [A B] + P [A B]
P [] = 1
P [] = 0
S
T
T
S
( n An ) = n An ( n An ) = n An
DeMorgan
S
T
P [ n An ] = 1 P [ n An ]
P [A B] = P [A] + P [B] P [A B]

= P [A B] P [A] + P [B]
P [A B] = P [A B] + P [A B] + P [A B]
P [A B] = P [A] P [A B]

fX (x) = P [X = x] = P [{ : X() = x}]


Probability Density Function (PDF)
Z
P [a X b] =

f (x) dx
a

Cumulative Distribution Function (CDF)


FX : R [0, 1]

FX (x) = P [X x]

1. Nondecreasing: x1 < x2 = F (x1 ) F (x2 )


2. Normalized: limx = 0 and limx = 1
3. Right-Continuous: limyx F (y) = F (x)

Continuity of Probabilities
A1 A2 . . . = limn P [An ] = P [A]
A1 A2 . . . = limn P [An ] = P [A]

S
where A = i=1 Ai
T
where A = i=1 Ai

Z
P [a Y b | X = x] =

fY |X (y | x)dy

Independence

fY |X (y | x) =
A
B P [A B] = P [A] P [B]

f (x, y)
fX (x)

Independence

Conditional Probability
P [A | B] =

ab

P [A B]
P [B]

P [B] > 0

1. P [X x, Y y] = P [X x] P [Y y]
2. fX,Y (x, y) = fX (x)fY (y)

3.1

Transformations

E [XY ] =

xyfX,Y (x, y) dFX (x) dFY (y)


X,Y

Transformation function

E [(Y )] 6= (E [X])
(cf. Jensen inequality)
P [X Y ] = 1 = E [X] E [Y ]
P [X = Y ] = 1 E [X] = E [Y ]

X
E [X] =
P [X x]

Z = (X)
Discrete
X



fZ (z) = P [(X) = z] = P [{x : (x) = z}] = P X 1 (z) =

fX (x)

x1 (z)

x=1

Sample mean

Continuous

X
n = 1
Xi
X
n i=1

Z
FZ (z) = P [(X) z] =

with Az = {x : (x) z}

f (x) dx
Az

Special case if strictly monotone





dx

d
1
fZ (z) = fX (1 (z)) 1 (z) = fX (x) = fX (x)
dz
dz
|J|

Conditional expectation
Z
E [Y | X = x] = yf (y | x) dy
E [X] = E [E [X | Y ]]
Z
E(X,Y ) | X=x [=]
(x, y)fY |X (y | x) dx

Z
E [(Y, Z) | X = x] =
(y, z)f(Y,Z)|X (y, z | x) dy dz

The Rule of the Lazy Statistician


Z
E [Z] =

(x) dFX (x)

Z
E [IA (x)] =

E [Y + Z | X] = E [Y | X] + E [Z | X]
E [(X)Y | X] = (X)E [Y | X]
EY | X [=] c = Cov [X, Y ] = 0

Z
dFX (x) = P [X A]

IA (x) dFX (x) =


A

Convolution
Z
Z := X + Y

fX,Y (x, z x) dx

fZ (z) =

X,Y 0

fX,Y (x, z x) dx

Z := |X Y |
Z :=

X
Y

fZ (z) = 2
fX,Y (x, z + x) dx
0
Z
Z

fZ (z) =
|x|fX,Y (x, xz) dx =
xfx (x)fX (x)fY (xz) dx

Variance

Definition and properties




 
2
2
V [X] = X
= E (X E [X])2 = E X 2 E [X]
#
" n
n
X
X
X
V
Xi =
V [Xi ] + 2
Cov [Xi , Yj ]
i=1

Expectation
V

Definition and properties

Z
E [X] = X =

x dFX (x) =

i=1

#
Xi =

i=1

xfX (x)

x
Z

xfX (x) dx

P [X = c] = 1 = E [X] = c
E [cX] = c E [X]
E [X + Y ] = E [X] + E [Y ]

" n
X

X discrete

n
X

i6=j

V [Xi ]

if Xi
Xj

i=1

Standard deviation
sd[X] =

V [X] = X

Covariance
X continuous

Cov [X, Y ] = E [(X E [X])(Y E [Y ])] = E [XY ] E [X] E [Y ]


Cov [X, a] = 0
Cov [X, X] = V [X]
Cov [X, Y ] = Cov [Y, X]

Cov [aX, bY ] = abCov [X, Y ]


Cov [X + a, Y + b] = Cov [X, Y ]

n
m
n X
m
X
X
X
Cov
Xi ,
Yj =
Cov [Xi , Yj ]
i=1

j=1

Distribution Relationships

Binomial
Xi Bern (p) =

i=1 j=1

n
X

Xi Bin (n, p)

i=1

Correlation

X Bin (n, p) , Y Bin (m, p) = X + Y Bin (n + m, p)


limn Bin (n, p) = Po (np)
(n large, p small)
limn Bin (n, p) = N (np, np(1 p))
(n large, p far from 0 and 1)

Cov [X, Y ]
[X, Y ] = p
V [X] V [Y ]

Independence

Negative Binomial

X
Y = [X, Y ] = 0 Cov [X, Y ] = 0 E [XY ] = E [X] E [Y ]
Sample variance
n

S2 =

1 X
n )2
(Xi X
n 1 i=1

Conditional variance




2
V [Y | X] = E (Y E [Y | X])2 | X = E Y 2 | X E [Y | X]
V [Y ] = E [V [Y | X]] + V [E [Y | X]]

X NBin (1, p) = Geo (p)


Pr
X NBin (r, p) = i=1 Geo (p)
P
P
Xi NBin (ri , p) =
Xi NBin ( ri , p)
X NBin (r, p) . Y Bin (s + r, p) = P [X s] = P [Y r]

Poisson
Xi Po (i ) Xi Xj =

n
X

Xi Po

i=1

n
X

!
i

i=1

X
n
X
n

Xj Bin
Xj , Pn
Xi Po (i ) Xi Xj = Xi

j
j=1
j=1
j=1

Inequalities

Cauchy-Schwarz

Exponential

   
2
E [XY ] E X 2 E Y 2

Markov
P [(X) t]

E [(X)]
t

Chebyshev
P [|X E [X]| t]
Chernoff


P [X (1 + )]

Xi Exp () Xi
Xj =

Xi Gamma (n, )

i=1

Memoryless property: P [X > x + y | X > y] = P [X > x]

V [X]
t2

e
(1 + )1+

n
X

Normal
X N , 2


> 1

Hoeffding
X1 , . . . , Xn independent P [Xi [ai , bi ]] = 1 1 i n

 

E X
t e2nt2 t > 0
P X


2 2

 

E X
| t 2 exp Pn 2n t
P |X
t>0
2
i=1 (bi ai )
Jensen
E [(X)] (E [X]) convex

N (0, 1)


2


X N , Z = aX + b = Z N a + b, a2 2


P
P
P
Xi N i , i2 Xi
Xj =
Xi N
i , i i2
i
i



P [a < X b] = b
a

(x) = 1 (x)
0 (x) = x(x)
00 (x) = (x2 1)(x)
1
Upper quantile of N (0, 1): z = (1 )
Gamma
X Gamma (, ) X/ Gamma (, 1)
P
Gamma (, ) i=1 Exp ()
P
P
Xi Gamma (i , ) Xi
Xj =
i Xi Gamma (
i i , )

10

()
=

9.2

x1 ex dx

Bivariate Normal



Let X N x , x2 and Y N y , y2 .

Beta
1
( + ) 1

x1 (1 x)1 =
x
(1 x)1
B(, )
()()
  B( + k, )


+k1
=
E X k1
E Xk =
B(, )
++k1
Beta (1, 1) Unif (0, 1)

f (x, y) =
"
z=

x x
x

2x y

2


+

1
p

1 2

y y
y


exp

2


2

z
2(1 2 )

x x
x



y y
y

#

Conditional mean and variance

Probability and Moment Generating Functions


 
GX (t) = E tX

X
(Y E [Y ])
Y
p
V [X | Y ] = X 1 2

E [X | Y ] = E [X] +

|t| < 1

"
#
 

X (Xt)i
X

E Xi
MX (t) = GX (et ) = E eXt = E
=
ti
i!
i!
i=0
i=0


9.3

P [X = 0] = GX (0)
P [X = 1] = G0X (0)
P [X = i] =

Multivariate Normal
(Precision matrix 1 )

V [X1 ]
Cov [X1 , Xk ]

..
..
..
=

.
.
.

Covariance matrix

(i)
GX (0)

i!
E [X] = G0X (1 )
 
(k)
E X k = MX (0)


X!
(k)
E
= GX (1 )
(X k)!

Cov [Xk , X1 ]

V [Xk ]

If X N (, ),
2

V [X] = G00X (1 ) + G0X (1 ) (G0X (1 ))

1/2

fX (x) = (2)n/2 ||

GX (t) = GY (t) = X = Y



1
exp (x )T 1 (x )
2

Properties

9
9.1

Multivariate Distributions

Standard Bivariate Normal

Let X, Y N (0, 1) X
Z where Y = X +

1 2 Z

Joint density
x2 + y 2 2xy
p
f (x, y) =
exp
2(1 2 )
2 1 2


10

Conditionals
(Y | X = x) N x, 1 2

and

(X | Y = y) N y, 1 2

Z N (0, 1) X = + 1/2 Z = X N (, )
X N (, ) = 1/2 (X ) N (0, 1)

X N (, ) = AX N A, AAT

X N (, ) kak = k = aT X N aT , aT a

Convergence

Let {X1 , X2 , . . .} be a sequence of rvs and let X be another rv. Let Fn denote
the cdf of Xn and let F denote the cdf of X.
Types of Convergence
D

1. In distribution (weakly, in law): Xn X


Independence
X
Y = 0

lim Fn (t) = F (t)

t where F continuous

11

2. In probability: Xn X


n
X
n(Xn ) D
Zn := q   =
Z

n
V X

( > 0) lim P [|Xn X| > ] = 0


n
as

3. Almost surely (strongly): Xn X


h
i
h
i
P lim Xn = X = P : lim Xn () = X() = 1
n

lim P [Zn z] = (z)

CLT notations

Zn N (0, 1)


2

Xn N ,
n


2

Xn N 0,
n


n(Xn ) N 0,

n(Xn )
N (0, 1)

qm

Relationships
qm

Xn X = Xn X = Xn X
as
P
Xn X = Xn X
D
P
Xn X (c R) P [X = c] = 1 = Xn X
Xn
Xn
Xn
Xn

X
qm
X
P
X
P
X

Yn
Yn
Yn
=

Y = Xn + Yn X + Y
qm
qm
Y = Xn + Yn X + Y
P
P
Y = Xn Yn XY
P
(Xn ) (X)

Continuity correction


n x
P X

Xn X = (Xn ) (X)
qm
Xn b limn E [Xn ] = b limn V [Xn ] = 0
qm
n
X1 , . . . , Xn iid E [X] = V [X] < X

Slutzkys Theorem
D

x + 12

/ n


x 12

/ n

Delta method
P

Law of Large Numbers (LLN)

Let {X1 , . . . , Xn } be a sequence of iid rvs, E [X1 ] = .


Weak (WLLN)
P
n
X

n
Strong (SLLN)
as
n
X

10.2



n x 1
P X

Xn X and Yn c = Xn + Yn X + c
D
P
D
Xn X and Yn c = Xn Yn cX
D
D
D
In general: Xn X and Yn Y =
6
Xn + Yn X + Y

10.1

zR

4. In quadratic mean (L2 ): Xn X




lim E (Xn X)2 = 0

where Z N (0, 1)

Central Limit Theorem (CLT)

Let {X1 , . . . , Xn } be a sequence of iid rvs, E [X1 ] = , and V [X1 ] = 2 .


Yn N

11

2
,
n


= (Yn ) N

2
(), ( ())
n
0

Statistical Inference
iid

Let X1 , , Xn F if not otherwise noted.

11.1

Point Estimation

Point estimator bn of is a rv: bn = g(X1 , . . . , Xn )


h i
bias(bn ) = E bn
P
Consistency: bn
Sampling distribution: F (bn )
r h i
b
Standard error: se(n ) = V bn

12

h
i
h i
Mean squared error: mse = E (bn )2 = bias(bn )2 + V bn

11.4

limn bias(bn ) = 0 limn se(bn ) = 0 = bn is consistent


bn D
Asymptotic normality:
N (0, 1)
se
Slutzkys Theorem often lets us replace se(bn ) by some (weakly) consistent estimator
bn .

11.2

Statistical functional: T (F )
Plug-in estimator of = (F ): bn = T (Fbn )
R
Linear functional: T (F ) = (x) dFX (x)
Plug-in estimator for linear functional:
Z
T (Fbn ) =

Normal-Based Confidence Interval





b 2 . Let z/2 = 1 (1 (/2)), i.e., P Z > z/2 = /2
Suppose bn N , se


and P z/2 < Z < z/2 = 1 where Z N (0, 1). Then

pth quantile: F 1 (p) = inf{x : F (x) p}


n

b=X
n
1 X
n )2
(Xi X

b2 =
n 1 i=1
Pn
1
b)3
i=1 (Xi
n

b=
b3
P
n

i=1 (Xi Xn )(Yi Yn )


qP
b = qP
n
n
2
2
i=1 (Xi Xn )
i=1 (Yi Yn )

Empirical distribution

Empirical Distribution Function (ECDF)


Pn
I(Xi x)
b
Fn (x) = i=1
n
(
1 Xi x
I(Xi x) =
0 Xi > x
Properties (for any fixed x)
h i
E Fbn = F (x)
h i F (x)(1 F (x))
V Fbn =
n
F (x)(1 F (x)) D
mse =
0
n
P
Fbn F (x)
Dvoretzky-Kiefer-Wolfowitz (DKW) inequality (X1 , . . . , Xn F )




2


b
P sup F (x) Fn (x) > = 2e2n

1X
(Xi )
(x) dFbn (x) =
n i=1



b 2 = T (Fbn ) z/2 se
b
Often: T (Fbn ) N T (F ), se

b
Cn = bn z/2 se

11.3

Statistical Functionals

12

Parametric Inference



Let F = f (x; ) : be a parametric model with parameter space Rk
and parameter = (1 , . . . , k ).

12.1
j

th

Method of Moments

moment
 
j () = E X j =

U (x) = min{Fbn + n , 1}
s
 
1
2
log
=
2n

P [L(x) F (x) U (x) x] 1

xj dFX (x)

j th sample moment

Nonparametric 1 confidence band for F


L(x) = max{Fbn n , 0}

bj =

1X j
X
n i=1 i

Method of Moments estimator (MoM)


1 () =
b1
2 () =
b2
.. ..
.=.
k () =
bk

13

Properties of the MoM estimator


bn exists with probability tending to 1
P
Consistency: bn
Asymptotic normality:

n(b ) N (0, )



where = gE Y Y T g T , Y = (X, X 2 , . . . , X k )T ,
1
g = (g1 , . . . , gk ) and gj =
j ()

12.2

Maximum Likelihood

Equivariance: bn is the mle = (bn ) is the mle of ()


Asymptotic optimality (or efficiency), i.e., smallest variance for large samples. If en is any other estimator, the asymptotic relative efficiency is:
p
1. se 1/In ()
(bn ) D
N (0, 1)
se
q
b 1/In (bn )
2. se
(bn ) D
N (0, 1)
b
se
Asymptotic optimality

Likelihood: Ln : [0, )
Ln () =

n
Y

h i
V bn
are(en , bn ) = h i 1
V en

f (Xi ; )

i=1

Approximately the Bayes estimator


Log-likelihood
`n () = log Ln () =

n
X

log f (Xi ; )

i=1

12.2.1

Delta Method

b where is differentiable and 0 () 6= 0:


If = ()

Maximum likelihood estimator (mle)

(b
n ) D
N (0, 1)
b )
se(b

Ln (bn ) = sup Ln ()

b is the mle of and


where b = ()

Score function
s(X; ) =

log f (X; )



b b
b = 0 ()
b n )
se
se(

Fisher information
I() = V [s(X; )]
In () = nI()
Fisher information (exponential family)



I() = E s(X; )

Observed Fisher information


Inobs () =
Properties of the mle
P
Consistency: bn

12.3

Multiparameter Models

Let = (1 , . . . , k ) and b = (b1 , . . . , bk ) be the mle.


Hjj =

2 `n
2

Hjk =

Fisher information matrix

n
2 X
log f (Xi ; )
2 i=1

2 `n
j k

..
.
E [Hk1 ]

E [H11 ]

..
In () =
.

E [H1k ]

..

.
E [Hkk ]

Under appropriate regularity conditions


(b ) N (0, Jn )

14

with Jn () = In1 . Further, if bj is the j th component of , then


(bj j ) D
N (0, 1)
bj
se
h
i
b 2j = Jn (j, j) and Cov bj , bk = Jn (j, k)
where se

Critical value c
Test statistic T
Rejection region R = {x : T (x) > c}
Power function () = P [X R]
Power of a test: 1 P [Type II error] = 1 = inf ()
1

Test size: = P [Type I error] = sup ()


0

12.3.1

Multiparameter delta method

Let = (1 , . . . , k ) and let the gradient of be


H0 true
H1 true

1
.

=
..



p-value = sup0 P [T (X) T (x)] = inf : T (x) R


P [T (X ? ) T (X)]
p-value = sup0
= inf : T (X) R
|
{z
}


b Then,
Suppose =b 6= 0 and b = ().

1F (T (X))

(b
) D
N (0, 1)
b )
se(b

b ) =
se(b

r


T

evidence
very strong evidence against H0
strong evidence against H0
weak evidence against H0
little or no evidence against H0

Parametric Bootstrap

Hypothesis Testing
H0 : 0

versus

Two-sided test
b 0
Reject H0 when |W | > z/2 where W =
b
se


P |W | > z/2
p-value = P0 [|W | > |w|] P [|Z| > |w|] = 2(|w|)

H1 : 1

Definitions

since T (X ? )F

Wald test

Sample from f (x; bn ) instead of from Fbn , where bn could be the mle or method
of moments estimator.

13

p-value
< 0.01
0.01 0.05
0.05 0.1
> 0.1

 
b
Jbn


b and
b = b.
and Jbn = Jn ()
=

12.4

Type II Error ()

Reject H0
Type
I Error ()
(power)

p-value

where

Retain H0

Null hypothesis H0
Alternative hypothesis H1
Simple hypothesis = 0
Composite hypothesis > 0 or < 0
Two-sided test: H0 : = 0 versus H1 : 6= 0
One-sided test: H0 : 0 versus H1 : > 0

Likelihood ratio test


sup Ln ()
Ln (bn )
=
sup0 Ln ()
Ln (bn,0 )
k
X
D
iid
(X) = 2 log T (X) 2rq where
Zi2 2k and Z1 , . . . , Zk N (0, 1)
T (X) =

 i=1

p-value = P0 [(X) > (x)] P 2rq > (x)

15

Multinomial LRT


X1
Xk
mle: pbn =
,...,
n
n
k
Y  pbj Xj
Ln (b
pn )
T (X) =
=
Ln (p0 )
p0j
j=1


k
X
pbj
D
Xj log
2k1
(X) = 2
p
0j
j=1

Natural form
fX (x | ) = h(x) exp { T(x) A()}
= h(x)g() exp { T(x)}


= h(x)g() exp T T(x)

15

The approximate size LRT rejects H0 when (X) 2k1,

Bayesian Inference

Bayes Theorem

Pearson Chi-square Test


T =

k
X
j=1

f ( | x) =

(Xj E [Xj ])2


where E [Xj ] = np0j under H0
E [Xj ]

f (x | )f ()
f (x | )f ()
=R
Ln ()f ()
f (xn )
f (x | )f () d

Definitions

T 2k1


p-value = P 2k1 > T (x)

2
than LRT, hence preferable for small n
Faster Xk1

Independence testing
I rows, J columns, X multinomial sample of size n = I J
X
mles unconstrained: pbij = nij
X

mles under H0 : pb0ij = pbi pbj = Xni nj




PI PJ
nX
LRT: = 2 i=1 j=1 Xij log Xi Xijj
PI PJ (X E[X ])2
PearsonChiSq: T = i=1 j=1 ijE[Xij ]ij

X n = (X1 , . . . , Xn )
xn = (x1 , . . . , xn )
Prior density f ()
Likelihood f (xn | ): joint density of the data
n
Y
In particular, X n iid = f (xn | ) =
f (xi | ) = Ln ()
i=1

Posterior density f ( | xn )
R
Normalizing constant cn = f (xn ) = f (x | )f () d
Kernel: part of a density that dependsRon
R
L ()f ()d
Posterior mean n = f ( | xn ) d = R n
Ln ()f () d

LRT and Pearson 2k , where = (I 1)(J 1)

15.1

Credible Intervals

Posterior interval

14

Exponential Family
P [ (a, b) | xn ] =

Scalar parameter

f ( | xn ) d = 1

Equal-tail credible interval


Z a
Z
f ( | xn ) d =

fX (x | ) = h(x) exp {()T (x) A()}


= h(x)g() exp {()T (x)}

Vector parameter
(
fX (x | ) = h(x) exp

s
X

)
i ()Ti (x) A()

i=1

= h(x) exp {() T (x) A()}


= h(x)g() exp {() T (x)}

f ( | xn ) d = /2

Highest posterior density (HPD) region Rn


1. P [ Rn ] = 1
2. Rn = { : f ( | xn ) > k} for some k
Rn is unimodal = Rn is an interval

16

15.2

Function of parameters

15.3.1

Continuous likelihood (subscript c denotes constant)

Let = () and A = { : () }.
Posterior CDF for
n

Conjugate Priors

H(r | x ) = P [() | x ] =

f ( | x ) d

Likelihood

Conjugate prior

Unif (0, )

Pareto(xm , k)

Exp ()

Gamma (, )

Posterior hyperparameters


max x(n) , xm , k + n
n
X
xi
+ n, +

i=1


2

N , c

Posterior density


2

N 0 , 0

h( | xn ) = H 0 ( | xn )
N c , 2

Bayesian delta method





b
b se
b 0 ()
| X n N (),

15.3

Priors

N , 2

MVN(, c )

Scaled Inverse Chisquare(, 02 )


Normalscaled
Inverse
Gamma(, , , )
MVN(0 , 0 )

Choice
Subjective Bayesianism: prior should incorporate as much detail as possible
the researchs a priori knowledgevia prior elicitation
Objective Bayesianism: prior should incorporate as little detail as possible
(non-informative prior)
Robust Bayesianism: consider various priors and determine sensitivity of
our inferences to changes in the prior

MVN(c , )

InverseWishart(, )

Pareto(xmc , k)

Gamma (, )

Pareto(xm , kc )

Pareto(x0 , k0 )

Gamma (c , )

Gamma (0 , 0 )

Pn
 

0
1
n
i=1 xi
+
/
+ 2 ,
2
2
02
c
 0
c1
1
n
+ 2
02
c
Pn
02 + i=1 (xi )2
+ n,
+n


+ n
x
n
,
+ n,
+ ,
+n
2
n
2
1X
(
x

)
+
(xi x
)2 +
2 i=1
2(n + )
1
1
0 + nc

1


1
1
x
,
0 0 + n


1 1
1
0 + nc
n
X
(xi c )(xi c )T
n + , +
i=1
n
X

xi
x
mc
i=1
x0 , k0 kn where k0 > kn
n
X
0 + nc , 0 +
xi
+ n, +

log

i=1

Types
Flat: f () constant
R
Proper: f () d = 1
R
Improper: f () d =
Jeffreys Prior (transformation-invariant):
f ()

I()

f ()

det(I())

Conjugate: f () and f ( | xn ) belong to the same parametric family

17

Discrete likelihood
Likelihood

Conjugate prior

Bern (p)

Posterior hyperparameters

Beta (, )

Bin (p)

Bayes factor

Beta (, )

n
X
i=1
n
X

xi , + n
xi , +

i=1

NBin (p)

Beta (, )

Po ()

+ rn, +

Gamma (, )

n
X

n
X

n
X

n
X

xi

i=1

Ni

i=1

n
X

xi

i=1

p =

xi

log10 BF10

BF10

evidence

0 0.5
0.5 1
12
>2

1 1.5
1.5 10
10 100
> 100

Weak
Moderate
Strong
Decisive

p
1p BF10
p
+ 1p
BF10

where p = P [H1 ] and p = P [H1 | xn ]

i=1

xi , + n

16

Sampling Methods

i=1

Multinomial(p)

Dir ()

n
X
i=1

Geo (p)

Beta (, )

16.1

x(i)

+ n, +

n
X

Setup
xi

i=1

15.4

Inverse Transform Sampling

U Unif (0, 1)
XF
F 1 (u) = inf{x | F (x) u}
Algorithm

Bayesian Testing

1. Generate u Unif (0, 1)


2. Compute x = F 1 (u)

If H0 : 0 :
Z
Prior probability P [H0 ] =

f () d
0

Posterior probability P [H0 | xn ] =

f ( | xn ) d

Let H0 . . .Hk1 be k hypotheses. Suppose f ( | Hk ),


f (xn | Hk )P [Hk ]
,
P [Hk | xn ] = PK
n
k=1 f (x | Hk )P [Hk ]
Marginal likelihood

16.2

The Bootstrap

Let Tn = g(X1 , . . . , Xn ) be a statistic.


1. Estimate VF [Tn ] with VFbn [Tn ].
2. Approximate VFbn [Tn ] using simulation:

(a) Repeat the following B times to get Tn,1


, . . . , Tn,B
, an iid sample from
b
the sampling distribution implied by Fn
i. Sample uniformly X1 , . . . , Xn Fbn .
ii. Compute Tn = g(X1 , . . . , Xn ).
(b) Then
!2
B
B
1 X
1 X

b
vboot = VFbn =
Tn,b
T
B
B r=1 n,r
b=1

f (xn | Hi ) =

f (xn | , Hi )f ( | Hi ) d

16.2.1

Bootstrap Confidence Intervals

Normal-based interval
b boot
Tn z/2 se

Posterior odds (of Hi relative to Hj )


P [Hi | xn ]
P [Hj | xn ]

f (xn | Hi )
f (xn | Hj )
| {z }

Bayes Factor BFij

Pivotal interval
P [Hi ]

P [Hj ]
| {z }

prior odds

1. Location parameter = T (F )

18

2. Pivot Rn = bn
3. Let H(r) = P [Rn r] be the cdf of Rn

4. Let Rn,b
= bn,b
bn . Approximate H using bootstrap:

2. Generate u Unif (0, 1)


Ln (cand )
3. Accept cand if u
Ln (bn )

B
1 X

b
H(r)
=
I(Rn,b
r)
B

16.4

Sample from an importance function g rather than target density h.


Algorithm to obtain an approximation to E [q() | xn ]:

b=1

5. = sample quantile of (bn,1


, . . . , bn,B
)

iid

), i.e., r = bn
6. r = beta sample quantile of (Rn,1
, . . . , Rn,B
 
7. Approximate 1 confidence interval Cn = a
, b where

a
=
b =



b 1 1 =
bn H
2

1
b
=
bn H
2

Percentile interval

16.3

bn r1/2
=

2bn 1/2

bn r/2
=

2bn /2

Rejection Sampling

We can easily sample from g()


We want to sample from h(), but it is difficult
k()
k() d
Envelope condition: we can find M > 0 such that k() M g()
We know h() up to a proportional constant: h() = R

Algorithm
1. Draw cand g()
2. Generate u Unif (0, 1)
k(cand )
3. Accept cand if u
M g(cand )
4. Repeat until B values of cand have been accepted
Example
We can easily sample from the prior g() = f ()
Target is the posterior h() k() = f (xn | )f ()
Envelope condition: f (xn | ) f (xn | bn ) = Ln (bn ) M
Algorithm
1. Draw cand f ()

1. Sample from the prior 1 , . . . , n f ()


Ln (i )
i = 1, . . . , B
2. wi = PB
i=1 Ln (i )
PB
3. E [q() | xn ] i=1 q(i )wi

17

Decision Theory

Definitions



Cn = /2
, 1/2

Setup

Importance Sampling

Unknown quantity affecting our decision:


Decision rule: synonymous for an estimator b
Action a A: possible value of the decision rule. In the estimation
b
context, the action is just an estimate of , (x).
Loss function L: consequences of taking action a when true state is or
b L : A [k, ).
discrepancy between and ,
Loss functions
Squared error loss: L(, a) = ( a)2
(
K1 ( a) a < 0
Linear loss: L(, a) =
K2 (a ) a 0
Absolute error loss: L(, a) = | a| (linear loss with K1 = K2 )
Lp loss: L(, a) = | a|p
(
0 a=
Zero-one loss: L(, a) =
1 a 6=

17.1

Risk

Posterior risk
Z

h
i
b
b
L(, (x))f
( | x) d = E|X L(, (x))

h
i
b
b
L(, (x))f
(x | ) dx = EX| L(, (X))

r(b | x) =
(Frequentist) risk
b =
R(, )

19

18

Bayes risk
ZZ
b =
r(f, )

h
i
b
b
L(, (x))f
(x, ) dx d = E,X L(, (X))

h
h
ii
h
i
b = E EX| L(, (X)
b
b
r(f, )
= E R(, )
h
h
ii
h
i
b = EX E|X L(, (X)
b
r(f, )
= EX r(b | X)

Linear Regression

Definitions
Response variable Y
Covariate X (aka predictor variable or feature)

18.1

Simple Linear Regression

Model

17.2

Yi = 0 + 1 Xi + i

Admissibility

E [i | Xi ] = 0, V [i | Xi ] = 2

Fitted line

b0 dominates b if

b0

b
: R(, ) R(, )
b
: R(, b0 ) < R(, )
b is inadmissible if there is at least one other estimator b0 that dominates
it. Otherwise it is called admissible.

rb(x) = b0 + b1 x
Predicted (fitted) values
Ybi = rb(Xi )
Residuals



i = Yi Ybi = Yi b0 + b1 Xi

Residual sums of squares (rss)

17.3

Bayes Rule
rss(b0 , b1 ) =

Bayes rule (or Bayes estimator)

n
X

2i

i=1

b = inf e r(f, )
e
r(f, )

R
b
b = r(b | x)f (x) dx
(x)
= inf r(b | x) x = r(f, )

Least square estimates


bT = (b0 , b1 )T : min rss
b0 ,
b1

Theorems
Squared error loss: posterior mean
Absolute error loss: posterior median
Zero-one loss: posterior mode

17.4

Minimax Rules

Maximum risk
b = sup R(, )
b
)
R(

R(a)
= sup R(, a)

Minimax rule
b = inf R(
e = inf sup R(, )
e
)
sup R(, )

b =c
b = Bayes rule c : R(, )
Least favorable prior
bf = Bayes rule R(, bf ) r(f, bf )

n
b0 = Yn b1 X
Pn
Pn

i=1 Xi Yi nXY
i=1 (Xi Xn )(Yi Yn )
b
Pn
= P
1 =
n
2

2
2
i=1 (Xi Xn )
i=1 Xi nX


h
i
0
E b | X n =
1


P
h
i
2 n1 ni=1 Xi2 X n
V b | X n = 2
1
X n
nsX
r Pn
2

b
i=1 Xi

b b0 ) =
se(
n
sX n

b b1 ) =
se(
sX n
Pn
Pn 2
1
where s2X = n1 i=1 (Xi X n )2 and
b2 = n2
i (unbiased estimate).
i=1 
Further properties:
P
P
Consistency: b0 0 and b1 1

20

18.3

Asymptotic normality:
b0 0 D
N (0, 1)
b b0 )
se(

and

b1 1 D
N (0, 1)
b b1 )
se(

Multiple Regression
Y = X + 

where

Approximate 1 confidence intervals for 0 and 1 :

b b1 )
and b1 z/2 se(

b b0 )
b0 z/2 se(

Wald test for H0 : 1 = 0 vs. H1 : 1 6= 0: reject H0 if |W | > z/2 where


b b1 ).
W = b1 /se(

Xn1

Pn b
Pn 2
2

rss
i=1 (Yi Y )
R = Pn
= 1 Pn i=1 i 2 = 1
2
tss
i=1 (Yi Y )
i=1 (Yi Y )
Likelihood

L2 =

1

= ...

Xnk

n
Y
i=1
n
Y
i=1
n
Y

f (Xi , Yi ) =

n
Y

fX (Xi )

n
Y

fY |X (Yi | Xi )

2
1 X
exp 2
Yi (0 1 Xi )
2 i

b = (X T X)1 X T Y
h
i
V b | X n = 2 (X T X)1

Under the assumption of Normality, the least squares estimator is also the mle
but the least squares variance estimator is not the mle.

b N , 2 (X T X)1

1X 2

n i=1 i

rb(x) =

k
X

bj xj

j=1

Unbiased estimate for 2

Prediction

Observe X = x of the covariate and want to predict their outcome Y .


Yb = b0 + b1 x
h i
h i
h i
h
i
V Yb = V b0 + x2 V b1 + 2x Cov b0 , b1
Prediction interval

Estimate regression function

b2 =

N
X
(Yi xTi )2

If the (k k) matrix X T X is invertible,

fX (Xi )
n

n

i=1

fY |X (Yi | Xi ) = L1 L2

i=1

i=1


1
..
=.

rss = (y X)T (y X) = kY Xk2 =

i=1

18.2



1
L(, ) = (2 2 )n/2 exp 2 rss
2

L1 =

X1k
..
.

Likelihood

R2

L=

..
.

X11
..
X= .

 Pn

2
2
2
i=1 (Xi X )
b
P
n =
b
2j + 1
n i (Xi X)
Yb z/2 bn

b2 =

n
1 X 2

n k i=1 i

 = X b Y

mle

b=X

b2 =

nk 2

1 Confidence interval
b bj )
bj z/2 se(

21

18.4

Model Selection

Akaike Information Criterion (AIC)

Consider predicting a new observation Y for covariates X and let S J


denote a subset of the covariates in the model, where |S| = k and |J| = n.
Issues
Underfitting: too few covariates yields high bias
Overfitting: too many covariates yields high variance

AIC(S) = `n (bS ,
bS2 ) k
Bayesian Information Criterion (BIC)
k
BIC(S) = `n (bS ,
bS2 ) log n
2

Procedure
1. Assign a score to each model
2. Search through all models to find the one with the highest score

Validation and training


bV (S) =
R

Hypothesis testing

m
X

(Ybi (S) Yi )2

m = |{validation data}|, often

i=1

n
n
or
4
2

H0 : j = 0 vs. H1 : j 6= 0 j J
Leave-one-out cross-validation
Mean squared prediction error (mspe)
h
i
mspe = E (Yb (S) Y )2

bCV (S) =
R

n
X

(Yi Yb(i) )2 =

i=1

Prediction risk
R(S) =

n
X

mspei =

i=1

n
X

h
i
E (Ybi (S) Yi )2

n
X
i=1

Yi Ybi (S)
1 Uii (S)

!2

U (S) = XS (XST XS )1 XS (hat matrix)

i=1

Training error
btr (S) =
R

n
X
(Ybi (S) Yi )2

19

Non-parametric Function Estimation

i=1

19.1

btr (S)
rss(S)
R
R2 (S) = 1
=1
=1
tss
tss

Pn b
2
i=1 (Yi (S) Y )
P
n
2
i=1 (Yi Y )

The training error is a downward-biased estimate of the prediction risk.


h
i
btr (S) < R(S)
E R
n
h
i
h
i
X
b
b
bias(Rtr (S)) = E Rtr (S) R(S) = 2
Cov Ybi , Yi
i=1

Adjusted R2
R2 (S) = 1

Density Estimation

Estimate f (x), where f (x) = P [X A] =


Integrated square error (ise)
L(f, fbn ) =

Z 

R
A

f (x) dx.

Z
2
f (x) fbn (x) dx = J(h) + f 2 (x) dx

Frequentist risk
Z
h
i Z
R(f, fbn ) = E L(f, fbn ) = b2 (x) dx + v(x) dx

n 1 rss
n k tss

Mallows Cp statistic
b
btr (S) + 2kb
R(S)
=R
2 = lack of fit + complexity penalty

h
i
b(x) = E fbn (x) f (x)
h
i
v(x) = V fbn (x)

22

19.1.1

Histograms

KDE


n
x Xi
1X1
K
fbn (x) =
n i=1 h
h
Z
Z
1
1
4
00
2
b
R(f, fn ) (hK )
K 2 (x) dx
(f (x)) dx +
4
nh
Z
Z
2/5 1/5 1/5
c
c2 c3
2
2
h = 1
c
=

,
c
=
K
(x)
dx,
c
=
(f 00 (x))2 dx
1
2
3
K
n1/5
Z
4/5 Z
1/5
c4
5 2 2/5

2
00 2
b
R (f, fn ) = 4/5
K (x) dx
(f ) dx
c4 = (K )
4
n
|
{z
}

Definitions

Number of bins m
1
Binwidth h = m
Bin Bj has j observations
R
Define pbj = j /n and pj = Bj f (u) du

Histogram estimator

fbn (x) =

m
X
pbj
j=1

C(K)

I(x Bj )
Epanechnikov Kernel

pj
E fbn (x) =
h
h
i p (1 p )
j
j
V fbn (x) =
nh2
Z
h2
1
2
R(fbn , f )
(f 0 (u)) du +
12
nh
!1/3
1
6

h = 1/3 R
2 du
n
(f 0 (u))
 2/3 Z
1/3
C
3
2
b
0
R (fn , f ) 2/3
(f (u)) du
C=
4
n
h

(
K(x) =

4 5(1x2 /5)

|x| <

otherwise

Cross-validation estimate of E [J(h)]


Z
JbCV (h) =



n
n
n
2Xb
2
1 X X Xi Xj
fbn2 (x) dx
K
+
K(0)
f(i) (Xi )
2
n i=1
hn i=1 j=1
h
nh

K (x) = K

(2)

(x) 2K(x)

(2)

Z
(x) =

K(x y)K(y) dy

Cross-validation estimate of E [J(h)]


Z
JbCV (h) =

19.1.2

2Xb
2
n+1 X 2
fbn2 (x) dx
f(i) (Xi ) =

pb
n i=1
(n 1)h (n 1)h j=1 j

19.2

Non-parametric Regression

Estimate f (x) where f (x) = E [Y | X = x]. Consider pairs of points


(x1 , Y1 ), . . . , (xn , Yn ) related by

Kernel Density Estimator (KDE)

Yi = r(xi ) + i
E [i ] = 0

Kernel K

K(x) 0
R
K(x) dx = 1
R
xK(x) dx = 0
R 2
2
x K(x) dx K
>0

V [i ] = 2
k-nearest Neighbor Estimator
rb(x) =

1
k

X
i:xi Nk (x)

Yi

where Nk (x) = {k values of x1 , . . . , xn closest to x}

23

20

Nadaraya-Watson Kernel Estimator


rb(x) =

n
X

Stochastic Processes

Stochastic Process

wi (x)Yi

i=1

wi (x) =

xxi
h

Pn
xxj
K
j=1
h

(
{0, 1, . . . } = Z discrete
T =
[0, )
continuous

{Xt : t T }

[0, 1]

Z
4 Z 
2
h4
f 0 (x)
2 2
00
0
R(b
rn , r)
x K (x) dx
r (x) + 2r (x)
dx
4
f (x)
R
Z 2
K 2 (x) dx
+
dx
nhf (x)
c1
h 1/5
n
c2

R (b
rn , r) 4/5
n

Notations Xt , X(t)
State space X
Index set T

20.1

Markov Chains

Markov chain
P [Xn = x | X0 , . . . , Xn1 ] = P [Xn = x | Xn1 ]

n T, x X

Cross-validation estimate of E [J(h)]


JbCV (h) =

n
X

(Yi rb(i) (xi ))2 =

i=1

19.3

n
X
i=1

(Yi rb(xi ))2


1

K(0)
 xx 
Pn
j
j=1 K
h

Smoothing Using Orthogonal Functions

Approximation
r(x) =

j j (x)

j=1

J
X

j j (x)

Multivariate regression

where

i = i

..
.
0 (xn )

!2

pij P [Xn+1 = j | Xn = i]
pij (n) P [Xm+n = j | Xm = i]
Transition matrix P (n-step: Pn )

Chapman-Kolmogorov

J (x1 )
..
.

pij (m + n) =

J (xn )

b = (T )1 T Y
1
T Y (for equally spaced observations only)
n
Cross-validation estimate of E [J(h)]
2

n
J
X
X
bCV (J) =
Yi
R
j (xi )bj,(i)
j=1

pij (m)pkj (n)

Pm+n = Pm Pn

Least squares estimator

i=1

n-step

(i, j) element is pij


pij > 0
P

i pij = 1

j=1

Y = +

0 (x1 )
..
and = .

Transition probabilities

Pn = P P = Pn
Marginal probability
n = (n (1), . . . , n (N ))

where

i (i) = P [Xn = i]

0 , initial distribution
n = 0 Pn

24

20.2

Poisson Processes

Autocorrelation function (ACF)

Poisson process
{Xt : t [0, )} = number of events up to and including time t
X0 = 0
Independent increments:

(s, t)
Cov [xs , xt ]
=p
(s, t) = p
V [xs ] V [xt ]
(s, s)(t, t)
Cross-covariance function (CCV)

t0 < < tn : Xt1 Xt0


Xtn Xtn1

xy (s, t) = E [(xs xs )(yt yt )]

Intensity function (t)


Cross-correlation function (CCF)

P [Xt+h Xt = 1] = (t)h + o(h)


P [Xt+h Xt = 2] = o(h)
Xs+t Xs Po (m(s + t) m(s)) where m(t) =

Rt
0

xy (s, t) = p

(s) ds

xy (s, t)
x (s, s)y (t, t)

Homogeneous Poisson process


(t) = Xt Po (t)

Backshift operator

>0

B k (xt ) = xtk
Waiting times
Wt := time at which Xt occurs


1
Wt Gamma t,

Difference operator
d = (1 B)d

Interarrival times

White noise

St = Wt+1 Wt
 
1
St Exp

2
wt wn(0, w
)

St
Wt1

21

Wt


iid
2
Gaussian: wt N 0, w
E [wt ] = 0 t T
V [wt ] = 2 t T
w (s, t) = 0 s 6= t s, t T

Random walk

Time Series

Mean function

xt = E [xt ] =

xft (x) dx

Drift
Pt
xt = t + j=1 wj
E [xt ] = t

Autocovariance function
Symmetric moving average
x (s, t) = E [(xs s )(xt t )] = E [xs xt ] s t


x (t, t) = E (xt t )2 = V [xt ]

mt =

k
X
j=k

aj xtj

where aj = aj 0 and

k
X
j=k

aj = 1

25

21.1

Stationary Time Series

21.2

Estimation of Correlation

Sample mean

Strictly stationary

x
=

P [xt1 c1 , . . . , xtk ck ] = P [xt1 +h c1 , . . . , xtk +h ck ]

1X
xt
n t=1

Sample variance

n 
|h|
1 X
1
x (h)
V [
x] =
n
n

k N, tk , ck , h Z

h=n

Weakly stationary
Sample autocovariance function
 
t Z
E x2t <
 2
E xt = m
t Z
x (s, t) = x (s + r, t + r)

r, s, t Z

Sample autocorrelation function

Autocovariance function

nh
1 X
(xt+h x
)(xt x
)
n t=1

b(h) =

(h) = E [(xt+h )(xt )]




(0) = E (xt )2
(0) 0
(0) |(h)|
(h) = (h)

h Z

b(h) =

b(h)

b(0)

Sample cross-variance function

bxy (h) =

nh
1 X
(xt+h x
)(yt y)
n t=1

Autocorrelation function (ACF)


Sample cross-correlation function
Cov [xt+h , xt ]
(t + h, t)
(h)
=p
=
x (h) = p
(0)
V [xt+h ] V [xt ]
(t + h, t + h)(t, t)
Jointly stationary time series

bxy (h)
bxy (h) = p

bx (0)b
y (0)
Properties

xy (h) = E [(xt+h x )(yt y )]

xy (h) = p

xy (h)
x (0)y (h)

1
bx (h) = if xt is white noise
n
1
bxy (h) = if xt or yt is white noise
n

21.3

Linear process

Non-Stationary Time Series

Classical decomposition model


xt = +

j wtj

where

j=

2
(h) = w

|j | <

xt = t + st + wt

j=

X
j=

j+h j

t = trend
st = seasonal component
wt = random noise term

26

21.3.1

Detrending

Moving average polynomial

Least squares

(z) = 1 + 1 z + + q zq

z C q 6= 0

1. Choose trend model, e.g., t = 0 + 1 t + 2 t


2. Minimize rss to obtain trend estimate
bt = b0 + b1 t + b2 t2
3. Residuals , noise wt

Moving average operator


(B) = 1 + 1 B + + p B p

Moving average

MA (q) (moving average model order q)

The low-pass filter vt is a symmetric moving average mt with aj =


vt =

1
2k+1 :

xt = wt + 1 wt1 + + q wtq xt = (B)wt

k
X
1
xt1
2k + 1

E [xt ] =

i=k

j E [wtj ] = 0

j=0

Pk
1
If 2k+1
i=k wtj 0, a linear trend function t = 0 + 1 t passes
without distortion
Differencing

(h) = Cov [xt+h , xt ] =

2
w
0

Pqh
j=0

j j+h

0hq
h>q

MA (1)
xt = wt + wt1

2 2

(1 + )w h = 0
2
(h) = w
h=1

0
h>1
(

h=1
2
(h) = (1+ )
0
h>1

t = 0 + 1 t = xt = 1

21.4

q
X

ARIMA models

Autoregressive polynomial
(z) = 1 1 z p zp

z C p 6= 0

Autoregressive operator

ARMA (p, q)
(B) = 1 1 B p B p

xt = 1 xt1 + + p xtp + wt + 1 wt1 + + q wtq

Autoregressive model order p, AR (p)


(B)xt = (B)wt
xt = 1 xt1 + + p xtp + wt (B)xt = wt
AR (1)
xt = k (xtk ) +

k1
X

j (wtj )

k,||<1

j=0

j (wtj )
{z

linear process

j
j=0 (E [wtj ]) = 0

(h) = Cov [xt+h , xt ] =


(h) =

(h)
(0)

xih1 , regression of xi on {xh1 , xh2 , . . . , x1 }


hh = corr(xh xh1
, x0 xh1
) h2
0
h
E.g., 11 = corr(x1 , x0 ) = (1)

j=0

|
E [xt ] =

Partial autocorrelation function (PACF)

2 h
w

12

= h

(h) = (h 1) h = 1, 2, . . .

ARIMA (p, d, q)
d xt = (1 B)d xt is ARMA (p, q)
(B)(1 B)d xt = (B)wt
Exponentially Weighted Moving Average (EWMA)
xt = xt1 + wt wt1

27

xt =

(1 )j1 xtj + wt

when || < 1

j=1

x
n+1 = (1 )xn +
xn
Seasonal ARIMA

Frequency index (cycles per unit time), period 1/


Amplitude A
Phase
U1 = A cos and U2 = A sin often normally distributed rvs

Periodic mixture

Denoted by ARIMA (p, d, q) (P, D, Q)s


d
s
P (B s )(B)D
s xt = + Q (B )(B)wt

xt =

q
X

(Uk1 cos(2k t) + Uk2 sin(2k t))

k=1

21.4.1

Causality and Invertibility

ARMA (p, q) is causal (future-independent) {j } :


xt =

j=0

j < such that

wtj = (B)wt

Spectral representation of a periodic process

j=0

ARMA (p, q) is invertible {j } :


(B)xt =

Uk1 , Uk2 , for k = 1, . . . , q, are independent zero-mean rvs with variances k2


Pq
(h) = k=1 k2 cos(2k h)
  Pq
(0) = E x2t = k=1 k2

j=0

(h) = 2 cos(20 h)

j < such that

2 2i0 h 2 2i0 h
e
+
e
2
2
Z 1/2
=
e2ih dF ()

Xtj = wt

j=0

1/2

Properties
ARMA (p, q) causal roots of (z) lie outside the unit circle

(z)
(z) =
j z =
(z)
j=0
j

Spectral distribution function

0
F () = 2 /2

|z| 1

< 0
< 0
0

ARMA (p, q) invertible roots of (z) lie outside the unit circle

(z)
(z) =
j z j =
(z)
j=0

F () = F (1/2) = 0
F () = F (1/2) = (0)

|z| 1

Spectral density

Behavior of the ACF and PACF for causal and invertible ARMA models
ACF
PACF

21.5

AR (p)
tails off
cuts off after lag p

MA (q)
cuts off after lag q
tails off q

Spectral Analysis

Periodic process
xt = A cos(2t + )
= U1 cos(2t) + U2 sin(2t)

ARMA (p, q)
tails off
tails off

(h)e2ih

|(h)| < = (h) =

R 1/2

f () =

h=

Needs

h=

1
1

2
2

1/2

e2ih f () d

h = 0, 1, . . .

f () 0
f () = f ()
f () = f (1 )
R 1/2
(0) = V [xt ] = 1/2 f () d
2
White noise: fw () = w

28

22.2

ARMA (p, q) , (B)xt = (B)wt :

Z 1
(x)(y)
Ordinary: B(x, y) = B(y, x) =
tx1 (1 t)y1 dt =
(x + y)
0
Z x
a1
b1
Incomplete: B(x; a, b) =
t
(1 t)
dt

|(e2i )|2
fx () =
|(e2i )|2
Pp
Pq
where (z) = 1 k=1 k z k and (z) = 1 + k=1 k z k
2
w

Regularized incomplete:
a+b1
(a + b 1)!
B(x; a, b) a,bN X
=
xj (1 x)a+b1j
Ix (a, b) =
B(a, b)
j!(a
+
b

j)!
j=a

Discrete Fourier Transform (DFT)


d(j ) = n1/2

n
X

Beta Function

xt e2ij t

I0 (a, b) = 0
I1 (a, b) = 1
Ix (a, b) = 1 I1x (b, a)

i=1

Fourier/Fundamental frequencies

22.3

j = j/n

Series

Finite

Inverse DFT
xt = n

1/2

n1
X

d(j )e

2ij t

j=0

Periodogram

I(j/n) = |d(j/n)|2
Scaled Periodogram
4
I(j/n)
n
!2
n
2X
=
xt cos(2tj/n +
n t=1

2X
xt sin(2tj/n
n t=1

!2

k=

k=1
n
X

n(n + 1)
2

(2k 1) = n2

k=1
n
X

k=1
n
X

ck =

22.1

k=0
n 
X

= 2n

k=0

Vandermondes Identity:
 

r  
X
m
n
m+n
=
k
rk
r
k=0

c 6= 1

Math

Binomial
Theorem:
n  
X
n nk k
a
b = (a + b)n
k
k=0

Gamma Function
Z

Infinite

ts1 et dt
0
Z
Upper incomplete: (s, x) =
ts1 et dt
Z xx
Lower incomplete: (s, x) =
ts1 et dt

Ordinary: (s) =

cn+1 1
c1

n  
X
n

 

r+k
r+n+1
=
k
n
k=0




n
X k
n+1

=
m
m+1

k2 =

k=0

22

n(n + 1)(2n + 1)
6
k=1
2

n
X
n(n + 1)

k3 =
2

P (j/n) =

Binomial

n
X

( + 1) = ()
>1
(n) = (n 1)!
nN
(0) = (1) =

(1/2) =
(1/2) = 2(1/2)

X
k=0

pk =

1
,
1p

p
|p| < 1
1p
k=1
!



X
d
1
1
k
p
=
=
dp 1 p
(1 p)2
pk =

d
dp
k=0
k=0


X
r+k1 k

x = (1 x)r r N+
k
k=0
 
X
k

p = (1 + p) |p| < 1 , C
k

kpk1 =

|p| < 1

k=0

29

22.4

Combinatorics

[3] R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications With R
Examples. Springer, 2006.

Sampling

[4] A. Steger. Diskrete Strukturen Band 1: Kombinatorik, Graphentheorie, Algebra.


Springer, 2001.

k out of n

w/o replacement
nk =

ordered

k1
Y

(n i) =

i=0

w/ replacement

n!
(n k)!

 
n
nk
n!
=
=
k
k!
k!(n k)!

unordered

nk

[5] A. Steger. Diskrete Strukturen Band 2: Wahrscheinlichkeitstheorie und Statistik.


Springer, 2002.
[6] L. Wasserman. All of Statistics: A Concise Course in Statistical Inference. Springer, 2003.


 

n1+r
n1+r
=
r
n1

Stirling numbers, 2nd kind


 

 

n
n1
n1
=k
+
k
k
k1

  (
1
n
=
0
0

1kn

n=0
else

Partitions
Pn+k,k =

n
X

Pn,i

k > n : Pn,k = 0

n 1 : Pn,0 = 0, P0,0 = 1

i=1

Balls and Urns

|B| = n, |U | = m

f :BU

f arbitrary
mn

B : D, U : D

B : D, U : D

B : D, U : D

f injective
(
mn m n
0
else


m+n1
n
m  
X
n
k=1

B : D, U : D

D = distinguishable, D = indistinguishable.

m
X
k=1

 
m
n

m!


 
n
m

n1
m1

1
0

mn
else

 
n
m

1
0

mn
else

Pn,m

k
Pn,k

f surjective

f bijective
(
n! m = n
0 else
(
1 m=n
0 else
(
1 m=n
0 else
(
1 m=n
0 else

References
[1] P. G. Hoel, S. C. Port, and C. J. Stone. Introduction to Probability Theory. Brooks Cole,
1972.
[2] L. M. Leemis and J. T. McQueston. Univariate Distribution Relationships. The American
Statistician, 62(1):4553, 2008.

30

31

Univariate distribution relationships, courtesy Leemis and McQueston [2].

You might also like