You are on page 1of 28

Probability and Statistics

Cookbook

c Matthias Vallentin, 2011


Copyright
vallentin@icir.org
12th December, 2011

12 Parametric Inference
12.1 Method of Moments . . . . . . . . .
12.2 Maximum Likelihood . . . . . . . . .
versity of California in Berkeley but also influenced by other
12.2.1 Delta Method . . . . . . . . .
sources [4, 5]. If you find errors or have suggestions for further
12.3 Multiparameter Models . . . . . . .
topics, I would appreciate if you send me an email. The most re12.3.1 Multiparameter delta method
cent version of this document is available at http://matthias.
12.4 Parametric Bootstrap . . . . . . . .

15 Exponential Family

11 20 Stochastic Processes
20.1 Markov Chains . . . . . . . . . .
11
20.2 Poisson Processes . . . . . . . . .
12
12
21 Time Series
12
21.1 Stationary Time Series . . . . . .
13
21.2 Estimation of Correlation . . . .
13
21.3 Non-Stationary Time Series . . .
21.3.1 Detrending . . . . . . . .
13
21.4 ARIMA models . . . . . . . . . .
21.4.1 Causality and Invertibility
14
21.5 Spectral Analysis . . . . . . . . .
14
14 22 Math
22.1 Gamma Function . . . . . . . . .
15
22.2 Beta Function . . . . . . . . . . .
15
22.3 Series . . . . . . . . . . . . . . .
15
22.4 Combinatorics . . . . . . . . . .
16

16 Sampling Methods
16.1 The Bootstrap . . . . . . . . . . . . .
16.1.1 Bootstrap Confidence Intervals
16.2 Rejection Sampling . . . . . . . . . . .
16.3 Importance Sampling . . . . . . . . . .

16
16
16
17
17

This cookbook integrates a variety of topics in probability theory and statistics. It is based on literature [1, 6, 3] and in-class
material from courses of the statistics department at the Uni-

vallentin.net/probability-and-statistics-cookbook/. To
reproduce, please contact me.

2 Probability Theory
3 Random Variables
3.1 Transformations . . . . . . . . . . . . .

6
7

4 Expectation

5 Variance

6 Inequalities

1 Distribution Overview
1.1 Discrete Distributions . . . . . . . . . .
1.2 Continuous Distributions . . . . . . . .

7 Distribution Relationships
8 Probability
Functions

and

Moment

9 Multivariate Distributions
9.1 Standard Bivariate Normal . . . . . . .
9.2 Bivariate Normal . . . . . . . . . . . . .
9.3 Multivariate Normal . . . . . . . . . . .
10 Convergence
10.1 Law of Large Numbers (LLN) . . . . . .
10.2 Central Limit Theorem (CLT) . . . . .
11 Statistical Inference
11.1 Point Estimation . . . . . . . . . .
11.2 Normal-Based Confidence Interval
11.3 Empirical distribution . . . . . . .
11.4 Statistical Functionals . . . . . . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.

17 Decision Theory
17.1 Risk . . . . . . . . . . . . . . . . . . . .
17.2 Admissibility . . . . . . . . . . . . . . .
17.3 Bayes Rule . . . . . . . . . . . . . . . .
9
17.4 Minimax Rules . . . . . . . . . . . . . .
9
9 18 Linear Regression
9
18.1 Simple Linear Regression . . . . . . . .
9
18.2 Prediction . . . . . . . . . . . . . . . . .
18.3 Multiple Regression . . . . . . . . . . .
9
18.4 Model Selection . . . . . . . . . . . . . .
10
10
19 Non-parametric Function Estimation
19.1 Density Estimation . . . . . . . . . . . .
10
19.1.1 Histograms . . . . . . . . . . . .
10
19.1.2 Kernel Density Estimator (KDE)
11
19.2 Non-parametric Regression . . . . . . .
11
11
19.3 Smoothing Using Orthogonal Functions
8

Generating

.
.
.
.
.
.

13 Hypothesis Testing

14 Bayesian Inference
14.1 Credible Intervals . . . .
14.2 Function of parameters .
3
3
14.3 Priors . . . . . . . . . .
4
14.3.1 Conjugate Priors
14.4 Bayesian Testing . . . .
6

Contents

.
.
.
.
.
.

17
17
17
18
18
18
18
19
19
19
20
20
20
21
21
21

22
. . . . 22
. . . . 22

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

23
23
24
24
24
24
25
25

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

26
26
26
27
27

Distribution Overview

1.1

Discrete Distributions
Notation1

FX (x)

Unif {a, . . . , b}

Uniform

x<a
axb
x>b
(1 p)1x

bxca+1
ba

1
Bernoulli

Bern (p)

Binomial

Multinomial

Hypergeometric

Hyp (N, m, n)

Negative Binomial

NBin (r, p)

Geometric

Geo (p)

Poisson

Po ()

MX (s)

I(a < x < b)


ba+1

a+b
2

(b a + 1)2 1
12

eas e(b+1)s
s(b a)

px (1 p)1x

p(1 p)

1 p + pes

np

np(1 p)

(1 p + pes )n

x N+

x
X
i
i!
i=0

Uniform (discrete)

1p
p

1p
p2

1
p

1p
p2

x N+

x e
x!

p
1 (1 p)es

e(e

1)

Poisson

p = 0.2
p = 0.5
p = 0.8

=1
=4
= 10

0.2

0.6

PMF

0.4

PMF

0.15

PMF

0.10

0.05
x

1 We

10

20
x

30

40

0.0

0.0

0.1

0.2

0.00

PMF

0.20

0.3

0.25

n = 40, p = 0.3
n = 30, p = 0.6
n = 25, p = 0.9

r

p
1 (1 p)es

Geometric

pi e

nm(N n)(N m)
N 2 (N 1)

nm
N
r

Binomial

!n
si

i=0

nx

N
x

p(1 p)x1

k
X

npi (1 pi )

npi

!
x+r1 r
p (1 p)x
r1

Ip (r, x + 1)
1 (1 p)x

V [X]

k
X
n!
x
xi = n
px1 1 pkk
x1 ! . . . xk !
i=1


m mx

Mult (n, p)
x np
p
np(1 p)

E [X]

!
n x
p (1 p)nx
x

I1p (n x, x + 1)

Bin (n, p)

fX (x)

0.8

10

10

15

20

use the notation (s, x) and (x) to refer to the Gamma functions (see 22.1), and use B(x, y) and Ix to refer to the Beta functions (see 22.2).
3

1.2

Continuous Distributions
Notation

FX (x)

Uniform

Unif (a, b)

Normal

N , 2

x<a
a<x<b

1
x>b
Z x
(x) =
(t) dt
xa
ba

Log-Normal

ln N , 2

Multivariate Normal

MVN (, )

Students t

Student()

Chi-square

2k



1
1
ln x
+ erf
2
2
2 2

fX (x)

E [X]

V [X]

MX (s)

I(a < x < b)


ba


(x )2
1
(x) = exp
2 2
2


(ln x )2
1

exp
2 2
x 2 2

a+b
2

(b a)2
12

esb esa
s(b a)


2 s2
exp s +
2

(2)k/2 ||1/2 e 2 (x)


 
Ix
,
2 2


1
k x

,
(k/2)
2 2

1 (x)

 
(+1)/2
+1
x2
2
 1+

2
1
xk/21 ex/2
2k/2 (k/2)
r

e+

/2

(e 1)e2+

2k

d2
d2 2

2d22 (d1 + d2 2)
d1 (d2 2)2 (d2 4)



1
exp T s + sT s
2

(1 2s)k/2 s < 1/2


F

F(d1 , d2 )

Exponential

Exp ()

Gamma
Inverse Gamma

Dirichlet

Beta
Weibull
Pareto

d1 x
d1 x+d2

Gamma (, )
InvGamma (, )

d1 d1
,
2 2

Pareto(xm , )

(, x/)
()

, x
()

1
x1 ex/
()

1 /x
x
e
()
P

k
k

i=1 i Y 1
xi i
Qk
i=1 (i ) i=1

>1
1

2
>2
( 1)2 ( 2)2

( + ) 1
x
(1 x)1
() ()

+


1
1 +
k

( + )2 ( + + 1)


2
2 1 +
2
k

xm
>1
1

x
m
>2
( 1)2 ( 2)

1 e(x/)
1

d1
, 2
2

1 x/
e

Ix (, )

Weibull(, k)

xB


d1

1 ex/

Dir ()

Beta (, )

(d1 x)d1 d2 2

(d1 x+d2 )d1 +d2

 x 
m

x xm

k  x k1 (x/)k
e

x
m
+1
x

x xm

i
Pk

i=1 i

1
(s < 1/)
1 s


1
(s < 1/)
1 s

p
2(s)/2
4s
K
()

E [Xi ] (1 E [Xi ])
Pk
i=1 i + 1
1+

X
k=1

X
n=0

k1
Y
r=0

+r
++r

sk
k!

sn n 
n
1+
n!
k

(xm s) (, xm s) s < 0

Normal

Lognormal

Student's t

1.0

= 0, 2 = 3
= 2, 2 = 2
= 0, 2 = 1
= 0.5, 2 = 1
= 0.25, 2 = 1
= 0.125, 2 = 1

=1
=2
=5
=

0.2

PDF

PDF
0.4

(x)

PDF

0.6

0.6

0.3

0.8

0.8

= 0, 2 = 0.2
= 0, 2 = 1
= 0, 2 = 5
= 2, 2 = 0.5

0.4

Uniform (continuous)

0.1
4

0.0

0.0

0.2

0.2

0.0

0.0

0.5

1.0

1.5

2.5

3.0

Exponential

Gamma

2.0

=2
=1
= 0.4

= 1, = 2
= 2, = 2
= 3, = 2
= 5, = 1
= 9, = 0.5

0.4

3.0

d1 = 1, d2 = 1
d1 = 2, d2 = 1
d1 = 5, d2 = 2
d1 = 100, d2 = 1
d1 = 100, d2 = 100

0.3
PDF
0.2

1.0

PDF

PDF

1.5

0.1
0

0.0
0

Inverse Gamma

Beta

10

15

20

Pareto
xm = 1, = 1
xm = 1, = 2
xm = 1, = 4

= 1, k = 0.5
= 1, k = 1
= 1, k = 1.5
= 1, k = 5

2.0

2.5

Weibull

= 0.5, = 0.5
= 5, = 1
= 1, = 3
= 2, = 2
= 2, = 5

3.0

= 1, = 1
= 2, = 1
= 3, = 1
= 3, = 0.5

3
x

3
x

0.0

0.2

0.4

0.6
x

0.8

1.0

2
0

0.0

0.0

0.5

0.5

1.0

1.0

PDF

PDF

1.5

PDF
2

PDF

1.5

2.0

2.5

0.0

0.0

0.0

0.5

0.1

0.5

1.0

0.2

PDF

0.3

2.0

1.5

2.5

0.4

0.5

k=1
k=2
k=3
k=4
k=5

2.0

0.5

0.4

1
ba

0.0

0.5

1.0

1.5
x

2.0

2.5

Probability Theory

Law of Total Probability

Definitions

P [B] =

Sample space
Outcome (point or element)
Event A
-algebra A

P [B|Ai ] P [Ai ]

n
G

Ai

i=1

Bayes Theorem
P [B | Ai ] P [Ai ]
P [Ai | B] = Pn
j=1 P [B | Aj ] P [Aj ]
Inclusion-Exclusion Principle
n

n
[ X
Ai =
(1)r1

Probability Distribution P

i=1

1. P [A] 0 A
2. P [] = 1
" #

G
X
3. P
Ai =
P [Ai ]

r=1

n
G

Ai

i=1


r

\

Aij

ii1 <<ir n j=1

Random Variables

Random Variable (RV)

i=1

X:R

Probability space (, A, P)
Probability Mass Function (PMF)

Properties

i=1

1. A
S
2. A1 , A2 , . . . , A = i=1 Ai A
3. A A = A A

i=1

n
X

P [] = 0
B = B = (A A) B = (A B) (A B)
P [A] = 1 P [A]
P [B] = P [A B] + P [A B]
P [] = 1
P [] = 0
S
T
T
S
( n An ) = n An ( n An ) = n An
DeMorgan
S
T
P [ n An ] = 1 P [ n An ]
P [A B] = P [A] + P [B] P [A B]

= P [A B] P [A] + P [B]
P [A B] = P [A B] + P [A B] + P [A B]
P [A B] = P [A] P [A B]

fX (x) = P [X = x] = P [{ : X() = x}]


Probability Density Function (PDF)
Z
P [a X b] =

f (x) dx
a

Cumulative Distribution Function (CDF)


FX : R [0, 1]

FX (x) = P [X x]

1. Nondecreasing: x1 < x2 = F (x1 ) F (x2 )


2. Normalized: limx = 0 and limx = 1
3. Right-Continuous: limyx F (y) = F (x)

Continuity of Probabilities
A1 A2 . . . = limn P [An ] = P [A]
A1 A2 . . . = limn P [An ] = P [A]

S
whereA = i=1 Ai
T
whereA = i=1 Ai

ab

Independence

fY |X (y | x) =

A
B P [A B] = P [A] P [B]

f (x, y)
fX (x)

Independence

Conditional Probability
P [A | B] =

fY |X (y | x)dy

P [a Y b | X = x] =

P [A B]
P [B]

P [B] > 0

1. P [X x, Y y] = P [X x] P [Y y]
2. fX,Y (x, y) = fX (x)fY (y)
6

3.1

Transformations

E [XY ] =

xyfX,Y (x, y) dFX (x) dFY (y)


X,Y

Transformation function

E [(Y )] 6= (E [X])
(cf. Jensen inequality)
P [X Y ] = 0 = E [X] E [Y ] P [X = Y ] = 1 = E [X] = E [Y ]

X
E [X] =
P [X x]

Z = (X)
Discrete
X



fZ (z) = P [(X) = z] = P [{x : (x) = z}] = P X 1 (z) =

f (x)

x1 (z)

x=1

Sample mean
n

X
n = 1
X
Xi
n i=1

Continuous
Z
FZ (z) = P [(X) z] =

with Az = {x : (x) z}

f (x) dx
Az

Special case if strictly monotone





dx

d
1
fZ (z) = fX (1 (z)) 1 (z) = fX (x) = fX (x)
dz
dz
|J|

Conditional expectation
Z
E [Y | X = x] = yf (y | x) dy
E [X] = E [E [X | Y ]]
E[(X, Y ) | X = x] =

(x, y)fY |X (y | x) dx
Z

The Rule of the Lazy Statistician


E [(Y, Z) | X = x] =

Z
E [Z] =

E [Y + Z | X] = E [Y | X] + E [Z | X]
E [(X)Y | X] = (X)E [Y | X]
E[Y | X] = c = Cov [X, Y ] = 0

Z
dFX (x) = P [X A]

IA (x) dFX (x) =

(y, z)f(Y,Z)|X (y, z | x) dy dz

(x) dFX (x)

Z
E [IA (x)] =

Convolution
Z
Z := X + Y

fX,Y (x, z x) dx

fZ (z) =

X,Y 0

fX,Y (x, z x) dx

Z := |X Y |
Z :=

X
Y

fZ (z) = 2
fX,Y (x, z + x) dx
0
Z
Z

fZ (z) =
|x|fX,Y (x, xz) dx =
xfx (x)fX (x)fY (xz) dx

Expectation

Variance

Definition and properties



 

2
2
V [X] = X
= E (X E [X])2 = E X 2 E [X]
" n
#
n
X
X
X
V
Xi =
V [Xi ] + 2
Cov [Xi , Yj ]
i=1

" n
X

#
Xi =

i=1

Definition and properties

Z
E [X] = X =

x dFX (x) =

i6=j

V [Xi ]

iff Xi
Xj

i=1

Standard deviation
X

xfX (x)

x
Z

xfX (x)

P [X = c] = 1 = E [c] = c
E [cX] = c E [X]
E [X + Y ] = E [X] + E [Y ]

i=1
n
X

sd[X] =

X discrete

V [X] = X

Covariance
X continuous

Cov [X, Y ] = E [(X E [X])(Y E [Y ])] = E [XY ] E [X] E [Y ]


Cov [X, a] = 0
Cov [X, X] = V [X]
Cov [X, Y ] = Cov [Y, X]
Cov [aX, bY ] = abCov [X, Y ]

Cov [X + a, Y + b] = Cov [X, Y ]

n
m
n X
m
X
X
X

Cov
Xi ,
Yj =
Cov [Xi , Yj ]
i=1

j=1

Correlation

limn Bin (n, p) = Po (np)


(n large, p small)
limn Bin (n, p) = N (np, np(1 p))
(n large, p far from 0 and 1)
Negative Binomial

i=1 j=1

Cov [X, Y ]
[X, Y ] = p
V [X] V [Y ]

Independence

X NBin (1, p) = Geo (p)


Pr
X NBin (r, p) = i=1 Geo (p)
P
P
Xi NBin (ri , p) =
Xi NBin ( ri , p)
X NBin (r, p) . Y Bin (s + r, p) = P [X s] = P [Y r]

Poisson

X
Y = [X, Y ] = 0 Cov [X, Y ] = 0 E [XY ] = E [X] E [Y ]

Xi Po (i ) Xi Xj =

n
X

Xi Po

i=1

Sample variance

n
X

!
i

i=1

X
n
X
n
i

Xj Bin
Xj , Pn
Xi Po (i ) Xi Xj = Xi
j=1
j=1 j
j=1

1 X
n )2
S2 =
(Xi X
n 1 i=1
Conditional variance




2
V [Y | X] = E (Y E [Y | X])2 | X = E Y 2 | X E [Y | X]
V [Y ] = E [V [Y | X]] + V [E [Y | X]]

Exponential
Xi Exp () Xi
Xj =

n
X

Xi Gamma (n, )

i=1

Memoryless property: P [X > x + y | X > y] = P [X > x]

Inequalities

Normal
X N , 2

Cauchy-Schwarz

   
E [XY ] E X 2 E Y 2

Markov
P [(X) t]

E [(X)]
t

Chebyshev
P [|X E [X]| t]
Chernoff


P [X (1 + )]

> 1

E [(X)] (E [X]) convex

Distribution Relationships

Binomial
Xi Bern (p) =

n
X

Gamma

Jensen

N (0, 1)


X N , Z = aX + b = Z N a + b, a2 2



X N 1 , 12 Y N 2 , 22 = X + Y N 1 + 2 , 12 + 22


P
P
P
2
Xi N i , i2 =
X N
i i ,
i i
i i


a
P [a < X b] = b

(x) = 1 (x)
0 (x) = x(x)
00 (x) = (x2 1)(x)
1
Upper quantile of N (0, 1): z = (1 )

V [X]
t2

e
(1 + )1+

Xi Bin (n, p)

i=1

X Bin (n, p) , Y Bin (m, p) = X + Y Bin (n + m, p)

X Gamma (, ) X/ Gamma (, 1)
P
Gamma (, ) i=1 Exp ()
P
P
Xi Gamma (i , ) Xi
Xj =
Xi Gamma ( i i , )
i
Z
()

=
x1 ex dx

0
Beta
1
( + ) 1

x1 (1 x)1 =
x
(1 x)1
B(, )
()()
  B( + k, )


+k1
E Xk =
=
E X k1
B(, )
++k1
Beta (1, 1) Unif (0, 1)
8

Probability and Moment Generating Functions


 
GX (t) = E tX

X
(Y E [Y ])
Y
p
V [X | Y ] = X 1 2

E [X | Y ] = E [X] +

|t| < 1


Xt

MX (t) = GX (e ) = E e

=E

"
#
X (Xt)i
i!

i=0

 

X
E Xi
=
ti
i!
i=0

P [X = 0] = GX (0)
P [X = 1] = G0X (0)
P [X = i] =

Conditional mean and variance

9.3

(i)
GX (0)

Multivariate Normal
(Precision matrix 1 )

V [X1 ]
Cov [X1 , Xk ]

..
..
..
=

.
.
.

Covariance matrix

i!
E [X] = G0X (1 )
 
(k)
E X k = MX (0)


X!
(k)
E
= GX (1 )
(X k)!

Cov [Xk , X1 ]
If X N (, ),

V [X] = G00X (1 ) + G0X (1 ) (G0X (1 ))


d

n/2

GX (t) = GY (t) = X = Y

9
9.1

V [Xk ]

fX (x) = (2)

||

1
exp (x )T 1 (x )
2

Properties

Multivariate Distributions

Standard Bivariate Normal

Let X, Y N (0, 1) X
Z where Y = X +

1/2

1 2 Z

Z N (0, 1) X = + 1/2 Z = X N (, )
X N (, ) = 1/2 (X ) N (0, 1)

X N (, ) = AX N A, AAT

X N (, ) kak = k = aT X N aT , aT a

Joint density
f (x, y) =


 2
1
x + y 2 2xy
p
exp
2(1 2 )
2 1 2

10

Convergence

Let {X1 , X2 , . . .} be a sequence of rvs and let X be another rv. Let Fn denote
the cdf of Xn and let F denote the cdf of X.

Conditionals
(Y | X = x) N x, 1 2

and

(X | Y = y) N y, 1 2


Types of convergence

Independence

1. In distribution (weakly, in law): Xn X

X
Y = 0

lim Fn (t) = F (t)

9.2

Bivariate Normal

2. In probability: Xn X



Let X N x , x2 and Y N y , y2 .
f (x, y) =
"
z=

x x
x

2x y

2


+

1
p

z
exp
2
2(1 2 )
1

y y
y

t where F continuous

2


2

x x
x



( > 0) lim P [|Xn X| > ] = 0

y y
y

n
as

#

3. Almost surely (strongly): Xn X


h
i
h
i
P lim Xn = X = P : lim Xn () = X() = 1
n

qm

4. In quadratic mean (L2 ): Xn X

CLT notations
Zn N (0, 1)


2

Xn N ,
n


2

Xn N 0,
n


2
n ) N 0,
n(X

n(Xn )
N (0, 1)
n



lim E (Xn X)2 = 0

Relationships
qm

Xn X = Xn X = Xn X
as
P
Xn X = Xn X
D
P
Xn X (c R) P [X = c] = 1 = Xn X

Xn
Xn
Xn
Xn

X
qm
X
P
X
P
X

Yn
Yn
Yn
=

Y = Xn + Yn X + Y
qm
qm
Y = Xn + Yn X + Y
P
P
Y = Xn Yn XY
P
(Xn ) (X)

Continuity correction


x + 12

/ n




x 12

P Xn x 1
/ n

Xn X = (Xn ) (X)
qm
Xn b limn E [Xn ] = b limn V [Xn ] = 0
qm
n
X1 , . . . , Xn iid E [X] = V [X] < X



n x
P X

Slutzkys Theorem
D

Xn X and Yn c = Xn + Yn X + c
D
P
D
Xn X and Yn c = Xn Yn cX
D
D
D
In general: Xn X and Yn Y =
6
Xn + Yn X + Y

10.1

Delta method

Yn N

11

Law of Large Numbers (LLN)

2
,
n


= (Yn ) N

2
(), ( ())
n
0

Statistical Inference
iid

Let {X1 , . . . , Xn } be a sequence of iid rvs, E [X1 ] = , and V [X1 ] < .

Let X1 , , Xn F if not otherwise noted.

Weak (WLLN)

11.1
n

Point estimator bn of is a rv: bn = g(X1 , . . . , Xn )


h i
bias(bn ) = E bn

as
n
X

P
Consistency: bn
Sampling distribution: F (bn )
r h i
b
Standard error: se(n ) = V bn
h
i
h i
Mean squared error: mse = E (bn )2 = bias(bn )2 + V bn

Strong (WLLN)

10.2

Central Limit Theorem (CLT)

Let {X1 , . . . , Xn } be a sequence of iid rvs, E [X1 ] = , and V [X1 ] = 2 .



n
n(Xn ) D
X
Z
Zn := q   =

n
V X
lim P [Zn z] = (z)

Point Estimation

n
X
P

where Z N (0, 1)

zR

limn bias(bn ) = 0 limn se(bn ) = 0 = bn is consistent


bn D
Asymptotic normality:
N (0, 1)
se
Slutzkys Theorem often lets us replace se(bn ) by some (weakly) consistent estimator
bn .
10

11.2

Normal-Based Confidence Interval




b 2 . Let z/2
Suppose bn N , se


and P z/2 < Z < z/2 = 1 where Z N (0, 1). Then



= 1 (1 (/2)), i.e., P Z > z/2 = /2

b
Cn = bn z/2 se

11.4

Statistical Functionals
Statistical functional: T (F )
Plug-in estimator of = (F ): bn = T (Fbn )
R
Linear functional: T (F ) = (x) dFX (x)
Plug-in estimator for linear functional:
n

11.3

1X
(Xi )
(x) dFbn (x) =
n i=1


b 2 = T (Fbn ) z/2 se
b
Often: T (Fbn ) N T (F ), se
T (Fbn ) =

Empirical distribution

Empirical Distribution Function (ECDF)


Pn
Fbn (x) =

i=1

I(Xi x)
n

(
1
I(Xi x) =
0

Xi x
Xi > x

Properties (for any fixed x)


h i
E Fbn = F (x)
h i F (x)(1 F (x))
V Fbn =
n
F (x)(1 F (x)) D
mse =
0
n
P
Fbn F (x)
Dvoretzky-Kiefer-Wolfowitz (DKW) inequality (X1 , . . . , Xn F )




2


P sup F (x) Fbn (x) > = 2e2n

pth quantile: F 1 (p) = inf{x : F (x) p}


n

b=X
n
1 X
n )2

b2 =
(Xi X
n 1 i=1
Pn
1
b)3
i=1 (Xi
n

b=

b3 j
Pn
n )(Yi Yn )
(Xi X
qP
b = qP i=1
n
n
2

(X

X
)
i
n
i=1
i=1 (Yi Yn )

12

Parametric Inference



Let F = f (x; ) : be a parametric model with parameter space Rk
and parameter = (1 , . . . , k ).

12.1

Method of Moments

j th moment
 
j () = E X j =

Nonparametric 1 confidence band for F


L(x) = max{Fbn n , 0}
U (x) = min{Fbn + n , 1}
s
 
1
2
log
=
2n

P [L(x) F (x) U (x) x] 1

xj dFX (x)

j th sample moment
n

bj =

1X j
X
n i=1 i

Method of moments estimator (MoM)


1 () =
b1
2 () =
b2
.. ..
.=.
k () =
bk

11

Properties of the MoM estimator


bn exists with probability tending to 1
P
Consistency: bn
Asymptotic normality:

n(b ) N (0, )



where = gE Y Y T g T , Y = (X, X 2 , . . . , X k )T ,
1
g = (g1 , . . . , gk ) and gj =
j ()

12.2

Maximum Likelihood

Likelihood: Ln : [0, )
Ln () =

n
Y

f (Xi ; )

Equivariance: bn is the mle = (bn ) ist the mle of ()


Asymptotic normality:
p
1. se 1/In ()
(bn ) D
N (0, 1)
se
q
b 1/In (bn )
2. se
(bn ) D
N (0, 1)
b
se
Asymptotic optimality (or efficiency), i.e., smallest variance for large samples. If en is any other estimator, the asymptotic relative efficiency is
h i
V bn
are(en , bn ) = h i 1
V en

i=1

Approximately the Bayes estimator


Log-likelihood
`n () = log Ln () =

n
X

log f (Xi ; )

i=1

12.2.1

Delta Method

b where is differentiable and 0 () 6= 0:


If = ()

Maximum likelihood estimator (mle)

(b
n ) D
N (0, 1)
b )
se(b

Ln (bn ) = sup Ln ()

b is the mle of and


where b = ()

Score function
s(X; ) =

log f (X; )



b b
b = 0 ()
b n )
se
se(

Fisher information
I() = V [s(X; )]
In () = nI()
Fisher information (exponential family)



I() = E s(X; )

Observed Fisher information


Inobs () =
Properties of the mle
P
Consistency: bn

12.3

Multiparameter Models

Let = (1 , . . . , k ) and b = (b1 , . . . , bk ) be the mle.


Hjj =

2 `n
2

Hjk =

Fisher information matrix

n
2 X
log f (Xi ; )
2 i=1

2 `n
j k

..
.
E [Hk1 ]

E [H11 ]

..
In () =
.

E [H1k ]

..

.
E [Hkk ]

Under appropriate regularity conditions


(b ) N (0, Jn )

12

with Jn () = In1 . Further, if bj is the j th component of , then

13

Hypothesis Testing
H0 : 0

(bj j ) D
N (0, 1)
bj
se
h
i
b 2j = Jn (j, j) and Cov bj , bk = Jn (j, k)
where se

12.3.1

Multiparameter delta method

Let = (1 , . . . , k ) and let the gradient of be

1
.

=
..

k

Definitions

Null hypothesis H0
Alternative hypothesis H1
Simple hypothesis = 0
Composite hypothesis > 0 or < 0
Two-sided test: H0 : = 0 versus H1 : 6= 0
One-sided test: H0 : 0 versus H1 : > 0
Critical value c
Test statistic T
Rejection region R = {x : T (x) > c}
Power function () = P [X R]
Power of a test: 1 P [Type II error] = 1 = inf ()

(b
) D
N (0, 1)
b )
se(b

H0 true
H1 true

T

Type II Error ()

1F (T (X))

p-value
< 0.01
0.01 0.05
0.05 0.1
> 0.1

 
b
Jbn


b and
b = b.
and Jbn = Jn ()
=

12.4

Retain H0

Reject H0
Type
I Error ()
(power)



p-value = sup0 P [T (X) T (x)] = inf : T (x) R


p-value = sup0
P [T (X ? ) T (X)]
= inf : T (X) R
|
{z
}

where
b ) =
se(b

Test size: = P [Type I error] = sup ()

p-value


b Then,
Suppose =b 6= 0 and b = ().

r


H1 : 1

versus

Parametric Bootstrap

Sample from f (x; bn ) instead of from Fbn , where bn could be the mle or method
of moments estimator.

since T (X ? )F

evidence
very strong evidence against H0
strong evidence against H0
weak evidence against H0
little or no evidence against H0

Wald test
Two-sided test
b 0
Reject H0 when |W | > z/2 where W =
b
se


P |W | > z/2
p-value = P0 [|W | > |w|] P [|Z| > |w|] = 2(|w|)
Likelihood ratio test (LRT)
T (X) =

sup Ln ()
Ln (bn )
=
sup0 Ln ()
Ln (bn,0 )

13

(X) = 2 log T (X) 2rq where

k
X

iid

Zi2 2k and Z1 , . . . , Zk N (0, 1)

 i=1

p-value = P0 [(X) > (x)] P 2rq > (x)
Multinomial LRT


Xk
X1
,...,
mle: pbn =
n
n
Xj
k 
Y
Ln (b
pn )
pbj
T (X) =
=
Ln (p0 )
p0j
j=1


k
X
pbj
D
Xj log
(X) = 2
2k1
p
0j
j=1

xn = (x1 , . . . , xn )
Prior density f ()
Likelihood f (xn | ): joint density of the data
n
Y
In particular, X n iid = f (xn | ) =
f (xi | ) = Ln ()
i=1

Posterior density f ( | xn )
R
Normalizing constant cn = f (xn ) = f (x | )f () d
Kernel: part of a density that depends Ron
R
Ln ()f ()
Posterior mean n = f ( | xn ) d = R
Ln ()f () d

14.1

The approximate size LRT rejects H0 when (X) 2k1,

Credible Intervals

Posterior interval

Pearson Chi-square Test

k
X
(Xj E [Xj ])2
where E [Xj ] = np0j under H0
T =
E [Xj ]
j=1
D

f ( | xn ) d = 1

P [ (a, b) | x ] =
a

Equal-tail credible interval

2k1

T


p-value = P 2k1 > T (x)

f ( | xn ) d =

2
Faster Xk1
than LRT, hence preferable for small n

f ( | xn ) d = /2

Highest posterior density (HPD) region Rn

Independence testing
I rows, J columns, X multinomial sample of size n = I J
X
mles unconstrained: pbij = nij
X

mles under H0 : pb0ij = pbi pbj = Xni nj




PI PJ
nX
LRT: = 2 i=1 j=1 Xij log Xi Xijj
PI PJ (X E[X ])2
PearsonChiSq: T = i=1 j=1 ijE[Xij ]ij
D

LRT and Pearson 2k , where = (I 1)(J 1)

14

1. P [ Rn ] = 1
2. Rn = { : f ( | xn ) > k} for some k
Rn is unimodal = Rn is an interval

14.2

Function of parameters

Let = () and A = { : () }.
Posterior CDF for

Bayesian Inference

H(r | xn ) = P [() | xn ] =

f ( | xn ) d

Bayes Theorem

Posterior density

f (x | )f ()
f (x | )f ()
f ( | x) =
=R
Ln ()f ()
n
f (x )
f (x | )f () d

h( | xn ) = H 0 ( | xn )
Bayesian delta method

Definitions
n

X = (X1 , . . . , Xn )




b
b se
b 0 ()
| X n N (),

14

14.3

Priors

Continuous likelihood (subscript c denotes constant)


Likelihood

Conjugate prior

Unif (0, )

Pareto(xm , k)

Exp ()

Gamma (, )

Choice
Subjective bayesianism.
Objective bayesianism.
Robust bayesianism.

i=1


2


2

N , c

N 0 , 0

Types
N c , 2

Flat: f () constant
R
Proper: f () d = 1
R
Improper: f () d =
Jeffreys prior (transformation-invariant):
f ()

I()

f ()

N , 2

Scaled Inverse Chisquare(, 02 )

Normalscaled
Inverse
Gamma(, , , )

det(I())
MVN(, c )

MVN(0 , 0 )

Conjugate: f () and f ( | xn ) belong to the same parametric family


MVN(c , )
14.3.1

Conjugate Priors
Discrete likelihood

Likelihood
Bern (p)
Bin (p)

Conjugate prior
Beta (, )
Beta (, )

Posterior hyperparameters
+
+

n
X
i=1
n
X

xi , + n
xi , +

i=1

NBin (p)
Po ()

Beta (, )
Gamma (, )

+ rn, +
+

n
X

n
X

n
X

i=1

xi

Dir ()

n
X

xi , + n
x(i)

i=1

Geo (p)

Beta (, )

n
X

InverseWishart(, )

Pareto(xmc , k)

Gamma (, )

Pareto(xm , kc )

Pareto(x0 , k0 )

Gamma (c , )

Gamma (0 , 0 )

Pn
 

n
0
1
i=1 xi
+
+ 2 ,
/
2
2
02
c
 0
c1
n
1
+ 2
02
c
Pn
02 + i=1 (xi )2
+ n,
+n


n
+ n
x
,
+ n,
+ ,
+n
2
n
1X
(
x )2
2
+
(xi x
) +
2 i=1
2(n + )
1
1
0 + nc

+ n, +

n
X
i=1

1


1
1
x
,
0 0 + n


1 1
1
0 + nc
n
X
n + , +
(xi c )(xi c )T
i=1
n
X

xi
xm c
i=1
x0 , k0 kn where k0 > kn
n
X
0 + nc , 0 +
xi
+ n, +

log

i=1

xi

i=1

Ni

n
X

14.4
xi

Bayesian Testing

If H0 : 0 :

i=1

Z
Prior probability P [H0 ] =

i=1

i=1

Multinomial(p)

Posterior hyperparameters


max x(n) , xm , k + n
n
X
+ n, +
xi

Posterior probability P [H0 | xn ] =

f () d
Z0

f ( | xn ) d

Let H0 , . . . , HK1 be K hypotheses. Suppose f ( | Hk ),


xi

f (xn | Hk )P [Hk ]
P [Hk | xn ] = PK
,
n
k=1 f (x | Hk )P [Hk ]

15

Marginal likelihood
f (xn | Hi ) =

1. Estimate VF [Tn ] with VFbn [Tn ].


2. Approximate VFbn [Tn ] using simulation:

f (xn | , Hi )f ( | Hi ) d

(a) Repeat the following B times to get Tn,1


, . . . , Tn,B
, an iid sample from
b
the sampling distribution implied by Fn
i. Sample uniformly X , . . . , Xn Fbn .

Posterior odds (of Hi relative to Hj )


P [Hi | xn ]
P [Hj | xn ]

f (xn | Hi )
f (xn | Hj )
| {z }

Bayes Factor BFij

P [Hi ]
P [Hj ]
| {z }

ii. Compute Tn = g(X1 , . . . , Xn ).


(b) Then

prior odds

Bayes factor
log10 BF10

p =

15

0 0.5
0.5 1
12
>2
p
1p BF10
1+

p
1p BF10

BF10

evidence

1 1.5
1.5 10
10 100
> 100

Weak
Moderate
Strong
Decisive

vboot

B
B
X
1 X

bb = 1
T

=V
T
n,b
Fn
B
B r=1 n,r

!2

b=1

16.1.1

where p = P [H1 ] and p = P [H1 | xn ]

Bootstrap Confidence Intervals

Normal-based interval
b boot
Tn z/2 se

Exponential Family

Pivotal interval

Scalar parameter

1.
2.
3.
4.

fX (x | ) = h(x) exp {()T (x) A()}


= h(x)g() exp {()T (x)}
Vector parameter
(
fX (x | ) = h(x) exp

s
X

Location parameter = T (F )
Pivot Rn = bn
Let H(r) = P [Rn r] be the cdf of Rn

Let Rn,b
= bn,b
bn . Approximate H using bootstrap:

B
1 X

b
H(r)
=
I(Rn,b
r)
B

i ()Ti (x) A()

i=1

b=1

= h(x) exp {() T (x) A()}


= h(x)g() exp {() T (x)}
Natural form
fX (x | ) = h(x) exp { T(x) A()}
= h(x)g() exp { T(x)}


= h(x)g() exp T T(x)

, . . . , bn,B
)
5. = sample quantile of (bn,1

6. r = sample quantile of (Rn,1


, . . . , Rn,B
), i.e., r = bn
 
7. Approximate 1 confidence interval Cn = a
, b where

a
=
b =

16
16.1

Sampling Methods
The Bootstrap

Let Tn = g(X1 , . . . , Xn ) be a statistic.



b 1 1 =
bn H
2
b 1 =
bn H
2

bn r1/2
=

2bn 1/2

bn r/2
=

2bn /2

Percentile interval



Cn = /2
, 1/2
16

16.2

Rejection Sampling

Setup
We can easily sample from g()
We want to sample from h(), but it is difficult
k()
k() d
Envelope condition: we can find M > 0 such that k() M g()
We know h() up to a proportional constant: h() = R

Algorithm
1. Draw cand g()
2. Generate u Unif (0, 1)
k(cand )
3. Accept cand if u
M g(cand )
4. Repeat until B values of cand have been accepted
Example

Loss functions
Squared error loss: L(, a) = ( a)2
(
K1 ( a) a < 0
Linear loss: L(, a) =
K2 (a ) a 0
Absolute error loss: L(, a) = | a| (linear loss with K1 = K2 )
Lp loss: L(, a) = | a|p
(
0 a=
Zero-one loss: L(, a) =
1 a 6=

17.1

We can easily sample from the prior g() = f ()


Target is the posterior h() k() = f (xn | )f ()
Envelope condition: f (xn | ) f (xn | bn ) = Ln (bn ) M
Algorithm
1. Draw cand f ()
2. Generate u Unif (0, 1)
Ln (cand )
3. Accept cand if u
Ln (bn )

16.3

Decision rule: synonymous for an estimator b


Action a A: possible value of the decision rule. In the estimation
b
context, the action is just an estimate of , (x).
Loss function L: consequences of taking action a when true state is or
b L : A [k, ).
discrepancy between and ,

Posterior risk
Z

h
i
b
b
L(, (x))f
( | x) d = E|X L(, (x))

h
i
b
b
L(, (x))f
(x | ) dx = EX| L(, (X))

r(b | x) =
(Frequentist) risk
b =
R(, )
Bayes risk
ZZ

Importance Sampling

b =
r(f, )

Sample from an importance function g rather than target density h.


Algorithm to obtain an approximation to E [q() | xn ]:

Decision Theory

Definitions
Unknown quantity affecting our decision:

h
i
b
b
L(, (x))f
(x, ) dx d = E,X L(, (X))

h
h
ii
h
i
b = E EX| L(, (X)
b
b
r(f, )
= E R(, )
h
h
ii
h
i
b = EX E|X L(, (X)
b
r(f, )
= EX r(b | X)

iid

1. Sample from the prior 1 , . . . , n f ()


Ln (i )
2. wi = PB
i = 1, . . . , B
i=1 Ln (i )
PB
3. E [q() | xn ] i=1 q(i )wi

17

Risk

17.2

Admissibility

b0 dominates b if

b
: R(, b0 ) R(, )
b
: R(, b0 ) < R(, )

b is inadmissible if there is at least one other estimator b0 that dominates


it. Otherwise it is called admissible.

17

17.3

Bayes Rule

Residual sums of squares (rss)

Bayes rule (or Bayes estimator)


rss(b0 , b1 ) =

b = inf e r(f, )
e
r(f, )

R
b
b
b = r(b | x)f (x) dx
(x) = inf r( | x) x = r(f, )

n
X

2i

i=1

Least square estimates


Theorems

bT = (b0 , b1 )T : min rss


b0 ,
b1

Squared error loss: posterior mean


Absolute error loss: posterior median
Zero-one loss: posterior mode

17.4

n
b0 = Yn b1 X
Pn
Pn
n )(Yi Yn )

(Xi X
i=1 Xi Yi nXY
Pn
b1 = i=1
=
P
n
2
2
2
i=1 (Xi Xn )
i=1 Xi nX
h
i  
0
E b | X n =
1


P
h
i
2 n1 ni=1 Xi2 X n
n
b
V |X =
X n
1
nsX
r Pn
2

b
i=1 Xi

b b0 ) =
se(
n
sX n

b b1 ) =
se(
sX n

Minimax Rules

Maximum risk
b = sup R(, )
b
)
R(

R(a)
= sup R(, a)

Minimax rule
e
e = inf sup R(, )
b = inf R(
)
sup R(, )

b =c
b = Bayes rule c : R(, )
Least favorable prior
bf = Bayes rule R(, bf ) r(f, bf )

18

Linear Regression

Pn
b2 =
where s2X = n1 i=1 (Xi X n )2 and
Further properties:

Pn

2i
i=1 

(unbiased estimate).

P
P
Consistency: b0 0 and b1 1
Asymptotic normality:

Definitions
Response variable Y
Covariate X (aka predictor variable or feature)

18.1

1
n2

b0 0 D
N (0, 1)
b b0 )
se(

Simple Linear Regression

b1 1 D
N (0, 1)
b b1 )
se(

Approximate 1 confidence intervals for 0 and 1 :

Model
Yi = 0 + 1 Xi + i

and

E [i | Xi ] = 0, V [i | Xi ] = 2

b b0 )
b0 z/2 se(

b b1 )
and b1 z/2 se(

Fitted line
Wald test for H0 : 1 = 0 vs. H1 : 1 6= 0: reject H0 if |W | > z/2 where
b b1 ).
W = b1 /se(

rb(x) = b0 + b1 x
Predicted (fitted) values
Ybi = rb(Xi )
Residuals

i = Yi Ybi = Yi b0 + b1 Xi

R2


Pn b
Pn 2
2

rss
i=1 (Yi Y )
R = Pn
= 1 Pn i=1 i 2 = 1
2
tss
i=1 (Yi Y )
i=1 (Yi Y )
2

18

If the (k k) matrix X T X is invertible,

Likelihood
L=
L1 =

n
Y
i=1
n
Y

f (Xi , Yi ) =

n
Y

fX (Xi )

i=1

n
Y

b = (X T X)1 X T Y
h
i
V b | X n = 2 (X T X)1

fY |X (Yi | Xi ) = L1 L2

i=1

b N , 2 (X T X)1

fX (Xi )

Estimate regression function

i=1
n
Y

2
1 X
Yi (0 1 Xi )
fY |X (Yi | Xi ) n exp 2
L2 =
2 i
i=1

)
rb(x) =

k
X

bj xj

j=1

Under the assumption of Normality, the least squares estimator is also the mle

Unbiased estimate for

n
1 X 2


b =
n k i=1 i

1X 2


b2 =
n i=1 i

 = X b Y

mle

18.2

b=X

Prediction

Observe X = x of the covariate and want to predict their outcome Y .


Yb = b0 + b1 x
i
h i
h i
h
i
V Yb = V b0 + x2 V b1 + 2x Cov b0 , b1
h

Prediction interval
bn2 =
b2


 Pn
2
i=1 (Xi X )
P
2j + 1
n i (Xi X)

nk 2

1 Confidence interval
b bj )
bj z/2 se(

18.4

Model Selection

Consider predicting a new observation Y for covariates X and let S J


denote a subset of the covariates in the model, where |S| = k and |J| = n.
Issues
Underfitting: too few covariates yields high bias
Overfitting: too many covariates yields high variance

Yb z/2 bn

18.3

b2 =

Procedure
1. Assign a score to each model
2. Search through all models to find the one with the highest score

Multiple Regression
Y = X + 

Hypothesis testing
H0 : j = 0 vs. H1 : j 6= 0 j J

where

X11
..
X= .
Xn1

..
.

X1k
..
.
Xnk

1

= ...


1
..
=.

n

Mean squared prediction error (mspe)


h
i
mspe = E (Yb (S) Y )2
Prediction risk

Likelihood

1
2 n/2
L(, ) = (2 )
exp 2 rss
2


R(S) =

n
X

mspei =

i=1

n
X

h
i
E (Ybi (S) Yi )2

i=1

Training error
rss = (y X)T (y X) = kY Xk2 =

N
X
(Yi xTi )2
i=1

btr (S) =
R

n
X
(Ybi (S) Yi )2
i=1

19

R2
btr (S)
rss(S)
R
R2 (S) = 1
=1
=1
tss
tss

Pn b
2
i=1 (Yi (S) Y )
P
n
2
i=1 (Yi Y )

Frequentist risk
Z
h
i Z
R(f, fbn ) = E L(f, fbn ) = b2 (x) dx + v(x) dx

The training error is a downward-biased estimate of the prediction risk.


h
i
btr (S) < R(S)
E R
h
i
btr (S)) = E R
btr (S) R(S) = 2
bias(R

n
X

h
i
b(x) = E fbn (x) f (x)
h
i
v(x) = V fbn (x)

h
i
Cov Ybi , Yi

i=1

Adjusted R

R2 (S) = 1

n 1 rss
n k tss

19.1.1

Mallows Cp statistic

Histograms

Definitions

b
btr (S) + 2kb
R(S)
=R
2 = lack of fit + complexity penalty

Akaike Information Criterion (AIC)


AIC(S) = `n (bS ,
bS2 ) k
Bayesian Information Criterion (BIC)

Number of bins m
1
Binwidth h = m
Bin Bj has j observations
R
Define pbj = j /n and pj = Bj f (u) du

Histogram estimator

k
BIC(S) = `n (bS ,
bS2 ) log n
2
Validation and training
bV (S) =
R

m
X

fbn (x) =

(Ybi (S) Yi )2

m = |{validation data}|, often

i=1

Leave-one-out cross-validation
bCV (S) =
R

n
X

(Yi Yb(i) ) =

i=1

n
X
i=1

Yi Ybi (S)
1 Uii (S)

!2

U (S) = XS (XST XS )1 XS (hat matrix)

19
19.1

Non-parametric Function Estimation

n
n
or
4
2

m
X
pbj
j=1

I(x Bj )

h
i p
j
E fbn (x) =
h
h
i p (1 p )
j
j
V fbn (x) =
nh2
Z
1
h2
2
b
(f 0 (u)) du +
R(fn , f )
12
nh
!1/3
1
6
h = 1/3 R
2 du
n
(f 0 (u))
 2/3 Z
1/3
3
C
2
C=
(f 0 (u)) du
R (fbn , f ) 2/3
4
n

Density Estimation

R
Estimate f (x), where f (x) = P [X A] = A f (x) dx.
Integrated square error (ise)
Z 
Z
2
L(f, fbn ) =
f (x) fbn (x) dx = J(h) + f 2 (x) dx

Cross-validation estimate of E [J(h)]


Z
JbCV (h) =

2Xb
2
n+1 X 2
f(i) (Xi ) =
pb
fbn2 (x) dx

n i=1
(n 1)h (n 1)h j=1 j
20

19.1.2

Kernel Density Estimator (KDE)

k-nearest Neighbor Estimator


X
1
Yi
where Nk (x) = {k values of x1 , . . . , xn closest to x}
rb(x) =
k

Kernel K

i:xi Nk (x)

K(x) 0
R
K(x) dx = 1
R
xK(x) dx = 0
R 2
2
>0
x K(x) dx K

Nadaraya-Watson Kernel Estimator


rb(x) =

n
X

wi (x)Yi

i=1

KDE

xxi

h
Pn
xxj
K
j=1
h

wi (x) =



n
1X1
x Xi
fbn (x) =
K
n i=1 h
h
Z
Z
1
1
4
00
2
b
R(f, fn ) (hK )
(f (x)) dx +
K 2 (x) dx
4
nh
Z
Z
2/5 1/5 1/5
c
c2 c3
2
2
h = 1
c
=

,
c
=
K
(x)
dx,
c
=
(f 00 (x))2 dx
1
2
3
K
n1/5
Z
4/5 Z
1/5
c4
5 2 2/5

2
00 2
b
R (f, fn ) = 4/5
K (x) dx
(f ) dx
c4 = (K )
4
n
|
{z
}

R(b
rn , r)

h4
4
Z

[0, 1]

4 Z 
2
f 0 (x)
x2 K 2 (x) dx
r00 (x) + 2r0 (x)
dx
f (x)
R
2 K 2 (x) dx
dx
nhf (x)
Z

c1
n1/5
c2
R (b
rn , r) 4/5
n
h

C(K)

Cross-validation estimate of E [J(h)]

Epanechnikov Kernel
(
K(x) =

4 5(1x2 /5)

|x| <

otherwise

JbCV (h) =

n
X

(Yi rb(i) (xi ))2 =

i=1

n
X
i=1

(Yi rb(xi ))2


1

Pn

j=1

Cross-validation estimate of E [J(h)]


Z
JbCV (h) =

K (x) = K (2) (x) 2K(x)

19.2

19.3



n
n
n
2Xb
1 X X Xi Xj
2
2
b
fn (x) dx
f(i) (Xi )
+
K(0)
K
n i=1
hn2 i=1 j=1
h
nh
K (2) (x) =

Approximation
r(x) =

X
j=1

j j (x)

J
X

j j (x)

i=1

Multivariate regression

Non-parametric Regression

Estimate f (x) where f (x) = E [Y | X = x]. Consider pairs of points


(x1 , Y1 ), . . . , (xn , Yn ) related by

K(0)
 xx 
j
K
h

Smoothing Using Orthogonal Functions

Z
K(x y)K(y) dy

!2

where

i = i

Y = +

0 (x1 )
..
and = .

..
.
0 (xn )

J (x1 )
..
.
J (xn )

Least squares estimator


Yi = r(xi ) + i
E [i ] = 0
V [i ] = 2

b = (T )1 T Y
1
T Y (for equally spaced observations only)
n

21

Cross-validation estimate of E [J(h)]

2
n
J
X
X
bCV (J) =
Yi
R
j (xi )bj,(i)
i=1

20

j=1

20.2

Poisson Processes

Poisson process
{Xt : t [0, )} = number of events up to and including time t
X0 = 0
Independent increments:

Stochastic Processes

Stochastic Process
(
{0, 1, . . . } = Z discrete
T =
[0, )
continuous

{Xt : t T }
Notations Xt , X(t)
State space X
Index set T

20.1

t0 < < tn : Xt1 Xt0



Xtn Xtn1
Intensity function (t)
P [Xt+h Xt = 1] = (t)h + o(h)
P [Xt+h Xt = 2] = o(h)
Xs+t Xs Po (m(s + t) m(s)) where m(t) =

Markov Chains

Markov chain

Rt
0

(s) ds

Homogeneous Poisson process

P [Xn = x | X0 , . . . , Xn1 ] = P [Xn = x | Xn1 ]

n T, x X
(t) = Xt Po (t)

Transition probabilities
pij P [Xn+1 = j | Xn = i]
pij (n) P [Xm+n = j | Xm = i]

>0

Waiting times
n-step
Wt := time at which Xt occurs

Transition matrix P (n-step: Pn )


(i, j) element is pij
pij > 0
P

i pij = 1



1
Wt Gamma t,

Chapman-Kolmogorov

Interarrival times

pij (m + n) =

pij (m)pkj (n)

St = Wt+1 Wt

Pm+n = Pm Pn
Pn = P P = Pn

St Exp

 
1

Marginal probability
n = (n (1), . . . , n (N ))

where

i (i) = P [Xn = i]

St

0 , initial distribution
n = 0 Pn

Wt1

Wt

t
22

21

Time Series

21.1

Mean function

Strictly stationary

xt = E [xt ] =

Stationary Time Series

xft (x) dx

P [xt1 c1 , . . . , xtk ck ] = P [xt1 +h c1 , . . . , xtk +h ck ]

Autocovariance function
x (s, t) = E [(xs s )(xt t )] = E [xs xt ] s t


x (t, t) = E (xt t )2 = V [xt ]
Autocorrelation function (ACF)

k N, tk , ck , h Z
Weakly stationary
 
t Z
E x2t <
 2
E xt = m
t Z
x (s, t) = x (s + r, t + r)

(s, t)
Cov [xs , xt ]
=p
(s, t) = p
V [xs ] V [xt ]
(s, s)(t, t)
Cross-covariance function (CCV)

r, s, t Z

Autocovariance function

xy (s, t) = E [(xs xs )(yt yt )]

Cross-correlation function (CCF)


xy (s, t) = p

xy (s, t)
x (s, s)y (t, t)

(h) = E [(xt+h )(xt )]




(0) = E (xt )2
(0) 0
(0) |(h)|
(h) = (h)

h Z

Backshift operator
B k (xt ) = xtk

Autocorrelation function (ACF)

Difference operator

(t + h, t)
(h)
Cov [xt+h , xt ]
x (h) = p
=p
=
(0)
V [xt+h ] V [xt ]
(t + h, t + h)(t, t)

d = (1 B)d
White noise
2
wt wn(0, w
)

Jointly stationary time series

iid

2
0, w

Gaussian: wt N
E [wt ] = 0 t T
V [wt ] = 2 t T
w (s, t) = 0 s 6= t s, t T

xy (h) = E [(xt+h x )(yt y )]


xy (h) = p

Random walk
Linear process

Drift
Pt
xt = t + j=1 wj
E [xt ] = t

xt = +

k
X
j=k

aj xtj

j wtj

where

j=

Symmetric moving average


mt =

xy (h)
x (0)y (h)

where aj = aj 0 and

k
X
j=k

aj = 1

(h) =

|j | <

j=

2
w

j+h j

j=

23

21.2

Estimation of Correlation

21.3.1

Detrending

Least squares

Sample mean
n

1X
x
=
xt
n t=1
Sample variance

n 
|h|
1 X
1
x (h)
V [
x] =
n
n
h=n

1. Choose trend model, e.g., t = 0 + 1 t + 2 t2


2. Minimize rss to obtain trend estimate
bt = b0 + b1 t + b2 t2
3. Residuals , noise wt
Moving average
The low-pass filter vt is a symmetric moving average mt with aj =

Sample autocovariance function


nh
1 X

b(h) =
(xt+h x
)(xt x
)
n t=1

vt =

1
2k+1 :

k
X
1
xt1
2k + 1
i=k

Pk
1
If 2k+1
i=k wtj 0, a linear trend function t = 0 + 1 t passes
without distortion

Sample autocorrelation function


b(h) =

b(h)

b(0)

Differencing
t = 0 + 1 t = xt = 1

Sample cross-variance function

bxy (h) =

nh
1 X
(xt+h x
)(yt y)
n t=1

21.4

ARIMA models

Autoregressive polynomial
(z) = 1 1 z p zp

Sample cross-correlation function

bxy (h)
bxy (h) = p

bx (0)b
y (0)

z C p 6= 0

Autoregressive operator
(B) = 1 1 B p B p

Properties
1
bx (h) = if xt is white noise
n
1
bxy (h) = if xt or yt is white noise
n

21.3

Non-Stationary Time Series

Autoregressive model order p, AR (p)


xt = 1 xt1 + + p xtp + wt (B)xt = wt
AR (1)
xt = k (xtk ) +

k1
X

j (wtj )

k,||<1

j=0

Classical decomposition model

j (wtj )

j=0

{z

linear process

xt = t + st + wt
t = trend
st = seasonal component
wt = random noise term

E [xt ] =

j=0

j (E [wtj ]) = 0

(h) = Cov [xt+h , xt ] =


(h) =

(h)
(0)

2 h
w

12

= h

(h) = (h 1) h = 1, 2, . . .

24

Moving average polynomial

Seasonal ARIMA

(z) = 1 + 1 z + + q zq

z C q 6= 0

Moving average operator


(B) = 1 + 1 B + + p B p

21.4.1

MA (q) (moving average model order q)


xt = wt + 1 wt1 + + q wtq xt = (B)wt
E [xt ] =

q
X

Denoted by ARIMA (p, d, q) (P, D, Q)s


d
s
P (B s )(B)D
s xt = + Q (B )(B)wt
Causality and Invertibility

ARMA (p, q) is causal (future-independent) {j } :


xt =

j E [wtj ] = 0

(h) = Cov [xt+h , xt ] =

2
w
0

Pqh
j=0

j j+h

j=0

j < such that

wtj = (B)wt

j=0

j=0

0hq
h>q

ARMA (p, q) is invertible {j } :

MA (1)
xt = wt + wt1

2 2

(1 + )w h = 0
2
(h) = w
h=1

0
h>1
(

h=1
2
(h) = (1+ )
0
h>1

(B)xt =

j=0

j < such that

Xtj = wt

j=0

Properties
ARMA (p, q) causal roots of (z) lie outside the unit circle
(z) =

j z j =

j=0

(z)
(z)

|z| 1

ARMA (p, q)
xt = 1 xt1 + + p xtp + wt + 1 wt1 + + q wtq

ARMA (p, q) invertible roots of (z) lie outside the unit circle

(B)xt = (B)wt

(z) =

Partial autocorrelation function (PACF)

j z j =

j=0

xh1
, regression of xi on {xh1 , xh2 , . . . , x1 }
i
hh = corr(xh xh1
, x0 xh1
) h2
0
h
E.g., 11 = corr(x1 , x0 ) = (1)

(z)
(z)

|z| 1

Behavior of the ACF and PACF for causal and invertible ARMA models

ACF
PACF

ARIMA (p, d, q)
d xt = (1 B)d xt is ARMA (p, q)

AR (p)
tails off
cuts off after lag p

MA (q)
cuts off after lag q
tails off q

ARMA (p, q)
tails off
tails off

(B)(1 B)d xt = (B)wt


Exponentially Weighted Moving Average (EWMA)
xt = xt1 + wt wt1
xt =

(1 )j1 xtj + wt

when || < 1

21.5

Spectral Analysis

Periodic process
xt = A cos(2t + )
= U1 cos(2t) + U2 sin(2t)

j=1

x
n+1 = (1 )xn +
xn

Frequency index (cycles per unit time), period 1/

25

Amplitude A
Phase
U1 = A cos and U2 = A sin often normally distributed rvs

Discrete Fourier Transform (DFT)


d(j ) = n1/2

xt =

q
X

Fourier/Fundamental frequencies
(Uk1 cos(2k t) + Uk2 sin(2k t))

j = j/n

k=1

Uk1 , Uk2 , for k = 1, . . . , q, are independent zero-mean rvs with variances k2


Pq
(h) = k=1 k2 cos(2k h)
  Pq
(0) = E x2t = k=1 k2

Inverse DFT
xt = n1/2

j=0

Scaled Periodogram

2 2i0 h 2 2i0 h
e
+
e
=
2
2
Z 1/2
=
e2ih dF ()

4
I(j/n)
n
!2
n
2X
=
xt cos(2tj/n +
n t=1

P (j/n) =

1/2

Spectral distribution function


< 0
< 0
0

22
22.1

(h)e2ih

h=
h=

!2

Gamma Function

ts1 et dt
0
Z
Upper incomplete: (s, x) =
ts1 et dt
x
Z x
Lower incomplete: (s, x) =
ts1 et dt

Ordinary: (s) =

Spectral density

2X
xt sin(2tj/n
n t=1

Math
Z

F () = F (1/2) = 0
F () = F (1/2) = (0)

f () =

d(j )e2ij t

I(j/n) = |d(j/n)|2

(h) = 2 cos(20 h)

0
F () = 2 /2

n1
X

Periodogram

Spectral representation of a periodic process

xt e2ij t

i=1

Periodic mixture

Needs

n
X

|(h)| < = (h) =

1
1

2
2

R 1/2
1/2

e2ih f () d

f () 0
f () = f ()
f () = f (1 )
R 1/2
(0) = V [xt ] = 1/2 f () d

h = 0, 1, . . .

( + 1) = ()
>1
(n) = (n 1)!
nN

(1/2) =

22.2

Beta Function

Z 1
(x)(y)
tx1 (1 t)y1 dt =
Ordinary: B(x, y) = B(y, x) =
(x + y)
0
Z x
Incomplete: B(x; a, b) =
ta1 (1 t)b1 dt

2
White noise: fw () = w
ARMA (p, q) , (B)xt = (B)wt :

|(e2i )|2
fx () =
|(e2i )|2
Pp
Pq
where (z) = 1 k=1 k z k and (z) = 1 + k=1 k z k
2
w

Regularized incomplete:
a+b1
B(x; a, b) a,bN X
(a + b 1)!
Ix (a, b) =
=
xj (1 x)a+b1j
B(a, b)
j!(a
+
b

j)!
j=a

26

Stirling numbers, 2nd kind


 

 

n
n1
n1
=k
+
k
k
k1

I0 (a, b) = 0
I1 (a, b) = 1
Ix (a, b) = 1 I1x (b, a)

22.3

Series

Finite

n(n + 1)
k=
2

(2k 1) = n2

k=1
n
X

k=1
n
X

k=1
n
X

k2 =

ck =

k=0

cn+1 1
c1

n  
X
n
k=0
n 
X

=2

k=0

Balls and Urns


|B| = n, |U | = m
B : D, U : D

Binomial
Theorem:
n  
X
n nk k
a
b = (a + b)n
k

B : D, U : D

k=0

c 6= 1

k > n : Pn,k = 0

f :BU
f arbitrary
n

m


B : D, U : D


n+n1
n
m  
X
n
k=1

1
,
1p

p
|p| < 1
1p
k=1
!



X
d
1
1
k
=
p
=
dp 1 p
(1 p)2
pk =

kpk1 =

B : D, U : D
|p| < 1

ordered

(n i) =

i=0

unordered

Pn,k

f injective
(
mn m n
0
else
 
m
n
(
1 mn
0 else
(
1 mn
0 else

f surjective
 
n
m!
m



n1
m1
 
n
m
Pn,m

f bijective
(
n! m = n
0 else
(
1 m=n
0 else
(
1 m=n
0 else
(
1 m=n
0 else

References

[3] R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications
With R Examples. Springer, 2006.

w/o replacement
nk =

D = distinguishable, D = indistinguishable.

[2] L. M. Leemis and J. T. McQueston. Univariate Distribution Relationships.


The American Statistician, 62(1):4553, 2008.

Sampling

k1
Y

n 1 : Pn,0 = 0, P0,0 = 1

[1] P. G. Hoel, S. C. Port, and C. J. Stone. Introduction to Probability Theory.


Brooks Cole, 1972.

Combinatorics

k out of n

m
X
k=1

k=0

22.4

Pn,i

k=0

d
dp
k=0
k=0


X
r+k1 k

x = (1 x)r r N+
k
k=0
 
X
k

p = (1 + p) |p| < 1 , C
k

n
X

Vandermondes Identity:
 

r  
X
m
n
m+n
=
k
rk
r

k=0

Infinite

pk =

Pn+k,k =

i=1

 

r+k
r+n+1
=
k
n
k=0




n
X k
n+1

=
m
m+1

n(n + 1)(2n + 1)
6
k=1
2

n
X
n(n + 1)

k3 =
2

Partitions

Binomial

n
X

1kn

  (
1 n=0
n
=
0
0 else

n!
(n k)!

 
n
nk
n!
=
=
k
k!
k!(n k)!

w/ replacement

[4] A. Steger. Diskrete Strukturen Band 1: Kombinatorik, Graphentheorie,


Algebra. Springer, 2001.

nk

[5] A. Steger. Diskrete Strukturen Band 2: Wahrscheinlichkeitstheorie und


Statistik. Springer, 2002.


 

n1+r
n1+r
=
r
n1

[6] L. Wasserman. All of Statistics: A Concise Course in Statistical Inference.


Springer, 2003.
27

28

Univariate distribution relationships, courtesy Leemis and McQueston [2].