Stat Cookbook PDF

Probability and Statistics
Cookbook
Version 0.2.1
14th July, 2016
http://statistics.zone/
Copyright
c Matthias Vallentin, 2016
Contents 14 Exponential Family 16 21.5 Spectral Analysis . . . . . . . . . . . . . 28
1 Distribution Overview 3 15 Bayesian Inference 16 22 Math 29

1.1 Discrete Distributions . . . . . . . . . . 3 15.1 Credible Intervals . . . . . . . . . . . . . 16 22.1 Gamma Function . . . . . . . . . . . . . 29
1.2 Continuous Distributions . . . . . . . . 5 15.2 Function of parameters . . . . . . . . . . 17 22.2 Beta Function . . . . . . . . . . . . . . . 29
15.3 Priors . . . . . . . . . . . . . . . . . . . 17 22.3 Series . . . . . . . . . . . . . . . . . . . 29
2 Probability Theory 8 15.3.1 Conjugate Priors . . . . . . . . . 17 22.4 Combinatorics . . . . . . . . . . . . . . 30
15.4 Bayesian Testing . . . . . . . . . . . . . 18
3 Random Variables 8
3.1 Transformations . . . . . . . . . . . . . 9 16 Sampling Methods 18
16.1 Inverse Transform Sampling . . . . . . . 18
4 Expectation 9 16.2 The Bootstrap . . . . . . . . . . . . . . 18
16.2.1 Bootstrap Confidence Intervals . 18
5 Variance 9 16.3 Rejection Sampling . . . . . . . . . . . . 19
16.4 Importance Sampling . . . . . . . . . . . 19
6 Inequalities 10
17 Decision Theory 19
7 Distribution Relationships 10 17.1 Risk . . . . . . . . . . . . . . . . . . . . 19
17.2 Admissibility . . . . . . . . . . . . . . . 20
8 Probability and Moment Generating
17.3 Bayes Rule . . . . . . . . . . . . . . . . 20
Functions 11
17.4 Minimax Rules . . . . . . . . . . . . . . 20
9 Multivariate Distributions 11 18 Linear Regression 20
9.1 Standard Bivariate Normal . . . . . . . 11 18.1 Simple Linear Regression . . . . . . . . 20
9.2 Bivariate Normal . . . . . . . . . . . . . 11 18.2 Prediction . . . . . . . . . . . . . . . . . 21
9.3 Multivariate Normal . . . . . . . . . . . 11 18.3 Multiple Regression . . . . . . . . . . . 21
18.4 Model Selection . . . . . . . . . . . . . . 22
10 Convergence 11
10.1 Law of Large Numbers (LLN) . . . . . . 12 19 Non-parametric Function Estimation 22
10.2 Central Limit Theorem (CLT) . . . . . 12 19.1 Density Estimation . . . . . . . . . . . . 22
19.1.1 Histograms . . . . . . . . . . . . 23
11 Statistical Inference 12 19.1.2 Kernel Density Estimator (KDE) 23
11.1 Point Estimation . . . . . . . . . . . . . 12 19.2 Non-parametric Regression . . . . . . . 23
11.2 Normal-Based Confidence Interval . . . 13 19.3 Smoothing Using Orthogonal Functions 24
11.3 Empirical distribution . . . . . . . . . . 13
11.4 Statistical Functionals . . . . . . . . . . 13 20 Stochastic Processes 24
20.1 Markov Chains . . . . . . . . . . . . . . 24
12 Parametric Inference 13 20.2 Poisson Processes . . . . . . . . . . . . . 25
12.1 Method of Moments . . . . . . . . . . . 13
12.2 Maximum Likelihood . . . . . . . . . . . 14 21 Time Series 25
12.2.1 Delta Method . . . . . . . . . . . 14 21.1 Stationary Time Series . . . . . . . . . . 26 This cookbook integrates various topics in probability theory
12.3 Multiparameter Models . . . . . . . . . 14 21.2 Estimation of Correlation . . . . . . . . 26 and statistics, based on literature [1, 6, 3] and in-class material
12.3.1 Multiparameter delta method . . 15 21.3 Non-Stationary Time Series . . . . . . . 26 from courses of the statistics department at the University of
12.4 Parametric Bootstrap . . . . . . . . . . 15 21.3.1 Detrending . . . . . . . . . . . . 27 California in Berkeley but also influenced by others [4, 5]. If you
21.4 ARIMA models . . . . . . . . . . . . . . 27 find errors or have suggestions for improvements, please get in
13 Hypothesis Testing 15 21.4.1 Causality and Invertibility . . . . 28 touch at http://statistics.zone/.
1 Distribution Overview
1.1 Discrete Distributions
Notation1 FX (x) fX (x) E [X] V [X] MX (s)

0 x<a
(b a + 1)2 1 eas e(b+1)s

bxca+1 I(a x b) a+b
Uniform Unif {a, . . . , b} axb
ba ba+1 2 12 s(b a)
1 x>b

Bernoulli Bern (p) (1 p)1x px (1 p)1x p p(1 p) 1 p + pes
!
n x
Binomial Bin (n, p) I1p (n x, x + 1) p (1 p)nx np np(1 p) (1 p + pes )n
x

np1 !n
np1 (1 p1 ) np1 p2
k
! k
n! x
X . X
Multinomial Mult (n, p) px1 pk k xi = n .. .. pi e si
x1 ! . . . xk ! 1 i=1 np2 p1 . i=0
npk
m N m
!
x np x nx nm nm(N n)(N m)
Hypergeometric Hyp (N, m, n) N N 2 (N 1)
p
np(1 p) n
N
! r
x+r1 r 1p 1p p
Negative Binomial NBin (r, p) Ip (r, x + 1) p (1 p)x r r 2
r1 p p 1 (1 p)es
1 1p pes
Geometric Geo (p) 1 (1 p)x x N+ p(1 p)x1 x N+
p p2 1 (1 p)es
x
X i x e s
Poisson Po () e e(e 1)
i=0
i! x!
1 We use the notation (s, x) and (x) to refer to the Gamma functions (see 22.1), and use B(x, y) and Ix to refer to the Beta functions (see 22.2).
3
Uniform (discrete) Binomial Geometric Poisson
n = 40, p = 0.3 0.8 p = 0.2
=1
n = 30, p = 0.6 p = 0.5 =4
n = 25, p = 0.9 p = 0.8 = 10

0.3
0.2 0.6
0.2
PMF
PMF
PMF
PMF
1

0.4
n

0.1

0.1
0.2

0.0

0.0

0.0

a b 0 10 20 30 40 0.0 2.5 5.0 7.5 10.0 0 5 10 15 20

x x x x
Uniform (discrete) Binomial Geometric Poisson
1 1.00
1.0 1.00

0.75 0.8 0.75

i

n

CDF
CDF
CDF
CDF
0.50 0.6 0.50

i

n

0.25 0.4 0.25

n = 40, p = 0.3 p = 0.2
=1
n = 30, p = 0.6 p = 0.5

=4

0 0.00

n = 25, p = 0.9 0.2 p = 0.8 0.00

= 10
a b 0 10 20 30 40 0.0 2.5 5.0 7.5 10.0 0 5 10 15 20

x x x x
4
1.2 Continuous Distributions
Notation FX (x) fX (x) E [X] V [X] MX (s)

0 x<a
(b a)2 esb esa

xa I(a < x < b) a+b
Uniform Unif (a, b) a<x<b
ba ba 2 12 s(b a)
1 x>b

(x )2
Z x
2 s2

1
N , 2 2

Normal (x) = (t) dt (x) = exp exp s +
2 2 2 2
(ln x )2

1 1 ln x 1 2 2 2
ln N , 2 e+ /2
(e 1)e2+

Log-Normal + erf exp
2 2 2 2 x 2 2 2 2

1 T
1 (x) 1
Multivariate Normal MVN (, ) (2)k/2 ||1/2 e 2 (x) exp T s + sT s
2
(+1)/2 (
+1

2 x2 2
>2
Students t Student() Ix , 1 + 0 >1
2

2 2 1<2

1 k x 1
Chi-square 2k , xk/21 ex/2 k 2k (1 2s)k/2 s < 1/2
(k/2) 2 2 2k/2 (k/2)
r
d
(d1 x)d1 d2 2
2d22 (d1 + d2 2)

d1 d1 (d1 x+d2 )d1 +d2 d2
F F(d1 , d2 ) I d1 x , d1 d1 d2 2 d1 (d2 2)2 (d2 4)

d1 x+d2 2 2 xB 2
, 2
1 x/ 1
Exponential Exp () 1 ex/ e 2 s (s < )
1
!
(, x) 1 x 1
Gamma Gamma (, ) x e s (s < )
() () 2 1
, x

1 /x 2 2(s)/2 p
Inverse Gamma InvGamma (, ) x e >1 >2 K 4s
() () 1 ( 1)2 ( 2) ()
P
k
i=1 i Y 1
k
i E [Xi ] (1 E [Xi ])
Dirichlet Dir () Qk xi i Pk Pk
i=1 (i ) i=1 i=1 i i=1 i + 1
k1
!
( + ) 1 X Y +r sk
Beta Beta (, ) Ix (, ) x (1 x)1 1+
() () + ( + )2 ( + + 1) r=0
++r k!
k=1

sn n

k k x k1 (x/)k 1 2 X n
Weibull Weibull(, k) 1 e(x/) e 1 + 2 1 + 2 1+
k k n=0
n! k
x
m x xm x2m
Pareto Pareto(xm , ) 1 x xm m
+1 x xm >1 >2 (xm s) (, xm s) s < 0
x x 1 ( 1)2 ( 2)
1
We use the rate parameterization where =
. Some textbooks use as scale parameter instead [6].
5
Uniform (continuous) Normal LogNormal Student's t
2.0 1.00 0.4 =1
= 0, = 0.2
2
= 0, = 3
2
= 0, 2 = 1 = 2, 2 = 2 =2
= 0, 2 = 5 = 0, 2 = 1 =5
=
= 2, 2 = 0.5 = 0.5, 2 = 1
= 0.25, 2 = 1
1.5 0.75 = 0.125, 2 = 1 0.3
PDF
PDF
PDF
PDF
1
1.0 0.50 0.2
ba
0.5 0.25 0.1
0.0 0.00 0.0
a b 5.0 2.5 0.0 2.5 5.0 0 1 2 3 5.0 2.5 0.0 2.5 5.0
x x x x
2 F Exponential Gamma
d1 = 1, d2 = 1 2.0 = 0.5 0.5 = 1, = 0.5
1.00 k=1 3 d1 = 2, d2 = 1 =1 = 2, = 0.5
k=2 d1 = 5, d2 = 2 = 2.5 = 3, = 0.5
k=3 d1 = 100, d2 = 1 = 5, = 1
k=4 d1 = 100, d2 = 100 0.4 = 9, = 2
k=5
1.5
0.75
2
0.3
PDF
PDF
PDF
PDF
0.50 1.0
0.2
1
0.25 0.5
0.1
0.00 0 0.0 0.0
0 2 4 6 8 0 1 2 3 4 5 0 1 2 3 4 5 0 5 10 15 20
x x x x
Inverse Gamma Beta Weibull Pareto
= 1, = 1 5 = 0.5, = 0.5 2.0 = 1, k = 0.5 4 xm = 1, k = 1
= 2, = 1 = 5, = 1 = 1, k = 1 xm = 1, k = 2
= 3, = 1 = 1, = 3 = 1, k = 1.5 xm = 1, k = 4
4 = 3, = 0.5 = 2, = 2 = 1, k = 5
4 = 2, = 5
1.5 3
3
3
PDF
PDF
PDF
PDF
1.0 2
2
2
0.5 1
1 1
0 0 0.0 0
0 1 2 3 4 5 0.00 0.25 0.50 0.75 1.00 0.0 0.5 1.0 1.5 2.0 2.5 1.0 1.5 2.0 2.5
x x x x
6
Uniform (continuous) Normal LogNormal Student's t
1 1.00 = 0, = 3
2 1.00
= 2, 2 = 2
0.75 = 0, 2 = 1
= 0.5, 2 = 1
= 0.25, 2 = 1
0.75 = 0.125, 2 = 1 0.75
0.50
CDF
CDF
CDF
CDF
0.50 0.50
0.25
0.25 0.25
= 0, 2 = 0.2 =1
= 0, 2 = 1 =2
= 0, 2 = 5 =5
0 0.00 = 2, 2 = 0.5 0.00 0.00 =
a b 5.0 2.5 0.0 2.5 5.0 0 1 2 3 5.0 2.5 0.0 2.5 5.0
x x x x
2 F Exponential Gamma
1.00 1.00 1.00
1.00
0.75 0.75 0.75 0.75

CDF
CDF
CDF
CDF
0.50 0.50 0.50 0.50
0.25 0.25 0.25 0.25

k=1 d1 = 1, d2 = 1 = 1, = 0.5
k=2 d1 = 2, d2 = 1 = 2, = 0.5
k=3 d1 = 5, d2 = 2 = 0.5 = 3, = 0.5
k=4 d1 = 100, d2 = 1 =1 = 5, = 1
0.00 k=5 0.00 d1 = 100, d2 = 100 0.00 = 2.5 0.00 = 9, = 2
0 2 4 6 8 0 1 2 3 4 5 0 1 2 3 4 5 0 5 10 15 20
x x x x
Inverse Gamma Beta Weibull Pareto
1.00 1.00
1.00 1.00 = 0.5, = 0.5
= 5, = 1
= 1, = 3
= 2, = 2
= 2, = 5
0.75 0.75 0.75 0.75
CDF
CDF
CDF
CDF
0.50 0.50 0.50 0.50
0.25 0.25 0.25 0.25
= 1, = 1 = 1, k = 0.5
= 2, = 1 = 1, k = 1 xm = 1, k = 1
= 3, = 1 = 1, k = 1.5 xm = 1, k = 2
0.00 = 3, = 0.5 0.00 0.00 = 1, k = 5 0.00 xm = 1, k = 4
0 1 2 3 4 5 0.00 0.25 0.50 0.75 1.00 0.0 0.5 1.0 1.5 2.0 2.5 1.0 1.5 2.0 2.5
x x x x
7
2 Probability Theory Law of Total Probability
n n
Definitions X G
P [B] = P [B|Ai ] P [Ai ] = Ai
Sample space i=1 i=1
Outcome (point or element) Bayes Theorem

Event A n
-algebra A P [B | Ai ] P [Ai ] G
P [Ai | B] = Pn = Ai
1. A j=1 P [B | Aj ] P [Aj ] i=1
S
2. A1 , A2 , . . . , A = i=1 Ai A Inclusion-Exclusion Principle
3. A A = A A
n n
r
[ X X \
Probability Distribution P (1)r1

Ai = A ij

1. P [A] 0 A i=1 r=1 ii1 <<ir n j=1
2. P [] = 1
" #
G
X 3 Random Variables
3. P Ai = P [Ai ]
i=1 i=1 Random Variable (RV)
Probability space (, A, P) X:R
Probability Mass Function (PMF)
Properties
P [] = 0 fX (x) = P [X = x] = P [{ : X() = x}]
B = B = (A A) B = (A B) (A B) Probability Density Function (PDF)
P [A] = 1 P [A]
Z b
P [B] = P [A B] + P [A B]
P [a X b] = f (x) dx
P [] = 1 P [] = 0 a
S T T S
( n An ) = n An ( n An ) = n An DeMorgan
S T Cumulative Distribution Function (CDF)
P [ n An ] = 1 P [ n An ]
P [A B] = P [A] + P [B] P [A B] FX : R [0, 1] FX (x) = P [X x]
= P [A B] P [A] + P [B] 1. Nondecreasing: x1 < x2 = F (x1 ) F (x2 )
P [A B] = P [A B] + P [A B] + P [A B] 2. Normalized: limx = 0 and limx = 1
P [A B] = P [A] P [A B] 3. Right-Continuous: limyx F (y) = F (x)
Continuity of Probabilities
S Z b
A1 A2 . . . = limn P [An ] = P [A] where A = i=1 Ai P [a Y b | X = x] = fY |X (y | x)dy ab
T
A1 A2 . . . = limn P [An ] = P [A] where A = i=1 Ai a
f (x, y)
Independence
fY |X (y | x) =
A
B P [A B] = P [A] P [B] fX (x)
Independence
Conditional Probability
1. P [X x, Y y] = P [X x] P [Y y]
P [A B]
P [A | B] = P [B] > 0 2. fX,Y (x, y) = fX (x)fY (y)
P [B] 8
Z
3.1 Transformations E [XY ] = xyfX,Y (x, y) dFX (x) dFY (y)
X,Y
Transformation function
E [(Y )] 6= (E [X]) (cf. Jensen inequality)
Z = (X)
P [X Y ] = 1 = E [X] E [Y ]
Discrete P [X = Y ] = 1 E [X] = E [Y ]
X
fZ (z) = P [(X) = z] = P [{x : (x) = z}] = P X 1 (z) =

fX (x)
X
E [X] = P [X x]
x1 (z) x=1
Continuous Sample mean

n
Z 1X
Xn = Xi
FZ (z) = P [(X) z] = f (x) dx with Az = {x : (x) z} n i=1
Az
Conditional expectation
Special case if strictly monotone Z

d

dx 1 E [Y | X = x] = yf (y | x) dy
fZ (z) = fX (1 (z)) 1 (z) = fX (x) = fX (x)

dz dz |J| E [X] = E [E [X | Y ]]
Z
The Rule of the Lazy Statistician E(X,Y ) | X=x [=] (x, y)fY |X (y | x) dx

Z Z
E [Z] = (x) dFX (x) E [(Y, Z) | X = x] = (y, z)f(Y,Z)|X (y, z | x) dy dz

Z Z E [Y + Z | X] = E [Y | X] + E [Z | X]
E [IA (x)] = IA (x) dFX (x) = dFX (x) = P [X A] E [(X)Y | X] = (X)E [Y | X]
A
EY | X [=] c = Cov [X, Y ] = 0
Convolution
Z Z z
X,Y 0
Z := X + Y fZ (z) = fX,Y (x, z x) dx = fX,Y (x, z x) dx
0 5 Variance
Z
Z := |X Y | fZ (z) = 2 fX,Y (x, z + x) dx Definition and properties
0
Z Z 2
2
X V [X] = X = E (X E [X])2 = E X 2 E [X]
Z := fZ (z) = |x|fX,Y (x, xz) dx = xfx (x)fX (x)fY (xz) dx " n # n
Y X X X
V Xi = V [Xi ] + 2 Cov [Xi , Yj ]
i=1 i=1 i6=j
4 Expectation " n
X
# n
X
V Xi = V [Xi ] if Xi
Xj
Definition and properties
i=1 i=1
X

xfX (x) X discrete Standard deviation p
sd[X] = V [X] = X

Z x

E [X] = X = x dFX (x) =

Z Covariance
xfX (x) dx X continuous

Cov [X, Y ] = E [(X E [X])(Y E [Y ])] = E [XY ] E [X] E [Y ]
P [X = c] = 1 = E [X] = c Cov [X, a] = 0
E [cX] = c E [X] Cov [X, X] = V [X]
E [X + Y ] = E [X] + E [Y ] Cov [X, Y ] = Cov [Y, X]
9
Cov [aX, bY ] = abCov [X, Y ] 7 Distribution Relationships
Cov [X + a, Y + b] = Cov [X, Y ]

n m

n X m
Binomial
X X X
n
Cov Xi , Yj = Cov [Xi , Yj ] X
i=1 j=1 i=1 j=1
Xi Bern (p) = Xi Bin (n, p)
i=1
Correlation X Bin (n, p) , Y Bin (m, p) = X + Y Bin (n + m, p)
Cov [X, Y ]
[X, Y ] = p limn Bin (n, p) = Po (np) (n large, p small)
V [X] V [Y ] limn Bin (n, p) = N (np, np(1 p)) (n large, p far from 0 and 1)
Independence
Negative Binomial
X
Y = [X, Y ] = 0 Cov [X, Y ] = 0 E [XY ] = E [X] E [Y ]
X NBin (1, p) = Geo (p)
Pr
Sample variance X NBin (r, p) = i=1 Geo (p)
n P P
1 X Xi NBin (ri , p) = Xi NBin ( ri , p)
S2 = (Xi Xn )2
n 1 i=1 X NBin (r, p) . Y Bin (s + r, p) = P [X s] = P [Y r]
Conditional variance Poisson
2 n n
!
V [Y | X] = E (Y E [Y | X])2 | X = E Y 2 | X E [Y | X] X X
Xi Po (i ) Xi Xj = Xi Po i
V [Y ] = E [V [Y | X]] + V [E [Y | X]]
i=1 i=1

n n
X X i
6 Inequalities Xi Po (i ) Xi Xj = Xi Xj Bin Xj , Pn
j=1 j=1 j=1 j
Cauchy-Schwarz
2 Exponential
E [XY ] E X 2 E Y 2

n
X
Markov Xi Exp () Xi
Xj = Xi Gamma (n, )
E [(X)]
P [(X) t] i=1
t Memoryless property: P [X > x + y | X > y] = P [X > x]
Chebyshev
V [X] Normal
P [|X E [X]| t]
t2
X

Chernoff X N , 2 = N (0, 1)

e
X N , Z = aX + b = Z N a + b, a2 2
2

P [X (1 + )] > 1
(1 + )1+
Xi N i , i2 Xi Xj =
P
Xi N
P
i , i i2
P
i i
Hoeffding
P [a < X b] = b a

X1 , . . . , Xn independent P [Xi [ai , bi ]] = 1 1 i n (x) = 1 (x) 0 (x) = x(x) 00 (x) = (x2 1)(x)
1
2 Upper quantile of N (0, 1): z = (1 )
P X E X t e2nt t > 0

Gamma
2n2 t2

P |X E X | t 2 exp Pn 2
t>0
i=1 (bi ai ) X Gamma (, ) X/ Gamma (, 1)
P
Jensen Gamma (, ) i=1 Exp ()
P P
E [(X)] (E [X]) convex Xi Gamma (i , ) Xi
Xj = i Xi Gamma ( i i , )
10
Z
() 9.2 Bivariate Normal
= x1 ex dx
0
Let X N x , x2 and Y N y , y2 .
Beta
1 ( + ) 1 1 z
x1 (1 x)1 = x (1 x)1 f (x, y) = exp
2(1 2 )
p
B(, ) ()() 2x y 1 2
B( + k, ) +k1
E X k1

E Xk =
" 2 2 #
= x x

y y

x x

y y
B(, ) ++k1 z= + 2
Beta (1, 1) Unif (0, 1) x y x y
Conditional mean and variance
8 Probability and Moment Generating Functions E [X | Y ] = E [X] +
X
(Y E [Y ])
Y
GX (t) = E tX |t| < 1 p
V [X | Y ] = X 1 2
" #
X (Xt)i X E Xi
ti

MX (t) = GX (et ) = E eXt = E =
i=0
i! i=0
i!
P [X = 0] = GX (0)
9.3 Multivariate Normal
P [X = 1] = G0X (0) Covariance matrix (Precision matrix 1 )
(i)
GX (0)
P [X = i] = V [X1 ] Cov [X1 , Xk ]
i! .. .. ..
E [X] = G0X (1 ) =

. . .
(k)
E X k = MX (0) Cov [Xk , X1 ] V [Xk ]

X! (k)
E = GX (1 ) If X N (, ),
(X k)!
2

V [X] = G00X (1 ) + G0X (1 ) (G0X (1 )) 1/2 1
fX (x) = (2)n/2 || exp (x )T 1 (x )
d 2
GX (t) = GY (t) = X = Y
Properties
9 Multivariate Distributions Z N (0, 1) X = + 1/2 Z = X N (, )
X N (, ) = 1/2 (X ) N (0, 1)
9.1 Standard Bivariate Normal X N (, ) = AX N A, AAT

p
Let X, Y N (0, 1) X
Z where Y = X + 1 2 Z

X N (, ) kak = k = aT X N aT , aT a
Joint density
1 x2 + y 2 2xy

10 Convergence
f (x, y) = exp
2(1 2 )
p
2 1 2
Let {X1 , X2 , . . .} be a sequence of rvs and let X be another rv. Let Fn denote
Conditionals the cdf of Xn and let F denote the cdf of X.
Types of Convergence
(Y | X = x) N x, 1 2 (X | Y = y) N y, 1 2

and
D
1. In distribution (weakly, in law): Xn X
Independence
X
Y = 0 lim Fn (t) = F (t) t where F continuous
n 11
P
2. In probability: Xn X
Xn n(Xn ) D
Zn := q = Z where Z N (0, 1)
( > 0) lim P [|Xn X| > ] = 0
n V Xn
as
3. Almost surely (strongly): Xn X lim P [Zn z] = (z) zR
n
h i h i
P lim Xn = X = P : lim Xn () = X() = 1 CLT notations
n n
qm
Zn N (0, 1)
4. In quadratic mean (L2 ): Xn X
2

Xn N ,
lim E (Xn X)2 = 0

n
n
2

Xn N 0,
Relationships n
2

qm P D n(Xn ) N 0,
Xn X = Xn X = Xn X
as
Xn X = Xn X
P
n(Xn )
N (0, 1)
D P
Xn X (c R) P [X = c] = 1 = Xn X
P P P
Xn X Yn Y = Xn + Yn X + Y
qm qm qm
Xn X Yn Y = Xn + Yn X + Y Continuity correction
P P P
Xn X Yn Y = Xn Yn XY
P P
x + 12

Xn X = (Xn ) (X)
D D
P Xn x
Xn X = (Xn ) (X) / n
qm
Xn b limn E [Xn ] = b limn V [Xn ] = 0
x 12

qm
X1 , . . . , Xn iid E [X] = V [X] < Xn

P Xn x 1
/ n
Slutzkys Theorem
Delta method
D P D
Xn X and Yn c = Xn + Yn X + c
2 2

D P D 0 2
Xn X and Yn c = Xn Yn cX Yn N , = (Yn ) N (), ( ())
D D D n n
In general: Xn X and Yn Y =
6 Xn + Yn X + Y
10.1 Law of Large Numbers (LLN)

11 Statistical Inference
iid
Let {X1 , . . . , Xn } be a sequence of iid rvs, E [X1 ] = . Let X1 , , Xn F if not otherwise noted.
Weak (WLLN)
P
Xn n 11.1 Point Estimation
Strong (SLLN) Point estimator bn of is a rv: bn = g(X1 , . . . , Xn )
as
Xn n
h i
bias(bn ) = E bn
P
10.2 Central Limit Theorem (CLT) Consistency: bn
Sampling distribution: F (bn )
Let {X1 , . . . , Xn } be a sequence of iid rvs, E [X1 ] = , and V [X1 ] = 2 . r h i
Standard error: se(n ) = V bn
b
12
h i h i
Mean squared error: mse = E (bn )2 = bias(bn )2 + V bn 11.4 Statistical Functionals
limn bias(bn ) = 0 limn se(bn ) = 0 = bn is consistent Statistical functional: T (F )
bn D Plug-in estimator of = (F ): bn = T (Fbn )
Asymptotic normality: N (0, 1) R
se Linear functional: T (F ) = (x) dFX (x)
Slutzkys Theorem often lets us replace se(bn ) by some (weakly) consis- Plug-in estimator for linear functional:
tent estimator
bn . Z n
1X
T (Fbn ) = (x) dFbn (x) = (Xi )
11.2 Normal-Based Confidence Interval n i=1

b 2 . Let z/2 = 1 (1 (/2)), i.e., P Z > z/2 = /2

Suppose bn N , se

b 2 = T (Fbn ) z/2 se
Often: T (Fbn ) N T (F ), se b
and P z/2 < Z < z/2 = 1 where Z N (0, 1). Then
pth quantile: F 1 (p) = inf{x : F (x) p}
Cn = bn z/2 se
b b = Xn
n
1 X
11.3 Empirical distribution b2 =
(Xi Xn )2
n 1 i=1
1
Pn
Empirical Distribution Function (ECDF) n i=1 (Xi b)3
Pn
b=
I(Xi x) b3
P
Fn (x) = i=1
b n
i=1 (Xi Xn )(Yi Yn )
n b = qP qP
n 2 n 2
i=1 (Yi Yn )
(
1 Xi x i=1 (X i Xn )
I(Xi x) =
0 Xi > x
Properties (for any fixed x) 12 Parametric Inference
h i
E Fbn = F (x)

Let F = f (x; ) : be a parametric model with parameter space Rk
h i F (x)(1 F (x)) and parameter = (1 , . . . , k ).
V Fbn =
n
F (x)(1 F (x)) D 12.1 Method of Moments
mse = 0
n
th
P
Fbn F (x) j moment Z
j () = E X j = xj dFX (x)

Dvoretzky-Kiefer-Wolfowitz (DKW) inequality (X1 , . . . , Xn F )

2
P sup F (x) Fn (x) > = 2e2n j th sample moment
b
x n
1X j
Nonparametric 1 confidence band for F
bj = X
n i=1 i
L(x) = max{Fbn n , 0}
Method of Moments estimator (MoM)
U (x) = min{Fbn + n , 1}
s
1

2 1 () =
b1
= log 2 () =
2n b2
.. ..
.=.
P [L(x) F (x) U (x) x] 1 k () =
bk
13
Properties of the MoM estimator Equivariance: bn is the mle = (bn ) is the mle of ()
bn exists with probability tending to 1 Asymptotic optimality (or efficiency), i.e., smallest variance for large sam-
P
Consistency: bn ples. If en is any other estimator, the asymptotic relative efficiency is:
p
Asymptotic normality: 1. se 1/In ()
(bn ) D
D
n(b ) N (0, ) N (0, 1)
se
q
where = gE Y Y T g T , Y = (X, X 2 , . . . , X k )T , b 1/In (bn )
2. se
1
g = (g1 , . . . , gk ) and gj = j ()
(bn ) D
N (0, 1)
se
b
12.2 Maximum Likelihood Asymptotic optimality
Likelihood: Ln : [0, ) h i
V bn
n
Y are(en , bn ) = h i 1
Ln () = f (Xi ; ) V en
i=1
Approximately the Bayes estimator
Log-likelihood
n
X 12.2.1 Delta Method
`n () = log Ln () = log f (Xi ; )
i=1 b where is differentiable and 0 () 6= 0:
If = ()
Maximum likelihood estimator (mle)
n ) D
(b
N (0, 1)
Ln (bn ) = sup Ln () se(b
b )

where b = ()
b is the mle of and
Score function

s(X; ) = log f (X; ) b = 0 ()
se se(
b n )
b b

Fisher information
I() = V [s(X; )] 12.3 Multiparameter Models
In () = nI() Let = (1 , . . . , k ) and b = (b1 , . . . , bk ) be the mle.
Fisher information (exponential family)
2 `n 2 `n
Hjj = Hjk =
2 j k
I() = E s(X; )
Fisher information matrix
Observed Fisher information
E [H11 ] E [H1k ]

n
In () = .. .. ..
2 X

. . .
Inobs () =

log f (Xi ; )
2 i=1 E [Hk1 ] E [Hkk ]
Properties of the mle Under appropriate regularity conditions

P
Consistency: bn (b ) N (0, Jn )
14
with Jn () = In1 . Further, if bj is the j th component of , then Critical value c
Test statistic T
(bj j ) D Rejection region R = {x : T (x) > c}
N (0, 1)
se
bj Power function () = P [X R]
h i Power of a test: 1 P [Type II error] = 1 = inf ()
b 2j = Jn (j, j) and Cov bj , bk = Jn (j, k)
where se 1
Test size: = P [Type I error] = sup ()
0
12.3.1 Multiparameter delta method
Let = (1 , . . . , k ) and let the gradient of be Retain H0 Reject H0

H0 true Type
I Error ()
1 H1 true Type II Error () (power)
.
p-value
..
=

k

p-value = sup0 P [T (X) T (x)] = inf : T (x) R
P [T (X ? ) T (X)]

p-value = sup0 = inf : T (X) R
Suppose =b 6= 0 and b = ().
b Then, | {z }
1F (T (X)) since T (X ? )F
) D
(b
N (0, 1)
se(b
b )
p-value evidence
where r < 0.01 very strong evidence against H0
T
0.01 0.05 strong evidence against H0

se(b
b ) =
b Jbn
b
0.05 0.1 weak evidence against H0
b and

b = b. > 0.1 little or no evidence against H0
and Jbn = Jn () =
Wald test
12.4 Parametric Bootstrap
Two-sided test
Sample from f (x; bn ) instead of from Fbn , where bn could be the mle or method
of moments estimator. b 0
Reject H0 when |W | > z/2 where W =
se
b
P |W | > z/2
13 Hypothesis Testing p-value = P0 [|W | > |w|] P [|Z| > |w|] = 2(|w|)
H0 : 0 versus H1 : 1
Likelihood ratio test
Definitions
Null hypothesis H0 sup Ln () Ln (bn )

Alternative hypothesis H1 T (X) = =
sup0 Ln () Ln (bn,0 )
Simple hypothesis = 0 k
Composite hypothesis > 0 or < 0 iid
D
X
(X) = 2 log T (X) 2rq where Zi2 2k and Z1 , . . . , Zk N (0, 1)
Two-sided test: H0 : = 0 versus H1 : 6= 0
i=1
One-sided test: H0 : 0 versus H1 : > 0 p-value = P0 [(X) > (x)] P 2rq > (x)
15
Multinomial LRT Natural form

X1 Xk
mle: pbn = ,..., fX (x | ) = h(x) exp { T(x) A()}
n n
k
Y pbj Xj = h(x)g() exp { T(x)}
Ln (b
pn )
T (X) = = = h(x)g() exp T T(x)

Ln (p0 ) j=1
p0j
k
X pbj D
(X) = 2 Xj log 2k1 15 Bayesian Inference
j=1
p 0j
The approximate size LRT rejects H0 when (X) 2k1, Bayes Theorem
Pearson Chi-square Test f (x | )f () f (x | )f ()
f ( | x) = =R Ln ()f ()
k f (xn ) f (x | )f () d
X (Xj E [Xj ])2
T = where E [Xj ] = np0j under H0
j=1
E [Xj ] Definitions
D
T 2k1 X n = (X1 , . . . , Xn )

p-value = P 2k1 > T (x) xn = (x1 , . . . , xn )
D
2
Faster Xk1 than LRT, hence preferable for small n Prior density f ()
Likelihood f (xn | ): joint density of the data
Independence testing Yn
In particular, X n iid = f (xn | ) = f (xi | ) = Ln ()
I rows, J columns, X multinomial sample of size n = I J i=1
X
mles unconstrained: pbij = nij Posterior density f ( | xn )
Normalizing constant cn = f (xn ) = f (x | )f () d
R
X
mles under H0 : pb0ij = pbi pbj = Xni nj
PI PJ
nX
Kernel: part of a density that dependsRon
LRT: = 2 i=1 j=1 Xij log Xi Xijj L ()f ()d
Posterior mean n = f ( | xn ) d = R Lnn()f () d
R
PI PJ (X E[X ])2
PearsonChiSq: T = i=1 j=1 ijE[Xij ]ij
D
LRT and Pearson 2k , where = (I 1)(J 1) 15.1 Credible Intervals
Posterior interval
14 Exponential Family Z b
P [ (a, b) | xn ] = f ( | xn ) d = 1
Scalar parameter a
fX (x | ) = h(x) exp {()T (x) A()} Equal-tail credible interval

= h(x)g() exp {()T (x)} Z a Z
f ( | xn ) d = f ( | xn ) d = /2
b
Vector parameter
( s
) Highest posterior density (HPD) region Rn
X
fX (x | ) = h(x) exp i ()Ti (x) A()
1. P [ Rn ] = 1
i=1
2. Rn = { : f ( | xn ) > k} for some k
= h(x) exp {() T (x) A()}
= h(x)g() exp {() T (x)} Rn is unimodal = Rn is an interval
16
15.2 Function of parameters 15.3.1 Conjugate Priors
Continuous likelihood (subscript c denotes constant)
Let = () and A = { : () }.
Likelihood Conjugate prior Posterior hyperparameters
Posterior CDF for
Unif (0, ) Pareto(xm , k) max x(n) , xm , k + n
Z Xn
n n n
H(r | x ) = P [() | x ] = f ( | x ) d Exp () Gamma (, ) + n, + xi
A
i=1
Pn
0 i=1 xi 1 n
2
2

Posterior density N , c N 0 , 0 + / + 2 ,
2 2 02 c
0 c1
1 n
h( | xn ) = H 0 ( | xn ) + 2
02 c
Pn
02 + i=1 (xi )2
Bayesian delta method N c , 2 Scaled Inverse Chi- + n,
+n
square(, 02 )

+ nx n

| X n N (),
b seb 0 ()

N , 2
b
Normal- , + n, + ,
+n 2
scaled Inverse n 2
1X (x )
Gamma(, , , ) + (xi x)2 +
2 i=1 2(n + )
15.3 Priors 1
1 1
1 1

MVN(, c ) MVN(0 , 0 ) 0 + nc 0 0 + n x ,
1 1
1

Choice 0 + nc
Xn
MVN(c , ) Inverse- n + , + (xi c )(xi c )T
Subjective Bayesianism: prior should incorporate as much detail as possible Wishart(, ) i=1
the researchs a priori knowledgevia prior elicitation n
X xi
Objective Bayesianism: prior should incorporate as little detail as possible Pareto(xmc , k) Gamma (, ) + n, + log
x mc
(non-informative prior) i=1
Pareto(xm , kc ) Pareto(x0 , k0 ) x0 , k0 kn where k0 > kn
Robust Bayesianism: consider various priors and determine sensitivity of Xn
our inferences to changes in the prior Gamma (c , ) Gamma (0 , 0 ) 0 + nc , 0 + xi
i=1
Types
Flat: f () constant
R
Proper: f () d = 1
R
Improper: f () d =
Jeffreys Prior (transformation-invariant):
p p
f () I() f () det(I())
Conjugate: f () and f ( | xn ) belong to the same parametric family

17
Discrete likelihood Bayes factor
Likelihood Conjugate prior Posterior hyperparameters log10 BF10 BF10 evidence
n n 0 0.5 1 1.5 Weak
0.5 1 1.5 10 Moderate
X X
Bern (p) Beta (, ) + xi , + n xi
i=1 i=1
12 10 100 Strong
Xn n
X n
X >2 > 100 Decisive
Bin (p) Beta (, ) + xi , + Ni xi
p
i=1 i=1 i=1 1p BF10
n
X p = p where p = P [H1 ] and p = P [H1 | xn ]
NBin (p) Beta (, ) + rn, + xi 1 + 1p BF10
i=1
n
16 Sampling Methods
X
Po () Gamma (, ) + xi , + n
i=1
n
X 16.1 Inverse Transform Sampling
Multinomial(p) Dir () + x(i)
i=1 Setup
n
X
Geo (p) Beta (, ) + n, + xi U Unif (0, 1)
i=1 XF
F 1 (u) = inf{x | F (x) u}
Algorithm
15.4 Bayesian Testing
1. Generate u Unif (0, 1)
If H0 : 0 : 2. Compute x = F 1 (u)
Z
Prior probability P [H0 ] = f () d 16.2 The Bootstrap
0
Z Let Tn = g(X1 , . . . , Xn ) be a statistic.
Posterior probability P [H0 | xn ] = f ( | xn ) d 1. Estimate VF [Tn ] with VFbn [Tn ].
0
2. Approximate VFbn [Tn ] using simulation:

(a) Repeat the following B times to get Tn,1 , . . . , Tn,B , an iid sample from
Let H0 . . .Hk1 be k hypotheses. Suppose f ( | Hk ), the sampling distribution implied by Fn b
i. Sample uniformly X1 , . . . , Xn Fbn .
f (xn | Hk )P [Hk ] ii. Compute Tn = g(X1 , . . . , Xn ).
P [Hk | xn ] = PK ,
n
k=1 f (x | Hk )P [Hk ] (b) Then
B B
!2
1 X 1 X
Marginal likelihood vboot = VFbn =
b Tn,b T
B B r=1 n,r
b=1
Z
f (xn | Hi ) = f (xn | , Hi )f ( | Hi ) d 16.2.1 Bootstrap Confidence Intervals

Normal-based interval
Posterior odds (of Hi relative to Hj ) Tn z/2 se
b boot
Pivotal interval
P [Hi | xn ] f (xn | Hi ) P [Hi ]
= 1. Location parameter = T (F )
P [Hj | xn ] f (xn | Hj ) P [Hj ] 18
| {z } | {z }
Bayes Factor BFij prior odds
2. Pivot Rn = bn 2. Generate u Unif (0, 1)
3. Let H(r) = P [Rn r] be the cdf of Rn Ln (cand )

3. Accept cand if u
4. Let Rn,b = bn,b bn . Approximate H using bootstrap: Ln (bn )
B
1 X 16.4 Importance Sampling
H(r)
b = I(Rn,b r)
B Sample from an importance function g rather than target density h.
b=1
Algorithm to obtain an approximation to E [q() | xn ]:
5. = sample quantile of (bn,1

, . . . , bn,B ) iid
1. Sample from the prior 1 , . . . , n f ()
6. r = beta sample quantile of (Rn,1

, . . . , Rn,B ), i.e., r = bn
Ln (i )
2. wi = PB i = 1, . . . , B

7. Approximate 1 confidence interval Cn = a, b where
i=1 Ln (i )
PB
3. E [q() | xn ] i=1 q(i )wi
b 1 1 =

a = bn H bn r1/2 = 2bn 1/2
2

1
b = bn Hb
2
= bn r/2 = 2bn /2 17 Decision Theory
Percentile interval Definitions

Cn = /2 , 1/2 Unknown quantity affecting our decision:
Decision rule: synonymous for an estimator b
16.3 Rejection Sampling Action a A: possible value of the decision rule. In the estimation
context, the action is just an estimate of , (x).
b
Setup
Loss function L: consequences of taking action a when true state is or
We can easily sample from g() discrepancy between and , b L : A [k, ).
We want to sample from h(), but it is difficult
Loss functions
k()
We know h() up to a proportional constant: h() = R Squared error loss: L(, a) = ( a)2
k() d (
Envelope condition: we can find M > 0 such that k() M g() K1 ( a) a < 0
Linear loss: L(, a) =
K2 (a ) a 0
Algorithm
Absolute error loss: L(, a) = | a| (linear loss with K1 = K2 )
1. Draw cand g() Lp loss: L(, a) = | a|p
2. Generate u Unif (0, 1) (
0 a=
k(cand ) Zero-one loss: L(, a) =
3. Accept cand if u 1 a 6=
M g(cand )
4. Repeat until B values of cand have been accepted
17.1 Risk
Example
Posterior risk
We can easily sample from the prior g() = f () Z h i
Target is the posterior h() k() = f (xn | )f () r(b | x) = L(, (x))f
b ( | x) d = E|X L(, (x))
b
Envelope condition: f (xn | ) f (xn | bn ) = Ln (bn ) M

(Frequentist) risk
Algorithm Z h i
1. Draw cand f () R(, )
b = L(, (x))f
b (x | ) dx = EX| L(, (X))
b
19
Bayes risk 18 Linear Regression
ZZ
Definitions
h i
r(f, )
b = L(, (x))f
b (x, ) dx d = E,X L(, (X))
b
Response variable Y
Covariate X (aka predictor variable or feature)
h h ii h i
r(f, )
b = E EX| L(, (X)
b = E R(, )
b
18.1 Simple Linear Regression

h h ii h i
r(f, )
b = EX E|X L(, (X)
b = EX r(b | X)
Model
17.2 Admissibility Yi = 0 + 1 Xi + i E [i | Xi ] = 0, V [i | Xi ] = 2
Fitted line
b0 dominates b if
b0 rb(x) = b0 + b1 x
: R(, ) R(, )
b
Predicted (fitted) values
: R(, b0 ) < R(, )
b Ybi = rb(Xi )
b is inadmissible if there is at least one other estimator b0 that dominates Residuals
it. Otherwise it is called admissible. i = Yi Ybi = Yi b0 + b1 Xi
Residual sums of squares (rss)

17.3 Bayes Rule
n
X
Bayes rule (or Bayes estimator) rss(b0 , b1 ) = 2i
i=1
r(f, )
b = inf e r(f, )

e
R Least square estimates
(x)
b = inf r(b | x) x = r(f, )
b = r(b | x)f (x) dx
bT = (b0 , b1 )T : min rss

b0 ,
b1
Theorems
Squared error loss: posterior mean b0 = Yn b1 Xn

Pn Pn
Absolute error loss: posterior median i=1 (Xi Xn )(Yi Yn ) i=1 Xi Yi nXY
1 =
b Pn = P n
Zero-one loss: posterior mode i=1 (Xi Xn )
2 2 2
i=1 Xi nX

0
h i
E b | X n =
17.4 Minimax Rules 1
2 n1 ni=1 Xi2 X n
h i P
Maximum risk V b | X n = 2
R()
b = sup R(, )
b R(a) = sup R(, a) nsX X n 1
r Pn
2
i=1 Xi

b
Minimax rule se(
b b0 ) =
sX n n
sup R(, )
b = inf R()
e = inf sup R(, )
e
e e

b
se(
b b1 ) =
sX n
b = Bayes rule c : R(, )
b =c Pn Pn 2
where s2X = n1 i=1 (Xi X n )2 and b2 = n21
i=1
i (unbiased estimate).
Least favorable prior Further properties:
P P
bf = Bayes rule R(, bf ) r(f, bf ) Consistency: b0 0 and b1 1
20
Asymptotic normality: 18.3 Multiple Regression
b0 0 D b1 1 D Y = X +
N (0, 1) and N (0, 1)
se(
b b0 ) se(
b b1 )
where
Approximate 1 confidence intervals for 0 and 1 :
X11 X1k 1 1
.. .. = ...
.. ..
b0 z/2 se( and b1 z/2 se( X= . =.

b b0 ) b b1 ) . .
Xn1 Xnk k n
Wald test for H0 : 1 = 0 vs. H1 : 1 6= 0: reject H0 if |W | > z/2 where
W = b1 /se(
b b1 ). Likelihood

1
R2 L(, ) = (2 2 )n/2 exp 2 rss
Pn b 2
Pn 2 2
i=1 (Yi Y ) rss
2
R = Pn 2
= 1 Pn i=1 i 2 = 1
i=1 (Yi Y ) i=1 (Yi Y )
tss
N
X
Likelihood rss = (y X)T (y X) = kY Xk2 = (Yi xTi )2
n n n i=1
Y Y Y
L= f (Xi , Yi ) = fX (Xi ) fY |X (Yi | Xi ) = L1 L2
i=1 i=1 i=1 If the (k k) matrix X T X is invertible,
Yn
L1 = fX (Xi ) b = (X T X)1 X T Y
i=1 h i
V b | X n = 2 (X T X)1
n
( )
Y 1 X 2
n
L2 = fY |X (Yi | Xi ) exp 2 Yi (0 1 Xi )
2 i b N , 2 (X T X)1

i=1
Under the assumption of Normality, the least squares estimator is also the mle
Estimate regression function
but the least squares variance estimator is not the mle.
n k
1X 2 X
b2 =
rb(x) = bj xj
n i=1 i j=1
18.2 Prediction Unbiased estimate for 2

Observe X = x of the covariate and want to predict their outcome Y . n
1 X 2
b2 =
= X b Y
Yb = b0 + b1 x n k i=1 i
h i h i h i h i
V Yb = V b0 + x2 V b1 + 2x Cov b0 , b1 mle
nk 2
Prediction interval
b = X b2 =

Pn 2
n
2 2 i=1 (Xi X )
n =
b P +1
n i (Xi X)2 j
b
1 Confidence interval
Yb z/2 bn bj z/2 se(
b bj )
21
18.4 Model Selection Akaike Information Criterion (AIC)
Consider predicting a new observation Y for covariates X and let S J
denote a subset of the covariates in the model, where |S| = k and |J| = n. bS2 ) k
AIC(S) = `n (bS ,
Issues
Bayesian Information Criterion (BIC)
Underfitting: too few covariates yields high bias
Overfitting: too many covariates yields high variance k
bS2 ) log n
BIC(S) = `n (bS ,
Procedure 2
1. Assign a score to each model Validation and training

2. Search through all models to find the one with the highest score
m
X n n
Hypothesis testing R
bV (S) = (Ybi (S) Yi )2 m = |{validation data}|, often or
i=1
4 2
H0 : j = 0 vs. H1 : j 6= 0 j J
Leave-one-out cross-validation
Mean squared prediction error (mspe)
n n
!2
h i X X Yi Ybi (S)
mspe = E (Yb (S) Y )2 R
bCV (S) = (Yi Yb(i) )2 =
i=1 i=1
1 Uii (S)
Prediction risk
n n h i
U (S) = XS (XST XS )1 XS (hat matrix)
X X
R(S) = mspei = E (Ybi (S) Yi )2
i=1 i=1
Training error
n
R
btr (S) =
X
(Ybi (S) Yi )2 19 Non-parametric Function Estimation
i=1
R 2 19.1 Density Estimation

Pn b 2
R i=1 (Yi (S) Y )
rss(S) btr (S) R
R2 (S) = 1 =1 =1 P n
Estimate f (x), where f (x) = P [X A] = A
f (x) dx.
2
i=1 (Yi Y )
tss tss Integrated square error (ise)
The training error is a downward-biased estimate of the prediction risk. Z Z
2
h i L(f, fbn ) = f (x) fbn (x) dx = J(h) + f 2 (x) dx
E R btr (S) < R(S)
h i n
X h i Frequentist risk
bias(Rtr (S)) = E Rtr (S) R(S) = 2
b b Cov Ybi , Yi
i=1 h i Z Z
R(f, fbn ) = E L(f, fbn ) = b2 (x) dx + v(x) dx
Adjusted R2
n 1 rss
R2 (S) = 1
n k tss
h i
Mallows Cp statistic b(x) = E fbn (x) f (x)
h i
R(S)
b =R 2 = lack of fit + complexity penalty
btr (S) + 2kb v(x) = V fbn (x)
22
19.1.1 Histograms KDE
n
Definitions

1X1 x Xi
fbn (x) = K
n i=1 h h
Number of bins m 1 4
Z
00 2 1
Z
Binwidth h = m 1 R(f, fn ) (hK )
b (f (x)) dx + K 2 (x) dx
4 nh
Bin Bj has j observations c
2/5 1/5 1/5
c2 c3
Z Z
h = 1 c = 2
, c = K 2
(x) dx, c = (f 00 (x))2 dx
R
Define pbj = j /n and pj = Bj f (u) du 1 K 2 3
n1/5
Z 4/5 Z 1/5
c4 5 2 2/5 2 00 2
Histogram estimator R (f, fn ) = 4/5
b c4 = (K ) K (x) dx (f ) dx
n 4
| {z }
m C(K)
X pbj
fbn (x) = I(x Bj )
h
j=1 Epanechnikov Kernel
h i pj
E fbn (x) = (
h 3
|x| < 5
h i p (1 p ) K(x) = 4 5(1x2 /5)
j j
V fbn (x) = 0 otherwise
nh2
h2
Z
2 1
R(fbn , f ) (f 0 (u)) du + Cross-validation estimate of E [J(h)]
12 nh
!1/3
1 6 n n n
1 X X Xi Xj
Z
h = 1/3 R 2 du 2Xb 2
n (f 0 (u)) JbCV (h) = fbn2 (x) dx f(i) (Xi ) 2
K + K(0)
n i=1 hn i=1 j=1 h nh
2/3 Z 1/3
b C 3 0 2
R (fn , f ) 2/3 C= (f (u)) du
n 4 Z
(2) (2)
K (x) = K (x) 2K(x) K (x) = K(x y)K(y) dy
Cross-validation estimate of E [J(h)]
Z n m
JbCV (h) = fbn2 (x) dx
2Xb
f(i) (Xi ) =
2

n+1 X 2
pb
19.2 Non-parametric Regression
n i=1 (n 1)h (n 1)h j=1 j
Estimate f (x) where f (x) = E [Y | X = x]. Consider pairs of points
(x1 , Y1 ), . . . , (xn , Yn ) related by
19.1.2 Kernel Density Estimator (KDE) Yi = r(xi ) + i

E [i ] = 0
Kernel K
V [i ] = 2
K(x) 0

R
K(x) dx = 1 k-nearest Neighbor Estimator
R
xK(x) dx = 0
R 2 2 1 X
x K(x) dx K >0 rb(x) = Yi where Nk (x) = {k values of x1 , . . . , xn closest to x}
k 23
i:xi Nk (x)
Nadaraya-Watson Kernel Estimator 20 Stochastic Processes
n
X
rb(x) = wi (x)Yi Stochastic Process
i=1 (
xxi {0, 1, . . . } = Z discrete

K {Xt : t T } T =
wi (x) = h [0, 1]
[0, ) continuous

Pn xxj
j=1 K h
4 Z 2
h4 f 0 (x)
Z
2 2 00 0 Notations Xt , X(t)
rn , r)
R(b x K (x) dx r (x) + 2r (x) dx
4 f (x) State space X
Z 2
K 2 (x) dx
R
Index set T
+ dx
nhf (x)
c1
h 1/5 20.1 Markov Chains
n
c2
rn , r) 4/5
R (b Markov chain
n
P [Xn = x | X0 , . . . , Xn1 ] = P [Xn = x | Xn1 ] n T, x X
Cross-validation estimate of E [J(h)]
n n Transition probabilities
X X (Yi rb(xi ))2
JbCV (h) = (Yi rb(i) (xi ))2 = !2
i=1 i=1 K(0)
pij P [Xn+1 = j | Xn = i]
1 xx
Pn
j=1 K h
j
pij (n) P [Xm+n = j | Xm = i] n-step
19.3 Smoothing Using Orthogonal Functions Transition matrix P (n-step: Pn )
Approximation (i, j) element is pij

J
X X pij > 0
r(x) = j j (x) j j (x) P
j=1 j=1
i pij = 1
Multivariate regression Chapman-Kolmogorov

Y = +
X
0 (x1 ) J (x1 ) pij (m + n) = pij (m)pkj (n)
.. .. ..
where i = i and = . . .
k
0 (xn ) J (xn )
Pm+n = Pm Pn
Least squares estimator
b = (T )1 T Y Pn = P P = Pn
1
T Y (for equally spaced observations only) Marginal probability
n
Cross-validation estimate of E [J(h)] n = (n (1), . . . , n (N )) where i (i) = P [Xn = i]
2
0 , initial distribution

Xn J
X
R
bCV (J) = Yi j (xi )bj,(i) n = 0 Pn
24
i=1 j=1
20.2 Poisson Processes Autocorrelation function (ACF)
Poisson process
Cov [xs , xt ] (s, t)
(s, t) = p =p
{Xt : t [0, )} = number of events up to and including time t V [xs ] V [xt ] (s, s)(t, t)
X0 = 0
Independent increments: Cross-covariance function (CCV)
t0 < < tn : Xt1 Xt0
Xtn Xtn1
xy (s, t) = E [(xs xs )(yt yt )]
Intensity function (t)
P [Xt+h Xt = 1] = (t)h + o(h) Cross-correlation function (CCF)
P [Xt+h Xt = 2] = o(h)
Rt xy (s, t)
Xs+t Xs Po (m(s + t) m(s)) where m(t) = 0
(s) ds xy (s, t) = p
x (s, s)y (t, t)
Homogeneous Poisson process
Backshift operator
(t) = Xt Po (t) >0
B k (xt ) = xtk
Waiting times
Wt := time at which Xt occurs
Difference operator
1
Wt Gamma t, d = (1 B)d

Interarrival times
White noise
St = Wt+1 Wt

1 2
wt wn(0, w )
St Exp
iid 2

Gaussian: wt N 0, w
St
E [wt ] = 0 t T
V [wt ] = 2 t T
Wt1 Wt t w (s, t) = 0 s 6= t s, t T
Random walk
21 Time Series
Drift
Mean function
Pt
xt = t + j=1 wj
Z
xt = E [xt ] = xft (x) dx
E [xt ] = t
Autocovariance function
Symmetric moving average
x (s, t) = E [(xs s )(xt t )] = E [xs xt ] s t
k
X k
X
x (t, t) = E (xt t )2 = V [xt ]

mt = aj xtj where aj = aj 0 and aj = 1
25
j=k j=k
21.1 Stationary Time Series 21.2 Estimation of Correlation
Strictly stationary Sample mean
n
1X
x = xt
P [xt1 c1 , . . . , xtk ck ] = P [xt1 +h c1 , . . . , xtk +h ck ] n t=1
Sample variance
n
k N, tk , ck , h Z

1 X |h|
V [x] = 1 x (h)
n n
h=n
Weakly stationary
Sample autocovariance function
E x2t < t Z
nh
1 X
2
E xt = m t Z
b(h) = (xt+h x)(xt x)
x (s, t) = x (s + r, t + r) r, s, t Z n t=1
Autocovariance function Sample autocorrelation function

b(h)
(h) = E [(xt+h )(xt )] h Z b(h) =

b(0)
(0) = E (xt )2
(0) 0 Sample cross-variance function
(0) |(h)|
nh
(h) = (h) 1 X

bxy (h) = (xt+h x)(yt y)
n t=1
Autocorrelation function (ACF)
Sample cross-correlation function
Cov [xt+h , xt ] (t + h, t) (h)
x (h) = p =p =
bxy (h)
V [xt+h ] V [xt ] (t + h, t + h)(t, t) (0) bxy (h) = p
bx (0)b
y (0)
Jointly stationary time series Properties
xy (h) = E [(xt+h x )(yt y )] 1

bx (h) = if xt is white noise
n
1
xy (h) bxy (h) = if xt or yt is white noise
xy (h) = p n
x (0)y (h)
21.3 Non-Stationary Time Series
Linear process
Classical decomposition model

X
X
xt = + j wtj where |j | < xt = t + st + wt
j= j=
t = trend

X st = seasonal component
2
(h) = w j+h j wt = random noise term
26
j=
21.3.1 Detrending Moving average polynomial
Least squares (z) = 1 + 1 z + + q zq z C q 6= 0
2
1. Choose trend model, e.g., t = 0 + 1 t + 2 t
Moving average operator
2. Minimize rss to obtain trend estimate bt = b0 + b1 t + b2 t2
3. Residuals , noise wt (B) = 1 + 1 B + + p B p
Moving average MA (q) (moving average model order q)
1
The low-pass filter vt is a symmetric moving average mt with aj = 2k+1 : xt = wt + 1 wt1 + + q wtq xt = (B)wt
k q
1 X X
vt = xt1 E [xt ] = j E [wtj ] = 0
2k + 1
i=k j=0
Pk ( Pqh
1 2
If 2k+1 i=k wtj 0, a linear trend function t = 0 + 1 t passes
w j=0 j j+h 0hq
(h) = Cov [xt+h , xt ] =
without distortion 0 h>q
Differencing MA (1)
xt = wt + wt1
t = 0 + 1 t = xt = 1
2 2
(1 + )w h = 0

2
21.4 ARIMA models (h) = w h=1

0 h>1

Autoregressive polynomial
(

(z) = 1 1 z p zp z C p 6= 0 2 h=1
(h) = (1+ )
0 h>1
Autoregressive operator
ARMA (p, q)
(B) = 1 1 B p B p
xt = 1 xt1 + + p xtp + wt + 1 wt1 + + q wtq
Autoregressive model order p, AR (p)
(B)xt = (B)wt
xt = 1 xt1 + + p xtp + wt (B)xt = wt
Partial autocorrelation function (PACF)
AR (1) xih1 , regression of xi on {xh1 , xh2 , . . . , x1 }
k1 hh = corr(xh xh1
h , x0 xh1
0 ) h2
X k,||<1 X
xt = k (xtk ) + j (wtj ) = j (wtj ) E.g., 11 = corr(x1 , x0 ) = (1)
j=0 j=0
| {z } ARIMA (p, d, q)
linear process
P j
d xt = (1 B)d xt is ARMA (p, q)
E [xt ] = j=0 (E [wtj ]) = 0
2 h
w (B)(1 B)d xt = (B)wt
(h) = Cov [xt+h , xt ] = 12
(h) Exponentially Weighted Moving Average (EWMA)
(h) = (0) = h
(h) = (h 1) h = 1, 2, . . . xt = xt1 + wt wt1
27

X Frequency index (cycles per unit time), period 1/
xt = (1 )j1 xtj + wt when || < 1
j=1
Amplitude A
Phase
xn+1 = (1 )xn + xn
U1 = A cos and U2 = A sin often normally distributed rvs
Seasonal ARIMA
Periodic mixture
Denoted by ARIMA (p, d, q) (P, D, Q)s
q
P (B s )(B)D d s
s xt = + Q (B )(B)wt X
xt = (Uk1 cos(2k t) + Uk2 sin(2k t))
k=1
21.4.1 Causality and Invertibility
P Uk1 , Uk2 , for k = 1, . . . , q, are independent zero-mean rvs with variances k2
ARMA (p, q) is causal (future-independent) {j } : j=0 j < such that Pq
(h) = k=1 k2 cos(2k h)
Pq

X (0) = E x2t = k=1 k2
xt = wtj = (B)wt
j=0 Spectral representation of a periodic process
P
ARMA (p, q) is invertible {j } : j=0 j < such that (h) = 2 cos(20 h)
2 2i0 h 2 2i0 h
X = e + e
(B)xt = Xtj = wt 2 2
Z 1/2
j=0
= e2ih dF ()
Properties 1/2
ARMA (p, q) causal roots of (z) lie outside the unit circle Spectral distribution function

X (z)
j 0
< 0
(z) = j z = |z| 1
(z) F () = 2 /2 < 0
j=0
2
0
ARMA (p, q) invertible roots of (z) lie outside the unit circle
F () = F (1/2) = 0

X (z) F () = F (1/2) = (0)
(z) = j z j = |z| 1
j=0
(z)
Spectral density
Behavior of the ACF and PACF for causal and invertible ARMA models
X 1 1
AR (p) MA (q) ARMA (p, q) f () = (h)e2ih
2 2
h=
ACF tails off cuts off after lag q tails off
PACF cuts off after lag p tails off q tails off P R 1/2
Needs h= |(h)| < = (h) = 1/2
e2ih f () d h = 0, 1, . . .
21.5 Spectral Analysis f () 0
f () = f ()
Periodic process f () = f (1 )
R 1/2
xt = A cos(2t + ) (0) = V [xt ] = 1/2 f () d
2
= U1 cos(2t) + U2 sin(2t) White noise: fw () = w
28
ARMA (p, q) , (B)xt = (B)wt : 22.2 Beta Function
Z 1
(x)(y)
|(e2i )|2
2 Ordinary: B(x, y) = B(y, x) = tx1 (1 t)y1 dt =
fx () = w 0 (x + y)
|(e2i )|2 Z x
a1 b1
Pp Pq Incomplete: B(x; a, b) = t (1 t) dt
where (z) = 1 k=1 k z k and (z) = 1 + k=1 k z k 0
Regularized incomplete:
Discrete Fourier Transform (DFT) a+b1
B(x; a, b) a,bN X (a + b 1)!
Ix (a, b) = = xj (1 x)a+b1j
n
X B(a, b) j=a
j!(a + b 1 j)!
d(j ) = n1/2 xt e2ij t
I0 (a, b) = 0 I1 (a, b) = 1
i=1
Ix (a, b) = 1 I1x (b, a)
Fourier/Fundamental frequencies
22.3 Series
j = j/n
Finite Binomial
Inverse DFT n n
n1 X n(n + 1) X n
= 2n
X
xt = n 1/2
d(j )e 2ij t k=
2 k
j=0 k=1 k=0
n n
X X r+k r+n+1
Periodogram (2k 1) = n2 =
I(j/n) = |d(j/n)|2 k n
k=1 k=0
n n
Scaled Periodogram
X n(n + 1)(2n + 1) X k n+1
k2 = =
6 m m+1
k=1 k=0
4 n
P (j/n) = I(j/n) X
n(n + 1)
2 Vandermondes Identity:
n k3 = r
m n

m+n

2
!2 !2 X
2X
n
2X
n k=1 =
= xt cos(2tj/n + xt sin(2tj/n n k rk r
n t=1 n t=1
X cn+1 1 k=0
ck = c 6= 1 Binomial Theorem:
c1 n
n nk k
k=0
X
a b = (a + b)n
22 Math k
k=0
22.1 Gamma Function Infinite

Z

Ordinary: (s) = ts1 et dt X 1 X p
0
pk = , pk = |p| < 1
Z 1p 1p
k=0 k=1
Upper incomplete: (s, x) = ts1 et dt
!
X d X d 1 1
Z xx kpk1 = pk
= = |p| < 1
dp dp 1 p (1 p)2
Lower incomplete: (s, x) = ts1 et dt k=0 k=0
0
X r+k1 k
( + 1) = () >1 x = (1 x)r r N+
k
(n) = (n 1)! nN k=0

(0) = (1) =
X k
p = (1 + p) |p| < 1 , C
(1/2) = k
k=0
(1/2) = 2(1/2)
29
22.4 Combinatorics [3] R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications With R
Examples. Springer, 2006.
Sampling [4] A. Steger. Diskrete Strukturen Band 1: Kombinatorik, Graphentheorie, Algebra.
Springer, 2001.
k out of n w/o replacement w/ replacement [5] A. Steger. Diskrete Strukturen Band 2: Wahrscheinlichkeitstheorie und Statistik.
k1 Springer, 2002.
Y n!
ordered nk = (n i) = nk [6] L. Wasserman. All of Statistics: A Concise Course in Statistical Inference. Springer, 2003.
i=0
(n k)!
nk

n n! n1+r n1+r
unordered = = =
k k! k!(n k)! r n1
Stirling numbers, 2nd kind

(
n n1 n1 n 1 n=0
=k + 1kn =
k k k1 0 0 else
Partitions
n
X
Pn+k,k = Pn,i k > n : Pn,k = 0 n 1 : Pn,0 = 0, P0,0 = 1
i=1
Balls and Urns f :BU D = distinguishable, D = indistinguishable.
|B| = n, |U | = m f arbitrary f injective f surjective f bijective

( (
mn m n

n n! m = n
B : D, U : D mn m!
0 else m 0 else
(
m+n1 m n1 1 m=n
B : D, U : D
n n m1 0 else
m
( (
X n 1 mn n 1 m=n
B : D, U : D
k 0 else m 0 else
k=1
m
( (
X 1 mn 1 m=n
B : D, U : D Pn,k Pn,m
k=1
0 else 0 else
References
[1] P. G. Hoel, S. C. Port, and C. J. Stone. Introduction to Probability Theory. Brooks Cole,
1972.
[2] L. M. Leemis and J. T. McQueston. Univariate Distribution Relationships. The American
Statistician, 62(1):4553, 2008.
30
Univariate distribution relationships, courtesy Leemis and McQueston [2].
31

Stat Cookbook PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Stat Cookbook PDF

Uploaded by

Copyright:

Available Formats

Probability and Statistics

1 Distribution Overview 3 15 Bayesian Inference 16 22 Math 29

a b 0 10 20 30 40 0.0 2.5 5.0 7.5 10.0 0 5 10 15 20

a b 0 10 20 30 40 0.0 2.5 5.0 7.5 10.0 0 5 10 15 20

0.5 0.25 0.1

0.0 0.00 0.0

0.00 0 0.0 0.0

0.75 0.75 0.75 0.75

0.50 0.50 0.50 0.50

0.25 0.25 0.25 0.25

0.25 0.25 0.25 0.25

Outcome (point or element) Bayes Theorem

Continuous Sample mean

10.1 Law of Large Numbers (LLN)

Properties of the mle Under appropriate regularity conditions

Null hypothesis H0 sup Ln () Ln (bn )

fX (x | ) = h(x) exp {()T (x) A()} Equal-tail credible interval

Conjugate: f () and f ( | xn ) belong to the same parametric family

Envelope condition: f (xn | ) f (xn | bn ) = Ln (bn ) M

18.1 Simple Linear Regression

Residual sums of squares (rss)

Squared error loss: posterior mean b0 = Yn b1 Xn

18.2 Prediction Unbiased estimate for 2

1. Assign a score to each model Validation and training

R 2 19.1 Density Estimation

19.1.2 Kernel Density Estimator (KDE) Yi = r(xi ) + i

19.3 Smoothing Using Orthogonal Functions Transition matrix P (n-step: Pn )

Approximation (i, j) element is pij

Multivariate regression Chapman-Kolmogorov

Autocovariance function Sample autocorrelation function

xy (h) = E [(xt+h x )(yt y )] 1

22.1 Gamma Function Infinite

Stirling numbers, 2nd kind

Balls and Urns f :BU D = distinguishable, D = indistinguishable.

|B| = n, |U | = m f arbitrary f injective f surjective f bijective

You might also like

19.1.2 Kernel Density Estimator (KDE) Yi = r(xi ) + i