Professional Documents
Culture Documents
Cookbook
Copyright
c Matthias Vallentin, 2011
vallentin@icir.org
i=0
i! x!
0.8
n = 30, p = 0.6 p = 0.5 =4
0.25
0.3
0.20
0.6
0.15
0.2
1
PMF
PMF
PMF
PMF
0.4
0.10
0.1
0.2
0.05
0.00
0.0
0.0
a b 0 10 20 30 40 0 2 4 6 8 10 0 5 10 15 20
x x x x
1 We use the notation (s, x) and (x) to refer to the Gamma functions (see 22.1), and use B(x, y) and Ix to refer to the Beta functions (see 22.2).
3
1.2 Continuous Distributions
Notation FX (x) fX (x) E [X] V [X] MX (s)
0 x<a
(b a)2 esb esa
xa I(a < x < b) a+b
Uniform Unif (a, b) a<x<b
ba ba 2 12 s(b a)
1 x>b
(x )2
Z x
2 s2
1
N , 2 2
Normal (x) = (t) dt (x) = exp exp s +
2 2 2 2
(ln x )2
1 1 ln x 1 2 2 2
ln N , 2 e+ /2
(e 1)e2+
Log-Normal + erf exp
2 2 2 2 x 2 2 2 2
1 T
1 (x) 1
Multivariate Normal MVN (, ) (2)k/2 ||1/2 e 2 (x) exp T s + sT s
2
(+1)/2
+1
2 x2
Students t Student() Ix ,
1+ 0 0
2 2 2
1 k x 1
Chi-square 2k , xk/21 ex/2 k 2k (1 2s)k/2 s < 1/2
(k/2) 2 2 2k/2 (k/2)
r
d
(d1 x)d1 d2 2
2d22 (d1 + d2 2)
d1 d1 (d1 x+d2 )d1 +d2 d2
F F(d1 , d2 ) I d1 x , d1 d1 d2 2 d1 (d2 2)2 (d2 4)
d1 x+d2 2 2 xB 2
, 2
1 x/ 1
Exponential Exp () 1 ex/ e 2 (s < 1/)
1 s
(, x/) 1 1
Gamma Gamma (, ) x1 ex/ 2 (s < 1/)
() () 1 s
, x
1 /x 2 2(s)/2 p
Inverse Gamma InvGamma (, ) x e >1 >2 K 4s
() () 1 ( 1)2 ( 2)2 ()
P
k
i=1 i Y 1
k
i E [Xi ] (1 E [Xi ])
Dirichlet Dir () Qk xi i Pk Pk
i=1 (i ) i=1 i=1 i i=1 i + 1
k1
!
( + ) 1 X Y +r sk
Beta Beta (, ) Ix (, ) x (1 x)1 1+
() () + ( + )2 ( + + 1) r=0
++r k!
k=1
sn n
k k x k1 (x/)k 1 2 X n
Weibull Weibull(, k) 1 e(x/) e 1 + 2 1 + 2 1+
k k n=0
n! k
x
m x xm x
Pareto Pareto(xm , ) 1 x xm m
+1 x xm >1 m
>2 (xm s) (, xm s) s < 0
x x 1 ( 1)2 ( 2)
4
Uniform (continuous) Normal Lognormal Student's t
=1
0.4
= 0, 2 = 0.2 = 0, 2 = 3
1.0
= 0, 2 = 1 = 2, 2 = 2 =2
= 0, 2 = 5 = 0, 2 = 1 =5
=
0.8
= 2, 2 = 0.5 = 0.5, 2 = 1
= 0.25, 2 = 1
= 0.125, 2 = 1
0.8
0.3
0.6
0.6
PDF
PDF
(x)
0.2
0.4
1
0.4
ba
0.1
0.2
0.2
0.0
0.0
0.0
2
F Exponential Gamma
k=1 d1 = 1, d2 = 1 =2 = 1, = 2
0.5
2.0
0.5
k=2 d1 = 2, d2 = 1 =1 = 2, = 2
3.0
k=3 d1 = 5, d2 = 2 = 0.4 = 3, = 2
k=4 d1 = 100, d2 = 1 = 5, = 1
k=5 d1 = 100, d2 = 100 = 9, = 0.5
0.4
0.4
2.5
1.5
2.0
0.3
0.3
PDF
PDF
1.0
1.5
0.2
0.2
1.0
0.5
0.1
0.1
0.5
0.0
0.0
0.0
0.0
0 2 4 6 8 0 1 2 3 4 5 0 1 2 3 4 5 0 5 10 15 20
x x x x
2.5
= 2, = 1 = 5, = 1 = 1, k = 1 xm = 1, = 2
= 3, = 1 = 1, = 3 = 1, k = 1.5 xm = 1, = 4
= 3, = 0.5 = 2, = 2 = 1, k = 5
= 2, = 5
4
2.5
2.0
3
2.0
3
1.5
2
PDF
PDF
1.5
2
1.0
1.0
1
1
0.5
0.5
0.0
0.0
0
0
0 1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 2.5 0 1 2 3 4 5
x x x x
5
2 Probability Theory Law of Total Probability
n n
Definitions X G
P [B] = P [B|Ai ] P [Ai ] = Ai
Sample space i=1 i=1
f (x, y)
Independence
fY |X (y | x) =
A
B P [A B] = P [A] P [B] fX (x)
Independence
Conditional Probability
1. P [X x, Y y] = P [X x] P [Y y]
P [A B]
P [A | B] = P [B] > 0 2. fX,Y (x, y) = fX (x)fY (y)
P [B] 6
Z
3.1 Transformations E [XY ] = xyfX,Y (x, y) dFX (x) dFY (y)
X,Y
Transformation function
E [(Y )] 6= (E [X]) (cf. Jensen inequality)
Z = (X)
P [X Y ] = 0 = E [X] E [Y ] P [X = Y ] = 1 = E [X] = E [Y ]
Discrete X
X E [X] = P [X x]
fZ (z) = P [(X) = z] = P [{x : (x) = z}] = P X 1 (z) =
f (x) x=1
x1 (z)
Sample mean
n
Continuous 1X
Xn = Xi
Z n i=1
FZ (z) = P [(X) z] = f (x) dx with Az = {x : (x) z}
Az Conditional expectation
Z
Special case if strictly monotone E [Y | X = x] = yf (y | x) dy
d dx 1 E [X] = E [E [X | Y ]]
fZ (z) = fX (1 (z)) 1 (z) = fX (x) = fX (x)
dz dz |J|
Z
E[(X, Y ) | X = x] = (x, y)fY |X (y | x) dx
The Rule of the Lazy Statistician Z
Z E [(Y, Z) | X = x] = (y, z)f(Y,Z)|X (y, z | x) dy dz
E [Z] = (x) dFX (x)
E [Y + Z | X] = E [Y | X] + E [Z | X]
Z Z E [(X)Y | X] = (X)E [Y | X]
E [IA (x)] = IA (x) dFX (x) = dFX (x) = P [X A] E[Y | X] = c = Cov [X, Y ] = 0
A
Convolution
z
5 Variance
Z Z
X,Y 0
Z := X + Y fZ (z) = fX,Y (x, z x) dx = fX,Y (x, z x) dx
0 Definition and properties
Z
Z := |X Y | fZ (z) = 2 fX,Y (x, z + x) dx V [X] = X2
= E (X E [X])2 = E X 2 E [X]
2
Z 0 Z " n #
X n
Z := |x|fX,Y (x, xz) dx =
X X X
fZ (z) = xfx (x)fX (x)fY (xz) dx V Xi = V [Xi ] + 2 Cov [Xi , Yj ]
Y
i=1 i=1 i6=j
" n # n
4 Expectation
X X
V Xi = V [Xi ] iff Xi
Xj
i=1 i=1
Definition and properties
Standard deviation p
X
xfX (x) X discrete sd[X] = V [X] = X
x Covariance
Z
E [X] = X = x dFX (x) =
Cov [X, Y ] = E [(X E [X])(Y E [Y ])] = E [XY ] E [X] E [Y ]
Z
xfX (x) X continuous
Cov [X, a] = 0
P [X = c] = 1 = E [c] = c Cov [X, X] = V [X]
E [cX] = c E [X] Cov [X, Y ] = Cov [Y, X]
E [X + Y ] = E [X] + E [Y ] Cov [aX, bY ] = abCov [X, Y ]
7
Cov [X + a, Y + b] = Cov [X, Y ] limn Bin (n, p) = Po (np) (n large, p small)
Xn Xm Xn Xm limn Bin (n, p) = N (np, np(1 p)) (n large, p far from 0 and 1)
Cov Xi , Yj =
Cov [Xi , Yj ] Negative Binomial
i=1 j=1 i=1 j=1
X NBin (1, p) = Geo (p)
Correlation Pr
X NBin (r, p) = i=1 Geo (p)
Cov [X, Y ] P P
[X, Y ] = p Xi NBin (ri , p) = Xi NBin ( ri , p)
V [X] V [Y ]
X NBin (r, p) . Y Bin (s + r, p) = P [X s] = P [Y r]
Independence
Poisson
n n
!
X
Y = [X, Y ] = 0 Cov [X, Y ] = 0 E [XY ] = E [X] E [Y ] X X
Xi Po (i ) Xi Xj = Xi Po i
Sample variance i=1 i=1
n
1 X n n
X
i
S2 = (Xi Xn )2
X
Xi Po (i ) Xi Xj = Xi Xj Bin Xj , Pn
n 1 i=1 j=1 j=1 j
j=1
Conditional variance Exponential
2 n
V [Y | X] = E (Y E [Y | X])2 | X = E Y 2 | X E [Y | X] X
Xi Exp () Xi
Xj = Xi Gamma (n, )
V [Y ] = E [V [Y | X]] + V [E [Y | X]] i=1
Memoryless property: P [X > x + y | X > y] = P [X > x]
6 Inequalities Normal
X
X N , 2 = N (0, 1)
Cauchy-Schwarz
2
E [XY ] E X 2 E Y 2 X N , Z = aX + b = Z N a + b, a2 2
2
Markov X N 1 , 12 Y N 2 , 22 = X + Y N 1 + 2 , 12 + 22
E [(X)] Xi N i , i2 = 2
P P P
X N i i , i i
P [(X) t] i i
t
P [a < X b] = b a
Chebyshev
V [X] (x) = 1 (x) 0 (x) = x(x) 00 (x) = (x2 1)(x)
P [|X E [X]| t] 1
t2 Upper quantile of N (0, 1): z = (1 )
Chernoff Gamma
e
P [X (1 + )] > 1 X Gamma (, ) X/ Gamma (, 1)
(1 + )1+ P
Gamma (, ) i=1 Exp ()
Jensen P P
Xi Gamma (i , ) Xi Xj = Xi Gamma ( i i , )
E [(X)] (E [X]) convex Z i
()
= x1 ex dx
0
7 Distribution Relationships Beta
1 ( + ) 1
Binomial x1 (1 x)1 = x (1 x)1
B(, ) ()()
n
X B( + k, ) +k1
E X k1
Xi Bern (p) = Xi Bin (n, p) E Xk = =
i=1
B(, ) ++k1
X Bin (n, p) , Y Bin (m, p) = X + Y Bin (n + m, p) Beta (1, 1) Unif (0, 1)
8
8 Probability and Moment Generating Functions Conditional mean and variance
X
GX (t) = E tX |t| < 1 E [X | Y ] = E [X] + (Y E [Y ])
" # Y
X (Xt)i X E Xi
ti
t
Xt
MX (t) = GX (e ) = E e =E = p
V [X | Y ] = X 1 2
i=0
i! i=0
i!
P [X = 0] = GX (0)
P [X = 1] = G0X (0) 9.3 Multivariate Normal
(i)
GX (0) Covariance matrix (Precision matrix 1 )
P [X = i] =
i!
E [X] = G0X (1 ) V [X1 ] Cov [X1 , Xk ]
(k) =
.. .. ..
.
E X k = MX (0) . .
X!
(k)
Cov [Xk , X1 ] V [Xk ]
E = GX (1 )
(X k)!
2
If X N (, ),
V [X] = G00X (1 ) + G0X (1 ) (G0X (1 ))
d n/2 1/2 1
GX (t) = GY (t) = X = Y fX (x) = (2) || exp (x )T 1 (x )
2
Properties
9 Multivariate Distributions
Z N (0, 1) X = + 1/2 Z = X N (, )
9.1 Standard Bivariate Normal X N (, ) = 1/2 (X ) N (0, 1)
X N (, ) = AX N A, AAT
p
Let X, Y N (0, 1) X
Z where Y = X + 1 2 Z
X N (, ) kak = k = aT X N aT , aT a
Joint density 2
x + y 2 2xy
1
f (x, y) = p
2 1 2
exp
2(1 2 )
10 Convergence
Conditionals Let {X1 , X2 , . . .} be a sequence of rvs and let X be another rv. Let Fn denote
the cdf of Xn and let F denote the cdf of X.
(Y | X = x) N x, 1 2 (X | Y = y) N y, 1 2
and
Types of convergence
Independence D
X
Y = 0 1. In distribution (weakly, in law): Xn X
1
z
( > 0) lim P [|Xn X| > ] = 0
n
f (x, y) = exp
2(1 2 )
p
2x y 1 2
as
3. Almost surely (strongly): Xn X
" 2 2 #
x x y y x x y y h i h i
z= + 2 P lim Xn = X = P : lim Xn () = X() = 1
x y x y n n
9
qm
4. In quadratic mean (L2 ): Xn X CLT notations
p-value evidence
where < 0.01 very strong evidence against H0
0.01 0.05 strong evidence against H0
r
T 0.05 0.1 weak evidence against H0
se(b
b ) =
b Jbn
b > 0.1 little or no evidence against H0
Wald test
Two-sided test
b and
and Jbn = Jn () b = b.
= b 0
Reject H0 when |W | > z/2 where W =
se
b
P |W | > z/2
p-value = P0 [|W | > |w|] P [|Z| > |w|] = 2(|w|)
12.4 Parametric Bootstrap
Likelihood ratio test (LRT)
Sample from f (x; bn ) instead of from Fbn , where bn could be the mle or method sup Ln () Ln (bn )
T (X) = =
of moments estimator. sup0 Ln () Ln (bn,0 ) 13
k
D
X iid xn = (x1 , . . . , xn )
(X) = 2 log T (X) 2rq where Zi2 2k and Z1 , . . . , Zk N (0, 1)
Prior density f ()
i=1 Likelihood f (xn | ): joint density of the data
p-value = P0 [(X) > (x)] P 2rq > (x) n
Y
In particular, X n iid = f (xn | ) = f (xi | ) = Ln ()
Multinomial LRT
i=1
Posterior density f ( | xn )
X1 Xk
mle: pbn = ,...,
Normalizing constant cn = f (xn ) = f (x | )f () d
R
n n
k Xj Kernel: part of a density that depends Ron
Ln (b
pn ) Y pbj
T (X) = = Ln ()f ()
Posterior mean n = f ( | xn ) d = R Ln ()f
R
Ln (p0 ) j=1
p0j () d
k
X pbj D
(X) = 2 Xj log 2k1 14.1 Credible Intervals
j=1
p 0j
The approximate size LRT rejects H0 when (X) 2k1, Posterior interval
Pearson Chi-square Test Z b
n
P [ (a, b) | x ] = f ( | xn ) d = 1
k
X (Xj E [Xj ])2 a
T = where E [Xj ] = np0j under H0
j=1
E [Xj ] Equal-tail credible interval
D
T 2k1 Z a Z
f ( | xn ) d = f ( | xn ) d = /2
p-value = P 2k1 > T (x)
b
2D
Faster Xk1 than LRT, hence preferable for small n
Highest posterior density (HPD) region Rn
Independence testing
1. P [ Rn ] = 1
I rows, J columns, X multinomial sample of size n = I J
X 2. Rn = { : f ( | xn ) > k} for some k
mles unconstrained: pbij = nij
X
mles under H0 : pb0ij = pbi pbj = Xni nj Rn is unimodal = Rn is an interval
PI PJ nX
LRT: = 2 i=1 j=1 Xij log Xi Xijj
PI PJ (X E[X ])2
PearsonChiSq: T = i=1 j=1 ijE[Xij ]ij
14.2 Function of parameters
D
LRT and Pearson 2k , where = (I 1)(J 1) Let = () and A = { : () }.
Posterior CDF for
Z
14 Bayesian Inference H(r | xn ) = P [() | xn ] = f ( | xn ) d
A
Bayes Theorem
Posterior density
f (x | )f () f (x | )f () h( | xn ) = H 0 ( | xn )
f ( | x) = n
=R Ln ()f ()
f (x ) f (x | )f () d
Bayesian delta method
Definitions
| X n N (),
b seb 0 ()
b
n
X = (X1 , . . . , Xn )
14
14.3 Priors Continuous likelihood (subscript c denotes constant)
Likelihood Conjugate prior Posterior hyperparameters
Choice
Unif (0, ) Pareto(xm , k) max x(n) , xm , k + n
Xn
Algorithm
(Frequentist) risk
1. Draw cand f ()
Z
2. Generate u Unif (0, 1)
h i
R(, )
b = L(, (x))f
b (x | ) dx = EX| L(, (X))
b
Ln (cand )
3. Accept cand if u
Ln (bn ) Bayes risk
ZZ
16.3 Importance Sampling
h i
r(f, )
b = L(, (x))f
b (x, ) dx d = E,X L(, (X))
b
Sample from an importance function g rather than target density h.
Algorithm to obtain an approximation to E [q() | xn ]:
h h ii h i
r(f, )
b = E EX| L(, (X)
b = E R(, )b
iid
1. Sample from the prior 1 , . . . , n f ()
h h ii h i
r(f, )
b = EX E|X L(, (X)
b = EX r(b | X)
Ln (i )
2. wi = PB i = 1, . . . , B
i=1 Ln (i )
PB 17.2 Admissibility
3. E [q() | xn ] i=1 q(i )wi
b0 dominates b if
: R(, b0 ) R(, )
b
17 Decision Theory
: R(, b0 ) < R(, )
b
Definitions
b is inadmissible if there is at least one other estimator b0 that dominates
Unknown quantity affecting our decision: it. Otherwise it is called admissible.
17
17.3 Bayes Rule Residual sums of squares (rss)
Bayes rule (or Bayes estimator) n
X
rss(b0 , b1 ) = 2i
r(f, )
b = inf e r(f, )
e
i=1
R
(x) = inf r( | x) x = r(f, )
b b b = r(b | x)f (x) dx
Least square estimates
Theorems
bT = (b0 , b1 )T : min rss
b0 ,
b1
Squared error loss: posterior mean
Absolute error loss: posterior median
Zero-one loss: posterior mode b0 = Yn b1 Xn
Pn Pn
(Xi Xn )(Yi Yn ) i=1 Xi Yi nXY
17.4 Minimax Rules b1 = i=1 Pn 2
= n
(X X ) 2
P 2
i=1 i n i=1 Xi nX
Maximum risk h i
0
R()
b = sup R(, ) R(a) = sup R(, a) E b | X n =
1
b
2 n1 ni=1 Xi2 X n
h i P
Minimax rule V |X =
b n
e = inf sup R(, )
b = inf R()
sup R(, ) e nsX X n 1
e e r Pn
2
i=1 Xi
b
se(
b b0 ) =
b = Bayes rule c : R(, )
b =c sX n n
Least favorable prior
b
se(
b b1 ) =
sX n
bf = Bayes rule R(, bf ) r(f, bf ) Pn Pn
where s2X = n1 i=1 (Xi X n )2 and
b2 = 1
n2 2i
i=1 (unbiased estimate).
Further properties:
18 Linear Regression
P P
Consistency: b0 0 and b1 1
Definitions
Asymptotic normality:
Response variable Y
Covariate X (aka predictor variable or feature) b0 0 D b1 1 D
N (0, 1) and N (0, 1)
se(
b b0 ) se(
b b1 )
18.1 Simple Linear Regression
Approximate 1 confidence intervals for 0 and 1 :
Model
Yi = 0 + 1 Xi + i E [i | Xi ] = 0, V [i | Xi ] = 2 b0 z/2 se(
b b0 ) and b1 z/2 se(
b b1 )
Fitted line
rb(x) = b0 + b1 x Wald test for H0 : 1 = 0 vs. H1 : 1 6= 0: reject H0 if |W | > z/2 where
W = b1 /se(
b b1 ).
Predicted (fitted) values
Ybi = rb(Xi ) R2
Pn b 2
Pn 2
Residuals i=1 (Yi Y ) rss
2
= 1 Pn i=1 i 2 = 1
i = Yi Ybi = Yi b0 + b1 Xi R = Pn 2
i=1 (Yi Y ) i=1 (Yi Y )
tss
18
Likelihood If the (k k) matrix X T X is invertible,
n
Y n
Y n
Y b = (X T X)1 X T Y
L= f (Xi , Yi ) = fX (Xi ) fY |X (Yi | Xi ) = L1 L2 h i
i=1 i=1 i=1 V b | X n = 2 (X T X)1
n
b N , 2 (X T X)1
Y
L1 = fX (Xi )
i=1
( ) Estimate regression function
n 2
Y 1 X k
L2 = fY |X (Yi | Xi ) n exp 2 Yi (0 1 Xi ) X
2 i rb(x) = bj xj
i=1
j=1
Under the assumption of Normality, the least squares estimator is also the mle Unbiased estimate for 2
n
1X 2
n 1 X 2
2
b2 =
b = = X b Y
n i=1 i n k i=1 i
mle
nk 2
18.2 Prediction
b = X b2 =
n
1 Confidence interval
Observe X = x of the covariate and want to predict their outcome Y .
bj z/2 se(
b bj )
Yb = b0 + b1 x
h i h i h i h i 18.4 Model Selection
V Yb = V b0 + x2 V b1 + 2x Cov b0 , b1
Consider predicting a new observation Y for covariates X and let S J
Prediction interval denote a subset of the covariates in the model, where |S| = k and |J| = n.
Pn 2
Issues
i=1 (Xi X )
bn2 =
b2 +1
Underfitting: too few covariates yields high bias
P
n i (Xi X)2 j
Overfitting: too many covariates yields high variance
Yb z/2 bn
Procedure
1. Assign a score to each model
18.3 Multiple Regression 2. Search through all models to find the one with the highest score
Y = X + Hypothesis testing
where H0 : j = 0 vs. H1 : j 6= 0 j J
X11 X1k 1 1 Mean squared prediction error (mspe)
.. .. = ...
.. ..
X= . =.
. .
h i
mspe = E (Yb (S) Y )2
Xn1 Xnk k n
Prediction risk
Likelihood n
X n
X h i
E (Ybi (S) Yi )2
2 n/2 1 R(S) = mspei =
L(, ) = (2 ) exp 2 rss
2 i=1 i=1
Training error
N
X n
X
rss = (y X)T (y X) = kY Xk2 = (Yi xTi )2 R
btr (S) = (Ybi (S) Yi )2
i=1 i=1 19
R2 Pn b Frequentist risk
2
R i=1 (Yi (S) Y )
rss(S) btr (S)
R2 (S) = 1 =1 =1 P n 2 i Z Z
i=1 (Yi Y )
tss tss h
R(f, fbn ) = E L(f, fbn ) = b2 (x) dx + v(x) dx
The training error is a downward-biased estimate of the prediction risk.
h i
E R btr (S) < R(S)
h i
h i n h i b(x) = E fbn (x) f (x)
X
bias(R
btr (S)) = E Rbtr (S) R(S) = 2 Cov Ybi , Yi h i
i=1
v(x) = V fbn (x)
2
Adjusted R
n 1 rss
R2 (S) = 1 19.1.1 Histograms
n k tss
Mallows Cp statistic
Definitions
R(S)
b =R 2 = lack of fit + complexity penalty
btr (S) + 2kb
Number of bins m
Akaike Information Criterion (AIC) Binwidth h = m 1
bS2 ) k
AIC(S) = `n (bS , Bin Bj has j observations
R
Define pbj = j /n and pj = Bj f (u) du
Bayesian Information Criterion (BIC)
k Histogram estimator
bS2 ) log n
BIC(S) = `n (bS ,
2
m
Validation and training
X pbj
fbn (x) = I(x Bj )
j=1
h
m
X n n
(Ybi (S) Yi )2
i p
m = |{validation data}|, often
h
R
bV (S) = or E fbn (x) =
j
i=1
4 2 h
h i p (1 p )
j j
Leave-one-out cross-validation V fbn (x) =
!2 nh2
n n h2
Z
Yi Ybi (S) 2 1
R(fn , f ) (f 0 (u)) du +
X X
2
R
bCV (S) = (Yi Yb(i) ) =
b
1 Uii (S) 12 nh
i=1 i=1 !1/3
1 6
U (S) = XS (XST XS )1 XS (hat matrix) h = 1/3 R 2 du
n (f 0 (u))
2/3 Z 1/3
C 3 2
19 Non-parametric Function Estimation R (fbn , f ) 2/3
n
C=
4
(f 0 (u)) du
Pm+n = Pm Pn
1
Pn = P P = Pn St Exp
Marginal probability
1
Pk
Sample autocorrelation function If 2k+1 i=k wtj 0, a linear trend function t = 0 + 1 t passes
without distortion
b(h)
b(h) = Differencing
b(0)
t = 0 + 1 t = xt = 1
Sample cross-variance function
nh
1 X 21.4 ARIMA models
bxy (h) = (xt+h x)(yt y)
n t=1 Autoregressive polynomial
bxy (h) Autoregressive operator
bxy (h) = p
bx (0)b
y (0)
(B) = 1 1 B p B p
Properties
Autoregressive model order p, AR (p)
1
bx (h) = if xt is white noise
n xt = 1 xt1 + + p xtp + wt (B)xt = wt
1
bxy (h) = if xt or yt is white noise AR (1)
n
k1
X k,||<1 X
21.3 Non-Stationary Time Series xt = k (xtk ) + j (wtj ) = j (wtj )
j=0 j=0
Classical decomposition model | {z }
linear process
P
xt = t + st + wt E [xt ] = j=0 j (E [wtj ]) = 0
2 h
w
t = trend (h) = Cov [xt+h , xt ] = 12
(h)
st = seasonal component (h) = (0) = h
wt = random noise term (h) = (h 1) h = 1, 2, . . .
24
Moving average polynomial Seasonal ARIMA
(z) = 1 + 1 z + + q zq z C q 6= 0 Denoted by ARIMA (p, d, q) (P, D, Q)s
Moving average operator P (B s )(B)D d s
s xt = + Q (B )(B)wt
(B) = 1 + 1 B + + p B p
21.4.1 Causality and Invertibility
MA (q) (moving average model order q) P
ARMA (p, q) is causal (future-independent) {j } : j=0 j < such that
xt = wt + 1 wt1 + + q wtq xt = (B)wt
q
X
xt = wtj = (B)wt
X
E [xt ] = j E [wtj ] = 0
j=0
j=0
( Pqh P
2
w j=0 j j+h 0hq ARMA (p, q) is invertible {j } : j=0 j < such that
(h) = Cov [xt+h , xt ] =
0 h>q
X
MA (1) (B)xt = Xtj = wt
xt = wt + wt1 j=0
2 2
(1 + )w h = 0
Properties
(h) = w 2
h=1
0 h>1 ARMA (p, q) causal roots of (z) lie outside the unit circle
(
2 h=1
(z)
(h) = (1+ )
X
0 h>1 (z) = j z j = |z| 1
j=0
(z)
ARMA (p, q)
xt = 1 xt1 + + p xtp + wt + 1 wt1 + + q wtq ARMA (p, q) invertible roots of (z) lie outside the unit circle
(B)xt = (B)wt X (z)
(z) = j z j = |z| 1
Partial autocorrelation function (PACF) j=0
(z)
xh1
i , regression of xi on {xh1 , xh2 , . . . , x1 }
Behavior of the ACF and PACF for causal and invertible ARMA models
hh = corr(xh xh1
h , x0 xh1
0 ) h2
E.g., 11 = corr(x1 , x0 ) = (1) AR (p) MA (q) ARMA (p, q)
ARIMA (p, d, q) ACF tails off cuts off after lag q tails off
d xt = (1 B)d xt is ARMA (p, q) PACF cuts off after lag p tails off q tails off
(B)(1 B)d xt = (B)wt
Exponentially Weighted Moving Average (EWMA) 21.5 Spectral Analysis
xt = xt1 + wt wt1 Periodic process
X xt = A cos(2t + )
xt = (1 )j1 xtj + wt when || < 1
j=1 = U1 cos(2t) + U2 sin(2t)
xn+1 = (1 )xn + xn
Frequency index (cycles per unit time), period 1/
25
Amplitude A Discrete Fourier Transform (DFT)
Phase n
X
U1 = A cos and U2 = A sin often normally distributed rvs d(j ) = n1/2 xt e2ij t
i=1
Periodic mixture
q
X Fourier/Fundamental frequencies
xt = (Uk1 cos(2k t) + Uk2 sin(2k t))
j = j/n
k=1
Uk1 , Uk2 , for k = 1, . . . , q, are independent zero-mean rvs with variances k2 Inverse DFT
n1
Pq X
(h) = k=1 k2 cos(2k h) xt = n1/2 d(j )e2ij t
Pq
(0) = E x2t = k=1 k2 j=0
22.4 Combinatorics [3] R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications
With R Examples. Springer, 2006.
Sampling
[4] A. Steger. Diskrete Strukturen Band 1: Kombinatorik, Graphentheorie,
k out of n w/o replacement w/ replacement Algebra. Springer, 2001.
k1
Y n! [5] A. Steger. Diskrete Strukturen Band 2: Wahrscheinlichkeitstheorie und
ordered nk = (n i) = nk Statistik. Springer, 2002.
i=0
(n k)!
n nk n!
n1+r
n1+r
[6] L. Wasserman. All of Statistics: A Concise Course in Statistical Inference.
unordered = = = Springer, 2003.
k k! k!(n k)! r n1
27
Univariate distribution relationships, courtesy Leemis and McQueston [2].
28