You are on page 1of 10

Probability and Statistics

Cheat Sheet

Flavio Schneider

ETH Zürich - D-INFK


1 Probability Theorem 1: Inclusion-Exclusion Theorem 3: Bayes’ Rule 2 Combinatorics
1.1 Basics Let A1 , . . . , An be a set of events, then: Let A1 , . . . , An be a set of disjunct events Let n be the number of total objects and k be the
∀i 6= j : Ai ∩ Aj = ∅ where n
S number of object that we want to select (k = n if
" n # n i=1 Ai = Ω,
Def. 1.1: Sample Space [ X with P [Ai ] > 0 for all i = 1, . . . , n, then for an we consider all objects), then:
P Ai = (−1)k−1 Sk event B ⊆ Ω with P [B] > 0 we have:
The sample space denoted by Ω 6= ∅ is the set i=1 k=1
of all possible outcomes of an experiment, it P [B|Ak ]P [Ak ] Def. 2.1: Permutation
can be finite or infinite. where P [Ak |B] = Pn
  i=1 P [B|Ai ] P [Ai ] A permutation Pn (k) is an arrangement of el-
X \ ements where we care about ordering.
Sk = P Ai  Note
Def. 1.2: Event
I⊆{1,...,n} i∈I (i) Repetition not allowed:
|I|=k · If we have only two events A and B it semplifies
An event A is a subset of the sample space P [B|A]P [A]
n!
to: P [A|B] = P [B]
A ⊆ Ω, or an element of the powerset of of the Pn (k) =
sample space A ∈ 2Ω . (n − k)!
1.2 Discrete Probability 1.4 Independence
(ii) Repetition allowed:
Def. 1.3: Observable Event Set We talk about discrete probability if Ω is countable Def. 1.8: Independence
(finite or infinite). Pn (k) = nk
The set of all observalbe events is deonted by A set of events A1 , . . . , An are independend if
F , where F ⊆ 2Ω . Def. 1.6: Laplace Space for all m ∈ N with {k1 , . . . , km } ⊆ 1, . . . , n we
have: Def. 2.2: Combination
If Ω = {ω1 , . . . , ωN } with |Ω| = N where all ωi
"m # m
Note
\ Y
1 A combination Cn (k) is an arrangement of el-
have the same probability pi = N , Ω is called P Aki = P [Aki ]
· Usually if Ω is countable F = 2Ω , however some- Laplace Space and P has a discrete unifrom i=1 i=1 ements where we do not care about ordering.
times many events are excluded from F since it’s distribution. For some event A we have:
(i) Repetition not allowed:
not possible for them to happen. Properties
|A|
P [A] = With only two events: n Pn (k) n!
|Ω| Cn (k) = = =
· A and B are independent iff P [A∩B] = P [A]P [B] k k! k!(n − k)!
Def. 1.4: σ−Algebra
Note · A and B are independent iff P [B|A] = P [B] (ii) Repetition allowed:
The set F is called a σ−Algebra if:
(i) Ω ∈ F · The discrete uniform distribution exists only if Ω n + k − 1
is finite. Cn (k) =
(ii) ∀A ⊆ Ω : A ∈ F ⇒ AC ∈ F k
(iii) ∀(An )n∈N : An ∈ F ⇒ ∞
S
n=1 An ∈ F Note
1.3 Conditional Probability · Repetition is the same as replacement, since by
Def. 1.5: Probability Function replacing an object in the set we’ll be able to use
Def. 1.7: Conditional Probability it again.
P : F → [0, 1] is a probability function if it Properties
Given two events A and B with P [A] > 0, the
satisfies the following 3 axioms:
probability of B given A is defined as: · 0! = 1
(i) ∀A ∈ F : P [A] ≥ 0
· n n!

P [B ∩ A] k
= k!(n−k)!
(ii) P [Ω] = 1 P [B|A] ..=
P [A] n n
(iii) P [ ∞
P∞
· 0 = = 1
S
n=1 An ] = n=1 P [An ] n
· n n 

where An are disjunct. 1
= n−1
=n
Theorem 2: Total Probability n n
+ n−1

· k = k k−1
Properties (derived from the 3 axioms): Let A1 , . . . , An be a set of disjunct events Pn n
· = 2n
∀i 6= j : Ai ∩ Aj = ∅ where n k=0
S
k
· P AC = 1 − P [A] i=1 Ai = Ω,
 
then for any event B ⊆ Ω:
· P [∅] = 0
n
· A ⊆ B ⇒ P [A] ≤ P [B]
X
P [B] = P [B|Ai ] P [Ai ]
· P [A ∪ B] = P [A] + P [B] − P [A ∩ B] i=1

Flavio Schneider ETH Zürich · D-INFK Page 1


3 Random Variables 3.2 Expected Value 3.3 Variance 3.4 Other Functions
3.1 Basics Def. 3.4: Expected Value Def. 3.6: Variance Def. 3.8: Covariance

Let X be a RV with E X 2 < ∞, then the


 
Def. 3.1: Random Variable Let X be a RV, then the expected value is de- Let X, Y be RVs with finite expected value,
fined as: variance of X is defined as: then the covariance of X and Y is defined as:
Let (Ω, F , P ) be a probability space, then a
Cov (X, Y ) ..= E [(X − E [X])(Y − E [Y ])]
 X
Var [X] ..= E (X − E [X])2
 
random variable (RV) on Ω is a function: 
 xk · pX (xk ) X discr.
= E [XY ] − E [X] E [Y ]


xk ∈W(X)
X : Ω → W(X) ⊆ R E [X] = µ ..= Z with the extended form:
 ∞


 x · fX (x)dx X cont.
 ! Note
if the image W(X) is countable X is called a 
−∞

 X
p (x ) · x 2
− µ2 X discr.
X k · The covariance is a measure of correlation be-
 k
discrete random variable, otherwise it’s called



k tween two random variables, Cov (X, Y ) > 0
a continous random variable. Properties Var [X] =
 ∞ 2

 Z if Y tends to increase as X increases and
· E [X] ≤ E [Y ] if ∀ω : X(ω) ≤ Y (ω) x · fX (x)dx − µ2 X cont. Cov (X, Y ) < 0 if Y tends to decrease as X in-



Def. 3.2: Probability Density Pn  Pn −∞ creases. If Cov (X, Y ) = 0 then X and Y are
· E i=0 ai Xi = i=0 ai E [Xi ]
P∞ uncorrelated.
The probability density function (PDF) fX : · E [X] = j=1 P [X ≥ j], if W (X) ⊆ N0 . Properties
Properties
R → [0, 1] of a RV X, is function defined as: · 0 ≤ Var [X] ≤ E X 2
P∞  P∞  
· E i=0 Xi 6= i=0 E [Xi ] · Cov(aX, bY ) = abCov(X, Y )
· Var [X] = E X 2 − E [X]2
 
fX (x) ..= P [X = x] ..= P [{ω | X(ω) = x}] · E [E [X]] = E [X]
· Cov(X + a, Y + b) = Cov(X, Y )
· E [XY ]2 ≤ E X 2 E Y 2 · Var [aX + b] = a2 Var [X]
   
with X discrete we use pX (t) instead of fX (t). · Cov(a1 X1 + a2 X2 , b1 Y1 + b2 Y2 ) =
· Var [X] = Cov(X, X)
" n # n
Y Y a1 b1 Cov(X1 , Y1 ) + a1 b2 Cov(X1 , Y2 ) +
Properties · E Xi = E [xi ] for indep. X1 , . . . , Xn . " n # a2 b1 Cov(X2 , Y1 ) + a2 b2 Cov(X2 , Y2 )
X
i=0 i=0 · Var ai Xi =
· fX = 0 and fX ≥ 0 outside of W (X).
i=0
R∞ n Def. 3.9: Correlation
· −∞ fX (t)dt = 1 Theorem 4: E of Functions
X X
a2i Var [Xi ] +2 ai aj Cov(Xi , Xj )
i=0 1≤i<j≤n Let X, Y be RVs with finite expected value,
Let X be a RV and Y = g(X), with g : R → R, " n then the correlation of X and Y is defined as:
n
#
Def. 3.3: Cumulative Distribution then: X X
· Var ai Xi = Var [Xi ] Cov (X, Y )
 X Corr(X, Y ) = p
The cumulative distribution function (CDF) 
 g(xk ) · pX (xk ) X discr. i=0 i=0
Var [X] · Var [Y ]
FX : R → [0, 1] of a RV X, is a function de-


xk ∈W(X) if ∀(i 6= j) : Cov(Xi , Xj ) = 0.
fined as: E [Y ] = Z

 ∞ Note
 g(x) · fX (x)dx X cont. Def. 3.7: Standard Deviation · Correlation is the same as covariance but normal-

FX (x) ..= P [X ≤ x] ..= P [{ω | X(ω) ≤ x}] 
−∞
ized with values between −1 and 1.
Let X be a RV with E X 2 < ∞, then the
 
if the PDF is given it can be expressed with:
standard deviation of X is defined as: · X, Y indep. ⇒ Corr(X, Y ) = Cov(X, Y ) = 0.
X Def. 3.5: Moment-Generating Function
 pX (xi ) X discr. p
σ(X) = sd(X) ..= Var [X]


xi ≤x
 Let X be a RV, then the moment-generating Def. 3.10: Indicator Function
FX (x) = Z function of X is defined as:
 x
The indicator function IA for a set (event) A


 fX (t)dt X cont. h i
MX (t) ..= E etX

−∞ is defined as:
(
Properties 1 w∈A
IA (ω) ..=
0 w ∈ AC
· Monotone: If t ≤ s then FX (t) ≤ FX (s).
· R-continous: If t > s then lim FX (t) = FX (s).
t→s

· Limits: lim FX (t) = 0 ∧ lim FX (t) = 1.


t→−∞ t→∞
Rb
· P [a < X ≤ b] = FX (b) − FX (a) = a fX (t)dt
· P [X > t] = 1 − P [X ≤ t] = 1 − FX (t)
d
· F (x)
dx X
= fX (x)
Flavio Schneider ETH Zürich · D-INFK Page 2
3.5 Joint Probability Def. 3.14: Marginal CDF 3.7 Joint Functions 3.8 Evaluation
Def. 3.11: Joint PDF The marginal cumulative distribution function Def. 3.17: Joint Expected Value Guide 3.1: Monte Carlo Integration
FXi : R → [0, 1] of Xi given a joint CDF
The joint probability density function fX : The joint expected value of a RV Y = Let I = ab g(x)dx be the integral of a function
R
FX (x1 , . . . , xn ), is defined as:
Rn → [0, 1] with X = (X1 , . . . , Xn ) is a func- g(X1 , . . . , Xn ) = g(X) is defined as: that is hard to evaluate, then:
tion defined as: FXi (xi ) = lim FX (x1 , . . . , xn ) X X
xj6=i →∞  ··· g(t)pX (t) discr.
Z b
I= g(x)dx


..
fX (x1 , . . . , xn ) = P [X1 = x1 , . . . , Xn = xn ]  t1
 tn
a
E [Y ] = Z
∞ Z ∞ Z b 1
with X discrete we use pX (x) instead of Def. 3.15: Conditional Distribution

= (b − a) g(x) dx


 ··· g(t)fX (t)dt cont.
fX (x).

−∞ −∞ a b−a
The conditional distribution fX|Y : R → [0, 1] Z ∞
is defined as: where t = (t1 , . . . , tn ), and in the discrete case = (b − a) g(x)fU (x)dx
−∞
Def. 3.12: Joint CDF ..
tk ∈ W(Xk ).
fX|Y (x|y) = P [X = x|Y = y] = (b − a) · E [g(U )]
The joint cumulative distribution function P [X = x, Y = y]
FX : Rn → [0, 1] with X = (X1 , . . . , Xn ) is = Def. 3.18: Conditional Expected Value where U (a, b) is uniformely distributed. Then
P [Y = y]
a function defined as: by the LLN know that we can approximate
Joint PDF The conditional expected value of RVs X, Y is: E [g(U )] by randomly sampling u1 , u2 , . . .
=
FX (x1 , . . . , xn ) ..= P [X1 ≤ x1 , . . . , Xn ≤ xn ] Marginal PDF X from U (a, b).
 x · pX|Y (x|y) discr.
n

if the joint PDF is given it can be expressed with X and Y discrete we write pX|Y (x|y) in- 

x∈R b−a X
with: stead of fX|Y (x|y). E [X|Y ] (y) = Z g(ui ) −−−−→ (b − a) · E [g(U )]
 ∞ n i=1 n→∞

x · fX|Y (x|y)dx cont.
 X X 
··· pX (t) discr.
 


 3.6 Independence −∞
t1 ≤x1
 tn ≤xn
FX (x) = Z Guide 3.2: Transformation

 x1 Z xn
Def. 3.16: Independence Properties

 ··· fX (t)dt cont. If we have a RV X with known CDF (strictly

−∞ −∞ · E [E [X|Y ]] = E [X]
The RVs X1 , . . . , Xn are independent if: increasing) with Y = g(X), to evaluate FY
· E [X|Y ] (y) = E [X] if X, Y indep. and fY we proceed as follows:
where t = (t1 , . . . , tn ) and x = (x1 , . . . , xn ). n
Y R
FX1 ,...,Xn (x1 , . . . , xn ) = FXi (xi ) (i) FY (t) = P [g(X) ≤ t] = Ag fX (s)ds
Properties i=1 Def. 3.19
dFY (t)
(ii) fY (t) =
∂ n FX (x1 , . . . , xn ) dt
· = fX (x1 , . . . , xn ) similarly if their PDF is absolutely continous Let Y = g(X1 , . . . , Xn ) = g(X), then:
∂x1 , . . . , ∂xn they are independent if: where Ag = {s ∈ R | g(s) ≤ t}
Z
n
Y P [Y ∈ C] = fX (t)dt
Def. 3.13: Marginal PDF AC Guide 3.3: Sum Convolution
fX1 ,...,Xn (x1 , . . . , xn ) = fXi (xi )
i=1 where AC = {x = (x1 , . . . , xn ) ∈ Rn | g(x) ∈ C}
The marginal probability density function Let X1 , . . . , Xn be independent RVs then the
fXi : R → [0, 1] of Xi given a joint PDF and t = (t1 , . . . , tn ). sum Z = X1 + · · · + Xn has a PDF fZ (z) eval-
fX (x1 , . . . , xn ), is defined as: Theorem 5: Function Independence uated with a convolution between all PDFs:
X X X X
If the RVs X1 , . . . , Xn are independent where Theorem 7: Transformation fZ (z) = (fX1 (x1 ) ∗ · · · ∗ fXn (xn ))(z)

 ··· ··· pX (t) discr.
fi : R → R is a function with Yi ..= fi (Xi )

 t1
 ti−1 ti+1 tn
Let F be continous and a strictly increasing in the special case that Z = X + Y :
fXi (ti ) = Z then also Y1 , . . . , Yn are independent.

 ∞ Z ∞ CDF and let X ∼ U (0, 1), then:
···
 X

 fX (t)dt̃ cont. pX (xk )pY (z − xk ) discr.
Y = F −1 (X) ⇒ FY = F
 
−∞ −∞


Theorem 6

xk ∈W(X)
where t̃ = (t1 , . . . , ti−1 , ti+1 , . . . , tn ), and in fZ (z) = Z

The RVs X1 , . . . , Xn are independent iff

the discrete case tk ∈ W(Xk ).


 fX (t)fY (z − t)dt cont.
∀Bi ⊆ W (Xi ):

−∞
Note n
Y Note
· The idea of the marginal probability is to ignore P [X1 ∈ B1 , . . . , Xn ∈ Bn ] = P [Xi ∈ Bi ]
· Often is much easier to use properties of the RVs
all other random variables and consider only the i=1
to find the sum instead of evaluating the convo-
one we’re interested to. lution.

Flavio Schneider ETH Zürich · D-INFK Page 3


Guide 3.4: Product 3.9 Sum and Average 3.11 Inequalities 3.12 Limit Theorems
Let X1 , . . . , Xn be i.i.d RVs with finite mean µ, Theorem 12: Law of Large Numbers
Let X, Y be independent RVs then to evaluate Theorem 8: Markov-Inequality
standard deviation σ, and let Zn be the standard-
the PDF and CDF of Z = XY we proceed as
ization of a RV Y defined as: Let X be a RV and g : W(X) → [0, ∞) be an Let X1 , X2 , . . . be i.i.d RVs with finite mean
follows:
increasing function, then for all c with g(c) > 0 µ. Let X n be the average of the first n vari-
FZ (z) = P [XY ≤ z] Sum Average we have: ables, then the law of large numbers (LLN)
n n says that (different versions):
= P X ≥ Yz , Y < 0 + P X ≤ Yz , Y > 0
   
X 1X E [g(X)]
Y Sn = Xi Xn = Xi P [X ≥ c] ≤ (i) Weak
Z 0 "Z ∞ n i=1
#
i=1 g(c)
= fX (x)dx fY (y)dy n
−∞ z 1X
y Note: for practical uses usually g(x) = x. Xn = −−−−→ µ
E[Y ] nµ µ
Z ∞
"Z z
# n i=1 n→∞
y
+ fX (x)dx fY (y)dy σ2
0 −∞ Var [Y ] nσ 2 Theorem 9: Chebyshev-Inequality
(ii) Weak
n
√ σ
 
∀ P X n − µ >  −−−−→ 0
where the PDF is: σ(Y ) nσ √ Let X a RV with Var [X] < ∞ then if b > 0: n→∞
n
d
Z ∞  
z 1 (iii) Weak
fZ (z) = FZ (z) = fY (y)fX dy Sn − nµ Xn − µ Var [X]
dz y |y| Zn √ √ P [|X − E [X]| ≥ b] ≤
−∞ b2  
σ n σ/ n ∀ P X n − µ <  −−−−→ 1
n→∞
(iv) Strong
Guide 3.5: Quotient 3.10 Convergence Theorem 10 hn oi
P ω ∈ Ω | X n (ω) −−−−→ µ
Let X, Y be independent RVs then to evaluate Let X1 , . . . , Xn i.i.d. where ∀t : MX (t) < ∞ n→∞
Def. 3.20: Probability Convergence
the PDF and CDF of Z = X Y
we proceed as then for any b ∈ R:
follows: Let X1 , X2 , . . . and Y be RV on the same   Note

X
 probability space, then: P [Sn ≥ b] ≤ exp inf (n log MX (t) − tb) · The law of large numbers says that if we aver-
FZ (z) = P ≤z t∈R age n i.i.d. RV, then the more n increases the
Y (i) X1 , X2 , . . . converges to Y in prob. if: more the average is probable to be close to the
= P [X ≥ zY, Y < 0] + P [X ≤ zY, Y > 0] expected value of the RVs: X n ≈ µ.
Z 0 Z ∞ ∀ > 0 lim P [|X − Y | > ] = 0 Theorem 11: Chernoff-Inequality
 n→∞
= fX (x)dx fY (y)dy Properties
(ii) X1 , X2 , . . . converges to Y in Lp for p > 0
−∞ yz Let X1P , . . . , Xn , with Xi i.i.d ∼ Be(p ) and
Pin · lim1 Pn
f (Xi ) = E [f (X)]
if: n i=1
Sn = .. n→∞ n
i=1 where µn = E [Sn ] = i=1 pi
Z ∞ Z yz 
+ fX (x)dx fY (y)dy p
lim E [|Xn − Y | ] = 0 then if δ > 0:
0 −∞ n→∞
µn Theorem 13: Central Limit Theorem


(iii) X1 , X2 , . . . converges to Y , P-almost
where the PDF is: P [Sn ≥ (1 + δ)µn ] ≤
surely if: (1 + δ)1+δ
Z ∞ Let X1 , . . . , Xn be i.i.d RVs with finite mean
d
|y| fX (yz)fY (y)dy ≈ O(e−n ) µ and standard deviation σ. Let Zn be a stan-
h i
fZ (z) = FZ (z) = P lim Xn = Y =
dz −∞ n→∞ dardization, then for any z ∈ R:
hn oi
P w ∈ Ω | lim Xn (ω) = Y (ω) =1
n→∞ lim FZn (z) = lim P [Zn ≤ z] = Φ(z)
n→∞ n→∞

Def. 3.21: Distribution Convergence Where a practical application is that for n big:
(i) P [Zn ≤ z] ≈ Φ(z)
Let X1 , X2 , . . . and Y be RV, with CDF
FX1 , FX2 , . . . and FY then X1 , X2 , . . . con- (ii) Zn ≈ N (0, 1)
(iii) Sn ≈ N nµ, nσ 2

verges to Y in distribution if:
2
 
∀x ∈ R lim FXn (x) = FY (x) (iv) X n ≈ N µ, σn
n→∞

Note
· The idea is that any (normalized) sum or average
of RVs approaches a (standard) normal distribu-
tion as n gets bigger.

Flavio Schneider ETH Zürich · D-INFK Page 4


4 Estimators 4.2 Maximum-Likelihood Method 4.3 Method of Moments Note

Def. 4.6: Likelihood Function · The first moment is the expected value, estimated
4.1 Basics Def. 4.8: Theoretical Moments
with: µ̂1 (x1 , . . . , xn ) = xn (average) and the sec-
Let X1 , . . . , Xn i.i.d. RVs, drawn accord- The likelhood function L is defined as: Let X be a RV, then: ond central moment is the variance, estimated
1 Pn
ing to some distribution Pθ parametrized by with: µ̂∗2 (x1 , . . . , xn ) = n 2
i=1 (xi − xn ) . Note
(i) The kth moment of X is:
(
θ = (θ1 , . . . , θm ) ∈ Θ where Θ is the set of all p(x1 , . . . , xn ; θ) discr. that we always use the central moments for i > 1.
possible parameters for the selected distribution. L(x1 , . . . , xn ; θ) = µk ..= mk = E[X k ]
f (x1 , . . . , xn ; θ) cont. · If we are given only the PDF of a distribution
Then the goal is to find the best estimator θ̂ ∈ Θ (ii) The kth central moment of X is: we can still evaluate the theoretical moments by
such that θ̂ ≈ θ since the real θ cannot be known µ∗k ..= m∗k = E[(X − µ)k ] solving the expected value integral (or summation
exactly from a finite sample. Def. 4.7: MLE (iii) The kth absolut moment of X is: if discrete).
Mk ..= E[|X|k ] (not used for MOM)
The maximum likelhood estimator θ̂ for θ is · To check if θ̂i is unbiased we solve Eθ [θ̂i ]
Def. 4.1: Estimator (parametrized by θ is important) and check
defined as:
whether it equals θ.
An estimator θ̂j for a parameter θj is a RV
  Def. 4.9: Sample Moments
θ̂ ∈ argmax L(X1 , . . . , Xn ; θ) Properties
θ̂j (X1 , . . . , Xn ) that is symbolized as a func- θ∈Θ Let X be a RV, then given a sample x1 , . . . , xn
tion of the observed data. Useful to simplify MLM:
using the Law of Large numbers: Qn Qn
· n
Guide 4.1: Evaluation i=1 a · xi = a i=1 xi
(i) The kth moment is evaluated as: Qn  Pn
Def. 4.2: Estimate n · log i=1 x i = i=1 log(xi )
Given a i.i.d. sample of data x1 , . . . , xn and a 1X k
µ̂k (x1 , . . . , xn ) = x Pn a·x =a n

n i=1 i
P
An estimate θ̂j (x1 , . . . , xn ) is a realization of distribution Pθ : · log i=1 e
i
i=1 xi
the estimator RV, it’s real value for the esti- (ii) The kth central moment is evaluated as:
(i) Identify the parameters θ = (θ1 , . . . , θm )
mated parater. n
for the given distribution (e.g. if normal 1X
θ = (θ1 = µ, θ2 = σ 2 )). µ̂∗k (x1 , . . . , xn ) = (xi − µ̂1 )k
n i=1
Def. 4.3: Bias (ii) Find the log likelihood, we use the log of
the likelhood since it’s much easier to dif-
The bias of an estimator θ̂ is defined as: ferentiate afterwards, and the maximum
Guide 4.2: Evaluation
of L is preserved (∀θj ):
Biasθ [θ̂] ..= Eθ [θ̂] − θ = Eθ [θ̂ − θ] Given a i.i.d. sample of data x1 , . . . , xn and a
g(θj ) ..= log L(x1 , . . . , xn ; θj ) distribution Pθ :
we say that an estimator is unbiased if: n
Y
= log f (xi ; θj ) (i) Identify the parameters θ = (θ1 , . . . , θm )
Biasθ [θ̂] = 0 or Eθ [θ̂] = θ for the given distribution.
i=1
(ii) Since the distribution is given the ex-
the goal here is to split f into as many pected value Eθ [X] = g1 (θ1 , . . . , θm ) and
Def. 4.4: Mean Squared Error sums as possible using log properties variance Varθ [X] = g2 (θ1 , . . . , θn ) are
(easier to differentiate). known. The functions gi with 0 ≤ i ≤ m
The mean squared error (MSE) of an estima-
tor θ̂ is defined as: (iii) Find the maximum of the log likelihood, are parametrized by θ and each of them
note that if the distribution is simple it is equal to a thoretical moment.
MSEθ [θ̂] ..= E[(θ̂ −θ)2 ] = Varθ [θ̂]+(Eθ [θ̂]−θ)2 might be easier to use the normal like- (iii) Since we have also the sample data to
lihood function and manually find the work with we can equate the theortical
max, and if the distribution is hard we moments to the moment estimators:
Def. 4.5: Consistent might have to use iterative methods in-
stead of differentiation. Then for each g1 (θ1 , . . . , θm ) = µ̂1 (x1 , . . . , xn )
A squence of estimators θ̂(n) of the parameter parameter θj : ∗
θ is called consistent if for any  > 0: g2 (θ1 , . . . , θm ) =
.. µ̂2 (x1 , . . . , xn )
dg MAX . ∗
= 0 gi (θ1 , . . . , θm ) =.. µ̂i (x1 , . . . , xn )
Pθ [|θ̂(n) − θ| > ] −−−−→ 0 dθj
n→∞ .
gm (θ1 , . . . , θm ) = µ̂∗m (x1 , . . . , xn )
Note Often we want to find inside the deriva-
tive set to 0 a sum or average (Sn , X n ). (iv) Now since there are m equations and m
· The idea is that an estimator is consistent only
(iv) State the final MLE, where each param- unknown thetas we can solve for each θ
if as the sample data increases the estimator ap-
eter estimator is the max found for θj : and set it as the estimator.
proaches the real parameter.
θ̂M LE = (θ̂1 , . . . , θ̂m ) θ̂M OM = (θ̂1 , . . . , θ̂m )

Flavio Schneider ETH Zürich · D-INFK Page 5


5 Hypothesis Testing 5.2 Hypotheses 5.3 Statistic z-Test

Let X1 , . . . , Xn i.i.d. RVs, is distributed accord- To test an hypothesis we must establish the null H0 Xi n σ2 Statistic Def. 5.4: z-Test
ing to some distribution Pθ parametrized by θ = and alternative HA hypotheses. The null hypoth- N (µ, σ 2 ) any known z-Test
(θ1 , . . . , θm ) ∈ Θ where Θ = Θ0 ∪ ΘA is the set esis is the default set of parameters θ, or what we The z-test is used when the data follows a nor-
expect to happen if our experiment fails and the N (µ, σ 2 ) small unknown t-Test mal distribution and σ 2 is known.
of all possible parameters for the selected distribu-
tion divided in two distinct subsets Θ0 ∩ ΘA = ∅. alternative hypothesis is rejected. any any any LR-Test
(i) Statistic Under H0 :
Then the goal is to test wheter the unknown θ lies
inside Θ0 or ΘA , this decision system is written as Right-Tailed (RT) LR-Test Xn − µ0
H0 : θ ∈ Θ0 (null hypothesis) and HA : θ ∈ ΘA T = √ ∼ N (0, 1)
Def. 5.3: Likelihood-Ratio σ/ n
(alternative hypothesis). H0 : θ = θ0 , HA = θ > θ0

Def. 5.1: Test Let L(x1 , . . . , xn ; θ) be the likelhood function (ii) Rejection Region:
where θ0 ∈ Θ0 and θA ∈ ΘA , then the
←− Accept H0 Reject H0 −→ RT
Concretely a test is composed of a function of Likelihood-Ratio is defined as: · K = [z1−α , ∞)
the sample t(x1 , . . . , xn ) = t and a rejection LT
H0 HA L(x1 , . . . , xn ; θ0 ) · K = (−∞, zα ]
region K ⊆ R. The decision of the test is then R(x1 , . . . , xn ; θ0 , θA ) ..= TT
written as RV: L(x1 , . . . , xn ; θA ) · K = (−∞, z α ] ∪ [z1− α , ∞)
2 2
( 1−α 1−β
1, t ∈ K : reject H0 Note
Properties
It∈K =
0, t ∈ / K : do not reject H0 · The intuition is that the likelihood function will
β
· Φ−1 (α) = zα = −z1−α
α tend to be the highest near the true value of θ,
c thus by evaluating the Likelihood-Ratio R be- · z0.95 = 1.645, z0.975 = 1.960
Def. 5.2: Test Statistic tween θ0 and θA we can conclude that if R < 1
the probability of getting the observed data is t-Test
The test statistic T (X1 , . . . , Xn ) is a RV, it is higher under HA where if R > 1 the probability
distributed according to some standard statis- of getting the obeserved data is higher under H0 . Def. 5.5: t-Test
tic (z, t, χ2 ). Left-Tailed (LT)
The t-test is used when the data follows a nor-
H0 : θ = θ0 , HA = θ < θ0 Theorem 14: Neyman-Pearson mal distribution, n is small (usually n < 30)
5.1 Steps and σ 2 is unknown.
(i) Model: identify the model Pθ , or which dis- Let T ..= R(x1 , . . . , xn ; θ0 , θA ) be the test
statistic, K ..= [0, c) be the rejection region (i) Statistic Under H0 :
tribution does Xi i.i.d. ∼ Pθ follow and what ←− Reject H0 Accept H0 −→

are the known and unknown parameters of θ. and α∗ ..= Pθ0 [T ∈ K] = Pθ0 [T < c]. Then
for any other test (T 0 , K 0 ) with Pθ0 [T 0 ∈ Xn − µ0
(ii) Hypothesis: identify the null and alternative
HA H0 T = √ ∼ t(n − 1)
K 0 ] ≤ α∗ we have: S/ n
hypothesis, in the null hypothesis we should
1 Pn
explicitely state the parameters value given. PθA T 0 ∈ K 0 ≤ PθA [T ∈ K] where S 2 = − Xn )2
 
i=1 (Xi
1−β 1−α
n−1
(iii) Statistic: identify the test statistic T of H0 (ii) Rejection Region:
and HA based on the sample size n and the Note
amount of known parameteres of Pθ . α β RT
· The idea of the lemma is that making a decision · K = [tn−1,1−α , ∞)
(iv) H0 Statistic: state the distribution of the test c
based on the Likelihood-Ratio Test with T and LT
statistic under H0 . · K = (−∞, tn−1,α ]
K will maximise the power of the test, any other TT
(v) Rejection Region: based on the test statistic Two-Tailed (TT) test will have a smaller power. Thus given a fixed · K = (−∞, tn−1, α ] ∪ [tn−1,1− α , ∞)
2 2
and the significance level α evaluate the rejec- α∗ , this is the best way to do hypothesis testing.
tion region K. H0 : θ = θ0 , HA = θ 6= θ0 Properties
(vi) Result: based on the observed data and the
· tm,α = −tm,1−α
rejection region reject H0 or don’t reject H0 . ←− Reject H0 Accept H0 Reject H0 −→
(vii) Errors (optional): compute the probability
HA H0 HA
of error, significance and power to decide how
reliable is the test result.
1−β 1−α 1−β

α β β α
2 2

Flavio Schneider ETH Zürich · D-INFK Page 6


Two-Sample Tests 5.4 Errors, Significance, Power 5.6 Confidence Interval
Def. 5.6: Paried Two-Sample Test We use the test statistic T distributed according to Def. 5.9: Confidence Interval
Pθ to evaluate the probability of errors:
The paried two-sample test is used when Given α (type-1 error) and an unknown
we have Y1 , . . . , Yn i.i.d. ∼ N (µY , σY 2 ) and H0 Don’t Reject (T ∈
/ K) Reject (T ∈ K) parameter θ the confidence interval
2 ) and X = Y −
Z1 , . . . , Zn i.i.d. ∼ N (µZ , σZ i i Type 1 Error (α) C(X1 , . . . , Xn ) ..= [a, b] tells us that with
Zi , then X1 , . . . , Xn i.i.d. ∼ N (µY − µZ , σ = true Correct Decision False Alarm probability at least 1 − α the real parameter
2 2
σY − σZ ), thus if σ is known we proceed with θ is contained in C (θ ∈ C). Evaluated as:
False Positive
a z-test on X otherwise with a t-test on X.
Type 2 Error (β) 1 − α ≤ Pθ [θ ∈ C(X1 , . . . , Xn )]
false Missed Alarm Correct Decision = Pθ [a < θ < b]
Def. 5.7: Unpaired Two-Sample Test
False Negative
Where a and b are:
The unparied two-sample test is used when Probabilities:
we have X1 , . . . , Xn i.i.d. ∼ N (µX , σX 2 ) and
(i) For θ ..= µ and known σ:
2 ) for X , Y inde- · Type 1 Error
Y1 , . . . , Yn i.i.d. ∼ N (µY , σY i j a ..= X n − z α √σn
pendent. P [T ∈ K | H0 true] = Pθ0 [T ∈ K] = α 2

For known σX , σY : b ..= X n + z α √σ


· Type 2 Error 2 n
..
(i) Hypothesis: H0 : µX − µY = µ0 P [T ∈
/ K | H0 false] = PθA [T ∈
/ K] = β (ii) For θ = µ and unknown σ:
(ii) Statistic Under H0 : · Significance Level a ..= X n − tn−1,1− α √Sn
2
S
P [T ∈
/ K | H0 true] = Pθ0 [T ∈
/ K] = 1 − α b ..= X n + tn−1,1− α √
X n − Y n − µ0 2 n
T = ∼ N (0, 1) · Power (iii) For θ ..= σ 2 and unknown µ, σ:
q 2 2
σX σY
+
n m P [T ∈ K | H0 false] = PθA [T ∈ K] = 1 − β (n−1)S 2
a ..= χ2
n−1,1−α
(iii) Rejection Region: Note: 2

.. (n−1)S 2
· The significance level should be small (near 0) b = χ2
RT α
· K = [z1−α , ∞) and the power large (near 1). n−1,
2
LT
· K = (−∞, zα ] · Smaller α ⇒ Smaller power.
TT
· K = (−∞, z α ] ∪ [z1− α , ∞)
2 2
5.5 P-Value
For unknown σX = σY > 0:
Def. 5.8: P-Value
(i) Hypothesis: H0 : µX − µY = µ0
The p-value is the probability of getting the
(ii) Statistic Under H0 :
observed value of the test statistic T (ω) =
X n − Y n − µ0 t(x1 , . . . , xn ), or a value with even greater ev-
T = q ∼ tn+m−2 idence against H0 , if the null hypothesis is ac-
1 1
S n +m tually true.
RT
· p-value = Pθ0 [T ≥ T (ω)]
(iii) Rejection Region (d ..= n + m − 2):
LT
RT
  · p-value = Pθ0 [T ≤ T (ω)]
· K = td,1−α , ∞
TT
LT  · p-value = Pθ0 [|T | ≥ T (ω)]
· K = −∞, td,α
TT
· K = (−∞, td, α ] ∪ [td,1− α , ∞) Note
2 2
· We can then still decide the test and reject H0
if p-value < α (α = 0.01 very strong evidence,
α = 0.05 strong evidence, α > 0.1 weak evi-
dence).
· The p-value can also be viewed as the smallest α∗
such that H0 is rejected given the observed value
of the test statistic t(x1 , . . . , xn ).

Flavio Schneider ETH Zürich · D-INFK Page 7


6 Discrete Distributions 6.3 Binomial Distribution 6.5 Negative Binomial Distribution 6.7 Poisson Distribution
6.1 Discrete Uniform Distribution Notation X ∼ Bin(n, p) Notation X ∼ NB(r, p) Notation X ∼ Poi(λ)
Notation X ∼ U (a, b) Experiment What is the probability of x suc- Experiment What is the probability that x
Experiment What is the probability of r suc-
Experiment What is the probability that we cesses in n trials if one success has cesses in x trials if one success has events happen in one unit of time
pick the value x knowing that all probability p? probability p? knowing that on average λ events
n = b−a+1 values between a and happen on one unit of time?
b are equally likely to be picked? Support x ∈ {0, 1, . . . , n} Support x ∈ {r, r + 1, r + 2, . . . }
n x − 1 Support x ∈ {0, 1, . . . } = N0
Support x ∈ {a, a + 1, . . . , b − 1, b} pX (x) · px · (1 − p)n−x pX (x) · (1 − p)x−r · pr λx
1 x r−1 pX (x) e−λ
pX (x) x!
n x
X x
X
x−a+1 FX (x) pX (i) FX (x) pX (i) Xx
λi
FX (x)
n i=1 i=1 FX (x) e−λ
i!
E [X] a+b
E [X] np r i=0
2 E [X]
(b−a+1)2 −1 p E [X] λ
Var [X] 12 r(1 − p)
Var [X] np(1 − p) Var [X]
p2 Var [X] λ

Properties
6.2 Bernulli Distribution · Poisson Approximation: If X ∼ Bin(n, p) and Properties
n  0, np < 5, then X ∼ Poi(np). 6.6 Hypergeometric Distribution · Let X = n
P
i=1 Xi ∼ Poi(λ
Pin) where Xi are inde-
Notation X ∼ Be(p) · Normal Approximation: If X ∼ Bin(n, p) and pendend, then X ∼ Poi i=1 λi
n  0, np > 5, n(1 − p) > 5 with p = Notation X ∼ HGeom(n, m, r) · If X = c + Y and Y ∼ Poi(λ) then X ∼ Poi(λ).
Experiment What is the probability of success P [a <X ≤ b], then:
or failure is success has probabil- b+ 1 −np
 
a+ 1 −np

Experiment What is the probability of picking
ity p? p≈Φ √ 2 −Φ √ 2 .
np(1−p) np(1−p) x elements of type 1 out of m, if
there are r elements of type 1 and
Support x ∈ {0, 1} n − r elements of type type 2 ?
( 6.4 Geometric Distribution
1−p x=0
pX (x) Support x ∈ {1, 2, . . . , min(m, r)}
p x=1 Notation X ∼ Geo(p)
  r  n − r . n 
0
 x<0 Experiment What is the probability of one pX (x)
x m−x m
FX (x) 1−p 0≤x≤1 success in x trials if one success
 x

1 x>1 has probability p? X
FX (x) pX (i)
E [X] p i=1
Support x ∈ {1, 2, . . . }
rm
E [X]
Var [X] p(1 − p) pX (x) (1 − p) x−1
·p n
(n − r)nm(n − m)
Var [X]
FX (x) 1 − (1 − p)x (2n − r)2 (n − 1)
1
E [X] Note
p
1−p · The items are picked without replacement.
Var [X]
p2

Properties
· Memoryless:
P [X > m + n | X ≥ m] = P [X > n]
· Sum: ( n
P
i=1 Xi ∼ Geo(p)) ∼ NB(n, p)

Flavio Schneider ETH Zürich · D-INFK Page 8


7 Continous Distributions 2 ) and Z ∼ N (µ , σ 2 ) then
· Y ∼ N (µY , σY Z Z · The gamma function Γ(z) is the continous anal- 7.7 t-Distribution
2 + σ2 )
X + Y ∼ N (µY + µZ , σY Z R = (n − 1)! for n > 0,
ogous of the factorial: Γ(n)
7.1 Uniform Distribution and is defined as Γ(z) = 0∞ xz−1 e−x dx. Notation X ∼ t(n)
Properties
Notation X ∼ U (a, b) 7.3 Exponential Distribution Pα Experiment -
· If X = i=1 Yi with Yi i.i.d. ∼ Exp(λ) then
X ∼ Ga(α, λ)
Experiment What is the probability that we Notation X ∼ Exp(λ) Support x∈R
pick the value x knowing that all · Ga(1, λ) = Exp(λ) − n+1
Γ( n+1 ) x2

2
values between a and b are equally Experiment What is the probability that there fX (x) √ 2
1+
nπΓ n

likely to be picked? are x units of time until the next 7.5 Beta Distribution 2
n
event, knowing that on average λ
Support x ∈ [a, b] FX (x) tn,x (use t-table)
( events happen in one unit of time? Notation X ∼ Beta(α, β)
1
b−a
a≤x≤b E [X] 0
fX (x) Support x ∈ [0, ∞) Experiment -
0 else ( n
 λe−λx x ≥ 0 Var [X]
0 x<a fX (x) Support x ∈ [0, 1] n−2
0 x<0

x−a  α−1
(1 − x)β−1
FX (x) b−a
a≤x≤b ( x x ∈ [0, 1]

1−e −λx x≥0 fX (x) Properties

1 x>b B(α, β)
FX (x) 
a+b 0 x<0 0 else · X ∼ t(n = 1) ⇒ X ∼ Cauchy
E [X]
2 1 B(α, β) =
Γ(α)Γ(β) · X ∼ t(n → ∞) ⇒ X ∼ N (0, 1)
(b − a)2 E [X] Γ(α+β)
Var [X] λ · If n > 30 we can usually approximate the t-
12 1 FX (x)
Rx
fX (t)dt
Var [X] 0 distribution with a normal distribution.
λ2 α
E [X]
7.2 Normal Distribution α+β 7.8 Cauchy Distribution
Properties αβ
Var [X]
Notation X ∼ N (µ, σ 2 ) (α + β)2 (α + β + 1) Notation X ∼ Cauchy(t, s)
· Memoryless:
Experiment What is the probability that we P [X > m + n | X ≥ m] = P [X > n] Experiment -
pick the number x knowing that 7.6 χ2 Distribution
all values have a mean of µ and a Support x∈R
standard deviation of σ?
7.4 Gamma Distribution
Notation X ∼ χ2 (k) 1
fX (x)  
Notation X ∼ Ga(α, λ) x−t 2

Support x∈R Experiment - πs 1 + s
(x−µ)2
 
1 − 1 1 x−t
fX (x) √ e 2σ 2 Experiment What is the probability that there FX (x) + arctan
2πσ 2 Support x ∈ [0, ∞) or x ∈ (0, ∞) if k = 1 2 π s
are x units of time until the next
1

  k x
FX (x) Φ x−µ
, (use table) α events, knowing that on aver- 
 k x 2 −1 e− 2 x ≥ 0 E [X] t
σ age λ events happen in one unit k
fX (x) 2 2 Γ( 2 )
E [X] µ of time? 
0 x<0 Var [X] undefined
Rx
Support x∈ R+ FX (x)
Var [X] σ2 ( −∞ fX (t)dt
1
Γ(α)
λα xα−1 e−λx x≥0
fX (x) E [X] k
68.3% 0 x<0
27.2% Rx Var [X] 2k
FX (x) 0 fX (t)dt
4.2%
0.2% α
E [X] Properties
λ
α · P
Let X1 , . . . , Xn i.i.d. Xi ∼ N (0, 1) then Y =
Var [X] n 2 2
λ2 i=1 Xi ∼ χ (n)
µ − 3σ µ − 2σ µ − σ µ µ+σ µ + 2σ µ + 3σ
n 1
· X ∼ χ2 (n) ⇔ X ∼ Ga(α = 2
,λ = 2
)
Properties Note

Flavio Schneider ETH Zürich · D-INFK Page 9

You might also like