Professional Documents
Culture Documents
Probability and Statistics: Cheat Sheet
Probability and Statistics: Cheat Sheet
Cheat Sheet
Flavio Schneider
Def. 3.21: Distribution Convergence Where a practical application is that for n big:
(i) P [Zn ≤ z] ≈ Φ(z)
Let X1 , X2 , . . . and Y be RV, with CDF
FX1 , FX2 , . . . and FY then X1 , X2 , . . . con- (ii) Zn ≈ N (0, 1)
(iii) Sn ≈ N nµ, nσ 2
verges to Y in distribution if:
2
∀x ∈ R lim FXn (x) = FY (x) (iv) X n ≈ N µ, σn
n→∞
Note
· The idea is that any (normalized) sum or average
of RVs approaches a (standard) normal distribu-
tion as n gets bigger.
Def. 4.6: Likelihood Function · The first moment is the expected value, estimated
4.1 Basics Def. 4.8: Theoretical Moments
with: µ̂1 (x1 , . . . , xn ) = xn (average) and the sec-
Let X1 , . . . , Xn i.i.d. RVs, drawn accord- The likelhood function L is defined as: Let X be a RV, then: ond central moment is the variance, estimated
1 Pn
ing to some distribution Pθ parametrized by with: µ̂∗2 (x1 , . . . , xn ) = n 2
i=1 (xi − xn ) . Note
(i) The kth moment of X is:
(
θ = (θ1 , . . . , θm ) ∈ Θ where Θ is the set of all p(x1 , . . . , xn ; θ) discr. that we always use the central moments for i > 1.
possible parameters for the selected distribution. L(x1 , . . . , xn ; θ) = µk ..= mk = E[X k ]
f (x1 , . . . , xn ; θ) cont. · If we are given only the PDF of a distribution
Then the goal is to find the best estimator θ̂ ∈ Θ (ii) The kth central moment of X is: we can still evaluate the theoretical moments by
such that θ̂ ≈ θ since the real θ cannot be known µ∗k ..= m∗k = E[(X − µ)k ] solving the expected value integral (or summation
exactly from a finite sample. Def. 4.7: MLE (iii) The kth absolut moment of X is: if discrete).
Mk ..= E[|X|k ] (not used for MOM)
The maximum likelhood estimator θ̂ for θ is · To check if θ̂i is unbiased we solve Eθ [θ̂i ]
Def. 4.1: Estimator (parametrized by θ is important) and check
defined as:
whether it equals θ.
An estimator θ̂j for a parameter θj is a RV
Def. 4.9: Sample Moments
θ̂ ∈ argmax L(X1 , . . . , Xn ; θ) Properties
θ̂j (X1 , . . . , Xn ) that is symbolized as a func- θ∈Θ Let X be a RV, then given a sample x1 , . . . , xn
tion of the observed data. Useful to simplify MLM:
using the Law of Large numbers: Qn Qn
· n
Guide 4.1: Evaluation i=1 a · xi = a i=1 xi
(i) The kth moment is evaluated as: Qn Pn
Def. 4.2: Estimate n · log i=1 x i = i=1 log(xi )
Given a i.i.d. sample of data x1 , . . . , xn and a 1X k
µ̂k (x1 , . . . , xn ) = x Pn a·x =a n
n i=1 i
P
An estimate θ̂j (x1 , . . . , xn ) is a realization of distribution Pθ : · log i=1 e
i
i=1 xi
the estimator RV, it’s real value for the esti- (ii) The kth central moment is evaluated as:
(i) Identify the parameters θ = (θ1 , . . . , θm )
mated parater. n
for the given distribution (e.g. if normal 1X
θ = (θ1 = µ, θ2 = σ 2 )). µ̂∗k (x1 , . . . , xn ) = (xi − µ̂1 )k
n i=1
Def. 4.3: Bias (ii) Find the log likelihood, we use the log of
the likelhood since it’s much easier to dif-
The bias of an estimator θ̂ is defined as: ferentiate afterwards, and the maximum
Guide 4.2: Evaluation
of L is preserved (∀θj ):
Biasθ [θ̂] ..= Eθ [θ̂] − θ = Eθ [θ̂ − θ] Given a i.i.d. sample of data x1 , . . . , xn and a
g(θj ) ..= log L(x1 , . . . , xn ; θj ) distribution Pθ :
we say that an estimator is unbiased if: n
Y
= log f (xi ; θj ) (i) Identify the parameters θ = (θ1 , . . . , θm )
Biasθ [θ̂] = 0 or Eθ [θ̂] = θ for the given distribution.
i=1
(ii) Since the distribution is given the ex-
the goal here is to split f into as many pected value Eθ [X] = g1 (θ1 , . . . , θm ) and
Def. 4.4: Mean Squared Error sums as possible using log properties variance Varθ [X] = g2 (θ1 , . . . , θn ) are
(easier to differentiate). known. The functions gi with 0 ≤ i ≤ m
The mean squared error (MSE) of an estima-
tor θ̂ is defined as: (iii) Find the maximum of the log likelihood, are parametrized by θ and each of them
note that if the distribution is simple it is equal to a thoretical moment.
MSEθ [θ̂] ..= E[(θ̂ −θ)2 ] = Varθ [θ̂]+(Eθ [θ̂]−θ)2 might be easier to use the normal like- (iii) Since we have also the sample data to
lihood function and manually find the work with we can equate the theortical
max, and if the distribution is hard we moments to the moment estimators:
Def. 4.5: Consistent might have to use iterative methods in-
stead of differentiation. Then for each g1 (θ1 , . . . , θm ) = µ̂1 (x1 , . . . , xn )
A squence of estimators θ̂(n) of the parameter parameter θj : ∗
θ is called consistent if for any > 0: g2 (θ1 , . . . , θm ) =
.. µ̂2 (x1 , . . . , xn )
dg MAX . ∗
= 0 gi (θ1 , . . . , θm ) =.. µ̂i (x1 , . . . , xn )
Pθ [|θ̂(n) − θ| > ] −−−−→ 0 dθj
n→∞ .
gm (θ1 , . . . , θm ) = µ̂∗m (x1 , . . . , xn )
Note Often we want to find inside the deriva-
tive set to 0 a sum or average (Sn , X n ). (iv) Now since there are m equations and m
· The idea is that an estimator is consistent only
(iv) State the final MLE, where each param- unknown thetas we can solve for each θ
if as the sample data increases the estimator ap-
eter estimator is the max found for θj : and set it as the estimator.
proaches the real parameter.
θ̂M LE = (θ̂1 , . . . , θ̂m ) θ̂M OM = (θ̂1 , . . . , θ̂m )
Let X1 , . . . , Xn i.i.d. RVs, is distributed accord- To test an hypothesis we must establish the null H0 Xi n σ2 Statistic Def. 5.4: z-Test
ing to some distribution Pθ parametrized by θ = and alternative HA hypotheses. The null hypoth- N (µ, σ 2 ) any known z-Test
(θ1 , . . . , θm ) ∈ Θ where Θ = Θ0 ∪ ΘA is the set esis is the default set of parameters θ, or what we The z-test is used when the data follows a nor-
expect to happen if our experiment fails and the N (µ, σ 2 ) small unknown t-Test mal distribution and σ 2 is known.
of all possible parameters for the selected distribu-
tion divided in two distinct subsets Θ0 ∩ ΘA = ∅. alternative hypothesis is rejected. any any any LR-Test
(i) Statistic Under H0 :
Then the goal is to test wheter the unknown θ lies
inside Θ0 or ΘA , this decision system is written as Right-Tailed (RT) LR-Test Xn − µ0
H0 : θ ∈ Θ0 (null hypothesis) and HA : θ ∈ ΘA T = √ ∼ N (0, 1)
Def. 5.3: Likelihood-Ratio σ/ n
(alternative hypothesis). H0 : θ = θ0 , HA = θ > θ0
Def. 5.1: Test Let L(x1 , . . . , xn ; θ) be the likelhood function (ii) Rejection Region:
where θ0 ∈ Θ0 and θA ∈ ΘA , then the
←− Accept H0 Reject H0 −→ RT
Concretely a test is composed of a function of Likelihood-Ratio is defined as: · K = [z1−α , ∞)
the sample t(x1 , . . . , xn ) = t and a rejection LT
H0 HA L(x1 , . . . , xn ; θ0 ) · K = (−∞, zα ]
region K ⊆ R. The decision of the test is then R(x1 , . . . , xn ; θ0 , θA ) ..= TT
written as RV: L(x1 , . . . , xn ; θA ) · K = (−∞, z α ] ∪ [z1− α , ∞)
2 2
( 1−α 1−β
1, t ∈ K : reject H0 Note
Properties
It∈K =
0, t ∈ / K : do not reject H0 · The intuition is that the likelihood function will
β
· Φ−1 (α) = zα = −z1−α
α tend to be the highest near the true value of θ,
c thus by evaluating the Likelihood-Ratio R be- · z0.95 = 1.645, z0.975 = 1.960
Def. 5.2: Test Statistic tween θ0 and θA we can conclude that if R < 1
the probability of getting the observed data is t-Test
The test statistic T (X1 , . . . , Xn ) is a RV, it is higher under HA where if R > 1 the probability
distributed according to some standard statis- of getting the obeserved data is higher under H0 . Def. 5.5: t-Test
tic (z, t, χ2 ). Left-Tailed (LT)
The t-test is used when the data follows a nor-
H0 : θ = θ0 , HA = θ < θ0 Theorem 14: Neyman-Pearson mal distribution, n is small (usually n < 30)
5.1 Steps and σ 2 is unknown.
(i) Model: identify the model Pθ , or which dis- Let T ..= R(x1 , . . . , xn ; θ0 , θA ) be the test
statistic, K ..= [0, c) be the rejection region (i) Statistic Under H0 :
tribution does Xi i.i.d. ∼ Pθ follow and what ←− Reject H0 Accept H0 −→
are the known and unknown parameters of θ. and α∗ ..= Pθ0 [T ∈ K] = Pθ0 [T < c]. Then
for any other test (T 0 , K 0 ) with Pθ0 [T 0 ∈ Xn − µ0
(ii) Hypothesis: identify the null and alternative
HA H0 T = √ ∼ t(n − 1)
K 0 ] ≤ α∗ we have: S/ n
hypothesis, in the null hypothesis we should
1 Pn
explicitely state the parameters value given. PθA T 0 ∈ K 0 ≤ PθA [T ∈ K] where S 2 = − Xn )2
i=1 (Xi
1−β 1−α
n−1
(iii) Statistic: identify the test statistic T of H0 (ii) Rejection Region:
and HA based on the sample size n and the Note
amount of known parameteres of Pθ . α β RT
· The idea of the lemma is that making a decision · K = [tn−1,1−α , ∞)
(iv) H0 Statistic: state the distribution of the test c
based on the Likelihood-Ratio Test with T and LT
statistic under H0 . · K = (−∞, tn−1,α ]
K will maximise the power of the test, any other TT
(v) Rejection Region: based on the test statistic Two-Tailed (TT) test will have a smaller power. Thus given a fixed · K = (−∞, tn−1, α ] ∪ [tn−1,1− α , ∞)
2 2
and the significance level α evaluate the rejec- α∗ , this is the best way to do hypothesis testing.
tion region K. H0 : θ = θ0 , HA = θ 6= θ0 Properties
(vi) Result: based on the observed data and the
· tm,α = −tm,1−α
rejection region reject H0 or don’t reject H0 . ←− Reject H0 Accept H0 Reject H0 −→
(vii) Errors (optional): compute the probability
HA H0 HA
of error, significance and power to decide how
reliable is the test result.
1−β 1−α 1−β
α β β α
2 2
.. (n−1)S 2
· The significance level should be small (near 0) b = χ2
RT α
· K = [z1−α , ∞) and the power large (near 1). n−1,
2
LT
· K = (−∞, zα ] · Smaller α ⇒ Smaller power.
TT
· K = (−∞, z α ] ∪ [z1− α , ∞)
2 2
5.5 P-Value
For unknown σX = σY > 0:
Def. 5.8: P-Value
(i) Hypothesis: H0 : µX − µY = µ0
The p-value is the probability of getting the
(ii) Statistic Under H0 :
observed value of the test statistic T (ω) =
X n − Y n − µ0 t(x1 , . . . , xn ), or a value with even greater ev-
T = q ∼ tn+m−2 idence against H0 , if the null hypothesis is ac-
1 1
S n +m tually true.
RT
· p-value = Pθ0 [T ≥ T (ω)]
(iii) Rejection Region (d ..= n + m − 2):
LT
RT
· p-value = Pθ0 [T ≤ T (ω)]
· K = td,1−α , ∞
TT
LT · p-value = Pθ0 [|T | ≥ T (ω)]
· K = −∞, td,α
TT
· K = (−∞, td, α ] ∪ [td,1− α , ∞) Note
2 2
· We can then still decide the test and reject H0
if p-value < α (α = 0.01 very strong evidence,
α = 0.05 strong evidence, α > 0.1 weak evi-
dence).
· The p-value can also be viewed as the smallest α∗
such that H0 is rejected given the observed value
of the test statistic t(x1 , . . . , xn ).
Properties
6.2 Bernulli Distribution · Poisson Approximation: If X ∼ Bin(n, p) and Properties
n 0, np < 5, then X ∼ Poi(np). 6.6 Hypergeometric Distribution · Let X = n
P
i=1 Xi ∼ Poi(λ
Pin) where Xi are inde-
Notation X ∼ Be(p) · Normal Approximation: If X ∼ Bin(n, p) and pendend, then X ∼ Poi i=1 λi
n 0, np > 5, n(1 − p) > 5 with p = Notation X ∼ HGeom(n, m, r) · If X = c + Y and Y ∼ Poi(λ) then X ∼ Poi(λ).
Experiment What is the probability of success P [a <X ≤ b], then:
or failure is success has probabil- b+ 1 −np
a+ 1 −np
Experiment What is the probability of picking
ity p? p≈Φ √ 2 −Φ √ 2 .
np(1−p) np(1−p) x elements of type 1 out of m, if
there are r elements of type 1 and
Support x ∈ {0, 1} n − r elements of type type 2 ?
( 6.4 Geometric Distribution
1−p x=0
pX (x) Support x ∈ {1, 2, . . . , min(m, r)}
p x=1 Notation X ∼ Geo(p)
r n − r . n
0
x<0 Experiment What is the probability of one pX (x)
x m−x m
FX (x) 1−p 0≤x≤1 success in x trials if one success
x
1 x>1 has probability p? X
FX (x) pX (i)
E [X] p i=1
Support x ∈ {1, 2, . . . }
rm
E [X]
Var [X] p(1 − p) pX (x) (1 − p) x−1
·p n
(n − r)nm(n − m)
Var [X]
FX (x) 1 − (1 − p)x (2n − r)2 (n − 1)
1
E [X] Note
p
1−p · The items are picked without replacement.
Var [X]
p2
Properties
· Memoryless:
P [X > m + n | X ≥ m] = P [X > n]
· Sum: ( n
P
i=1 Xi ∼ Geo(p)) ∼ NB(n, p)