Professional Documents
Culture Documents
Basic Definitions
Bayes’ Rule
Concentration Inequalities
Bibliography
2/36 Jörg Schäfer | Learning From Data | c b n a -1: Probability Theory: Recap and Prerequisites
Probability Theory
3/36 Jörg Schäfer | Learning From Data | c b n a -1: Probability Theory: Recap and Prerequisites
σ-Algebras
Definition 1
A system of A subsets of a set Ω is called σ-Algebra iff
1. Ω ∈ A
2. A ∈ A ⇒ Ac ∈ A
S∞
3. Ai ∈ A ⇒ i=1 Ai ∈A
4/36 Jörg Schäfer | Learning From Data | c b n a -1: Probability Theory: Recap and Prerequisites
Generating σ-Algebras
Definition 2
Let C ⊂ Ω denote an arbitrary subset of Ω. Then we denote by A(C) the
smallest σ-Algebra containing C.
Theorem 3
A(C) as defined above is well-defined.
5/36 Jörg Schäfer | Learning From Data | c b n a -1: Probability Theory: Recap and Prerequisites
Measures
Definition 4
Let A denote a σ-algebra. A function µ : A → R with
1. µ(∅) = 0
2. µ ≥ 0
3. If An are pairwise disjoint, then
∞
[ ∞
X
µ Ai = µ(Ai )
i=1 i=1
is called a measure on A.
6/36 Jörg Schäfer | Learning From Data | c b n a -1: Probability Theory: Recap and Prerequisites
Probability Spaces
Definition 5
A set Ω with a σ-algebra A including Ω and a measure P defined on A
such that P(Ω) = 1 is called a probability space with a probability
measure P. We usually use the notation (Ω, A, P).
Lemma 6
For any probability space, the following holds true
P(∅) = 0
P(Ac ) = 1 − P(A)
P(A \ B) = P(A) − P(A ∩ B)
[ X
P( Ai ) ≤ P(Ai ).
7/36 Jörg Schäfer | Learning From Data | c b n a -1: Probability Theory: Recap and Prerequisites
Probability Space Examples
1. Any finite set Ω with the set of all subsets as a σ-algebra can be
made into probability space by choosing an arbitrary function
g : Ω → R+ such as x ∈Ω g(x ) = 1 and specifying a probability
P
measure as X
P(A) := g(x )
x ∈A
8/36 Jörg Schäfer | Learning From Data | c b n a -1: Probability Theory: Recap and Prerequisites
Random Variables
Definition 7
Let Ω be a measurable space, i.e. a space equipped with a σ-algebra A
on Ω. Let B denote a measurable space with σ-algebra B. Then a map
f : Ω → B from Ω to a B is called measurable iff
∀B ∈ B : f −1 (B) ∈ A.
Definition 8
Let X be a measurable map from a probability space (Ω, A, P) to a
measurable space B. Then X is called a random variable.
Special cases:
1. Real-valued random variables with E = R (equipped with the usual
σ-algebra of Lebesgues measurable sets).
2. Discrete random variables with E = M ⊆ Z
9/36 Jörg Schäfer | Learning From Data | c b n a -1: Probability Theory: Recap and Prerequisites
Push-forward Probability Measures
A random variable X provides a way to push-forward a probability
measure, by defining PX (B) := P(X −1 (B)). This makes the following
diagram commutative:
Ω
X /B
P PX
[0, 1]
id / [0, 1]
Definition 9
Let I be an arbitrary index set. Then any quadrupel (Ω, A, P, (Xt )t∈I ) is
called a stochastic process, where (Ω, A, P) is a probability space and
(Xt )t∈I is a collection of random variables with values in a common
measure space E . The map I → E defined by t → Xt is called path of
the process.
Common examples of index sets are the real line R or [0, 1].
(We ignore technical details how to construct such spaces and the
associated σ-algebras of proper filtrations, etc.)
11/36 Jörg Schäfer | Learning From Data | c b na -1: Probability Theory: Recap and Prerequisites
Brownian Motion
Definition 10
Let (Ω, A, P, (Xt )t∈R+ ) be a stochastic process such that Xt − Xs has
distribution N (0, t − s) for s ≤ t. Then we call Xt Brownian Motion or
Wiener Process. 2
1
phenotype
0
−1
−2
0 20 40 60 80 100
time
12/36 Jörg Schäfer | Learning From Data | c b n a -1: Probability Theory: Recap and Prerequisites
Distribution Functions
Definition 11
Let X be a real-valued random variable on (Ω, A, P). Then
F : R → [0, 1], F (x ) := P(X ≤ x ) is called the distribution function of X .
Lemma 12
Properties:
limx →−∞ F (x ) = 0, limx →∞ F (x ) = 1
F is monotonously increasing
F is continuous from the right
P(a < X ≤ b) = F (b) − F (a)
P(X > a) = 1 − F (a)
P(X = a) = F (a) − limx >0,x →0 F (a − x )
13/36 Jörg Schäfer | Learning From Data | c b na -1: Probability Theory: Recap and Prerequisites
Distribution Functions: Example
Normal Distribution
0.5
N ~ (0, 1)
0.4
0.3
Density
0.2
0.1
0.0
−6 −4 −2 0 2 4 6
Z x
1 2 /σ 2
F (x ) := √ e −x dx
2πσ 2 −∞
14/36 Jörg Schäfer | Learning From Data | c b n a -1: Probability Theory: Recap and Prerequisites
Expectations
Definition 13
Let X beR a real-valued random variable on (Ω, A, P). Then
E[X ] := X (ω)dP(ω), if exists is called the Expectation (or Mean) of X .
Lemma 14
R
E[X ] = x dFX (x )
15/36 Jörg Schäfer | Learning From Data | c b na -1: Probability Theory: Recap and Prerequisites
Moments
Definition 15
Let X be a real-valued random variable on (Ω, A, P). Then
Var [X ] := E[(X − E[X ])2 ], if exists is called the Variance of X . In
general, we define the n-th moments as µn [X ] := E[(X − E[X ])n ],
wherever the integrals exist.
16/36 Jörg Schäfer | Learning From Data | c b na -1: Probability Theory: Recap and Prerequisites
Independence of Events
Definition 16
Two events A and B are called independent (often written as A ⊥ B) iff
their joint probability equals the product of their probabilities:
P(A ∩ B) = P(A)P(B)
17/36 Jörg Schäfer | Learning From Data | c b na -1: Probability Theory: Recap and Prerequisites
Independence of Random Variables
Definition 17
Two random variables X and Y are independent iff the elements of the
σ-algebras generated by them are independent; i.e. for every a and b, the
events X ≤ a andY ≤ b are independent (in the sense defined above).
Lemma 18
Let X and Y be independent. Then the (joint) distribution function of
X , Y factors as follows:
FX ,Y (x , y ) = FX (x )FY (y )
18/36 Jörg Schäfer | Learning From Data | c b na -1: Probability Theory: Recap and Prerequisites
Covariance and Correlations
Covariance is a measure of the joint variability of two random variables:
Definition 19
Let X and Y be a real-valued random variables on (Ω, A, P). Then
Cov [X , Y ] := E[(X − E[X ]) (Y − E[Y ])], if called the covariance.
Definition 20
Let X and Y be a real-valued
p random variables on (Ω, A, P). Then
ρ(X , Y ) := Cov (X , Y )/ Var (X )Var (Y ) is called the correlation.
19/36 Jörg Schäfer | Learning From Data | c b na -1: Probability Theory: Recap and Prerequisites
Conditional Probabilities
Definition 22
Let (Ω, A, P) be a probability space and A, B ∈ A with P(B) > 0. Then
the conditional probability of A with respect to B (or given B),
denoted by P(A|B) is defined as
P(A ∩ B)
P(A|B) := .
P(B)
Lemma 23
For any A and B we have P(A) = P(A|B)P(B) + P(A|B c )P(B c ).
20/36 Jörg Schäfer | Learning From Data | c b na -1: Probability Theory: Recap and Prerequisites
Conditional Probabilities – Example
Three tosses of a fair coin:
Let (Ω, A, P) be {hhh, hht, hth, htt, thh, tht, tth, ttt}
2 or more heads in 3 tosses: A = {hhh, hht, hth, thh} ⇒ P(A) = 1/2
What is the probability of A, given that the first toss lands head
(event B)?
By counting, we conclude P(A|B) = 3/4
Indeed, using Bayes we also get
P(A ∩ B) = 3/8, P(B) = 1/2, thus P(A|B) = P(A∩B) P(B) = 3/4
Similarly, what is the probability of A, given that the first toss lands
tail?
P(A|B c ) = 1/4
Indeed, P(A) = P(A|B)P(B) + P(A|B c )P(B c ) =
3/4 × 1/2 + 1/4 × 1/2 = 1/2.
21/36 Jörg Schäfer | Learning From Data | c b n a -1: Probability Theory: Recap and Prerequisites
Conditional Expectations
Definition 24
Let (Ω, A, P) be a probability space, let X denote a random variable
defined on Ω, and B ∈ A an event with P(B) > 0. Then
E[X 1B ]
Z
E[X |B] := = x dP(x |B)
P(B)
22/36 Jörg Schäfer | Learning From Data | c b na -1: Probability Theory: Recap and Prerequisites
Bayes’ Rule
Theorem 25
Let (Ω, A, P) be a probability space, let A and B be random events, then
P(A|B)P(B)
P(B|A) =
P(A)
Proof.
Trivial exercise.
23/36 Jörg Schäfer | Learning From Data | c b na -1: Probability Theory: Recap and Prerequisites
Bayes – Example
Suppose you are tested positive for HIV. Should you be worried?
Here’s the math:
Prevalence of HIV in German population: 0.05% (Source:
Robert-Koch-Institut)
If a person carries HIV test is positive (T + ) with 99.9% (sensitivity)
If a person does not carry HIV test is negative with 99.7% (specivity)
Thus, according to Bayes
P(T + |HIV )P(HIV )
P(HIV |T + ) =
P(T + )
P(T + |HIV )P(HIV )
=
P(T + |HIV )P(HIV ) + P(T + |HIV c )P(HIV c )
99.9% ∗ 0.05%
=
99.9% ∗ 0.05% + 0.3% ∗ 99.95%
= 14%
24/36 Jörg Schäfer | Learning From Data | c b n a -1: Probability Theory: Recap and Prerequisites
Bayes – Example
Suppose you are tested positive for HIV. Should you be worried?
Here’s the math:
Prevalence of HIV in German population: 0.05% (Source:
Robert-Koch-Institut)
If a person carries HIV test is positive (T + ) with 99.9% (sensitivity)
If a person does not carry HIV test is negative with 99.7% (specivity)
Thus, according to Bayes
P(T + |HIV )P(HIV )
P(HIV |T + ) =
P(T + )
P(T + |HIV )P(HIV )
=
P(T + |HIV )P(HIV ) + P(T + |HIV c )P(HIV c )
99.9% ∗ 0.05%
=
99.9% ∗ 0.05% + 0.3% ∗ 99.95%
= 14%
24/36 Jörg Schäfer | Learning From Data | c b na -1: Probability Theory: Recap and Prerequisites
Bayes – Example
Suppose you are tested positive for HIV. Should you be worried?
Here’s the math:
Prevalence of HIV in German population: 0.05% (Source:
Robert-Koch-Institut)
If a person carries HIV test is positive (T + ) with 99.9% (sensitivity)
If a person does not carry HIV test is negative with 99.7% (specivity)
Thus, according to Bayes
P(T + |HIV )P(HIV )
P(HIV |T + ) =
P(T + )
P(T + |HIV )P(HIV )
=
P(T + |HIV )P(HIV ) + P(T + |HIV c )P(HIV c )
99.9% ∗ 0.05%
=
99.9% ∗ 0.05% + 0.3% ∗ 99.95%
= 14%
24/36 Jörg Schäfer | Learning From Data | c b na -1: Probability Theory: Recap and Prerequisites
Bayes – Example
Suppose you are tested positive for HIV. Should you be worried?
Here’s the math:
Prevalence of HIV in German population: 0.05% (Source:
Robert-Koch-Institut)
If a person carries HIV test is positive (T + ) with 99.9% (sensitivity)
If a person does not carry HIV test is negative with 99.7% (specivity)
Thus, according to Bayes
P(T + |HIV )P(HIV )
P(HIV |T + ) =
P(T + )
P(T + |HIV )P(HIV )
=
P(T + |HIV )P(HIV ) + P(T + |HIV c )P(HIV c )
99.9% ∗ 0.05%
=
99.9% ∗ 0.05% + 0.3% ∗ 99.95%
= 14%
24/36 Jörg Schäfer | Learning From Data | c b na -1: Probability Theory: Recap and Prerequisites
Bayes – Example
Suppose you are tested positive for HIV. Should you be worried?
Here’s the math:
Prevalence of HIV in German population: 0.05% (Source:
Robert-Koch-Institut)
If a person carries HIV test is positive (T + ) with 99.9% (sensitivity)
If a person does not carry HIV test is negative with 99.7% (specivity)
Thus, according to Bayes
P(T + |HIV )P(HIV )
P(HIV |T + ) =
P(T + )
P(T + |HIV )P(HIV )
=
P(T + |HIV )P(HIV ) + P(T + |HIV c )P(HIV c )
99.9% ∗ 0.05%
=
99.9% ∗ 0.05% + 0.3% ∗ 99.95%
= 14%
24/36 Jörg Schäfer | Learning From Data | c b na -1: Probability Theory: Recap and Prerequisites
Bayes – Example
Suppose you are tested positive for HIV. Should you be worried?
Here’s the math:
Prevalence of HIV in German population: 0.05% (Source:
Robert-Koch-Institut)
If a person carries HIV test is positive (T + ) with 99.9% (sensitivity)
If a person does not carry HIV test is negative with 99.7% (specivity)
Thus, according to Bayes
P(T + |HIV )P(HIV )
P(HIV |T + ) =
P(T + )
P(T + |HIV )P(HIV )
=
P(T + |HIV )P(HIV ) + P(T + |HIV c )P(HIV c )
99.9% ∗ 0.05%
=
99.9% ∗ 0.05% + 0.3% ∗ 99.95%
= 14%
24/36 Jörg Schäfer | Learning From Data | c b na -1: Probability Theory: Recap and Prerequisites
Bayes and Machine Learning
In machine learning, we often have the following interpretation of Bayes’
rule and nomenclature:
Let D denote the data and θ denote parameters of our model.
Then according to Bayes we have
P(θ|D) = P(D|θ)P(θ)/P(D).
25/36 Jörg Schäfer | Learning From Data | c b n a -1: Probability Theory: Recap and Prerequisites
Convergence
Definition 26
Let Xn be a series of random variables. Then
a.s.
1. Xn converges to X almost surely (notation Xn −−→ X ) iff
P(limn→∞ |Xn − X | = 0) = 1.
i.p.
2. Xn converges to X in probability (notation Xn −−→ X ) iff ∀ > 0
limn→∞ P(|Xn − X | ≥ ) = 0.
d
3. Xn converges to X in distribution (notation Xn − → X ) iff
limn→∞ Fn (x ) = F (x ) for all x where F (x ) is continuous.
Lr
4. Xn converges to X in r -th mean (notation Xn −→ X ) iff
limn→∞ E(|Xn − X |r ) = 0.
26/36 Jörg Schäfer | Learning From Data | c b na -1: Probability Theory: Recap and Prerequisites
Convergence Relations
a.s.
Xn −−→ X +3 Xn −
i.p.
−KS → X +3 Xn −
d
→X
Lr
Xn −→ X
27/36 Jörg Schäfer | Learning From Data | c b na -1: Probability Theory: Recap and Prerequisites
Counterexample
Proof.
Homework
28/36 Jörg Schäfer | Learning From Data | c b na -1: Probability Theory: Recap and Prerequisites
Law of Large Numbers
Theorem 29
Strong Law of Large Numbers: If E[|Xn |] ≤ ∞, then the sample average
a.s.
X n := n1 ni=0 Xi converges almost surely to µ, i.e. X n −−→ µ
P
29/36 Jörg Schäfer | Learning From Data | c b na -1: Probability Theory: Recap and Prerequisites
Central Limit Theorem
Let Xn denote a sequence of independently, identically distributed random
variables on a probability space (Ω, A, P). We already know that the
sample average converges under certain assumptions to the mean. But
how fast?
It turns out that we cannot just describe how fast, but we can even
calculate the limit distribution of the sample average:
Theorem 30
Central Limit Theorem: Let Sn := (Xi − µ). Then E[Sn ] = 0 and
P
The key point of the theorem is that we know that the sampling
distribution of the statistic is always normal no matter the distribution
of the Xn (universality property of normal distribution).
30/36 Jörg Schäfer | Learning From Data | c b na -1: Probability Theory: Recap and Prerequisites
Speed of Convergence
In the context of the Central Limit Theorem we have for arbitrary
λ ∈ R+ : Z λ
Sn − nµ
≤ λ → √1
2
P √ e −t /2 dt
nσ
2π −λ
Often we are interested in the following quantity
Sn − nµ
P √
>λ
nσ
Theorem 31
For any random variable X with finite expectation µ and finite non zero
variance σ 2 , for any real number k > 0, we have
1
P |Xn − µ| ≥ kσ 2 ≤ .
k2
32/36 Jörg Schäfer | Learning From Data | c b na -1: Probability Theory: Recap and Prerequisites
Concentration Inequalities
Fortunately, there are so-called concentration of measure inequalities
providing exponential type bounds for quantities as above. This allows us
to calculate how much the sample Sn differs from the target distribution
(keep this in mind!).
For instance, the famous Hoeffding inequality states, e.g.
Theorem 32
Let Xn be bounded in an interval, then
Sn − nµ
2
P √
≥ λ ≤ 2 exp −cλ
nσ
33/36 Jörg Schäfer | Learning From Data | c b na -1: Probability Theory: Recap and Prerequisites
Example
2500
1.0
●●
0.9
800
●
●
●●
● ●
●●● ● ● ● ●
●● ● ●
0.8
● ● ●● ● ●●
●● ● ●● ●● ●● ●● ● ●● ● ● ● ●●
● ●● ● ● ●● ● ● ●●
●●●
● ● ●● ●●●●● ●● ●●●
● ● ● ● ● ●●●● ●
● ● ● ●
●●
● ●● ● ● ● ●
●● ● ● ● ●●●●●● ● ●●● ●● ●●● ● ●●●
600
Sample mean
● ● ● ●
● ●●● ●
●● ●●●
●
●● ●●
● ● ● ● ●● ●
●
●●● ●
●● ●●●● ●●● ●●●● ●●● ● ●●●●●●● ●●●●●
●●●
●●
●● ●
● ●●
● ●●●●●●● ●● ●●●● ●●
●
● ●●●
●
●● ●
●●
●
●●● ●●
● ●
●
● ●●
●●●●
● ● ●
●●●
●●●●●
● ● ●
●● ●● ●
●●●● ● ●
●●
●●● ●●
● ●●●●● ●●
●
●
● ●●
●●●●
Frequency
● ●● ●
● ●●●
●● ● ● ●● ●●●
● ● ●● ● ●● ●
●●
● ● ●●● ●
● ●●● ●● ●●● ●●● ●
●● ●●●●● ●●●●●● ●
●● ●●
● ● ● ● ● ● ●●
● ●●●
● ●●
●● ●
●
●
●●
●●
●● ●● ● ● ●●●● ● ● ● ●●● ●●●●● ● ● ●●● ●
●●●● ● ●● ● ●
0.7
●●
●● ● ●●● ● ● ●
● ●●
●● ●●
● ● ● ● ● ● ●
●●
● ●●● ● ● ● ● ●
● ● ● ●
● ●
●
400
● ●
●
0.6
●
●●
0.5
200
0.4
0
0 100 200 300 400 500 −4 −2 0 2 4
Sample size m
Figure: Law of Large Numbers and Central Limit for Binomial Distributions
34/36 Jörg Schäfer | Learning From Data | c b n a -1: Probability Theory: Recap and Prerequisites
Why Probability?
35/36 Jörg Schäfer | Learning From Data | c b n a -1: Probability Theory: Recap and Prerequisites
References I
[Par60] E. Parzen, Modern Probability Theory and its Applications. John Wiley
and Sons, Inc., New York and London, 1960.
36/36 Jörg Schäfer | Learning From Data | c b n a -1: Probability Theory: Recap and Prerequisites