Probability Theory Essentials

Learning From Data
-1: Probability Theory: Recap

Jörg Schäfer
Frankfurt University of Applied Sciences
Department of Computer Sciences
Nibelungenplatz 1
D-60318 Frankfurt am Main
Wissen durch Praxis stärkt

1/36 Jörg Schäfer | Learning From Data | c b n a -1: Probability Theory: Recap and Prerequisites Summer Semester 2022
Content
Basic Definitions
Random Variables, Independence and Conditional Expectations
Bayes’ Rule
Convergence and Limit Laws
Concentration Inequalities
Probability and Machine Learning
Bibliography
2/36 Jörg Schäfer | Learning From Data | c b n a -1: Probability Theory: Recap and Prerequisites
Probability Theory
How dare we speak of the laws of chance? Is not chance the

antithesis of all law?
Joseph Bertrand, Calcul des probabilités
The theory of probability as a mathematical discipline can and
should be developed from axioms in exactly the same way as
geometry and algebra
Kolmogorov, Andrey
σ-Algebras
Definition 1
A system of A subsets of a set Ω is called σ-Algebra iff
1. Ω ∈ A
2. A ∈ A ⇒ Ac ∈ A
S∞
3. Ai ∈ A ⇒ i=1 Ai ∈A
Examples for σ-Algebras are

The power set P(Ω) of an arbitrary set Ω
The intersection of σ-Algebras
Generating σ-Algebras
Definition 2
Let C ⊂ Ω denote an arbitrary subset of Ω. Then we denote by A(C) the
smallest σ-Algebra containing C.
Theorem 3
A(C) as defined above is well-defined.
Measures
Definition 4
Let A denote a σ-algebra. A function µ : A → R with
1. µ(∅) = 0
2. µ ≥ 0
3. If An are pairwise disjoint, then
∞
[ ∞
X

µ Ai = µ(Ai )
i=1 i=1
is called a measure on A.
Probability Spaces
Definition 5
A set Ω with a σ-algebra A including Ω and a measure P defined on A
such that P(Ω) = 1 is called a probability space with a probability
measure P. We usually use the notation (Ω, A, P).
Lemma 6
For any probability space, the following holds true
P(∅) = 0
P(Ac ) = 1 − P(A)
P(A \ B) = P(A) − P(A ∩ B)
[ X
P( Ai ) ≤ P(Ai ).
Probability Space Examples
1. Any finite set Ω with the set of all subsets as a σ-algebra can be
made into probability space by choosing an arbitrary function
g : Ω → R+ such as x ∈Ω g(x ) = 1 and specifying a probability
P
measure as X
P(A) := g(x )
x ∈A
2. Ω = [0, 1], A(Ω) the standard

R
σ-algebra of Lebesgue-measurable sets
and P(A) := λ(A) = A dλ, where λ denotes the Lebesgue-Measure.
Random Variables
Definition 7
Let Ω be a measurable space, i.e. a space equipped with a σ-algebra A
on Ω. Let B denote a measurable space with σ-algebra B. Then a map
f : Ω → B from Ω to a B is called measurable iff
∀B ∈ B : f −1 (B) ∈ A.
Definition 8
Let X be a measurable map from a probability space (Ω, A, P) to a
measurable space B. Then X is called a random variable.
Special cases:
1. Real-valued random variables with E = R (equipped with the usual
σ-algebra of Lebesgues measurable sets).
2. Discrete random variables with E = M ⊆ Z
Push-forward Probability Measures
A random variable X provides a way to push-forward a probability
measure, by defining PX (B) := P(X −1 (B)). This makes the following
diagram commutative:
Ω
X /B
P PX

[0, 1]
id / [0, 1]
Figure: Push-forward Probability Measure
Henceforth, without loss of generality, we might only consider measures

on “standard" probability spaces like e.g. B = R or B = Z.
Random/Stochastic Process
Definition 9
Let I be an arbitrary index set. Then any quadrupel (Ω, A, P, (Xt )t∈I ) is
called a stochastic process, where (Ω, A, P) is a probability space and
(Xt )t∈I is a collection of random variables with values in a common
measure space E . The map I → E defined by t → Xt is called path of
the process.
Common examples of index sets are the real line R or [0, 1].
(We ignore technical details how to construct such spaces and the
associated σ-algebras of proper filtrations, etc.)
11/36 Jörg Schäfer | Learning From Data | c b na -1: Probability Theory: Recap and Prerequisites
Brownian Motion
Definition 10
Let (Ω, A, P, (Xt )t∈R+ ) be a stochastic process such that Xt − Xs has
distribution N (0, t − s) for s ≤ t. Then we call Xt Brownian Motion or
Wiener Process. 2
1
phenotype
0
−1
−2
0 20 40 60 80 100
time
Distribution Functions
Definition 11
Let X be a real-valued random variable on (Ω, A, P). Then
F : R → [0, 1], F (x ) := P(X ≤ x ) is called the distribution function of X .
Lemma 12
Properties:
limx →−∞ F (x ) = 0, limx →∞ F (x ) = 1
F is monotonously increasing
F is continuous from the right
P(a < X ≤ b) = F (b) − F (a)
P(X > a) = 1 − F (a)
P(X = a) = F (a) − limx >0,x →0 F (a − x )
Distribution Functions: Example
Normal Distribution
0.5
N ~ (0, 1)
0.4
0.3
Density
0.2
0.1
0.0
−6 −4 −2 0 2 4 6
Figure: Gaussian distribution
Z x
1 2 /σ 2
F (x ) := √ e −x dx
2πσ 2 −∞
Expectations
Definition 13
Let X beR a real-valued random variable on (Ω, A, P). Then
E[X ] := X (ω)dP(ω), if exists is called the Expectation (or Mean) of X .
Lemma 14
R
E[X ] = x dFX (x )
Remark: In case of discrete random variables we have

P
E[X ] = xi P(X = xi ).
Moments
Definition 15
Let X be a real-valued random variable on (Ω, A, P). Then
Var [X ] := E[(X − E[X ])2 ], if exists is called the Variance of X . In
general, we define the n-th moments as µn [X ] := E[(X − E[X ])n ],
wherever the integrals exist.
Independence of Events
Definition 16
Two events A and B are called independent (often written as A ⊥ B) iff
their joint probability equals the product of their probabilities:
P(A ∩ B) = P(A)P(B)
Independence of Random Variables
Definition 17
Two random variables X and Y are independent iff the elements of the
σ-algebras generated by them are independent; i.e. for every a and b, the
events X ≤ a andY ≤ b are independent (in the sense defined above).
Lemma 18
Let X and Y be independent. Then the (joint) distribution function of
X , Y factors as follows:
FX ,Y (x , y ) = FX (x )FY (y )
Conversely, if the (joint) distribution function of X , Y factors as above,

then X and Y are independent.
Covariance and Correlations
Covariance is a measure of the joint variability of two random variables:
Definition 19
Let X and Y be a real-valued random variables on (Ω, A, P). Then
Cov [X , Y ] := E[(X − E[X ]) (Y − E[Y ])], if called the covariance.
Definition 20
Let X and Y be a real-valued
p random variables on (Ω, A, P). Then
ρ(X , Y ) := Cov (X , Y )/ Var (X )Var (Y ) is called the correlation.
For vector valued random variables, the correlation matrix is useful:

Definition 21
Let X and Y be a random variables on (Ω, A, P) with values in Rn . Then
σ[X, Y] := E[(X − E[X]) (Y − E[Y])t ], if called the covariance matrix.
Conditional Probabilities
Definition 22
Let (Ω, A, P) be a probability space and A, B ∈ A with P(B) > 0. Then
the conditional probability of A with respect to B (or given B),
denoted by P(A|B) is defined as
P(A ∩ B)
P(A|B) := .
P(B)
Lemma 23
For any A and B we have P(A) = P(A|B)P(B) + P(A|B c )P(B c ).
Conditional Probabilities – Example
Three tosses of a fair coin:
Let (Ω, A, P) be {hhh, hht, hth, htt, thh, tht, tth, ttt}
2 or more heads in 3 tosses: A = {hhh, hht, hth, thh} ⇒ P(A) = 1/2
What is the probability of A, given that the first toss lands head
(event B)?
By counting, we conclude P(A|B) = 3/4
Indeed, using Bayes we also get
P(A ∩ B) = 3/8, P(B) = 1/2, thus P(A|B) = P(A∩B) P(B) = 3/4
Similarly, what is the probability of A, given that the first toss lands
tail?
P(A|B c ) = 1/4
Indeed, P(A) = P(A|B)P(B) + P(A|B c )P(B c ) =
3/4 × 1/2 + 1/4 × 1/2 = 1/2.
Conditional Expectations
Definition 24
Let (Ω, A, P) be a probability space, let X denote a random variable
defined on Ω, and B ∈ A an event with P(B) > 0. Then
E[X 1B ]
Z
E[X |B] := = x dP(x |B)
P(B)
is called the conditional expectation with regards to B (P(x |B) denotes

the probability measure corresponding to the conditional probability
defined as above).
R R
One can show that σ(B) E[X |B] dP = σ(B) X dP and this can be used
to define E[X |B] even for cases with P(B) = 0, but this requires more
measure theory than we can afford here. . .
Bayes’ Rule
Theorem 25
Let (Ω, A, P) be a probability space, let A and B be random events, then
P(A|B)P(B)
P(B|A) =
P(A)
Proof.
Trivial exercise.
Bayes – Example
Suppose you are tested positive for HIV. Should you be worried?
Here’s the math:
Prevalence of HIV in German population: 0.05% (Source:
Robert-Koch-Institut)
If a person carries HIV test is positive (T + ) with 99.9% (sensitivity)
If a person does not carry HIV test is negative with 99.7% (specivity)
Thus, according to Bayes
P(T + |HIV )P(HIV )
P(HIV |T + ) =
P(T + )
P(T + |HIV )P(HIV )
=
P(T + |HIV )P(HIV ) + P(T + |HIV c )P(HIV c )
99.9% ∗ 0.05%
=
99.9% ∗ 0.05% + 0.3% ∗ 99.95%
= 14%
Bayes – Example
Here’s the math:
P(T + |HIV )P(HIV )
P(HIV |T + ) =
P(T + )
P(T + |HIV )P(HIV )
=
99.9% ∗ 0.05%
=
99.9% ∗ 0.05% + 0.3% ∗ 99.95%
= 14%
Bayes – Example
Here’s the math:
P(T + |HIV )P(HIV )
P(HIV |T + ) =
P(T + )
P(T + |HIV )P(HIV )
=
99.9% ∗ 0.05%
=
99.9% ∗ 0.05% + 0.3% ∗ 99.95%
= 14%
Bayes – Example
Here’s the math:
P(T + |HIV )P(HIV )
P(HIV |T + ) =
P(T + )
P(T + |HIV )P(HIV )
=
99.9% ∗ 0.05%
=
99.9% ∗ 0.05% + 0.3% ∗ 99.95%
= 14%
Bayes – Example
Here’s the math:
P(T + |HIV )P(HIV )
P(HIV |T + ) =
P(T + )
P(T + |HIV )P(HIV )
=
99.9% ∗ 0.05%
=
99.9% ∗ 0.05% + 0.3% ∗ 99.95%
= 14%
Bayes – Example
Here’s the math:
P(T + |HIV )P(HIV )
P(HIV |T + ) =
P(T + )
P(T + |HIV )P(HIV )
=
99.9% ∗ 0.05%
=
99.9% ∗ 0.05% + 0.3% ∗ 99.95%
= 14%
Bayes and Machine Learning
In machine learning, we often have the following interpretation of Bayes’
rule and nomenclature:
Let D denote the data and θ denote parameters of our model.
Then according to Bayes we have
P(θ|D) = P(D|θ)P(θ)/P(D).
P(D) is just a normalization constant and can be ignored.

P(D|θ) is called the likelihood of data given model parameters.
P(θ) is the prior and our (subjective?!) belief of what the model
parameters might be.
P(θ|D) is called the posterior – a probability distribution of the
model parameters given the data.
Convergence
Definition 26
Let Xn be a series of random variables. Then
a.s.
1. Xn converges to X almost surely (notation Xn −−→ X ) iff
P(limn→∞ |Xn − X | = 0) = 1.
i.p.
2. Xn converges to X in probability (notation Xn −−→ X ) iff ∀ > 0
limn→∞ P(|Xn − X | ≥ ) = 0.
d
3. Xn converges to X in distribution (notation Xn − → X ) iff
limn→∞ Fn (x ) = F (x ) for all x where F (x ) is continuous.
Lr
4. Xn converges to X in r -th mean (notation Xn −→ X ) iff
limn→∞ E(|Xn − X |r ) = 0.
Convergence Relations
a.s.
Xn −−→ X +3 Xn −
i.p.
−KS → X +3 Xn −
d
→X
Lr
Xn −→ X
Figure: Convergence Relations
The reverse directions are not true, in general.
Counterexample
Let Xn be a sequence of random variables with Bernoulli distribution 1/n,

i.e. P(Xn = 1) = 1/n and P(Xn = 0) = 1 − 1/n.
Lemma 27
For Xn defined as above we have:
i.p.
1. Xn −−→ 0
L2
2. Xn −→ 0
3. Xn does not converge almost surely to any random variable X
Proof.
Homework
Law of Large Numbers
Let Xn denote a sequence of independently, identically distributed random

variables on a probability space (Ω, A, P).
Theorem 28
Weak Law of Large Numbers: If µ = E[Xn ], then the sample average
1 Pn i.p.
X n := n i=0 Xi converges in probability to µ, i.e. X n −−→ µ
Theorem 29
Strong Law of Large Numbers: If E[|Xn |] ≤ ∞, then the sample average
a.s.
X n := n1 ni=0 Xi converges almost surely to µ, i.e. X n −−→ µ
P
Central Limit Theorem
Let Xn denote a sequence of independently, identically distributed random
variables on a probability space (Ω, A, P). We already know that the
sample average converges under certain assumptions to the mean. But
how fast?
It turns out that we cannot just describe how fast, but we can even
calculate the limit distribution of the sample average:
Theorem 30
Central Limit Theorem: Let Sn := (Xi − µ). Then E[Sn ] = 0 and
P
Var (Sn ) = nσ 2 . Henceforth, for Sñ := nS√nσ we have E[Sñ ] = 0 and

d
Var [Sñ ] = 1. Then Sñ −
→ N (0, 1).
The key point of the theorem is that we know that the sampling
distribution of the statistic is always normal no matter the distribution
of the Xn (universality property of normal distribution).
Speed of Convergence
In the context of the Central Limit Theorem we have for arbitrary
λ ∈ R+ : Z λ
Sn − nµ

≤ λ → √1

2
P √ e −t /2 dt
nσ
2π −λ
Often we are interested in the following quantity
Sn − nµ

P √
>λ
nσ
The central limit theorem suggests that

Z λ
Sn − nµ

1

2 /2
P √
>λ →1− √ e −t dt,
nσ
2π −λ
which becomes arbitrarily small for big enough λ.

However, the central limit theorem only controls the typical behaviour of
statistics. We are now interested in the outliers.
Chebyshev’s Inequality
Theorem 31
For any random variable X with finite expectation µ and finite non zero
variance σ 2 , for any real number k > 0, we have
1
P |Xn − µ| ≥ kσ 2 ≤ .
k2
Concentration Inequalities
Fortunately, there are so-called concentration of measure inequalities
providing exponential type bounds for quantities as above. This allows us
to calculate how much the sample Sn differs from the target distribution
(keep this in mind!).
For instance, the famous Hoeffding inequality states, e.g.
Theorem 32
Let Xn be bounded in an interval, then
Sn − nµ

2
P √
≥ λ ≤ 2 exp −cλ
nσ
for some c > 0.

See also https://dustingmixon.wordpress.com/2014/09/29/
an-intuition-for-concentration-inequalities/.
Example
2500
1.0
●●
0.9
800
●
●
●●
● ●
●●● ● ● ● ●
●● ● ●
0.8
● ● ●● ● ●●
●● ● ●● ●● ●● ●● ● ●● ● ● ● ●●
● ●● ● ● ●● ● ● ●●
●●●
● ● ●● ●●●●● ●● ●●●
● ● ● ● ● ●●●● ●
● ● ● ●
●●
● ●● ● ● ● ●
●● ● ● ● ●●●●●● ● ●●● ●● ●●● ● ●●●
600
Sample mean
● ● ● ●
● ●●● ●
●● ●●●
●
●● ●●
● ● ● ● ●● ●
●
●●● ●
●● ●●●● ●●● ●●●● ●●● ● ●●●●●●● ●●●●●
●●●
●●
●● ●
● ●●
● ●●●●●●● ●● ●●●● ●●
●
● ●●●
●
●● ●
●●
●
●●● ●●
● ●
●
● ●●
●●●●
● ● ●
●●●
●●●●●
● ● ●
●● ●● ●
●●●● ● ●
●●
●●● ●●
● ●●●●● ●●
●
●
● ●●
●●●●
Frequency
● ●● ●
● ●●●
●● ● ● ●● ●●●
● ● ●● ● ●● ●
●●
● ● ●●● ●
● ●●● ●● ●●● ●●● ●
●● ●●●●● ●●●●●● ●
●● ●●
● ● ● ● ● ● ●●
● ●●●
● ●●
●● ●
●
●
●●
●●
●● ●● ● ● ●●●● ● ● ● ●●● ●●●●● ● ● ●●● ●
●●●● ● ●● ● ●
0.7
●●
●● ● ●●● ● ● ●
● ●●
●● ●●
● ● ● ● ● ● ●
●●
● ●●● ● ● ● ● ●
● ● ● ●
● ●
●
400
● ●
●
0.6
●
●●
0.5
200
0.4
0
0 100 200 300 400 500 −4 −2 0 2 4
Sample size m
Figure: Law of Large Numbers and Central Limit for Binomial Distributions
Why Probability?
We need probability for machine learning, because

the laws of the universe are fundamentally probabilistic,
uncertainty arises because of complex systems,
uncertainty arises because of in-precise measurements,
we want to use imperfect ad-hoc models,
we want to use it as a tool-kit,
it provides the fundamental concepts for statistical learning, and
statistical learning theory is – by now – the only scientific foundation
of machine learning (the rest is engineering and heuristics. . . ).
References I
[Bau02] H. Bauer, Wahrscheinlichkeitstheorie, ser. De-Gruyter-Lehrbuch. de

Gruyter, 2002. [Online]. Available:
https://books.google.de/books?id=G87VCyJHsb0C
[Fel50] W. Feller, An introduction to probability theory and its applications, ser.

Wiley mathematical statistics series. Wiley, 1950, no. Bd. 1. [Online].
Available: https://books.google.de/books?id=_DNMAAAAMAAJ
[Geo04] H. Georgii, Stochastik: Einführung in die Wahrscheinlichkeitstheorie und

Statistik, ser. De Gruyter Lehrbuch. De Gruyter, 2004. [Online]. Available:
https://books.google.de/books?id=m2unQRKR2acC
[JJ60] P. P. Jean Jacod, Probability Essentials. Springer Berlin, 1960, vol. 2.
[Par60] E. Parzen, Modern Probability Theory and its Applications. John Wiley
and Sons, Inc., New York and London, 1960.

Probability Theory Essentials

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Probability Theory Essentials

Uploaded by

Copyright:

Available Formats

Learning From Data

-1: Probability Theory: Recap

Wissen durch Praxis stärkt

Random Variables, Independence and Conditional Expectations

Convergence and Limit Laws

Probability and Machine Learning

How dare we speak of the laws of chance? Is not chance the

Examples for σ-Algebras are

2. Ω = [0, 1], A(Ω) the standard

Figure: Push-forward Probability Measure

Henceforth, without loss of generality, we might only consider measures

Figure: Gaussian distribution

Remark: In case of discrete random variables we have

Conversely, if the (joint) distribution function of X , Y factors as above,

For vector valued random variables, the correlation matrix is useful:

is called the conditional expectation with regards to B (P(x |B) denotes

P(D) is just a normalization constant and can be ignored.

Figure: Convergence Relations

The reverse directions are not true, in general.

Let Xn be a sequence of random variables with Bernoulli distribution 1/n,

Let Xn denote a sequence of independently, identically distributed random

Var (Sn ) = nσ 2 . Henceforth, for S˜n := nS√nσ we have E[S˜n ] = 0 and

The central limit theorem suggests that

which becomes arbitrarily small for big enough λ.

for some c > 0.

We need probability for machine learning, because

[Bau02] H. Bauer, Wahrscheinlichkeitstheorie, ser. De-Gruyter-Lehrbuch. de

[Fel50] W. Feller, An introduction to probability theory and its applications, ser.

[Geo04] H. Georgii, Stochastik: Einführung in die Wahrscheinlichkeitstheorie und

[JJ60] P. P. Jean Jacod, Probability Essentials. Springer Berlin, 1960, vol. 2.

You might also like