You are on page 1of 233

Uncertainty Modelling and Risk Analysis

Xavier Romão
Assistant Professor
Civil Engineering Department – FEUP

March 2022
Expressing uncertainty
Expressing uncertainty

Quantitative methods
The usual approach in traditional science fields where sufficient hard
data is available for numerical treatment

descriptive statistics

Quantitative methods

probabilistic models

Review of probability theory and related basic concepts

3
Review of probability theory and related basic concepts

• Set Theory
• Sample Space and Probability
• Axioms of Probability
• Conditional Probability
• Total Probability Theorem
• Bayes Theorem
• Independence
• Discrete & Continuous Distributions of Random Variables
• Moments and other Descriptors of Random Variables
• Common Probability Distribution Models

4
Review of probability theory and related basic concepts

• Return Period
• Confidence Intervals
• Building Probabilistic Models (distribution fitting and parameter
estimation)

5
Set theory
Review of probability theory and related basic concepts

Set Theory: some concepts

A set is a collection of well defined objects


- Example 1: a group of large numbers → that’s not well defined
- Example 2: a group of numbers larger than 50 → that’s well defined

The objects of a set are called elements

Usually sets are defined by upper-case letters, and elements by


lower-case letters
x ∈ A → x is an element of set A
x ∉ A → x is not an element of set A
7
Review of probability theory and related basic concepts

Set Theory: some concepts

A set with no elements is a Null Set or an Empty Set. This set is


represented by Ø
A finite set is a set with a finite number of elements

A infinite set is a set with an infinite number of elements:


- An infinite set is said to be countable if the number of elements
can be counted, and uncountable otherwise (the number of points in
a line segment is an uncountable set)

If all the elements of set A are also elements of set B then A is a


subset of B and can be represented by

A⊂ B
8
Review of probability theory and related basic concepts

Set Theory: some concepts

If A ⊂ B and B ⊂ A, this means that A = B


The set containing all the elements of all the sets being considered is
the space and it is denoted by S

If A is a set in S, the set of the elements that are in S but are not in A

is the complement of A (also called not A) and it is denoted by A

Other relations involving the complement of sets:

S= ∅ ∅ =S A= A

9
Review of probability theory and related basic concepts

Set Theory: some concepts

The union of sets A and B represents all the elements that belong to
A or B or both, and is represented by A U B
The intersection of sets A and B represents all the elements that
belong to both A and B, and is represented by A ∩ B.

AUB A∩B
S S

A B A B

10
Review of probability theory and related basic concepts

Set Theory: some concepts

Two sets A and B are said to be exhaustive of another set C if A U B = C

Two sets A and B are called disjoint or mutually exclusive if A ∩ B = Ø

AUB=C A∩B=Ø
C S

A B

B A

11
Review of probability theory and related basic concepts

Set Theory: some concepts


The union and intersection of sets can be generalised into
n
All the elements belonging to one
A1  ... An =  Ai or more of the sets Ai
i =1
n
A1  ... An =  Ai All the elements common to all the sets Ai
i =1 S
The difference of sets A – B are
A B
the elements of A that do not

belong to B → A ∩ B
A-B
A∩B
Review of probability theory and related basic concepts

Set Theory: some concepts

Other relations: A  ( B  C ) = ( A  B)  ( A  C )
A B = A B
A B = A B
n n

i
=i 1 =i 1
A =A i
De Morgan’s laws
n n

i
=i 1 =i 1
A =A i
13
Review of probability theory and related basic concepts

Set Theory: some examples


A  ∅ =? A
A A = ? S
A  ∅ =? ∅
A− A = ? A
A A = ? ∅
A S = ? S
A S = ? A
if A ⊂ B then ?
A B = A
if A ⊂ B then ?
A B = B 14
Sample Space and Probability
Review of probability theory and related basic concepts

Sample Space and Probability: some concepts

Random variable - a numerical quantity that is the result of an experiment and


takes on different values depending on chance
- Continuous random variables can take any value in a given range (data that
you measure)
- Discrete random variables can take only discrete values (data that you count)
Sample Point or Realization - an outcome of the random variable
Sample – a set of outcomes for a random variable
Population or Sample Space - the set of all possible outcomes for a random variable
Event - a statement about the possible outcome(s) of a random variable

16
Review of probability theory and related basic concepts

Sample Space and Probability: some concepts

Probability - there are several definitions


The classical or Laplace definition
The number of times an event occurs (i.e. a favourable outcome) with respect to
the total number of possible outcomes of the process (i.e. the population),
assuming all outcomes are equally probable
The frequentist definition
The relative frequency of an event (i.e. a favourable outcome) after a large number
of occurrences of the process which may not cover the whole population

Sample Probability - the same as Probability but relative to the sample instead of
the population

17
Review of probability theory and related basic concepts

Sample Space and Probability: some concepts

Example: rolling of a die

Random variable - the numerical value of the die


Population or Sample Space - values 1 to 6
Sample Point or Realization - the value of the die when it is rolled once
Sample - the value of the die after ten rolls: 3, 5, 4, 3, 2, 6, 3, 5, 2, 3… for example
Event - the value of the die is 3
Probability - the relative frequency of the outcome 3 is 1/6 ≈ 17%
Sample Probability - the relative frequency of the outcome 3 is 4/10 ≈ 40% … this
means the sample is not representative enough (the size is too small)!!!
18
Review of probability theory and related basic concepts

Sample Space and Probability: correspondence with set theory

Set theory Probability theory


space S sample space, certain event
empty set impossible or null event
elements sample points
sets events
A event A occurs

A event A does not occur
AUB at least one of A and B occurs
A∩B both A and B occur
A⊂B A is a subevent of B, i.e. the occurrence of A implies that of B
A∩B=Ø A and B are mutually exclusive
19
Axioms of Probability
Review of probability theory and related basic concepts

Axioms of Probability
The probability function associated to the occurrence of event A, P(A),
is a number assigned to this event that represents its likelihood and is
called the probability of A.

The probability has the following properties (axioms):

1. P ( A) ≥ 0 (the probability is non-negative)

2. P(S ) =1
3. For a group of mutually exclusive events A1, A2,…

 n  n
P   Ai  = ∑ P ( Ai )
 i=1  i=1 21
Review of probability theory and related basic concepts

Axioms of Probability: some consequences

P ( ∅ ) =0 ( )
P A = 1 − P ( A ) (for any event A)

0 ≤ P ( A ) ≤ 1 (for any event A) If A ⊆ B then P ( A ) ≤ P ( B )

For any events A and B


P ( A ∪ B=
) P ( A) + P ( B ) − P ( A ∩ B )
For any events A, B and C
P ( A ∪ B ∪ C ) = P ( A) + P ( B ) + P ( C ) − P ( A ∩ B ) − P ( A ∩ C ) − P ( B ∩ C ) + P ( A ∩ B ∩ C )

For any number n of events Ai


 n  n n −1 n n − 2 n −1 n
 n 
Ai  ∑ P ( Ai ) − ∑∑ P ( Ai ∩ Aj ) + ∑∑∑ P ( Ai ∩ Aj ∩ Ak ) − ... + ( −1) P   Ai 
n +1
P =

 i 1=  i= 1 i = 1 j >i i = 1 j >i k > j i 1 
Review of probability theory and related basic concepts

Axioms of Probability: some consequences

For any disjoint (or mutually exclusive) events A and B


Since P ( A ∩ B ) =
∅ P ( A ∪ B=
) P ( A) + P ( B )
For any disjoint (or mutually exclusive) events A, B and C
P( A ∩ B) =

P( A ∩C) =∅
Since P ( A ∪ B ∪ C=
) P ( A) + P ( B ) + P ( C )
P(B ∩C) = ∅
P( A ∩ B ∩C) =∅
For any number n of disjoint (or mutually exclusive) events Ai
 n  n
P   Ai  = ∑ P ( Ai )
 i =1  i =1 23
Review of probability theory and related basic concepts

Axioms of Probability: some consequences

Example: television sports viewing habits of people

A survey of a people viewing habits of car racing (A), golf (B) and football
(C) revealed that:
- 28% watched A, 29% watched C, 19% watched B
- 14% watched A and C
- 12% watched C and B
- 10% watched A and B
- and 8% watched all three sports

What percentage of the people watched none of the three sports?

The event representing none of the three sports being watched is A∪ B ∪C


24
Review of probability theory and related basic concepts

Axioms of Probability: some consequences

Example: television sports viewing habits of people

( )
P A ∪ B ∪ C =1 − P ( A ∪ B ∪ C )

0.29

0.12 0.14
0.08
0.19
0.10 0.28

P ( A ∪ B ∪ C ) = P ( A) + P ( B ) + P ( C ) − P ( A ∩ B ) − P ( A ∩ C ) − P ( B ∩ C ) + P ( A ∩ B ∩ C )
P ( A ∪ B ∪ C ) = 0.28 + 0.19 + 0.29 − 0.10 − 0.14 − 0.12 + 0.08 = 0.47
( )
P A ∪ B ∪ C =1 − 0.47 =0.53 25
Conditional Probability
Review of probability theory and related basic concepts

Conditional Probability
Given two arbitrary events A and B, the probability P(A|B) is defined as
the conditional probability of event A given that event B has occurred.

P( A ∩ B)
P( A | B) , P( B) ≠ 0
P( B)
Note that if event B is S:
P ( A ∩ S ) P ( A)
P( A| S ) = = ⇔ P( A| S ) =P ( A ) which is obvious
P(S ) 1
it helps to define conditional probability P(A|B) as the probability of A
with respect to a reduced sample space defined by the outcomes of
event B
27
Review of probability theory and related basic concepts

Conditional Probability

Consider a 1x1 square (sample space S) and events A and B. The area
of squares A and B are P(A) = 0.25 and P(B) = 0.375.
A B
We can see that P(A ∩ B) = 0.25/4 = 0.0625

P ( A ∩ B ) 0.0625 1
P ( A | B )= = = ≈ 0.17
1
P( B) 0.375 6

which means that the area (probability) of events A and B occurring


at the same time is about 17% of the area (probability) of event B

28
Review of probability theory and related basic concepts

Conditional Probability

Example: picking 2 aces out of a deck of cards

Two cards are drawn in succession without replacement from an ordinary deck of
cards (52 cards). What is the probability that both cards are aces?
A is the event corresponding to the first card being an ace
B is the event corresponding to the second card being an ace
P( A ∩ B)
P ( A | B=
) ⇒ P ( A ∩ B=
) P( B)× P( A | B)
P( B)
P ( A ∩ B=
) P ( B ) × P ( A | B =) P ( A) × P ( B | A) Let’s focus on this

4 3
P ( A ∩ B ) = P ( A) × P ( B | A) = × = 0.0045=0.45%
52 51 29
Review of probability theory and related basic concepts

Conditional Probability: multiplication rules

General Multiplication Rule – For arbitrary events A and B

P ( A ∩ B=
) P ( B ) × P ( A | B =) P ( A) × P ( B | A)
General Multiplication Rule – For arbitrary events A, B and C
P ( A ∩ B ∩= B ) P ( C | A ∩ B ) × P ( B | A) × P ( A)
C ) P ( C | A ∩ B ) × P ( A ∩=
Event D
General Multiplication Rule – For n arbitrary events

P ( A1 ∩ A2 ... ∩ An ) = P ( A1 ∩ A2 ... ∩ An −1 ) × P ( An | A1 ∩ A2 ... ∩ An −1 )

 n  n  i −1 
which then turns into P  Ai = ∏ P  Ai  A j 
 
=  i =1  j 1 
 i 1= 30
Total Probability Theorem
Review of probability theory and related basic concepts

Total Probability Theorem

From the General Multiplication Rule we define the Total Probability


Theorem → considering n disjoint and collectively exhaustive events
A1, … An whose probabilities sum to 1.0, the probability of an event B is:
P ( B )= P ( B ∩ A1 ) + ... + P ( B ∩ An )= Probability of event Ai

n
P ( B | A1 ) × P ( A1 ) + ... + P ( B | An ) × P=
( An ) ∑ [ P( B | A ) × P( A )]
i i
A1 i =1
B A3 Probability of event B given that event Ai occurred

A4 P= ( A1 ) P= ( A2 ) P= ( A3 ) P= ( A4 ) 0.25
1
P ( B ∩ A1 ) = P ( B | A1 ) × P ( A1 ) = 0.25 × 0.25 = 0.0625
n
A2
1
∑ [ P( B | Ai ) × P( Ai )] =
P( B) =
i =1
4 × ( 0.25 × 0.25 ) =
0.25
Bayes Theorem
Review of probability theory and related basic concepts

Bayes Theorem

General Multiplication Rule – For arbitrary events A and B

P ( A ∩ B=
) P ( B ) × P ( A | B =) P ( A) × P ( B | A)

Prior Probability The support that B


(before the new data) P ( A) × P ( B | A) provides for A
P( A | B) =
P( B)

The Bayes Theorem reflects how the probability


Posterior Probability
of an event (in this case A) is affected by new
(after getting the new data)
data (in this case the occurrence of event B)

34
Review of probability theory and related basic concepts

Bayes Theorem
Considering a more complex case with n disjoint and collectively
exhaustive events A1, … An

P ( Ai ) × P ( B | Ai )
P ( Ai | B ) =
P( B)

The probability P(B) can be eliminated using the Total Probability Theorem

P ( Ai ) × P ( B | Ai )
P ( Ai | B ) = n

∑ [ P( B | A ) × P( A )]
i =1
i i

35
Review of probability theory and related basic concepts

Bayes Theorem

Example: rain on a wedding day

Marie is getting married tomorrow at an outdoor ceremony in the desert.


In recent years, it has rained only 5 days each year. Unfortunately, the weatherman
has predicted rain for tomorrow.
When it actually rains, the weatherman correctly forecasts rain 90% of the time.
When it doesn't rain, he incorrectly forecasts rain 10% of the time.
What is the probability that it will rain on the day of Marie's wedding?
The sample space is defined by two mutually-exclusive events - it rains on Marie's
wedding (event A1) or it does not rain on Marie's wedding (event A2). Additionally,
the third event is the weatherman predicting rain (event B).

36
Review of probability theory and related basic concepts

Bayes Theorem

Example: rain on a wedding day

What we know:

( A1 ) 5=
P= 365 0.0137 It rains 5 days during the year

P ( A2 ) 360
= = 365 0.9863 It does not rain 360 days during the year
When it rains, the weatherman predicts rain
P ( B A1 ) = 0.9 90% of the time
When it does not rain, the weatherman predicts
P ( B A2 ) = 0.1 rain 10% of the time

We want to know P(A1|B), the probability that it will rain on the day of
Marie's wedding, given a forecast for rain by the weatherman. 37
Review of probability theory and related basic concepts

Bayes Theorem

Example: rain on a wedding day

P ( A1 ) × P ( B | A1 )
P ( A1 | B ) =
P ( A1 ) × P ( B | A1 ) + P ( A2 ) × P ( B | A2 )

0.0137 × 0.9
P ( A1 | B ) = 0.111
0.0137 × 0.9 + 0.9863 × 0.1

This result may seem somewhat contradictory!!

38
Statistical independent events
Review of probability theory and related basic concepts

Statistical independent events


Two events A and B are independent if the occurrence or non-occurrence
of one of the events does not change the probability of the other event

P ( A | B ) = P ( A ) and P ( B | A ) = P ( B )
Based on the General Multiplication Rule
P ( A ∩ B=
) P ( B ) × P ( A | B =) P ( A) × P ( B | A)
we get
P ( A ∩ B=
) P ( A) × P ( B )
In a more general form we get
n

∏ P ( Ai )
P ( A1 ∩ A2 ... ∩ An ) =
i =1 40
Review of probability theory and related basic concepts

Statistical independent events

If two events are mutually exclusive (or disjoint) this


does not imply that they are independent events
If two events are independent this does not imply
that they are mutually exclusive (or disjoint) events

P ( A ∩ B=
) P ( A) × P ( B ) P( A ∩ B) =

independent events disjoint events

If two events are uncorrelated this does not


imply that they are independent events**
If two events are independent this implies ** except in one
case we’ll see later
that they are uncorrelated events
41
Discrete & Continuous Distributions
of Random Variables
Review of probability theory and related basic concepts

Discrete & Continuous Distributions of Random Variables

Random variable (RV) - a numerical quantity that is the results of an experiment


and takes on different values depending on chance
- Continuous random variables can take any value in a given range (data that you
measure) (data can take an infinite number of values that can’t be counted)
- Discrete random variables can take only discrete values (data that you count)
(data can take a finite or an infinite number of values that can be counted)

The behaviour a random variable is defined by its probability distribution (it defines
how probabilities are distributed across the different values of the random variable)
The probability distribution function FX (x) (usually
called the cumulative distribution function)
Probability distributions
are defined by 2 functions The probability density function fX (x) - for continuous
RVs - and the probability mass function pX (x) - for
discrete RVs 43
Review of probability theory and related basic concepts

Discrete & Continuous Distributions of Random Variables


The cumulative distribution function (cdf) is defined as 𝐹𝐹𝑋𝑋 : ℝ → 0, 1 such that

𝐹𝐹𝑋𝑋 𝑥𝑥 = 𝑃𝑃(𝑋𝑋 ≤ 𝑥𝑥)

The cdf defines the probability of the RV to take any value between its lowest
possible value (which could be - ∞) up to x.
The cdf has the following properties:

• 0 ≤ 𝐹𝐹𝑋𝑋 𝑥𝑥 ≤ 1
• lim𝑥𝑥→−∞ 𝐹𝐹𝑋𝑋 𝑥𝑥 = 0
• lim𝑥𝑥→∞ 𝐹𝐹𝑋𝑋 𝑥𝑥 = 1
• 𝑥𝑥 ≤ 𝑦𝑦 ⇒ 𝐹𝐹𝑋𝑋 𝑥𝑥 ≤ 𝐹𝐹𝑋𝑋 𝑦𝑦

44
Review of probability theory and related basic concepts

Discrete & Continuous Distributions of Random Variables


1
0.9
0.8
0.7
0.6

𝐹𝐹𝑋𝑋 𝑥𝑥 𝐹𝐹𝑋𝑋 𝑥𝑥 0.5


0.4
0.3
0.2
0.1
0
0 2 4 6 8 10
x x

The cdf of a continuous The cdf of a discrete RV is


RV is a smooth function a stepwise function
𝑥𝑥
𝑑𝑑𝑑𝑑𝑋𝑋 𝑥𝑥
For continuous RVs we have 𝐹𝐹𝑋𝑋 𝑥𝑥 = � 𝑓𝑓𝑋𝑋 𝑢𝑢 𝑑𝑑𝑢𝑢 and 𝑓𝑓𝑋𝑋 𝑥𝑥 =
−∞ 𝑑𝑑𝑑𝑑
where 𝑓𝑓𝑋𝑋 𝑥𝑥 is the probability density function (pdf)
45
Review of probability theory and related basic concepts

Discrete & Continuous Distributions of Random Variables


1
0.9
0.8
0.7
0.6

𝐹𝐹𝑋𝑋 𝑥𝑥 𝐹𝐹𝑋𝑋 𝑥𝑥 0.5


0.4
0.3
0.2
0.1
0
0 2 4 6 8 10
x x

The cdf of a continuous The cdf of a discrete RV is


RV is a smooth function a stepwise function
A

For discrete RVs we have 𝐹𝐹𝑋𝑋 𝑥𝑥 = � 𝑓𝑓𝑋𝑋 (𝑥𝑥𝑗𝑗 ) and 𝑓𝑓𝑋𝑋 (𝑥𝑥𝑗𝑗 ) = 𝐹𝐹𝑋𝑋 𝑥𝑥𝑗𝑗 − 𝐹𝐹𝑋𝑋 𝑥𝑥𝑗𝑗−1
𝑥𝑥𝑗𝑗 ≤𝑥𝑥
where 𝑓𝑓𝑋𝑋 𝑥𝑥 is the probability mass function (pmf)… usually also called pdf
46
Review of probability theory and related basic concepts

Discrete & Continuous Distributions of Random Variables


0.4 0.7

0.35 0.6

Examples of a
0.3
0.5
0.25

discrete (stepwise)
0.4
0.2
0.3

pdf (pmf) 0.15

0.1
0.2

0.05 0.1

0 0
-5 0 5 0 2 4 6 8 10

0.4 0.7

0.35 0.6
0.3
0.5

Examples of a 0.25

0.2
0.4

continuous 0.15
0.3

(smooth) pdf 0.1


0.2

0.1
0.05

0 0
-5 0 5 0 2 4 6 8 10
47
Review of probability theory and related basic concepts

Discrete & Continuous Distributions of Random Variables

Important things to remember


� 𝑓𝑓𝑋𝑋 𝑥𝑥 𝑑𝑑𝑑𝑑 = 1
−∞
Consistency Condition
� 𝑓𝑓𝑋𝑋 (𝑥𝑥𝑗𝑗 ) = 1
𝑎𝑎𝑎𝑎𝑎𝑎 𝑥𝑥𝑗𝑗

From the cdf it is possible to define the probability of the RV to


take values between a lower bound a and an upper bound b

𝑃𝑃 𝑎𝑎 < 𝑋𝑋 ≤ 𝑏𝑏 = 𝐹𝐹𝑋𝑋 𝑏𝑏 − 𝐹𝐹𝑋𝑋 𝑎𝑎

B 48
Review of probability theory and related basic concepts

Discrete & Continuous Distributions of Random Variables

Important things to remember


A

𝑓𝑓𝑋𝑋 (𝑥𝑥𝑗𝑗 ) = 𝐹𝐹𝑋𝑋 𝑥𝑥𝑗𝑗 − 𝐹𝐹𝑋𝑋 𝑥𝑥𝑗𝑗−1 For a discrete RV, the pdf defines
the probability of occurrence of
each value of the RV!!
𝑃𝑃 𝑎𝑎 < 𝑋𝑋 ≤ 𝑏𝑏 = 𝐹𝐹𝑋𝑋 𝑏𝑏 − 𝐹𝐹𝑋𝑋 𝑎𝑎

For a continuous RV, the values of the pdf have no meaning!!


For a continuous RV, the probability of occurrence of a specific
value of the RV is zero!!
49
Review of probability theory and related basic concepts

Discrete & Continuous Distributions of Random Variables


FX = P(X ≤ x)
fX 1.0

FX(4)
FX(3)

0
0 3 4 x 0 3 4
x
4 4 3
P (3 < X ≤ 4) =
area = ∫ f X ( x ) dx = ∫ f X ( x ) dx − ∫ f X ( x ) dx =
3 0 0

= FX ( 4 ) − FX ( 3)
4
P ( X= 4=
) area ? =∫ f X ( x ) dx
3.9999999999999999999

FX ( 4 ) − FX ( 3.9999999999999999999 ) ≈ 0
=
Review of probability theory and related basic concepts

Discrete & Continuous Distributions of Random Variables

Important things to remember

An histogram and a pmf may look similar, but they’re not the same:
- An histogram is a discrete version of the pdf of a continuous RV
(because usually we don’t have the full population, just a
representative sample). The pmf is the pdf of a discrete RV
- The vertical axis of the histogram represents the number of times a
value of the RV falls within a certain interval (bin). The vertical axis of
the pmf represents the probability of each value of the discrete RV
0.4

0.35

0.3

0.25

0.2

0.15

0.1

0.05

0 51
-5 0 5
Review of probability theory and related basic concepts

Discrete & Continuous Distributions of Random Variables

Example: levels of earthquake damage in 10.000 buildings

• X is the (variable) level of damage of buildings hit


by an earthquake X Level of damage
• X is a discrete random variable that can be either 0 None
0, 1, 2, 3, or 4 1 Light
2 Moderate
• The distribution of the probability of the buildings
being damaged by a certain level X is defined by the 3 High
probability mass function (pmf) 4 Collapse

x 0 1 2 3 4
P(X=x) 0.0039 0.0469 0.2109 0.4219 0.3164
nº of buildings 39 469 2109 4219 3164 52
Review of probability theory and related basic concepts

Discrete & Continuous Distributions of Random Variables

Example: levels of earthquake damage in 10.000 buildings


x 0 1 2 3 4
P(X=x) 0.0039 0.0469 0.2109 0.4219 0.3164
Probability

Graphically

53
Level of damage
Review of probability theory and related basic concepts

Discrete & Continuous Distributions of Random Variables

Example: levels of earthquake damage in 10.000 buildings

What was the probability of a building not collapsing?


Let A be the event corresponding to
buildings damaged by level 4 (collapse)
Then, Ā = “damage level 3 or less”
Ā

Probability
Since A and Ā are disjoint and collectively
exhaustive events:
P(Ā) = 1 – P(A)
A
= 1 – 0.3164
= 0.6836
Level of damage
68.36% is the probability of a building not collapsing 55
Review of probability theory and related basic concepts

Discrete & Continuous Distributions of Random Variables

Example: levels of earthquake damage in 10.000 buildings

What was the probability of a building having a damage


level 3 or 4?
Let A be the event corresponding to having
a damage level 4 and B the event
corresponding to having a damage level 3

Probability
Since A and B are disjoint events:
B A
P(A U B)
= P(A) + P(B)
= 0.3164 + 0.4129
= 0.7293 Level of damage
72.93% is the probability of a building to have a damage level of 3 or 4 56
Review of probability theory and related basic concepts

Discrete & Continuous Distributions of Random Variables


Example: random numbers between 0 and 1 from a spinner

Random variable - the numerical value of the spinner


(continuous random variable)
Sample Space - any number between 0 and 1
(there is an infinite number of outcomes)

• The probabilities for continuous random variables


are assigned by a probability density function (pdf)
• Since no value has a larger chance of occurring than
another, the distribution of values is uniform
• The probability of any exact value is 0!!
• What we can determine is the probability of a range
of values of the random variable
Review of probability theory and related basic concepts

Discrete & Continuous Distributions of Random Variables


Example: random numbers between 0 and 1 from a spinner

• The probability of a range of values of the random


variable is the area under the pdf function
• For example:
The probability of a value between 0 and 0.5
P(0 ≤ X ≤ 0.5)
= shaded rectangle
= height × base
= 1 × 0.5 = 0.5
• Areas = Probabilities!!
• Areas = Integral of the pdf function!!
58
Review of probability theory and related basic concepts

Discrete & Continuous Distributions of Random Variables


Example: random numbers between 0 and 1 from a spinner

• Let’s try to deduce the cdf and the pdf of the RV

Let the RV be X: the number obtained by the spinner


Considering that any number between 0 and 1 has the
same likelihood of occurring
x any possible number
FX ( x )= P ( X ≤ x )= = = x
1− 0 sample space
dFX ( x )
fX ( x) = 1
= Only valid for values of X
dx between 0 and 1

59
Review of probability theory and related basic concepts

Discrete & Continuous Distributions of Random Variables


Example: random numbers between 0 and 1 from a spinner

• Let’s compare it with the cdf and the pdf


of the uniform distribution

0 for x < a
 x − a
FX ( x ) 
= for a ≤ x < b (where a = 0 and b = 1 in our case)
b − a
1 for x ≥ b

 1
 for a ≤ x ≤ b
fX ( x) = b − a
0 for x < a or x > b 60
Moments as Descriptors of Random Variables
Review of probability theory and related basic concepts

Moments as Descriptors of Random Variables


The probabilistic model of a RV can be completely described if the cdf or
pdf are specified along with their parameters.

In practice, the exact form of the probabilistic model may not be known.
In those cases, a RV can be defined using its moments.

A moment is a quantitative measure of the shape of a set of points, used in


both mechanics and statistics. Moments are defined using the expectation
operator.
Considering a real function g(X), with X being a RV, the expectation E[g(X)] is

E  g ( X )  = ∫ g ( x ) f ( x ) dx
X
for a continuous RV
−∞

E  g ( X )  = ∑ i g ( xi ) f X ( xi ) for a discrete RV 62
Review of probability theory and related basic concepts

Moments as Descriptors of Random Variables


The function g(X) usually takes the general form:

g ( x=
) ( x − c)
k

where k is an integer defining the kth order of the moment



E ( X − c )  = ∫ ( x − c) f X ( x ) dx
k k

 
−∞

When k = 0, the zeroth order moment gives the consistency condition



E ( X − c )  = ∫ ( x − c) f X ( x ) dx = 1
0 0

 
−∞

63
Review of probability theory and related basic concepts

Moments as Descriptors of Random Variables


When c = 0, the moments are called raw moments. The 1st order moment
is a raw moment called expected value or mean value:

µ=
X E [ X=] ∫ x ⋅ f ( x ) dx
X
−∞

which corresponds to the centroid of the area under the curve of f(x)

In statistics, the 2nd order moment, the variance, is called a central moment
since c = μX and measures the dispersion of the RV X around its mean.

σ X2 =V ( X ) = E ( X − µ X )  = ∫ ( x − µ X ) ⋅ f X ( x ) dx
2 2

 
−∞
2
This can be re-written as = [ ]
V ( X ) E  X 2  −  E X  64
Review of probability theory and related basic concepts

Moments as Descriptors of Random Variables


From the variance, we can define alternative measure of dispersions such as
the standard deviation σ which is an with the same units of X, and the
coefficient of variation CoV which is unitless

σ X = σ X2 CoVX = σ X µ X

The 3rd order moment is also a central moment that, when divided by σ3, is
called the skewness coefficient and measures the asymmetry of the RV X with
respect to its mean.

E ( X − µ X ) = ∫ ( x − µ X ) ⋅ f X ( x ) dx
 3 3

 
−∞

E ( X − µX ) 
 3

γ1 =  
3
σX 65
Review of probability theory and related basic concepts

Moments as Descriptors of Random Variables


The 4th order moment is also a central moment that, when divided by σ4, is
called the kurtosis coefficient and measures the flateness/peakedness of the
RV X around its mean.

E ( X − µ X )  = ∫ ( x − µ X ) ⋅ f X ( x ) dx
4 4

 
−∞

E ( X − µ X ) 
4

γ2 =  
4
σX
The kurtosis coefficient is usually compared to the value 3 (the kurtosis
coefficient of a RV that follows a Normal distribution).

66
Review of probability theory and related basic concepts

Moments as Descriptors of Random Variables

this is called
“excess kurtosis”
– kurtosis above
(or below) the
normal value of 3

67
Review of probability theory and related basic concepts

Other Descriptors of Random Variables: quantiles (or fractiles or


percentiles)
The pth level quantile of a RV X that has a cdf F(x) is the value xp such that:

X ( xp )
F= p, with 0 ≤ p ≤ 1

For example, the median is the quantile for level p = 0.50. Quantiles are often
used in civil engineering to set the value of loads and material properties.

The pth level quantile xp of a RV is the value of the RV that has a probability
of being exceeded 1-p.

P ( X > xp ) =
1 − FX ( x p ) =
1− p

68
Review of probability theory and related basic concepts

Moments as Descriptors of Random Variables

Example:
P(X = x) 0.4 0.4 1 P(X = x) 2

1/6
0.1 0.1

1 2 3 4 5 6 x 1 2 3 4 5 6 x

The mean of X is defined by:


4
1 µX =E[X ] =∑ x ⋅ fX ( x) =
2 × 0.1 + 3 × 0.4 + 4 × 0.4 + 5 × 0.1 =3.5
i =1
6
2 µX = E [ X ] = ∑ x ⋅ f ( x ) = (1 + 2 + 3 + 4 + 5 + 6 ) ×1 6 = 3.5
X 69
i =1
Review of probability theory and related basic concepts

Moments as Descriptors of Random Variables

Example:
P(X = x) 0.4 0.4 1 P(X = x) 2

1/6
0.1 0.1

1 2 3 4 5 6 x 1 2 3 4 5 6 x

The variance of X is defined by:


4
V ( X ) = E ( X − µ X )  = ∑( x − µX ) ⋅ fX ( x) =
2 2
1
  i =1

( 2 − 3.5) × 0.1 + ( 3 − 3.5) × 0.4 + ( 4 − 3.5) × 0.4 + ( 5 − 3.5) × 0.1 =


2 2 2 2
=
= 0.65 70
Review of probability theory and related basic concepts

Moments as Descriptors of Random Variables

Example:
P(X = x) 0.4 0.4 1 P(X = x) 2

1/6
0.1 0.1

1 2 3 4 5 6 x 1 2 3 4 5 6 x

The variance of X is defined by:


6
V ( X ) = E ( X − µ X )  = ∑( x − µX ) ⋅ fX ( x) =
2 2
2
  i =1
(1 − 3.5 )2 + ( 2 − 3.5 )2 + ( 3 − 3.5 )2 + ( 4 − 3.5 )2 + ( 5 − 3.5 )2 + ( 6 − 3.5 )2 
=  2.9
6
Review of probability theory and related basic concepts

Moments as Descriptors of Random Variables

Example:
P(X = x) 0.4 0.4 1 P(X = x) 2

1/6
0.1 0.1

1 2 3 4 5 6 x 1 2 3 4 5 6 x

The standard deviation and CoV of X are defined by:

σX
1= 0.65 0.81
= CoVX σ=
= X µX 0.81
= 3.5 0.23

2 σX
= 2.9 1.7
= CoVX σ=
= X µX 1.7
= 3.5 0.49 72
Review of probability theory and related basic concepts

Moments as Descriptors of Random Variables

Example: lifespan of a structural component

The lifespan of a structural component X is a RV described by the pdf:

fX ( x) =
a ⋅ e − ax , with x ≥ 0, a ≥ 0
The mean of X is defined by: integration by parts
∞ ∞ ∞ ∞
 e  − ax
µ X = E [ X ] = ∫ x ⋅ f X ( x ) dx = ∫ x ⋅ a ⋅ e − ax dx =  x ⋅ a ⋅  − ∫ − e − ax
dx
0 0  −a  0 0

(1 + a ⋅ x ) 
∞ ∞
 e 
− ax
e   e
− ax − ax
 1 1
= x ⋅ a ⋅  +  = −  = 0−−  =
 −a  0  −a  0  a 0  a a
73
using limits
Review of probability theory and related basic concepts

Moments as Descriptors of Random Variables

Example: lifespan of a structural component

The variance of X is defined by:


∞ ∞ 2
 1
V ( X )= E ( X − µ X ) = ∫(x − µ ) ⋅ f X ( x ) dx= ∫  x −  ⋅ a ⋅ e − ax dx=
2 2

  X
a
0 0

  1  − ax 
2 ∞
 1  − ax  1 2 

1  − ax
 −  x −  e  + ∫ 2  x −  e dx =
=  0 + 2  + ∫ a  x −  e dx =
  a  0 0  a  a  a0  a
integration by parts using limits

1 2    1  − ax 
∞ ∞  1 2   1  e  − ax ∞ 
+   −  x −  e  + ∫ e dx= − ax
 +  0 −  +  −  =

a 2
a    a 0 0  a

2
a  a   a  
integration by parts
 0 
using limits
Review of probability theory and related basic concepts

Moments as Descriptors of Random Variables

Example: lifespan of a structural component

The variance of X is defined by:

1 2 1  1 1
... = 2
+  − +  0 +   = 2
a a a  a a
using limits

To obtain the quantiles we need to define the cdf:


x x

f X ( x )= a ⋅ e − ax FX ( x ) =
∫ X
f ( u ) du ∫
= a ⋅ e − a ⋅u
du 1
= − e − ax

0 0
75
Review of probability theory and related basic concepts

Moments as Descriptors of Random Variables

Example: lifespan of a structural component

The pth level quantiles are then obtained by:

FX ( x p ) = p ⇔ 1 − e
− ax p
= p

−1
=xp ln (1 − p )
a

76
Common Probability Distribution Models
(including the Return Period and the
Central Limit Theorem)
Review of probability theory and related basic concepts

Common Probability Distribution Models


There are dozens of probability distribution models. Many of them were
developed for specific applications or to model very specific phenomena.

There are some probability distribution models that are more general and
that appear in many situations. We’ll focus models representing the
behaviour of RVs that are:
- the result of independent events
- the sum of different effects
- the product of different effects
- the extremes of different effects

… and we’ll also address a few other useful probability distribution models

78
Review of probability theory and related basic concepts

Common Probability Distribution Models: RVs that come from


independent events
• Bernoulli Process is a sequence of binary RVs, so it is a discrete-time
stochastic (i.e. random) process that takes only two values. The Bernoulli
variables Xi are identical and independent. Every variable Xi in the
sequence is associated with a Bernoulli trial (a random experiment with
exactly two possible outcomes, "success" and "failure", in which the
probability of success is the same every time the experiment is
conducted)... So it’s basically a repeated coin toss

• Poisson Process is a continuous-time stochastic process in which events


occur continuously and independently of one another. It is a collection of
RVs that represent the number of events and the time points at which they
occur in a given time interval (starting from time 0)... So let’s imagine the
number and time of page view requests of a website… or the number and
time of earthquake occurrences in a given region 79
Review of probability theory and related basic concepts

Common Probability Distribution Models: RVs that come from


independent events

Bernoulli Process Poisson Process


How many successes
or events in a fixed Binomial distribution Poisson distribution
time interval or B(n,p) P(ν)
number of trials?
How long (time or Geometric Exponential
trials) until first distribution distribution
success/event? G(p) Exp(ν)

How long (time or Negative Binomial


Gamma distribution
trials) until the kth distribution
Γ(k, 1/ν)
success/event? NB(r,p) 80
Review of probability theory and related basic concepts

Common Probability Distribution Models: RVs that come from


independent events
Binomial distribution B(n,p) is a discrete probability distribution that gives the
probability of k successes in n independent yes/no experiments, each having a
probability p of success.

n k
f B ( k )   p (1 − p )
n−k
=
k 
k
n j
FB ( k ) ∑   p (1 − p )
n− j

j =1  j 

n n!
with   = Mean: µ K = np
 k  k !( n − k )! Variance: V (=
K ) np (1 − p )
Binomial coefficient
Review of probability theory and related basic concepts

Common Probability Distribution Models: RVs that come from


independent events
Example: operative bulldozers
A contractor has 3 independent bulldozers, each with a probability of being
operative of 0.90. What is the probability of 2 bulldozers being inoperative?

Possible combinations of events Let X = number of operative bulldozers


(G: good bulldozer; B: bad bulldozer) (operative = “success”)

GGG X=3 P ( X = 3) = p × p × p
GGB
GBG X=2 P ( X = 2 ) = 3  p × p × (1 − p ) 
BGG
BBG
BGB X=1 P ( X = 1) = 3  p × (1 − p ) × (1 − p ) 
GBB
BBB X=0 P ( X = 0 ) = (1 − p ) × (1 − p ) × (1 − p ) 82
Review of probability theory and related basic concepts

Common Probability Distribution Models: RVs that come from


independent events
Example: operative bulldozers
A contractor has 3 independent bulldozers, each with a probability of being
operative of 0.90. What is the probability of 2 bulldozers being inoperative?

Possible combinations of events Let X = number of operative bulldozers


(G: good bulldozer; B: bad bulldozer) (operative = “success”)
 3 x
P ( X= x=
)   p (1 − p )
3− x
Generalised to
 x
Event A = 2 bulldozers are inoperative ≡ Event B = 1 bulldozer is operative

3!
P ( B) 0.9 (1 − =
0.9 )
1 3−1
= 0.027
1!( 3 − 1)!
83
Review of probability theory and related basic concepts

Common Probability Distribution Models: RVs that come from


independent events
Example: operative bulldozers
A contractor has 3 independent bulldozers, each with a probability of being
operative of 0.90. What is the probability of 2 bulldozers being inoperative?

Alternatively, if we consider the event Y = number of inoperative bulldozers


(inoperative = “success”)

and the probability of success p = 0.10, we have event A = 2 bulldozers are inoperative

3!
P ( A) 0.12 (1 −=
0.1)
3− 2
= 0.027
2!( 3 − 2 )!

84
Review of probability theory and related basic concepts

Common Probability Distribution Models: RVs that come from


independent events
Geometric distribution G(p) is a discrete probability distribution that gives the
probability of the number x of experiments, each having a probability p of
success, that are needed to get the first success.

fG (=
x ) p (1 − p )
x −1

FG ( x ) =1 − (1 − p )
x

1
Mean: µX =
p
1− p
Variance: V ( X ) =
p2 85
Review of probability theory and related basic concepts

Common Probability Distribution Models: RVs that come from


independent events
the Return Period
The number x of experiments until the first success can be seen as discrete and
independent time intervals. In this case, the number of time intervals x (e.g. the
number of years, of days, of weeks, etc) until the first success is the first
occurrence time. Since the time intervals are independent, the time until the first
success must be the same as the time between 2 consecutive successes. The mean
of x, μX, can then be seen as the mean time between 2 consecutive successes (or
events) which is usually called Return Period.

Return Period

86
x = Tfirst tornado x = Tsecond tornado
Review of probability theory and related basic concepts

Common Probability Distribution Models: RVs that come from


independent events
Geometric distribution G(p) is a discrete probability distribution that gives the
probability of the number x of experiments, each having a probability p of
success, that are needed to get the first success.

fG (=
x ) p (1 − p )
x −1

FG ( x ) =1 − (1 − p )
x

1
Mean: µX = Return Period
p
1− p
Variance: V ( X ) =
p2 87
Review of probability theory and related basic concepts

Common Probability Distribution Models: RVs that come from


independent events
Example: excessive wind speed

A wind tower was designed to be operational for wind speeds up to 100km/h. This
wind speed has a 5% annual probability of exceedance. What is the probability of
exceeding this wind speed during the lifetime of the tower which is 100 years?

Consider the event X “wind speed > 100km/h” = “success”


The Return Period of this event = 1/0.05 = 20 years
The probability of having one success during a period up to 100 years is

FG (100 ) =P ( X ≤ 100 ) =1 − (1 − 0.05 )


100
=1 − 0.95100 =0.994

88
Review of probability theory and related basic concepts

Common Probability Distribution Models: RVs that come from


independent events
Negative Binomial distribution NB(r,p) is a discrete probability distribution that
gives the probability of getting the rth success in a sequence of experiments,
each having a probability p of success.

 x − 1 r
f NB ( x ) = p (1 − p ) , with x ≥ r
x−r

 r − 1
r+x j −1 There are alternative definitions
  r
FNB ( x ) ∑  ( )
j −r
 p 1 − p for this distribution:
j =r  r − 1 
Y = number of failures before
rth success. This formulation is
r equivalent to the one in terms
Mean: µ X = of X = trial at which the rth
p success occurs, since Y = X − r
r (1− p )
Variance: V ( X ) =
p2 89
Review of probability theory and related basic concepts

Common Probability Distribution Models: RVs that come from


independent events
Poisson distribution P(ν) is a discrete probability distribution that gives the
probability of k events occurring in a fixed interval of time and/or space if
these events occur with a known mean rate of occurrence ν and independently
of the time since the last event.

(νt)
k

fP ( k ) = e −ν t
k!
(ν t )
j
k
FP ( k ) = e −ν t

j =0 j!
Mean: µK = ν t λ = νt is the mean number of events
that occur in a specified time t
Variance: V ( K ) = ν t
90
Review of probability theory and related basic concepts

Common Probability Distribution Models: RVs that come from


independent events
Exponential distribution Exp(ν) is a continuous probability distribution that
gives the probability of the time t between events in a Poisson process with a
known mean rate of occurrence ν

f Exp ( t ) ν e −ν t , with t ≥ 0
=

FExp ( t ) = 1 − e −ν t
1
Mean: µT =
ν
1
Variance: V (T ) =
ν2 91
Review of probability theory and related basic concepts

Common Probability Distribution Models: RVs that come from


independent events
Revisiting the Return Period
In the Poisson distribution, νt is the mean number of events that occur in a
specified time (or reference period) t. If t is set to 1 unit (year, day, etc), 1/ν is
the mean time between events, which coincides with the mean value of the
exponential distribution and corresponds also to the Return Period.

What’s the return period


of a boomerang?
92
Review of probability theory and related basic concepts

Common Probability Distribution Models: RVs that come from


independent events
Alternative explanation for the Return Period
… still considering the Poisson distribution, the return period (RP) can also be
established by the following reasoning

Consider the particular case where we need the probability Pt of any non-zero
number of events occurring during a reference period of time t (in years)
(ν t )
0

Pt =p (1) + p ( 2 ) + ... =1 − p ( 0 ) =1 − f P ( 0 ) =1 − e −ν t =1 − e −ν t =1 − e − t RP
0!
−t t
RP = → for low probability events → RP ≈
ln (1 − Pt ) Pt
1
In earthquake engineering, Pt is called seismic hazard Ht . Hence RP ≈
H1 93
Review of probability theory and related basic concepts

Common Probability Distribution Models: RVs that come from


independent events
Example: earthquake occurrences #1

What is the return period (RP) of an earthquake with a 2% probability of occurrence


in 50 years?
−50
=RP = 2474.9 years
ln (1 − 0.02 )
50
RP ≈ 2500 years
=
0.02
The annual seismic hazard can then be seen to be
1 1
H1 ≈ = = 0.04%
RP 2500
1 − e −T
H1 = RP
1 − e −1 2474.9 =
= 0.0404% 94
Review of probability theory and related basic concepts

Common Probability Distribution Models: RVs that come from


independent events
Example: earthquake occurrences #2

Consider that in the last 50 years 2 large earthquakes (MW > 6) occurred in a given
region and that such occurrences can be modelled by a Poisson process. What is
the probability of occurrence of such earthquakes within the next 2 years?
Consider the event “occurrence of a MW > 6 earthquake”
The mean rate of occurrence of this event = 2/50 = 0.04/year and the return
period is 1/0.04 = 25 years

The probability of having an event within the next 2 years is

FExp ( 2 ) =P ( t ≤ 2 ) =−
1 e −0.04×2 =0.077
We can also determine this probability using the Poisson distribution 95
Review of probability theory and related basic concepts

Common Probability Distribution Models: RVs that come from


independent events
Example: earthquake occurrences #2

For the case where we use the probability Pt of any non-zero number of events
occurring during a reference period of time t (in years)
(ν t )
0

Pt =p (1) + p ( 2 ) + ... =1 − p ( 0 ) =1 − f P ( 0 ) =1 − e −ν t =1 − e −ν t =1 − e −0.04×2 =0.077


0!

96
Review of probability theory and related basic concepts

Common Probability Distribution Models: RVs that come from


independent events
Gamma distribution Γ(k, 1/ν) is a continuous probability distribution that
gives the probability of having k events with a known mean rate of occurrence
ν occurring in a time interval (0,t)

ν (ν t )
k −1
ν = 1/θ
fΓ ( t ) −ν t
e , with t ≥ 0
Γ(k )
( y)
νt k −1

FΓ ( t ) = ∫ e − y dy
0
Γ(k )
∞ k
with Γ ( k ) =
Mean: µT =

−y k −1
e y dy ν
k
Variance: V (T ) = 2
0
Gamma function 97
ν
Review of probability theory and related basic concepts

Common Probability Distribution Models: RVs that come from the sum of
different effects
• Central Limit Theorem: this theorem states that if we consider Sn to be the
sum (or average) of n independent RVs, each with an arbitrary probability
distribution, under certain conditions (the Lindeberg condition: the variance
of each RV divided by the sum of the variances of all the RVs tends to zero as
n tends to ∞), the distribution of Sn is well-approximated by a certain type of
continuous function known as a normal density function.

Sn

98
Review of probability theory and related basic concepts

Common Probability Distribution Models: RVs that come from the sum of
different effects
Normal (or Gaussian) distribution N(μ,σ) is the continuous probability
distribution that is most used in statistics and probability analysis across all areas
of research (especially due to the Central Limit Theorem and its implications)
2
1  x−µ 
1 −  
f N= ( x)
( x ) ϕ= e 2 σ 

σ 2π
2
x 1  y−µ 
1 −  
FN ( x ) =
Φ ( x) =
∫−∞ σ 2π e
2 σ 
dy

Mean: µX = µ (when μ = 0 and σ = 1, we have


a standard normal distribution )
Variance: V ( X ) = σ 2
Review of probability theory and related basic concepts

Common Probability Distribution Models: RVs that come from the product
of different effects
Lognormal distribution LN(λ,β) is the continuous probability distribution of a RV
X when its natural logarithm ln(X) = Y follows a normal distribution
2
1  ln ( x ) − λ 
1 −  
f LN ( x ) =
2 β
e 

β x 2π
2
x 1  ln ( z ) − λ 
1 −  
FLN ( x ) = ∫ βz
2 β
e 
dz
−∞ 2π
λ = µln ( x ) mean value of the log of the data λ+
β2
Mean: µ X = e 2
β = σ ln ( X ) standard deviation of the log of the data
V (X )
Variance: = µ 2
X (e β2
)
−1
Review of probability theory and related basic concepts

Common Probability Distribution Models: RVs that come from extremes of


different effects
• Extreme distributions: Let Y1, Y2,…, Yn be a series of identically distributed
independent RVs, considering that Xmax and Xmin represent the maximum and
minimum values among the variables Yi, it can be proven that the distribution
Xmax and Xmin must be one of the following models:

type I or Gumbel distribution function for maxima

Xmax type II or Fréchet distribution function for maxima


type III or Weibull distribution function for maxima

type I or Gumbel distribution function for minima


Xmin type II or Fréchet distribution function for minima
type III or Weibull distribution function for minima 101
Review of probability theory and related basic concepts

Common Probability Distribution Models: RVs that come from extremes of


different effects
Gumbel distribution for maximum values Extreme Value
− α ( x −u ) distribution in Matlab
FI ( x ) e −e
, with − ∞ ≤ x ≤ ∞
γ π 2
Mean: µ X= u + Variance: V ( X ) =
α 6α 2
γ is the Euler constant (0.57721566490…)

Fréchet distribution for maximum values In Matlab, define 1/x


u
k
and use the Weibull
− 
FII ( x ) e
distribution
x
, with 0 ≤ x ≤ ∞
 1   2 2 1 
Mean: µ X =uΓ  1 −  Variance: V ( X=
) u Γ 1 −  − Γ 1 −  
2

 k   k  k 
Review of probability theory and related basic concepts

Common Probability Distribution Models: RVs that come from extremes of


different effects
Weibull distribution for maximum values
k
 ε −x 
− 
FIII ( x ) =
1− e , with x ≤ ε
 ε −u 

 1
Mean: µ X =ε + ( u − ε ) Γ  1 + 
 k
  2 2 1 
Variance: V ( X =
) (u − ε )
2
Γ 1 + k  − Γ 1 + k  
    
There are alternative definitions for this distribution (e.g. in many cases the distribution is
presented for ε = 0)

The distributions for the minimum values can be derived by noting that min(Yi) = -max(-Yi)
103
Review of probability theory and related basic concepts

Common Probability Distribution Models: RVs that come from extremes of


different effects

How to determine which extreme model is best? Although data fitting


techniques are usually used for this, there are a few other aspects to
also account for:

- If the tail decay of the data is exponential (e.g. as in the Gamma,


Gaussian and Exponential models), the maxima distribution converges
towards the Type I (Gumbel)

- If the tail decay of the data is proportional to x-k as x → ∞, the maxima


distribution converges towards the Type II (Fréchet)

- If the tail decay of the data is proportional to xk as x → 0-, the maxima


distribution converges towards the Type III (Weibull)
104
Review of probability theory and related basic concepts

Common Probability Distribution Models: RVs that come from extremes of


different effects
Generalized extreme value distribution: it is a family of continuous probability
distributions that combines the Gumbel, Fréchet and Weibull families
−1 ζ
−1 ζ −1   x −u  
1  x − u  −1+ζ  
fGExt ( x ) = 1 + ζ   ⋅e   s 

s   s 
−1 ζ
  x −u  
−1+ζ  
FGExt ( x ) = e   s 

 x−u 
with 1 + ζ   > 0, and ζ , u , s are the parameters
 s 
by setting ζ = 0, > 0 or < 0, the Gumbel, Fréchet and Weibull families are
obtained, respectively 105
Review of probability theory and related basic concepts

Common Probability Distribution Models: other distributions

Uniform distribution: useful to model data with values that are equally probable

 1
 for a ≤ x ≤ b
fU ( x ) =  b − a
0 for x < a or x > b

0 for x < a
 x − a
FU ( x ) 
= for a ≤ x < b
b − a
1 for x ≥ b
( b − a)
2
a+b
Mean: µ X = Variance: V ( X ) =
2 12
Review of probability theory and related basic concepts

Common Probability Distribution Models: other distributions

χ2 distribution with k degrees of freedom χ2 (k) is the distribution of a sum of the


squares of k independent standard normal random variables X12 + X22 +…+ Xk2
k y
−1 −
x 2
e 2
fχ2 ( y ) = k
k
2 Γ 
2

2
k y γ is the lower incomplete
γ ,  Gamma function
2 2
Fχ 2 ( y ) = 
k Mean: µY = k
Γ 
2
Variance: V (Y ) = 2k 107
Review of probability theory and related basic concepts

Common Probability Distribution Models: other distributions

Student’s t distribution with ν degrees of freedom t(ν): if we have a random


sample of size n drawn from a Normal distribution N(μ,σ) and denote the
sample mean x and the sample standard deviation s, the quantity
x −µ 1 n
z= ∑ ( xi − x )
2
don ' t forget
= that s
n − 1 i =1
s n
has a t distribution with n-1 degrees of freedom ν
ν +1
−1 −
ν + 1    ν   z 
2 2
ft ( z ) =
Γ   νπ Γ    1 + 
 2   2   ν 
µZ 0
Mean:= (ν > 1) for ν = 1 → Cauchy distribution
that has no mean and no variance
ν
Variance: V ( Z ) = for 1 < ν ≤ 2 it is ∞ 108
ν −2
Review of probability theory and related basic concepts

Common Probability Distribution Models: other distributions

Beta distribution Beta(α,β) is a family of continuous distributions defined over


the interval [a, b], with a > 0 and b > 0

( x − a) (b − x )
α −1 β −1

f Beta ( x ) = α > 0, β > 0


B (α , β ) ⋅ ( b − a )
α + β −1

(α , β ) ∫ (1 − x )
α −1 β −1
B= x dx
0

X −a
Y= FBeta ( y ) = I x (α , β ) regularized beta function
b−a
αβ ( b − a )
2
α (b − a )
Mean: µ X = a + Variance: V ( X ) =
(α + β ) (α + β + 1)
2
α +β 109
Review of probability theory and related basic concepts

Common Probability Distribution Models: other distributions

Triangular distribution T(a,b) is a simple continuous probability distribution


defined over the interval [a, b], with a > 0 and b > 0
 2( x − a)
 for a ≤ x ≤ c
 ( b − a )( c − a )
fT ( x ) = 
 2 (b − x ) for c ≤ x ≤ b
 ( b − a )( b − c )
 c is the mode
c = (a + b)/2 gives a symmetric distribution
 ( x − a )2
 for a ≤ x ≤ c a+b+c
 ( b − a )( c − a ) Mean: µ X =
FT ( x ) =  3
( b − x) Variance:
2
 a 2 + b 2 + c 2 + ab − ac − bc
1 − ( b − a )( b − c ) for c ≤ x ≤ b V (X )=
 18
Confidence Intervals
Review of probability theory and related basic concepts

Confidence Intervals
Remember this…
• An estimator of a Population Parameter is a Sample Statistic used to estimate or
predict that Population Parameter.
• An estimate is a particular numerical value of a Sample Statistic obtained through
sampling.

Considering a single sample of data and an estimate obtained from that sample
to establish a population parameter, a confidence interval for that population
parameter corresponds to a range of values bounding the estimate with a
certain probability of containing the true value of the population parameter

This is the single sample interpretation for what is a confidence interval... There is also the
repeated sample interpretation.
112
Review of probability theory and related basic concepts

Confidence Intervals

More formally, consider that θ is a population parameter, that 𝜃𝜃̂ is used as an


estimate for θ (and was obtained using an estimator T), an interval estimate for
θ has the form
θˆL ≤ θ ≤ θˆU
where 𝜃𝜃̂𝐿𝐿 and 𝜃𝜃̂𝑈𝑈 are lower and upper bounds of the interval, and are computed
from the sample data (so the bounds 𝜃𝜃̂𝐿𝐿 and 𝜃𝜃̂𝑈𝑈 are a function of 𝜃𝜃, ̂ which means
that different samples will produce different values for 𝜃𝜃̂𝐿𝐿 and 𝜃𝜃̂𝑈𝑈 ).

If 𝜃𝜃̂𝐿𝐿 is set to be - ∞ or if 𝜃𝜃̂𝑈𝑈 is set to be ∞, a one-sided bound (or interval) is


obtained.
Review of probability theory and related basic concepts

Confidence Intervals

If the probabilistic distribution of T is known, it is possible to define values for


𝜃𝜃̂𝐿𝐿 and 𝜃𝜃̂𝑈𝑈 such that

( )
P θˆL ≤ θ ≤ θˆU =1 − α ( with 0 ≤ α ≤ 1)
This expression states that the interval defined by the bounds 𝜃𝜃̂𝐿𝐿 and 𝜃𝜃̂𝑈𝑈 has a
(1 - α) probability of containing θ.

(1 - α) is called the confidence level and α is called the significance level.

So we need the probabilistic distribution of the estimator T to be able to


construct a confidence interval...

114
Review of probability theory and related basic concepts

Confidence Intervals
Remember this…
• Central Limit Theorem: if we consider Sm to be the sum (or average) of m independent
RVs, … , the distribution of Sm is well-approximated by … a normal density function

If we extract m independent samples from a certain population and calculate


the mean 𝑋𝑋� for each sample, the data 𝑋𝑋�1 , 𝑋𝑋�2 , … 𝑋𝑋�𝑚𝑚 will follow a normal
distribution.

In particular, it can be proven that if we extract m independent samples of size n


from a certain distribution with a true mean μ and a true standard deviation σ,
the distribution of the sample mean 𝑋𝑋� follows a normal distribution N(μ,σ/ 𝑛𝑛)

So we can construct a confidence interval for the mean of a distribution. Now


we need to define 𝜃𝜃̂𝐿𝐿 and 𝜃𝜃̂𝑈𝑈 based on this information...
115
Review of probability theory and related basic concepts

Confidence Intervals

If the sample mean 𝑋𝑋� follows a normal distribution N(μ,σ/ 𝑛𝑛) we can define
the standard normal variable Z by
X −µ
Z=
σ n
Variable Z follows a standard normal distribution N(0,1).
Considering that we want to construct a confidence interval with a (1 - α)
confidence level of (1 – 0.05) = 0.95 (i.e. an interval that has a 95% probability
of containing μ), what is the value of Z that has a:

• (1 – α/2) probability of occurrence → P(Z = ?) = 1 – α/2 = 97.5% → Z = 1.96


• (α/2) probability of ocurrence → P(Z = ?) = α/2 = 2.5% → Z = -1.96
116
Standard normal
distribution

2nd digit of x

1st digit of x
Review of probability theory and related basic concepts

Confidence Intervals

1 − α = 0.95

α α
= .025 = .025
2 2

Z units: z = -1.96 0 z = 1.96


Lower Upper
X units: Confidence Point Estimate Confidence
Limit Limit
The values of Z = zα/2 = -1.96 and Z = z1-α /2 = 1.96 are found using the cdf of a
standard normal distribution and looking for the values of Z corresponding to
the required probabilities. 118
Review of probability theory and related basic concepts

Confidence Intervals
Using the values of Z = zα/2 = -1.96 and Z = z1-α /2 = 1.96, we can say that

 X −µ 
P ( −1.96 ≤ Z ≤ 1.96 ) =P  −1.96 ≤ ≤ 1.96  =0.95
 σ n 
which states that there is a 95% that the value of Z is between -1.96 and 1.96.
We can rewrite this expression as:

 X −µ 
P ( zα 2 ≤ Z ≤ z1−α=
2) P  zα 2 ≤ ≤ z1−α=
2
 σ n 
 σ σ 
P  X − z1−α 2 ≤ µ ≤ X − zα 2  =1 − α
 n n 119
Review of probability theory and related basic concepts

Confidence Intervals
Since zα/2 = - z1-α /2, we get the more general form of the confidence interval of
the mean

σ σ
X − z1−α 2 ≤ µ ≤ X + z1−α 2
n n

sample mean
true value
of the mean margin of error

sample size
value of a standard normal variable true value of
with a 1-α/2 probability of occurrence the standard
deviation 120
Review of probability theory and related basic concepts

Confidence Intervals
The one-sided versions of the this interval are then

σ
µ ≤ X + z1−α Upper-bounded one sided interval
n

σ
X − z1−α ≤µ Lower-bounded one sided interval
n

121
Review of probability theory and related basic concepts

Confidence Intervals
How large should n be? (According to the Central Limit Theorem, the larger
the sample size, the better will be the normal approximation to the sampling

distribution of 𝑋𝑋)
For practical applications, n = 30 is usually seen as a minimum

How to reduce the margin of error?


Decreasing the margin of error = a narrower confidence interval
The margin of error can be reduced if
• the standard deviation is lower: σ ↓
• The sample size is increased: n ↑
• The confidence level is decreased: (1 – α) ↓

A lower margin of error = less uncertainty!! 122


Review of probability theory and related basic concepts

Confidence Intervals
How to choose or define the confidence level?
There are no specific answers for this… 95% is perhaps the most common
value… other common values are 99%, 90%... Values lower than 75% are
not found. The higher the desired confidence level, the wider the
confidence interval will have to be.

Confidence Level Zα/2 value

80% 1.28
90% 1.645
95% 1.96
98% 2.33
99% 2.58
99.8% 3.08
99.9% 3.27 123
Review of probability theory and related basic concepts

Confidence Intervals
The confidence interval assumes that σ is known, but this parameter might
not be known for most cases. We usually only have its sample estimate s.

In this case, the normal distribution can no longer be used and it can be
proven that variable T defined by

X −µ
T=
s n
follows a t distribution with n-1 degrees of freedom, t(n-1)

124
Review of probability theory and related basic concepts

Confidence Intervals
The t distribution looks like a normal distribution, but has “thicker” tails. The
tail thickness is controlled by the degrees of freedom
standard normal distribution

t with df = 5

t with df = 1

• The smaller the degrees of freedom, the thicker the tails of the t
distribution
• If the degrees of freedom is large (if we have a large sample size),
then the t distribution approaches the standard normal distribution
Review of probability theory and related basic concepts

Confidence Intervals
Variable T follows a t distribution t(n-1).
Considering that we want to construct a confidence interval with a (1 - α)
confidence level of (1 – 0.05) = 0.95 (i.e. an interval that has a 95% probability
of containing μ), what is the value of T that has a:

• (1 – α/2) probability of occurrence for a given sample size n: tn-1,1 – α/2


• (α/2) probability of ocurrence for a given sample size n: tn-1,α/2

The values of T = tn-1,1 – α/2 and T = tn-1,α/2 are found using the cdf of a t
distribution with n-1 degrees of freedom and looking for the values of T
corresponding to the required probabilities.

126
t-Student
distribution

1 − FX ( ta ) =−
1 P ( X ≤ ta ) =P ( X > ta ) =a

table values = ta

127
Review of probability theory and related basic concepts

Confidence Intervals
Using the values of T = tn-1,1 – α/2 and T = tn-1,α/2, we can say that

 X −µ 
P ( tn −1,α 2 ≤ T ≤ t n −1,1−α 2 ) =P  t n −1,α 2 ≤ ≤ tn −1,1−α 2  =1 − α
 s n 
We can rewrite this expression as:
 s s 
P  X − tn −1,1−α 2 ≤ µ ≤ X − tn −1,α 2  =1 − α
 n n
or
 s s 
P  X − tn −1,1−α 2 ≤ µ ≤ X + tn −1,1−α 2  =1 − α
 n n
to get the confidence interval
s s
X − tn −1,1−α 2 ≤ µ ≤ X + tn −1,1−α 2
n n 128
Review of probability theory and related basic concepts

Confidence Intervals
A correction for the case of finite populations

Normally the size N of the population is assumed to be ∞

s s
X − tn −1,1−α 2 ≤ µ ≤ X + tn −1,1−α 2
n n
When the size N of the population is assumed to be a finite number :

s N −n s N −n
X − tn−1,1−α 2 ≤ µ ≤ X + tn−1,1−α 2
n N −1 n N −1

129
Review of probability theory and related basic concepts

Confidence Intervals
Assessing the sample size needed to estimate the mean within a certain
margin of error and with a certain confidence level
Starting from
s s
X − tn −1,1−α 2 ≤ µ ≤ X + tn −1,1−α 2
n n
Dividing by X
cov µ cov
1 − tn−1,1−α 2 ≤ ≤ 1 + tn−1,1−α 2
n X n

Setting a certain margin of error ME (e.g. +/-10%)

cov X × (1 ± ME ) cov
1 − tn−1,1−α 2 ≤ ≤ 1 + tn−1,1−α 2
n X n 130
Review of probability theory and related basic concepts

Confidence Intervals
Assessing the sample size needed to estimate the mean within a certain
margin of error and with a certain confidence level
Separating in 2 parts
cov cov
1 − tn−1,1−α 2 ≤ 1 − ME 1 + ME ≤ 1 + tn−1,1−α 2
n n
cov cov
tn−1,1−α 2 ≥ ME ME ≤ tn−1,1−α 2
n n
leads to only one expression
cov
tn−1,1−α 2 = ME
n
131
Review of probability theory and related basic concepts

Confidence Intervals
Assessing the sample size needed to estimate the mean within a certain
margin of error and with a certain confidence level
Replacing the value of the t distribution (that depends on the sample size)

cov cov
tn−1,1−α 2 ME ⇒ z1−α 2
= ≈ ME
n n

which leads to 2
 cov 
n =  z1−α 2 
 ME 
Setting ME, defining z1−α 2
and “guessing” cov leads to the value of n
132
Review of probability theory and related basic concepts

Confidence Intervals
It is possible to construct confidence intervals for other parameters

For the confidence interval for the variance σ2 of a normal distribution it is


possible to define the following relation

 ( n − 1) s 2 ( n − 1 ) s 2

P 2 ≤σ ≤ 2
2
 1 α
=−
 χ n −1,α 2 χ 
 n −1,1−α 2 
and the following confidence interval (which is assymetric!!)

( n − 1) s 2 ≤ σ 2 ≤ ( n − 1) s 2
χ n2−1,α 2 χ n2−1,1−α 2
This interval can also provide good estimates for the variance of other distributions 133
Review of probability theory and related basic concepts

Confidence Intervals
A correction for the case of finite populations1

 n −1 N − n 1  2  n −1 N − n 1  2
s ≤σ ≤ 
2
 + × + × s
 N − 1 N − 1 F1−α /2,n−1, N −n   N − 1 N − 1 Fα /2,n−1, N −n 

value of a variable with a 1-α/2 probability of occurrence


that follows an F distribution of parameters n-1 and N-n

This interval can also provide good estimates for the variance of other distributions

1 O’Neill,
134
B. (2014) Some useful moment results in sampling problems. The American Statistician, 68(4), 282-296
Review of probability theory and related basic concepts

Confidence Intervals
Assessing the sample size needed to estimate the variance within a certain
margin of error and with a certain confidence level
Starting from
( )
n − 1 s 2

≤σ 2 ≤
( )
n − 1 s 2

χ n2−1,α 2 χ n2−1,1−α 2
Dividing by s 2
( n − 1) σ 2 ( n − 1)
≤ ≤
χ 2
n −1,α 2 s 2
χ n2−1,1−α 2
Setting a certain margin of error ME (e.g. 1.10 or 0.90)

( n − 1) s 2 × ME ( n − 1)
≤ ≤ 2
χ n2−1,α 2 s 2
χ n−1,1−α 2 135
Review of probability theory and related basic concepts

Confidence Intervals
Assessing the sample size needed to estimate the variance within a certain
margin of error and with a certain confidence level
That simplifies to
( n − 1) ( n − 1)
≤ ME ≤
χ 2
n −1,α 2 χ n2−1,1−α 2

Since this interval is assymetric, both sides need to be analysed:


- Select a confidence level
- Define multiple values of n and compute χ n2−1,α 2 and χ n2−1,1−α 2
- Compute the bounds of the interval until they match the desired value of ME

136
Review of probability theory and related basic concepts

Confidence Intervals
Assessing the sample size needed to estimate the variance within a certain
margin of error and with a certain confidence level
Considering a confidence level of 10%
3

2.5

2
ME

1.5

0.5
0 20 40 60 80 100 120 140 160 180 200 137
Sample size
Building Probabilistic Models (distribution
fitting and parameter estimation)
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)


Let’s consider a set of (variable) data that was generated by a certain process
and let’s assume that a RV following a certain (unknown) statistical distribution
is able to represent the variability of this data.

Based on the available data, we want to fully characterize the statistical


distribution of this RV

X = { X 1 , X 2 ,..., X n } f(x)

INPUT
MAGICAL PROCESS
x
PERFECT
OUTPUT
139
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Real (continuous) data will hardly ever follow an exact theoretical statistical
model

Usually, available samples of real data are not entirely representative of the
true population

X = { X 1 , X 2 ,..., X n } f(x)

INCOMPLETE
INPUT MAGICAL PROCESS
x
IMPERFECT/APPROXIMATED
OUTPUT
140
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Let’s focus on the MAGICAL PROCESS

First, analyse the shape of the data by visualising the data


• Use histograms to check the symmetry/asymmetry of the data

141
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Let’s focus on the MAGICAL PROCESS

First, analyse the shape of the data by visualising the data


• Use boxplots to check for outliers in the data (and to see if the use of
robust statistics is needed)

142
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Second, analyse the distribution type using a probability plot

A probability plot is not a p-p plot or a q-q plot!!

For a probability plot, we don’t need to estimate parameters for the


distribution type we are testing (unlike for the p-p plot or q-q plot)

The probability plot assumes the data is from a location-scale family of


distributions formed by a translation and rescaling of a standard
distribution in that family. So, if X is an RV with a distribution of that family,
the RV Z will also have a distribution of that family if it can be written as:

X −m m is a location parameter (not necessarily the mean)


Z=
s s is a scale parameter (not necessarily the standard deviation) 143
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Based on this assumption, it can also be seen that

 x−m
F ( x ) G=
=   G(z)
 s 
cdf of the data

cdf of the standardized variable


By inverting this relation we get

x−m x m
z= G  F ( x ) =
−1
= −
s s s

We see there is a linear relation between x and G-1[F(x)] (note that G-1[F(x)]
are the percentile values of x) 144
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Since the true cdf F(x) is not known, we have to define an empirical cdf Fn(x)
based on the number of points in the data.

A simple empirical cdf Fn(x) (ecdf) can be defined by:

i
( )
Fn x( i ) =
n
where x(i) are the ordered values of X (usually called
ordered statistics or ranks)

Instead of this simple expression, the ecdf is usually defined by an estimate


of the median ordered statistics (for each value x(i), we are trying to estimate
a value for its probability that has a 50% chance of being the true percentile,
given that we have a small sample size).

Several empirical expressions have been proposed for this… 145


Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

A few examples (there are others)…

i − 0.3 i − 0.3175
Benard ( )
Fn x( i ) =
n + 0.4
Filiben ( )
Fn x( i ) =
n + 0.365

i − 0.35 i − 0.375
Hosking
and Wallis ( )
Fn x( i ) =
n
Blom ( )
Fn x( i ) =
n + 0.25

i − 0.5 i
Hazen ( )
Fn x( i ) =
n
Weibull ( )
Fn x( i ) =
n +1
i − 0.44 i − 0.4
Gringorten ( )
Fn x( i ) =
n + 0.12
Cunnane ( )
Fn x( i ) =
n + 0.2 146
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

But which one to choose?

If the goodness of the linear relation given by the probability plot


depends on the selected proposal for the ecdf, this means we are
probably choosing the wrong family of distributions

The Benard proposal is one of the most popular choices

i − 0.3
( )
Fn x( i ) =
n + 0.4

147
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

And which family of distributions to choose?

Based on the visual assessment of symmetry/asymmetry, start by the


most common families. If the data is symmetric, start by the normal
distribution. If the data is asymmetric, start by the lognormal,
exponential and other extreme type families of distributions

If the data comes from a phenomenon that has been previously


analysed, check the literature for suggested families of distributions

The next issue is how to obtain G-1[F(x(i))] since there is no analytical


expression for F(x(i)), just a pointwise representation

148
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

For the case of the normal distribution family, it can be seen that when

 x−m
F ( x ) G=
=   G(z)
 s 
z follows a standard normal distribution N(0,1) and it is possible to obtain
numerical values for its inverse

Fn=( )
x( i )
i − 0.3
n + 0.4
⇒ Φ −1 Fn = ( ( ))
x( i ) z( i )

The probability plot is then a plot of x(i) versus z(i)

149
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

2.5 2.5

2 2

1.5 1.5

1 1

0.5 0.5

z( i ) 0 z( i ) 0

-0.5 -0.5

-1 -1

-1.5 -1.5

-2 -2

-2.5 -2.5
-3 -2 -1 0 1 2 3 4 0 2 4 6 8 10 12

x( i ) x( i )

close to normal not normal at all

150
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Other information we get from a probability plot for the normal distribution
2.5 2.5

2 2

1.5 1.5

1 1

0.5 0.5

z( i ) 0 z( i ) 0

-0.5 -0.5

-1 -1

-1.5 -1.5

-2 -2

-2.5 -2.5
0 10 20 30 40 50 60 70 80 90 0.9 0.92 0.94 0.96 0.98 1 1.02 1.04

x( i ) x( i )
data is skewed to the right data is skewed to the left
(there’s always data below one (there’s always data above one
potential straight line) potential straight line)
151
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Other information we get from a probability plot for the normal distribution
2.5 2.5

2 2

1.5 1.5

1 1

0.5 0.5

z( i ) 0 z( i ) 0

-0.5 -0.5

-1 -1

-1.5 -1.5

-2 -2

-2.5 -2.5
-6 -4 -2 0 2 4 6 8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x( i ) x( i )

data is symmetric heavy-tailed data is symmetric light-


(or fat-tailed) tailed (or thin-tailed)
152
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

For the case of the lognormal distribution family, given the relation between
the normal and the lognormal distributions, we just have to set

y = ln ( x )
and do the plot for ln(x) that now follows a normal distribution

The probability plot is then a plot of ln(x(i)) versus z(i)

153
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

For the case of the Weibull distribution family there is an alternative and
even simpler process because the cdf has an analytical expression
k k
 ε −x  x
−  − 
F ( x) =
1− e , with x ≤ ε
 ε −u 
F ( x )= 1 − e u

assume ε = 0
From which we can get
k
x
− 
1 e
F =− u
⇔ ln ( − ln (1 − F ) ) =k ln ( x ) − k ln ( u )

To get the probability plot, we just plot ln(x(i)) versus ln(-ln(1-F(x(i))))


In this case we can even get the distribution
154
parameters by fitting the equation of a straight line
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

2 2

1 1

0 0

-1 -1

-2 -2

-3 -3

-4 -4

-5 -5
-12 -10 -8 -6 -4 -2 0 2 4 -4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0

( )
ln x( i ) ( )
ln x( i )

close to Weibull not weibull at all

155
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Third, analyse the fitting of the distribution using a q-q plot or a p-p plot
A p-p plot compares the empirical cumulative distribution function of the
data with a specified theoretical cumulative distribution function

x( i )
fit parameters of the
selected distribution ( )
determine F x( i )

i − 0.3
select the expression
for the empirical cdf ( )
Fn x( i ) =
n + 0.4
(for example)

( ) ( )
plot Fn x( i ) versus F x( i )

156
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Third, analyse the fitting of the distribution using a q-q plot or a p-p plot
A q-q plot compares the quantiles of the empirical data with the quantiles of
a theoretical distribution

i − 0.3
x( i ) select the expression
for the empirical cdf ( )
Fn x( i ) = y=
n + 0.4
(for example)

fit parameters of the


selected distribution determine F −1 ( y )

plot x( i ) versus F −1 ( y )

157
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

When to use one or the other?

A p-p plot tends to magnify deviations between the data and the
selected theoretical distribution in the middle range of the
distribution.
A q-q plot tends to magnify deviations between the data and the
selected theoretical distribution in the tail range of the distribution.

The more linear the plot looks, the better the fit between the data
and the theoretical distribution

158
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

When to use one or the other?

1 1.06

0.9
1.04
0.8

0.7 1.02

0.6
1
0.5
0.98
0.4

0.3 0.96

0.2
0.94
0.1

0 0.92
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.88 0.9 0.92 0.94 0.96 0.98 1 1.02 1.04 1.06

p-p plot for a normal distribution q-q plot for a normal distribution
trying to fit Weibull data trying to fit the same Weibull data
159
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

What to do when nothing seems to fit the data??


5
10
4 13

3.5
12

11
2.5
Quantiles of the empirical data

Quantiles of the empirical data


2 10

1.5
9

8
0.5

0 7
0 0.5 1 1.5 2 2.5 3 7 8 9 10 11 12 13

Theoretical quantiles 10
5 Theoretical quantiles

qq plot (lognormal fit – original data) qq plot (lognormal fit – log of data)
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

But how to determine the parameters of the selected distribution?


• Method of Moments
• Maximum Likelihood Method
• Bayesian analysis

161
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Method of Moments
After selecting the distribution type, the number of parameters that need
to be determined is known and it is assumed that the available data is
sufficient to estimate their values

The method of moments defines the distribution parameters by


assuming that the sample moments (i.e. from the data) and the
theoretical moments (i.e. from the selected distribution) are identical

1 n ∞

∑ ( i )
j
ˆ
∫ ( x − c) f X ( x ) dx
j
=mj x − c λj
=
n i =1 −∞
sample moments theoretical moments
162
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Method of Moments
If we need to estimate n parameters, we then need n equations:

=m j λ=
j , with j 1,..., n

1 n
∑ ( xˆi − c ) = ∫ ( x − c) f X ( x ) dx , with j = 1,..., n
j j

n i =1 −∞

The method is usually applied considering raw moments, so c = 0

163
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Method of Moments Example: parameters of the distribution of


concrete compressive strength
Available data (MPa):
i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
x 24.4 27.6 27.8 27.9 28.5 30.1 30.3 31.7 32.2 32.8 33.3 33.5 34.1 34.6 35.8 35.9 36.8 37.1 39.2 39.7

It is assumed that concrete compressive strength follows a normal


distribution. Since this distribution has 2 parameters, we need 2 equations.

1 n
m1 = ∑ xˆi λ1 = ∫ x f ( x ) dx
X
n i =1 −∞

1 n 2
m2 = ∑ xˆi λ2 = ∫ x 2 f X ( x ) dx
n i =1 −∞
164
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Method of Moments Example: parameters of the distribution of


concrete compressive strength

The sample moments are then


1 n 1 n 2
=m1 = ∑
20 i =1
xˆi 32.67 =m2 = ∑
20 i =1
xˆi 1083.36

The theoretical moments are established as function of the parameters


2 2
∞ 1  x−µ  ∞ 1  x−µ 
1 −   1 −  
λ1 ∫−∞ σ 2π e dx µ λ= ∫ x2 = µ2 +σ 2
2 σ  2 σ 
=x 2 e dx
−∞ σ 2π
165
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Method of Moments Example: parameters of the distribution of


concrete compressive strength

By formulating the following function

( λ ( µ ,σ ) − m ) + ( λ ( µ ,σ ) − m )
2 2
g ( µ ,=
σ) 1 1 2 2

g ( µ , σ ) = ( µ − m1 ) + ( µ + σ − m2 )
2 2 2 2

the parameters can be obtained numerically by finding the solution of μ


and σ that minimizes the function g (e.g. using the least-squares method)

The solution is: µ = 32.67


σ = 4.04 166
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Maximum Likelihood Method


What is the likelihood?
The likelihood can be understood as a measure of the extent to which a
sample provides support for particular values of a parameter in a parametric
model, i.e. the chance (probability) of occurrence of the observed data
conditional on the model.

If 𝑓𝑓 𝑥𝑥1 , 𝑥𝑥2 , … 𝑥𝑥𝑛𝑛 , 𝜃𝜃 is the joint probability distribution of n independent RVs


X1, X2,…, Xn that follow the same distribution (that has parameter 𝜃𝜃) and
have sample values x1, x2,…, xn, the likelihood function of the sample is

L (θ | x1 , x2 ,..., xn ) = L (θ )
167
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Maximum Likelihood Method


Let’s assume that a given data follows a normal distribution
2
1  x−µ 
1 −  
fX ( x) = e 2 σ 

σ 2π
the likelihood of one of the elements in the data x1 is
2
1  x −µ 
1 −  1 
L ( µ , σ | x1 ) = e 2 σ 

σ 2π
the likelihood of two of the elements in the data x1 and x2 is
2 2
1  x −µ  1  x −µ 
1 −  1  1 −  2 
L ( µ , σ | x1 , x2 )
= e 2 σ 
× e 2 σ 

σ 2π σ 2π
168
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Maximum Likelihood Method


By considering the n elements of the data, their likelihood is:
2
1  xˆ − µ 
1 n n −  i
2  σ 
= ∏
L (θ | xˆ ) ∏
=
=i 1 =i 1 σ 2π
f X ( xˆi )
e

the vector of the distribution parameters the vector of the data


θ = {µ , σ } xˆ = { xˆ1 , xˆ2 ,...xˆn }

169
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Maximum Likelihood Method


The maximum likelihood method defines the value of the distribution
parameters by maximizing the likelihood of the observed data
2
1  xˆ − µ 
n
1 n −  i
2  σ 
L (θ | xˆ ) ∏
= = f X ( xˆi ) ∏ e
=i 1 =i 1 σ 2π

Parameters θ are estimated as those maximizing the likelihood


function or, equivalently, those minimizing the –likelihood function:

min ( −L (θ | xˆ ) )
170
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Maximum Likelihood Method


Due to mathematical convenience, there are advantages in considering
the log-likelihood function l instead

 n 
l (θ | x ) log
= ˆ =  L (θ | x )  log  ∏ f X ( xi ) 
ˆ ˆ
 i =1 
n
l (θ | xˆ ) = ∑ log  f X ( xˆ ) 
i =1

This simplifies the problem, since now we have to maximize a sum


of terms rather than a long product of terms
171
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Maximum Likelihood Method


If parameters θ are estimated as those minimizing the –log-likelihood
function
min ( −l (θ | xˆ ) )
It can be shown that the parameter estimates are RVs that follow a a
joint normal distribution with

Mean values µΘ = (θ1* ,θ 2* ,...θ n* )


∂ 2l (θ | xˆ )
Covariance matrix CΘΘ = H −1 where H ij = −
∂θi ∂θ j
θ =θ *
172
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Maximum Likelihood Method Example: parameters of the distribution of


concrete compressive strength
The log-likelihood function can be written as

 −  i 2 
2
1  xˆ −θ  2
n
1  1  1  xˆi − θ 2 
n


l (θ | xˆ ) = ln 
 θ1 2π
e
2  θ1  


n × ln 
=  − ∑
θ1 2π  2 i 1 θ 
i 1=  1 
 
The minimum can be obtained by solving the following equations

∂l n 1 n ∂l 1 n
− + 3 ∑ ( xi − θ 2 ) = = 2 ∑ ( xˆi −=
θ2 ) 0
2
= ˆ 0
∂θ1 θ1 θ1 i =1 ∂θ 2 θ1 i =1
173
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Maximum Likelihood Method Example: parameters of the distribution of


concrete compressive strength
which then leads to:
n

∑ ( xˆ − θ )
2
i 2 1 n
θ1 = i =1 θ 2 = ∑ xˆi
n n i =1
For the case of the normal distribution, the sample mean and the
sample standard deviation are the Maximum Likelihood estimators!
Finally, we obtain again:
θ1 = 4.04 θ 2 = 32.67 174
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Maximum Likelihood Method Example: parameters of the distribution of


concrete compressive strength
We can also obtain the covariance matrix of these estimators:

 n 3 n 2 n 
 − θ + θ 4 ∑ ( xi − θ 2 ) θ 3 ∑ ( xi − θ 2 ) 
2
ˆ ˆ
=
H=  1 1 i 1= 1 i 1

2 n
n 
θ 3 ∑ ( xi − θ 2 )
ˆ
θ 2 
 1 i =1 1 

0.836 0
−1  Variance of the standard deviation
C=
ΘΘ H=  
 0 0.165  Variance of the mean value
175
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Maximum Likelihood Method

How good are these estimates of the parameters?

We have 2 measures to assess the goodness of these parameter estimates:

- BIAS: how close is the estimate to the true value?


- VARIANCE: how much does it change for different datasets?

176
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Maximum Likelihood Method

How good are these estimates of the parameters?

The bias-variance tradeoff: in most cases, we can only decrease one of


them at the expense of the other

177
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Maximum Likelihood Method

How good are these estimates of the parameters?

For the case of the mean and standard deviation, it can be proven that the
mean estimate is not biased, while the estimate of the standard deviation
is biased. For data samples of larger size, the bias becomes very small.
However, for samples with “common” sizes, a correction to the estimator is
used to correct the bias:
n n

∑ ( xˆi − x ) ∑ ( xˆi − x )
2 2

s= i =1
s= i =1
n n −1
178
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Maximum Likelihood Method

How good are these estimates of the parameters?

For skewness and kurtosis estimators, similar “small-sample” correction


factors can be defined:
3
n n
 xi − x 
( n − 1)( n − 2 ) ∑
β1 =  
i =1  s 

n ( n + 1)
4
n
 xi − x 
( n − 1)( n − 2 )( n − 3) ∑
β2 =  
i =1  s 

179
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Maximum Likelihood Method

How good are these estimates of the parameters?

For skewness and kurtosis estimators, similar “small-sample” correction


factors can be defined:
3
n n
 xi − x 
( n − 1)( n − 2 ) ∑
β1 =  
i =1  s 

n ( n + 1) 3 ( n − 1)
4 2
n
 xi − x 
=β2 ∑   −
( n − 1)( n − 2 )( n − 3) i =1  s  ( n − 2 )( n − 3)

180
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Bayesian Estimation
Bayesian Estimation assumes that parameters 𝜃𝜃 are random variables that
have a known prior distribution f (𝜃𝜃). This distribution is typically very
broad or vague to reflect the fact that we know little about its true value
Once we obtain data X, we use the Bayes theorem to find the posterior
distribution f*(𝜃𝜃). Ideally, we want this data to reduce our uncertainty
about the parameters.

181
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Bayesian Estimation
By recalling the following relation from the Bayes Theorem:
P ( Ai ) × P ( B | Ai )
P ( Ai | B ) = n

∑ [ P( B | A ) × P( A )]
i =1
i i

we can obtain the following equation for a continuous distribution:


prior distribution
f (θ ) × P ( X | θ )
f (θ ) =
*

∫ f (θ ) × P ( X | θ ) dθ conditional probability or
−∞ likelihood of observing X
posterior distribution assuming that
(after getting the data in X) parameters are θ
normalizing constant
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Bayesian Estimation
Considering that:
−1
k  ∫ f (θ ) × P ( X | θ ) dθ  L (θ | X ) = P ( X | θ )

=
 −∞ 
we get:

f * (θ ) =
k × f (θ ) × L (θ | X ) distribution of parameter θ

But we need an estimate of parameter θ


usually the expected value θ
= *
∫ θ × f * (θ ) dθ
−∞
183
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Bayesian Estimation Example: defective reinforced concrete piles

Consider that the reinforced concrete piles of a building foundation could be


defective due to poor construction quality. We want to know the proportion
p of defective piles in a given project that may have hundreds of piles.
We assume that p is a continuous RV and we assume that there is no
adequate prior information about p. Therefore, the prior distribution of p
will be a uniform distribution (also called a diffuse prior):

( p ) 1, for 0 ≤ p ≤ 1
f=
On the basis of the inspection of one pile, revealing that it is defective,
the likelihood is the probability of the event X = one pile selected for
inspection is defective, which is p.
184
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Bayesian Estimation
Therefore

f * ( p ) =k × f ( p ) × L ( p | X ) =k × 1.0 × p, for 0 ≤ p ≤ 1
−1
k is k =
pdp 
1

 ∫0
=
and the normalizing constant 2

The posterior distribution of p is then:

f * ( p ) 2 p, for 0 ≤ p ≤ 1
=
The point estimate of p is then:
∞ 1
∫ θ × f (θ ) dθ =
∫ p × 2 pdp =0.667
* *
p = 185
−∞ 0
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Bayesian Estimation Example: distribution of wind speed

Since parameters s and u are RVs, their (best) estimates (i.e. their average
values) may change when new data is obtained. Assume the distribution of
parameter u is the following exponential distribution with a mean value of
6 m/s: u

e 6
f (u ) =
6

Considering that 1 new value of wind speed data is obtained (𝑥𝑥� = 18m/s),
what is the updated value of parameter u?

186
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Bayesian Estimation Example: distribution of wind speed


u
Prior distribution of u: −
e 6
f (u ) =
6
Likelihood function of the wind speed given the current value of u:
2 2
2−1 x x
2 x −  2x − 
L ( u | x ) =×   ×e u
=×2
e u
u u u
Likelihood function for the new wind speed data:
2
 18  324
2 ×18 −  36 −
L ( u | xˆ ) =
L ( u |18 ) = 2 × e u 
=2 × e u2
u u 187
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Bayesian Estimation Example: distribution of wind speed

Considering that:
−1
k  ∫ f ( u ) × L ( u | xˆ ) du 

=
 −∞ 
we get: −
u u 324
− − 2
324
e 36 − u 2
6
6× e 6 u
f ( u ) =k × f ( u ) × L ( u | x ) =k ×
*
ˆ × 2 ×e =k ×
6 u u2
−1
and the normalizing constant k is:
  u 324
− − 2
∞ 6× e 6 u
 
 ∫0
=k = du 151.987
u 2 
  188
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Bayesian Estimation Example: distribution of wind speed

We then get the posterior distribution of parameter u:


u 324
911.922 − − 2
f * (u ) = 2
e 6 u
u
The new estimate of parameter u is then:

u 324
∞ 911.922 − 6 − u 2

∫−∞ u × f ( u ) du =
∫0 u e du =
* *
u = 15.67

189
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Bayesian Estimation Example: distribution of wind speed

0.18
0.16 Prior distribution
0.14 Posterior distribution
0.12 Likelihood
0.10
0.08
0.06
0.04
0.02
0.00
0.0 10.0 20.0 30.0 40.0 50.0 60.0
u

190
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Bayesian Estimation
There are advantages when we “know” the prior distribution of the
parameter we want to estimate and the likelihood function of the data
that is used to estimate the parameter: the posterior distribution may
already be known from existing theoretical results and is often of the same
family
f * (θ ) =
k × f (θ ) × L (θ | X )
The cases where there is a theoretical connection between the prior
distribution, the likelihood function and the posterior distribution which is of
the same family of the prior distribution are known as conjugate distributions

191
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

192
https://en.wikipedia.org/wiki/Conjugate_prior
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

• Method of Moments
The simplest approach to obtain parameters, but the estimates are usually
not the best (it is rarely used in practice).
• Maximum Likelihood Method
An approach a bit more complicated (used by most statistical analysis
softwares). We also obtain information about the distribution of the
parameters.
• Bayesian analysis
The most complex approach of the three. It leads directly to the distribution
of the parameters and any prior assumption made about their distribution
may be corrected by a posterior distribution.

193
Building Probabilistic Models (distribution
fitting and parameter estimation)
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Fourth, analyse the fitting of the distribution using a goodness-of-fit test

When using these techniques, you need to know what you’re doing!!!!

195
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Fourth , analyse the fitting of the distribution using a goodness-of-fit test


A goodness-of-fit test is a statistical hypothesis test that is designed to assess
formally if the sample of data comes from a certain statistical distribution

Hypothesis testing is a class of statistical techniques designed to


extrapolate information from samples of data to make inferences
about populations for the purpose of decision making.

Hypothesis testing can be used to test if


• a certain RV follows some specific distribution
• a population parameter (e.g. the mean) has some specific value
• two population parameters are the same
196
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

For an overview of other types of statistical tests, see for example:

197
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Fourth, analyse the fitting of the distribution using a goodness-of-fit test


The basic components of a statistical hypothesis test are:
• Null hypothesis H0 – A statement regarding what we want to test
formulated as an equality (e.g. the RV X follows a normal distribution)
• Alternative hypothesis H1 or HA – A statement contradictory to the null
hypothesis
• Test statistic – A quantity that reflects the hypothesis we want to test and
that is computed using the sample data. Since the test statistic value
changes from sample to sample, it is a RV and has a sampling distribution
(that may be known or not)
• Rejection region – Values of the test statistic for which we reject the null
hypothesis in favour of the alternative hypothesis 198
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Fourth, analyse the fitting of the distribution using a goodness-of-fit test


In hypothesis testing, no matter the outcome of the test, WE ARE
NEVER ABLE TO PROVE THAT WE CAN ACCEPT THE NULL HYPOTHESIS!!!
We can only prove that we have to REJECT or that we FAIL TO REJECT
the null hypothesis!!!
So what’s the difference between ACCEPTING and FAILING TO REJECT
the null hypothesis?
In the first case, there are facts and arguments that lead to an
acceptance of the null hypothesis (we have enough evidence to accept).
In the second case, there are only facts and arguments stating that
rejecting is not possible (we don’t have enough evidence to reject, so
we must accept the null hypothesis)
199
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Fourth, analyse the fitting of the distribution using a goodness-of-fit test


So when to REJECT or when to FAIL TO REJECT?

Managing the Level of Significance, the Power of the test and the Type of Errors

The Type of Errors in hypothesis testing are:

• Type I Error – A Type I Error occurs when we reject a true null hypothesis
• Type II Error – A Type II error occurs when we fail to reject a false null
hypothesis

So what are the possible results of the test?


200
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Fourth, analyse the fitting of the distribution using a goodness-of-fit test

True Nature of the Hypothesis

The null hypothesis is The null hypothesis is


true false
Type I Error
Result of the test

The test rejects


(rejecting a true null correct decision
the null hypothesis
hypothesis) - α
Type II Error
The test fails to reject
correct decision (failure to reject a false
the null hypothesis
null hypothesis) - β
201
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Fourth, analyse the fitting of the distribution using a goodness-of-fit test


• α is the Level of Significance which is the probability of having a Type I
Error. α is set by the person that performs the test (usually 0.05)
• β is the probability of having a Type II Error and it is related to the Power
of the test. The Power of the test is the probability of the test to reject
the null hypothesis when it is false, i.e. 1 - β
In practice, we want both α and β to be as small as possible!
α and β are not independent!! When α is reduced, β increases, but the
actual relation between α and β is unknown!! Only α can be set by the
person performing the test. The value of β is a property of the test
(which we usually don’t know and is actually more important than α!!).
For a given test, the only way to reduce β is by increasing the sample
size of the data under analysis 202
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Fourth, analyse the fitting of the distribution using a goodness-of-fit test


And what’s the test statistic and how to use it?
The test statistic is a parameter that is a function of the data and addresses a
specific feature of the data in order to reflect the hypothesis we want to test.

The value of the test statistic based on the sample is compared to a critical
value. The critical value is a specific value of the test statistic defining the
boundary of the rejection region above or below which (depending on the
test) we reject the null hypothesis.
The critical value depends on the statistical distribution of the test statistic
and on the selected level of significance

So, how does it all work? 203


Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Fourth, analyse the fitting of the distribution using a goodness-of-fit test


Let’s assume that a statistic δ is able to reflect a unique feature of a sample
of data that enables us to determine if the sample comes from a normal
distribution. Let’s also consider that δ exhibits positive low values (e.g. closer
to zero) when the data being tested comes from a true normal distribution.

Let’s assume also that the distribution of the statistic δ is known. This
distribution defines how likely is each value of the statistic δ.

δ
204
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Fourth, analyse the fitting of the distribution using a goodness-of-fit test


Given A , the occurrence of higher values of the statistic δ is less likely if
we are testing a sample data that comes from a normal distribution.

Let’s consider that δ* is the value of δ with a probability of being exceeded


of 5%

P(δ > δ*) = 0.05


(Rejection region - the most
unlikely values of δ)

δ 205
δ*
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Fourth, analyse the fitting of the distribution using a goodness-of-fit test


By considering that δ* is the critical value of δ, δcrit, we can conduct a
hypothesis test with a level of significance of 5%. Therefore, if a certain
sample of data X# has a value of the test statistic δ# > δcrit, we reject the null
hypothesis with a level of significance of 5% (i.e. with a 5% probability of
having a Type I Error). If δ# ≤ δcrit, we can’t reject the null hypothesis.


P(δ > δcrit) = 0.05
(Rejection region - the most
unlikely values of δ)

δcrit δ
206
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Fourth, analyse the fitting of the distribution using a goodness-of-fit test


Since we are rejecting the null hypothesis when the data presents a high
value of δ, this type of test is called a one-sided upper test.
If the rejection condition was “reject the null hypothesis when the data
sample presents a low value of δ”, the test would be a one-sided lower test.

P(δ < δcrit) = 0.05

δ
δcrit
207
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Fourth, analyse the fitting of the distribution using a goodness-of-fit test


If the rejection condition was “reject the null hypothesis when the data
sample presents either a low value or a high value of δ”, the test would be a
two-sided test.

P(δ < δcrit, low) = 0.025
P(δ < δcrit, up) = 0.025

δ
δcrit, low δcrit, up

208
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Fourth, analyse the fitting of the distribution using a goodness-of-fit test


Sometimes, the outcome of a test is not a “reject/failure to reject” decision
based on the critical value of the test statistic (e.g. when using statistical
analysis softwares).
Instead the result is a numerical value known as the p-value. The p-value
does not provide a “reject/failure to reject” answer but “helps” you
determine the significance of the test result.

But what does a p-value represent?

Technically, a p-value is the probability of a value of the test statistic at


least as extreme as the one obtained from the sample under analysis,
assuming that the null hypothesis is true.
Let’s see what this means really… 209
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Fourth, analyse the fitting of the distribution using a goodness-of-fit test


Let’s consider that a certain sample of data X# has a value of the test statistic
δ#. Let’s also consider that, according to the distribution of the test statistic,
the probability of having a test statistic with a value of δ# is 0.005 (0.5%).
Therefore, the p-value is 0.005.

How significant is this result? (We have a non significant result


fδ when the probability of δ# is so low so that we reject H0 by saying
that δ# is too unlikely to occur in a situation where H0 is true)

So, is a probability of
P(δ > δ#) = 0.005
0.5% low enough?

δ 210
δcrit δ#
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Fourth, analyse the fitting of the distribution using a goodness-of-fit test


A few pointer on how to interpret p-values:
The smaller the p-value, the more statistical evidence there is to support the
alternative hypothesis (i.e. to reject the null hypothesis):
• If the p-value is less than 1%, there is overwhelming evidence that
supports the alternative hypothesis.
• If the p-value is between 1% and 5%, there is a strong evidence that
supports the alternative hypothesis.
• If the p-value is between 5% and 10%, there is a weak evidence that
supports the alternative hypothesis.
• If the p-value exceeds 10%, there is no evidence that supports the
alternative hypothesis.
211
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Fourth, analyse the fitting of the distribution using a goodness-of-fit test


What are the available goodness-of-fit tests?
There are 2 very popular goodness-of-fit tests that can be found in most
statistical analysis softwares:
• The χ2 test
• The Kolmogorov-Smirnov test

These tests are popular not because they are good!


They are popular because they can be found in most
statistical analysis softwares and because they can be used
to assess goodness-of-fit for any statistical distribution!

212
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Fourth, analyse the fitting of the distribution using a goodness-of-fit test


What are the available goodness-of-fit tests?
There are other goodness-of-fit tests that can be used to assess goodness-of-
fit for any statistical distribution. These are 2 others that are easy to
implement:

• The Cramér-von Mises test


• The Anderson-Darling test

213
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Fourth, analyse the fitting of the distribution using a goodness-of-fit test


The χ2 test can be used to determine if a given sample of data follows a
certain distribution. The parameters of the target distribution must be
estimated first. The test involves the following steps:
Consider a sample of data X of size n. After computing estimates of the m
parameters of the target distribution fX :
• Divide the data X into M cells (e.g. like for constructing the bins of an
histogram). The value of M is selected between M1 and M2 given by:
15
 2n  2
qα is the (1 - α) percentile of the standard
M1 = 4  2  normal distribution, where α is the level of
 qα  signficance of the test
M=
2 0.5 × M 1 214
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Fourth, analyse the fitting of the distribution using a goodness-of-fit test


• Count how many values of X are within each cell i → Oi
• For the target distribution, determine the expected fraction of data pi
that would be in each cell i (i.e. the probability of a given data value
being within cell i):
Mi
pi = ∫ f X ( x )dx
M i −1

• Compute the χ2 test statistic:

( Oi − n × pi )
2
M Oi and n×pi should be > 5
χ =∑
2

i =1 n × pi Reduce M if needed
215
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Fourth, analyse the fitting of the distribution using a goodness-of-fit test


• Since larger values of the χ2 test statistic mean the target distribution fits
poorly to the data, and since the χ2 test statistic follows a χ2 distribution
with (M - m - 1) degrees of freedom, the null hypothesis is rejected if the
test statistic exceeds the critical value defined by the (1 - α) percentile of
the χ2 distribution with (M - m - 1) degrees of freedom

If χ 2 > χ M2 −m−1,1−α ⇒ reject H 0


If χ 2 ≤ χ M2 −m−1,1−α ⇒ do not reject H 0

216
Review of probability theory and related basic concepts

Example of application of the χ2 test:


Severe thunderstorms have been recorded at a given station over a period of 66
years. During this period, the frequencies of severe thunderstorms observed are
as follows:
- 20 years with zero storm
- 23 years with one storm
- 15 years with two storms 0.4

- 6 years with three storms data

- 2 years with four storms 0.3

The histogram of the annual 0.2

number of rainstorms recorded is:


Probability

0.1

0 1 2 3 4

Number of occurrences
217
Review of probability theory and related basic concepts

Example of application of the χ2 test:


We want to fit a Poisson distribution to the yearly occurrence of thunderstorms
and test the goodness-of-fit of that distribution to the data.

Poisson distribution P(ν) is a discrete probability distribution that gives the probability of k
events occurring in a fixed interval of time and/or space if these events occur with a known
mean rate of occurrence ν and independently of the time since the last event.

(ν t ) (ν )
k k

fP ( k ) = e −ν t t = 1 year fP ( k ) = e −ν
k! k!
Mean: µK = ν t t = 1 year µK = ν

218
Review of probability theory and related basic concepts

Example of application of the χ2 test:


Case #1: fit the Poisson distribution to the data using the method of moments
Since this distribution has 1 parameter, we need 1 equation:
1 n ∞
m1 = ∑ xˆi
n i =1
λ1 = ∫ x f ( x ) dx
−∞
X

1 n 20 × 0 + 23 ×1 + 15 × 2 + 6 × 3 + 2 × 4
m1 = ∑
n i =1
xˆi
66
= 1.197

(ν )
∞ x

m1= λ1= ∫x
0
x!
e −ν dx= µ= ν= 1.197

219
Review of probability theory and related basic concepts

Example of application of the χ2 test:


Case #2: fit the Poisson distribution to the data using the maximum likelihood
method

n
 n 
L (θ | xˆ ) = ∏ f X ( xˆi ) l (θ | xˆ ) log
= =  L (θ | xˆ )  log  ∏ f X ( xˆi ) 
i =1  i =1 
n
l (θ | xˆ ) = ∑ log  f X ( xˆ )  min ( −l (θ | xˆ ) )
i =1

For the Poisson distribution:

(ν )
x
(ν )
xˆi
n
fX ( x) = e −ν
L (ν | xˆ ) = ∏ e −ν …
x! i =1 xˆi !
220
Review of probability theory and related basic concepts

Example of application of the χ2 test:


Case #2: fit the Poisson distribution to the data using the maximum likelihood
method
1 n
… min ( −l (ν | xˆ ) ) =ν = ∑ xˆi 1.197
n i =1
0.4

data

Poisson distribution

0.3

0.2
Probability

0.1

0 1 2 3 4

Number of occurrences

221
Review of probability theory and related basic concepts

Example of application of the χ2 test:


Applying the fit the χ2 test:

(1.197 )
x

fX ( x) = e −1.197 M
( Oi − n × pi )
2

x! χ =∑
2

i =1 n × pi

( Oi − n × pi )
2
Nº of storms Observed Theoretical
( Oi − n × pi )
2
per year frequencies Oi frequencies n×pi n × pi
0 20
I 23
2 15 with M = 5, we have a number of
3 6 observed frequencies that is lower than
4 2 5, so we need to reduce M: aggregate
the cases with 3 and 4 storms per year 222
Review of probability theory and related basic concepts

Example of application of the χ2 test:


Applying the fit the χ2 test:

(1.197 )
x

fX ( x) = e −1.197 M
( Oi − n × pi )
2

x! χ =∑
2

i =1 n × pi

( Oi − n × pi )
2
Nº of storms Observed Theoretical
( Oi − n × pi )
2
per year frequencies Oi frequencies n×pi n × pi
0 20 19.94 0.0036 0.0002
I 23 23.87 0.7569 0.0317
2 15 14.29 0.5041 0.0353
≥3 8 = 6+2 7.90 0.0100 0.0013
Total 66 66 0.0685
223
Review of probability theory and related basic concepts

Example of application of the χ2 test:


Applying the fit the χ2 test: (1.197 )
1

n × f X (1) =×
66 e −1.197 =
23.87
1!
(1.197 )
x

fX ( x) = e −1.197 M
( Oi − n × pi )
2

x! χ =∑2

i =1 n × pi

( Oi − n × pi )
2
Nº of storms Observed Theoretical
( Oi − n × pi )
2
per year frequencies Oi frequencies n×pi n × pi
0 20 19.94 0.0036 0.0002
I 23 23.87 0.7569 0.0317
2 15 14.29 0.5041 0.0353
≥3 8 = 6+2 7.90 0.0100 0.0013
Total 66 66 0.0685
224
Review of probability theory and related basic concepts

Example of application of the χ2 test:


Applying the fit the χ2 test:

( Oi − n × pi )
2
M
χ 2

=
i =1 n × pi
0.0685

H0 is “the data follows a Poisson distribution”


If χ 2 > χ M2 −m−1,1−α ⇒ reject H 0
If χ 2 ≤ χ M2 −m−1,1−α ⇒ do not reject H 0

For a significance level of 5%, we get:

χ M2 −m−1,1−α = χ 42−1−1,1−0.05 = χ 2,0.95


2
⇒ P ( X ≤ x p )= P ( Χ 22 ≤ χ 2,0.95
2
)= 0.95
225
Review of probability theory and related basic concepts

χ2 distribution
1 − FX ( x p ) =
p= 1− P ( X ≤ xp ) =
P ( X > xp )

table values = x p

χ M2 −m−1,1−α = χ 42−1−1,1−0.05 = χ 2,0.95


2
⇒ P ( X ≤ x p )= P ( Χ 22 ≤ χ 2,0.95
2
)= 0.95
1− P (Χ ≤ χ )=
P(Χ )= χ 2,0.95
2
2
2
2
2,0.95
2
2 >χ 2
2,0.95 0.05 = 5.991

χ 2 ≤ χ M2 −m−1,1−α 0.0685 ≤ 5.991 do not reject H 0 226


Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Fourth, analyse the fitting of the distribution using a goodness-of-fit test


The Kolmogorov-Smirnov test can be used to determine if a given sample of
data follows a certain distribution. The target distribution must be
completely defined first and its parameters must not be assessed from the
sample of data being tested. The test involves the following steps:
Consider a sample of data X of size n. After defining the target distribution fX:
• Arrange the data in ascending order (i.e. define the ordered statistics x(i))
• Compute the distances D+ and D-:

D−
1≤i ≤ n
( )
max FX x( i ) − ( i − 1) n =D + max i n − FX x( i )
1≤i ≤ n
( )
Maximum vertical distance between the Maximum vertical distance between the
cdf of the target distribution FX and the cdf of the target distribution FX and the
empirical cdf FnX when FX > FnX empirical cdf FnX when FX < FnX
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Fourth, analyse the fitting of the distribution using a goodness-of-fit test


• Compute the test statistic D:
D
= n × max ( D + , D − )
Large values of D mean the target
distribution fits poorly to the data.
• It can be shown that, when n → ∞, D
has the following distribution

1 − 2∑ ( −1) e
FD ( x ) =
i −1 2 −i 2 x 2

i =1

But for low sample sizes it’s not adequate


228
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Fourth, analyse the fitting of the distribution using a goodness-of-fit test


There are critical values of D (or for D/n0.5) widely available for various
significance levels, with asymptotic formulae for samples with size n > 30
and tabulated values for small samples.
• The null hypothesis is rejected if the test statistic exceeds the critical
value defined by the (1 - α) percentile value

229
Review of probability theory and related basic concepts

Building Probabilistic Models (distribution fitting and parameter estimation)

Fourth, analyse the fitting of the distribution using a goodness-of-fit test


The results of the Kolmogorov-Smirnov test are not valid if the distribution
parameters are determined from the sample of data being tested.
For a procedure to apply the test when the distribution parameters are
determined from the sample of data being tested see:
Capasso, M., Alessi, L., Barigozzi, M., Fagiolo, G. (2009) On approximating the distributions of goodness-of-fit
test statistics based on the empirical distribution function: The case of unknown parameters. Advances in
complex systems, 12(02), 157-167.

If the target distribution is the normal distribution, an alternative version


of this test called the Lilliefors test can be used.
This version of the test allows for the parameters of the distribution to be
determined using the sample of data being tested.
This test is the same as the Kolmogorov-Smirnov test. The difference is
that it uses different critical values of the statistic 230
Review of probability theory and related basic concepts

Example of application of the KS test:

Assume the data follows a normal distribution N(30,4.5) and test this
hypothesis using the KS test and considering a 5% significance level 231
Review of probability theory and related basic concepts

Example of application of the KS test:


No. of Compressive
i/n (i-1)/n Φ(30,4.5) |D-| |D+|
sample (i) strength (MPa)
1 24.4 0.05 0 0.1067 0.1067 0.0567
2 27.6 0.1 0.05 0.2969 0.2469 0.1969 D = 0.2972
3 27.8 0.15 0.1 0.3125 0.2125 0.1625
4 27.9 0.2 0.15 0.3204 0.1704 0.1204
5 28.5 0.25 0.2 0.3694 0.1694 0.1194
6 30.1 0.3 0.25 0.5089 0.2589 0.2089
7 30.3 0.35 0.3 0.5266 0.2266 0.1766
8 31.7 0.4 0.35 0.6472 0.2972 0.2472
9 32.2 0.45 0.4 0.6875 0.2875 0.2375
10 32.8 0.5 0.45 0.7331 0.2831 0.2331
11 33.3 0.55 0.5 0.7683 0.2683 0.2183
12 33.5 0.6 0.55 0.7816 0.2316 0.1816
13 34.1 0.65 0.6 0.8189 0.2189 0.1689
14 34.6 0.7 0.65 0.8467 0.1967 0.1467
15 35.8 0.75 0.7 0.9013 0.2013 0.1513
16 35.9 0.8 0.75 0.9051 0.1551 0.1051
17 36.8 0.85 0.8 0.9346 0.1346 0.0846
18 37.1 0.9 0.85 0.9427 0.0927 0.0427
19 39.2 0.95 0.9 0.9795 0.0795 0.0295
232
20 39.7 1 0.95 0.9844 0.0344 0.0156
Review of probability theory and related basic concepts

Example of application of the KS test:

D > Dcrit ,1−α 0.2972 > 0.2941 reject H 0

233
Review of probability theory and related basic concepts

Example of application of the KS test:


No. of Compressive If we use the best
i/n (i-1)/n Φ(32.7,4.04) |D-| |D+|
sample (i) strength (MPa)
1 24.4 0.05 0 0.0200 0.0200 0.0300
estimates of the
2 27.6 0.1 0.05 0.1034 0.0534 0.0034 normal distribution
3 27.8 0.15 0.1 0.1126 0.0126 0.0374 parameters:
4 27.9 0.2 0.15 0.1174 0.0326 0.0826
5 28.5 0.25 0.2 0.1493 0.0507 0.1007
6 30.1 0.3 0.25 0.2599 0.0099 0.0401 D = 0.1007
7 30.3 0.35 0.3 0.2762 0.0238 0.0738
8 31.7 0.4 0.35 0.4023 0.0523 0.0023 But we need to sue
9 32.2 0.45 0.4 0.4508 0.0508 0.0008
10 32.8 0.5 0.45 0.5099 0.0599 0.0099
the Lilliefors critical
11 33.3 0.55 0.5 0.5590 0.0590 0.0090 value instead: 0.173
12 33.5 0.6 0.55 0.5785 0.0285 0.0215
13
14
34.1
34.6
0.65
0.7
0.6
0.65
0.6355
0.6809
0.0355
0.0309
0.0145
0.0191
D < Dcrit ,1−α
15 35.8 0.75 0.7 0.7786 0.0786 0.0286
16 35.9 0.8 0.75 0.7858 0.0358 0.0142
17 36.8 0.85 0.8 0.8449 0.0449 0.0051
18 37.1 0.9 0.85 0.8619 0.0119 0.0381 do not reject H 0
19 39.2 0.95 0.9 0.9462 0.0462 0.0038
234
20 39.7 1 0.95 0.9584 0.0084 0.0416

You might also like