You are on page 1of 49

Foundations of Data Science

7COM1073

Introduction to Probability and Statistics (I)


Learning Outcomes
At the end of this unit, you should have

• A good understanding of the standard univariate distributions and


their properties
• A good understanding of the use of conditional probability
Probability
• Probability – deducing what is likely to happen when an ‘experiment’ is
performed.

• Some terms people use for probability:


Chance, percentage, likelihood, odds, proportion

• A number between zero and one


➢ The classical approach
▪ It is a mathematical approach using counting rules.
▪ It is used to random processes with certain assumptions.
➢ The relative frequency approach
▪ It is based on collecting data and finding the percentage of time that an event 𝐸 occurred on that data.
➢ The subjective approach
➢ The logical approach
Axiomatic Probability Theory
Let (𝛺, 𝜮, 𝑃) denote a probability space, where
• 𝛺 is the set of all possible outcomes, known as the sample space
➢For example: A coin is flipped three times in succession.
• 𝜮 is a collection of subsets of 𝛺, each subset being called an event.
• 𝑃 is a probability measure defined as a real valued function of the elements of 𝜮
satisfying the following axioms of probability:
➢Axiom 1: 0 ≤ 𝑃 𝐴 ≤ 1 for all 𝐴 ∈ 𝜮.
➢Axiom 2: 𝑃 𝛺 = 1.
➢Axiom 3: If two events A and B are mutually exclusive(that is, no elements in
common), then the probability of either A or B occurring is the probability
of A occurring plus the probability of B occurring:
𝑃 𝐴 ∪ 𝐵 = 𝑃 𝐴 + 𝑃(𝐵)
Probability
If 𝐸 is an event, 𝑃(𝐸) is the probability of the occurrence of the event

Number of elements in event 𝐸


𝑃(𝐸) =
size of the sample space

The maximum probability of any event is 1.


The
sample
space
E
Example
Rolling a fair six-sided die.
• Let 𝛺 = {1, 2, 3, 4,5, 6}
• 𝐴 = 2, 4, 6
• 𝐵 = 3, 5
• 𝐶 = 2, 3, 5
• 𝑃 𝐴 ∪ 𝐵 =?
• 𝑃 𝐴 ∪ 𝐶 =?
Example
Rolling two fair six-sided dice. What is the probability of getting two
even numbers?
Example
Rolling two fair six-sided dice. What is the probability of getting two
even numbers?
1 2 3 4 5 6
1 (1,1) (1,2) (1,3) (1,4) (1,5) (1,6)
2 (2,1) (2,2) (2,3) (2,4) (2,5) (2,6)
3 (3,1) (3,2) (3,3) (3,4) (3,5) (3,6)
4 (4,1) (4,2) (4,3) (4,4) (4,5) (4,6)
5 (5,1) (5,2) (5,3) (5,4) (5,5) (5,6)
6 (6,1) (6,2) (6,3) (6,4) (6,5) (6,6)

Event: getting two even numbers


The size of sample space = 36
The number of elements in event = 9
9 1
P(getting two even numbers)= 36 = 4
Exercise
• A coin is flipped three times in succession. What is the probability of
getting two heads?

➢Sample space:
➢E: getting two heads
➢Size of the sample space
➢Number of times E occurs
Number of times E occuring
➢𝑃(𝐸) =
size of the sample space
Discrete Random Variables
Informally, a random variable is a map from the outcome space (𝛺) to
the real numbers.

Example: Consider throwing two fair dice.


➢The sample space 𝛺 = 1, 1 , 1,2 , 1,3 , 1,4 , 1,5 , 1,6 , ⋯ , (6,6)

➢To each element 𝜔 ∈ 𝛺, we assign the real number 𝑋 𝜔 . For example:

▪ We are interested in ‘the total score obtained’, which is a random variable


and may be denoted by 𝑋

▪ 𝑋 (1,1) = 2, 𝑋 (2,3) = 5
Discrete Random Variables
Event defined by random variables

➢If 𝑋 is a random variable and 𝑥, 𝑥1 and 𝑥2 are


fixed real numbers, we may have the following events:

(𝑋 = 𝑥), 𝑋 ≤ 𝑥 , 𝑋 > 𝑥 or (𝑥1 < 𝑋 ≤ 𝑥2 ).

➢These events have probabilities which are denoted by

𝑃(𝑋 = 𝑥), 𝑃 𝑋 ≤ 𝑥 , 𝑃 𝑋 > 𝑥 or 𝑃(𝑥1 < 𝑋 ≤ 𝑥2 ).


Discrete Random Variables
• A random variable 𝑋 is called discrete if it only takes values in the integers
or (possibly) some other countable set of real numbers.

• Probability mass function (or probability density function): the probability


that 𝑋 takes on a certain value:

𝑓𝑋 𝑥𝑘 = 𝑃(𝑋 = 𝑥𝑘 )

• Cumulative distribution function


𝐹𝑋 𝑥 = 𝑃 𝑋 ≤ 𝑥 = σ𝑥𝑘 ≤𝑥 𝑓𝑋 (𝑥𝑘 )

➢𝐹𝑋 𝑥 is a staircase function.


Example
Flipping a fair coin three times. Let 𝑋 be the number of heads in three
tosses of a fair coin.
• Probability mass function:
𝑓𝑋 0 = 𝑃 𝑋 = 0 = 1/8; 𝑓𝑋 1 = 𝑃 𝑋 = 1 = 3/8;
𝑓𝑋 2 = 𝑃 𝑋 = 2 = 3/8; 𝑓𝑋 3 = 𝑃 𝑋 = 3 = 1/8.
• Cumulative distribution function:
𝑥 𝑋≤𝑥 𝐹𝑋 𝑥
-1 ∅ 0
1
0 {TTT} 8
4
1 {TTT, HTT, THT, TTH} 8
7
2 {TTT, HTT, THT, TTH, HHT, HTH, THH} 8
1
3 {TTT, HTT, THT, TTH, HHT, HTH, THH, HHH}
1
4 {TTT, HTT, THT, TTH, HHT, HTH, THH, HHH}
Continuous Random Variables [4]
𝑋 is a continuous random variable if there is a real-valued function 𝑓𝑋 ,
called the probability density function of 𝑋, which satisfies
• 𝑓𝑋 is piecewise continuous;
• 𝑓𝑋 𝑥 ≥ 0;

• ‫׬‬−∞ 𝑓𝑋 𝑥 𝑑𝑥 = 1
Credit: https://wildart.github.io/CSC21700/CSC_21700_L03.html

Cumulative distribution function (cdf)


𝑥
• 𝐹𝑋 𝑥 = 𝑃 𝑋 ≤ 𝑥 = ‫׬‬−∞ 𝑓𝑋 𝑡 𝑑𝑡; 0 ≤ 𝐹𝑋 𝑥 ≤ 1; 𝐹𝑋 𝑥 is a nondecreasing function
𝑏
• 𝑃 𝑎 < 𝑋 < 𝑏 = ‫𝑥𝑑 𝑥 𝑋𝑓 𝑎׬‬
• 𝑃 𝑎 < 𝑋 ≤ 𝑏 = 𝐹𝑋 𝑏 − 𝐹𝑋 (𝑎)
• 𝑃 𝑋 > 𝑎 = 1 − 𝐹𝑋 (𝑎)
Mean
• The mean (or expected value) of a random variable 𝑋, denoted by 𝜇𝑋
or 𝐸(𝑋), is defined by [4]
σ𝑘 𝑥𝑘 𝑓𝑋 𝑥𝑘 𝑋: 𝑑𝑖𝑠𝑐𝑟𝑒𝑡𝑒
𝜇𝑋 = 𝐸 𝑋 = ൝ ∞
‫׬‬−∞ 𝑥𝑓𝑋 𝑥 𝑑𝑥 𝑋: 𝑐𝑜𝑛𝑡𝑖𝑛𝑢𝑜𝑢𝑠

The expected value should be regarded as the average value.


Variance
• The variance of a random variable 𝑋, denoted by 𝜎𝑥2 or 𝑉𝑎𝑟(𝑋), is defined
by [4]
𝜎𝑥2 = 𝑉𝑎𝑟(𝑋)=𝐸{[𝑋 − 𝐸(𝑋)]2 }

෍( 𝑥𝑘 − 𝜇𝑋 )2 𝑓𝑋 𝑥𝑘 𝑋: 𝑑𝑖𝑠𝑐𝑟𝑒𝑡𝑒
𝜎𝑥2 = ∞
𝑘

න (𝑥 − 𝜇𝑋 )2 𝑓𝑋 𝑥 𝑑𝑥 𝑋: 𝑐𝑜𝑛𝑡𝑖𝑛𝑢𝑜𝑢𝑠
−∞

The variance should be regarded as the average of the difference of the actual values from
the average.
Example
Three products are selected at random from 9 products, of which 2 are
defective. The sample space consists of the distinct, equally likely
samples of size 3. Let 𝑋 be the random variable which counts the
number of defective items in a sample. The values of 𝑋 are 0, 1, and 2.
What is the expected value of defective products in a sample of size 3?

𝐸(𝑋) = σ𝑘 𝑥𝑘 𝑓𝑋 𝑥𝑘 𝑋: 𝑑𝑖𝑠𝑐𝑟𝑒𝑡𝑒

𝑛 𝑛!
𝐶 𝑛, 𝑘 = 𝑘
=
𝑘! 𝑛−𝑘 !
Example
Three products are selected at random from 9 products, of which 2 are defective. The
sample space consists of the distinct, equally likely samples of size 3. Let 𝑋 be the random
variable which counts the number of defective items in a sample. The values of 𝑋 are
0, 1, and 2. What is the expected value of defective products in a sample of size 3?
• Solution:
➢ The number of ways of choosing 𝑥𝑖 defectives from 2 defectives and choosing
3 − 𝑥𝑖 nondefectives from 7 nondefectives is : 𝑥2 3−𝑥7
𝑖 𝑖
9
➢ The total number of possible outcomes is 3
➢ The probability of the value 𝑥𝑖 of 𝑋 is
2 7 9
𝑝𝑖 = 𝑥𝑖 3−𝑥𝑖
/ 3
(𝑥𝑖 =0, 1, 2)

➢ 𝐸(𝑋) = σ𝑖 𝑥𝑖 𝑝𝑖 𝑋: 𝑑𝑖𝑠𝑐𝑟𝑒𝑡𝑒
Example
Letting 𝑋 be a random variable. Consider its distribution function on
the interval [0, 1] has the probability density function

0 if 𝑥 < 0 or 𝑥 > 1
𝑓𝑋 𝑥 = ቊ
1 if 0 ≤ 𝑥 ≤ 1
Compute 𝐸 𝑋 .
Solution

•𝐸 𝑋 = ‫׬‬−∞ 𝑥𝑓𝑋 𝑥 𝑑𝑥
0 1 ∞
= ‫׬‬−∞ 𝑥 × 0𝑑𝑥 + ‫׬‬0 𝑥 × 1𝑑𝑥 + ‫׬‬1 𝑥 × 0𝑑𝑥
1 2 1
= 0 + 𝑥 |0 + 0
2
1
=
2
The Discrete Uniform Distribution
Interval [𝑎, 𝑏]

𝑛 =𝑏−𝑎+1

1
𝑓𝑋 𝑥 = ቐ 𝑎≤𝑥≤𝑏
𝑛
0 Otherwise

𝑎+𝑏
𝜇𝑋 = 𝐸 𝑋 = 2

𝑛2 −1
𝜎𝑥2 = 𝑉𝑎𝑟 𝑋 = 12
Example:

For the special distributions, you just need to know Suppose we throw a die. Let 𝑋 be the
how to use formulas to get the mean and the random variable denoting the obtained
variance. You need not to derive it by yourself. number.
The Bernoulli Distribution
• If 𝑋 is a random variable with this distribution

𝑃 𝑋=1 =𝑝
𝑃(𝑋 = 0) = 1 − 𝑝
The probability mass function 𝑓of this distribution is:
𝑝 𝑖𝑓 𝑋 = 1
𝑓 𝑋; 𝑝 = ቊ
𝑞 = 1−𝑝 𝑖𝑓 𝑋 = 0

Generate 5000 Bernoulli random numbers


with the probability of success: p =0.2

Credit: http://cmdlinetips.com/2018/03/probability-distributions-in-python/
The Bernoulli Distribution

𝜇𝑋 = 𝐸 𝑋 = 𝑝
𝜎𝑋2 = 𝑉𝑎𝑟 𝑋 = 𝑝(1 − 𝑝)

➢Coin tosses: a coin lands head up and tail up.


➢Birth: boy or girl.
➢Use in epidemiology: an event like death or survive.
The Binomial Distribution

𝑃 𝑋=𝑘 =

𝜇𝑥 = 𝐸 𝑋 = 𝐸 𝑋1 + ⋯ + 𝑋𝑛 = 𝑝 + ⋯ + 𝑝 = 𝑛𝑝

𝜎𝑥2 = 𝑉𝑎𝑟 𝑋 = 𝑉𝑎𝑟 𝑋1 + ⋯ + 𝑋𝑛 = 𝑝 1 − 𝑝 + ⋯ + 𝑝 1 − 𝑝 = 𝑛𝑝(1 − 𝑝)

Examples:
• The number of defective/non-defective
products in a production run.
• Yes/No survey
• The number of successful sales calls.
The Binomial Distribution
For example: run 20 independent experiments, each
having a Bernoulli distribution with parameter p=0.6.
Repeat the whole procedure 10000 times.

𝑃 𝑋=𝑘 =

𝜇𝑥 = 𝐸 𝑋 = 𝑛𝑝

𝜎𝑥2 = 𝑉𝑎𝑟 𝑋 = 𝑛𝑝(1 − 𝑝)

For example, for X = 12, that is, we set 𝑘 to 12, then we have:

20!
𝑃 𝑋 = 12 = 0.612 (1 − 0.6)20−12 ≈ 0.1797
12! 20 − 12 !
…})
Example
• 90% of all students pass the module
• A sample of 10 new students is selected
• Find the probability that exactly seven will pass

Do we satisfy the conditions of the binomial distribution model?


➢There are only two possible mutually exclusive outcomes: pass or fail
➢There are a fixed number of trails (students) -10
➢It is reasonable to assume that students are independent
➢The probability of pass for each student is 0.9.
➢We have 𝑛 = 10, 𝑝 = 0.90, 𝑞 = 0.10, 𝑘 = 7
Example
• 90% of all students pass the module
• A sample of 10 new students is selected
• Find the probability that exactly seven will pass

Do we satisfy the conditions of the binomial distribution model?


➢ There are only two possible mutually exclusive outcomes: pass or fail
➢ There are a fixed number of trails (students) -10
➢ It is reasonable to assume that students are independent
➢ The probability of pass for each student is 0.9.
➢ We have 𝑛 = 10, 𝑝 = 0.90, 𝑞 = 0.10, 𝑘 = 7

10!
𝑃 𝑋=7 = 0.907 (1 − 0.90)10−7 = 5.74%
7! 10 − 7 !
The Poisson Distribution
𝜆 𝑘
𝑃 𝑋 = 𝑘 events in an interval = 𝑒 −𝜆 𝑘 = 0, 1, ⋯
𝑘!
𝜆> 0, is the average number of events per interval
e is the number 2.71828

𝜇𝑋 = 𝐸 𝑋 = 𝜆
𝜎𝑥2 = 𝑉𝑎𝑟 𝑋 = 𝜆

Applications:
➢The number of customers entering a supermarket during various intervals of time.
➢The number of misprints on a page of a document.
The Poisson Distribution
𝜆 𝑘
𝑃 𝑋 = 𝑘 events in an interval = 𝑒 −𝜆 𝑘 = 0, 1, ⋯
𝑘!

Generate 20,000 random numbers following the Poisson distribution with 𝜆= 0.4

0
0.4
For example, 𝑃 𝑋 ≥ 1 = 1 − 𝑃 𝑋 = 0 = 1 − 𝑒 −0.4 ≈ 0.3297
0!
Example
Example: A doctor was able to see 3 patients an hour on average. Find
the probability that she can see 5 patients the next hour
5
3
𝑃 𝑘 = 5 = 𝑒 −3 = 0.1008
5!
The Law of Large Numbers
The law of large numbers states that if we repeat a procedure over and
over, the relative frequency probability will approach the actual
probability.

The frequentist approach is to calculate the following:

The number of times E occured


𝑃 𝐸 =
The number of times the procedure was repeated
The Law of Large Numbers
The law of large numbers states that if we repeat a procedure over and over, the
relative frequency probability will approach the actual probability.

A coin is flipped three times in succession. What is the


probability of getting two heads?

➢Sample space:
➢E: getting two heads
➢Size of the sample space
➢Number of times E occurs
Number of times E occurs
➢𝑃(𝐸) =
size of the sample space

Credit:
http://cmdlinetips.com/2018/03/probability-distributions-in-python/
The Continuous Uniform Distribution

Example: Credit:
http://resources.esri.com/help/9.3/arcgisengine/java/gp_t
oolref/process_simulations_sensitivity_analysis_and_error
Student’s height can take any value within a range _analysis_modeling/distributions_for_assigning_random_
values.htm
The Gaussian or Normal Distribution
A random variable 𝑋 is distributed normally with mean
𝜇 and variance 𝜎 2 if its density is

We write

𝜇𝑋 = 𝐸 𝑋 = 𝜇

𝜎𝑥2 = 𝑉𝑎𝑟 𝑋 = 𝜎 2

Examples: https://studiousguy.com/real-life-examples-normal-distribution/
The Gaussian or Normal Distribution

Credit: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html
The Normal Distribution

Credit:

https://www.mathsisfun.com/data/standard-
normal-distribution.html
Z-scores
• Measure how far away a single data point is from the mean
𝑥−𝑥ҧ 𝑥−𝜇
𝑧= or 𝑧 =
𝑠 𝜎
➢𝑥 is the data point
➢𝑥ҧ is the sample mean; 𝜇 is the population mean
➢𝑠 is the sample standard deviation;
➢𝜎 is the population standard deviation Credit:
http://www.z-table.com/
• Table: areas under the standard normal curve to the left of Z.
Example

Cumulative distribution function: gives us the area under the probability density function for the interval negative
infinity to x.

We can see: 1) the probability of x taking on values less than -2.5 is nearly 0;
2) the values sampled from x will mostly be less than 2.5.
Central Limit Theorem [3]
Let 𝑋1,⋯, 𝑋𝑛 be 𝑛 independent random variables, each of which has
mean 𝜇 and standard deviation 𝜎. Let 𝑌 = (𝑋1 + ⋯ + 𝑋𝑛 )/𝑛 be the
average; thus, 𝑌 has mean 𝜇 and standard deviation 𝜎/ 𝑛. If 𝑛 is large,
then the cumulative distribution of 𝑌 is very nearly equal to the
cumulative distribution of the Gaussian with mean 𝜇 and standard
deviation 𝜎/ 𝑛.

Applications:
https://en.wikipedia.org/wiki/Central_limit_theorem#Applications_and_examples
Compound Events
• The probability of either 𝐴 or 𝐵 occurs
𝑃 𝐴 ∪ 𝐵 = 𝑃 𝐴 𝑜𝑟 𝐵

• The probability of 𝐴 and 𝐵 occur


𝑃(𝐴 ∩ 𝐵) = 𝑃(𝐴 𝑎𝑛𝑑 𝐵)
Example [1]
• 100 people who showed up for a new test for cancer
• Event A: people actually have cancer
• Event B: people’s test result was positive (it claimed that they had cancer)
• The number of event A: 25
• The number of event B: 30
Sample space
• Among 30 people whose test
results were positive, 20 actually 𝐴∩𝐵

have cancer B
𝐴∩𝐵
A
• 𝑃(𝐴 ∩ 𝐵) =??
• 𝑃 𝐴 ∪ 𝐵 =? ?
Example [1]
• 100 people who showed up for a new test for cancer
• Event A: people actually have cancer
• Event B: people’s test result was positive (it claimed that they had cancer)
• The number of event A: 25
• The number of event B: 30
• Among 30 people whose test Sample space
results were positive, 20 actually
𝐴∩𝐵
have cancer
B
20 𝐴∩𝐵
• 𝑃 𝐴∩𝐵 = = 20% A
100
35
• 𝑃 𝐴∪𝐵 = = 35%
100
The Rules of Probability - The Addition Rule

• The addition rule


𝑃 𝐴∪𝐵 =𝑃 𝐴 +𝑃 𝐵 −𝑃 𝐴∩𝐵

• The probability of someone either has cancer or has the positive test result

𝑃 𝐴 ∪ 𝐵 = 𝑃 𝐴 + 𝑃 𝐵 − 𝑃 𝐴 ∩ 𝐵 = 0.25 + 0.30 − 0.20 = 0.35


The Rules of Probability - Mutual exclusivity

Mutual exclusivity: two events are mutually exclusive if they cannot


occur at the same time, that is, 𝐴 ∩ 𝐵 = ∅, and 𝑃 𝐴 ∩ 𝐵 = 0.

𝑃 𝐴∪𝐵 =𝑃 𝐴 +𝑃 𝐵 −𝑃 𝐴∩𝐵 =𝑃 𝐴 +𝑃 𝐵
• Example:
Considering a module, students pass the module and students fail
the module are mutually exclusive.
Conditional Probability
𝑃 𝐴 >0

Problem: A math teacher gave her class two tests. 25% of the
class passed both tests and 42% of the class passed the first
test. What percent of those who passed the first test also
passed the second test?

Credit: https://www.mathgoodies.com/lessons/vol6/conditional

See: http://setosa.io/ev/conditional-probability/
Exercise
• Flipping a fair coin three times.
• Let 𝐴 be the event that the first flip is a head.
• Let 𝐵 be the event of getting exactly two heads.

Compute 𝑃(𝐵|𝐴)
Solution

Let 𝛺 be the sample space of all eight outcomes of flipping a coin three times, and
1
let 𝑃 𝑋 = 8 .

𝐴 = 𝐻𝐻𝐻, 𝐻𝐻𝑇, 𝐻𝑇𝐻, 𝐻𝑇𝑇 ; 𝐵 = {𝐻𝐻𝑇, 𝐻𝑇𝐻, 𝑇𝐻𝐻}

𝐴 ∩ 𝐵 = 𝐻𝐻𝑇, 𝐻𝑇𝐻
2/8 1
𝑃 𝐵𝐴 = =
4/8 2
Exercise
Sarah has 2 children. You learn that she has a son, Mark. What is the
probability that Mark’s sibling is a brother?

𝑃(𝐴∩𝐵)
Conditional probability: 𝑃 𝐵𝐴 =
𝑃(𝐴)
➢What is the sample space?
➢What is the event A?
➢What is the event B?
References
[1] Chapters 5-6 in Principles of Data Science, Sinan Ozdemir, 2016
Packt Publishing
[2] Probability Theory, Fundamentals of Machine Learning (Part 1) by
William Fleshman
[3] Linear Algebra and Probability for Computer Science Applications by
Ernest Davis, 2012.
[4] Schaum’s Outlines Probability, Random Variables, & Random
Processes by Hwei Hsu, 1997

You might also like