You are on page 1of 31

Biostatistics

Lecture 4
Theoretical Probability Distributions
2022-2 Fall Semester

Instructor: Min Jin Ha


Department of Health Informatics and Biostatistics
Graduate School of Public Health
Yonsei University
Reading
• Pagano and Gauvreau, Chapter 7.1, 7.2 and 7.4
• Credits:
• CMU Open Learning Initiative: Probability & Statistics v5.0
Random Variable
• Any characteristic that can be measured or categorized is called a variable
• If a variable can assume a number of different values such that any
particular outcome is determined by chance, it is random variable
• Ex. Serum cholesterol level of a 25-to 34-year old male in the US
• Random variables typically represented by upper case letters such as X, Y
and Z.
• A discrete random variable can assume only a finite or countable number
of outcomes X x1...x100
• Ex. Marital status, gender, the number of pregnancies
• A continuous random variable can take on any value within a specified
interval
• Ex. Weight, Height, Serum cholesterol level
Probability Distribution
• Every random variable has a corresponding probability distribution to
describe the behavior of the random variable
• We need probability distribution to make statements about how likely
an event may be
• Empirical Distributions: Revisit Lecture 2
Discrete Case
• Research Question: Perception of their own body among college
students in Korea
• Survey Question: Do you feel you are overweight, underweight, or
about right?
• From a random sample of the target population (i.e., college students
in Korea), we obtain categorical responses.
ategory Frequency Relative Frequency
About right 855 (855/1200) ∗ 100 = 71.3%
Overweight 235 (235/1200) ∗ 100 = 19.6%
Underweight 110 (110/1200) ∗ 100 = 9.2%
Total n=1200 100%
Continuous Case empirical distribution

• Pima Indian Woman


• A population of women who were at least 21 years old, of Pima Indian
heritage and living near Phoenix, Arizona
• Interested in BMI

Rcode:
hist(Pima.tr$bmi,col="coral",main="BMI of Pima IndianWomen",xlab="BMI",border="burlywood",breaks=30,freq=F)
lines(density(Pima.tr$bmi,adjust=0.7),col="burlywood",lwd=3)
Theoretical Distributions
• Probabilities that are calculated from a finite amount of data are
called empirical probabilities
• The probability distributions can be determined based on theoretical
considerations, which is called theoretical probability distributions
Bernoulli Distribution
• Consider a dichotomous (two-level) random variable Y.
• By definition, Y must assume one of two possible values:
• Failure or success
• Dead or alive
• Male or Female
• Current smoker or not
• Heads or tails (coin flip)
• A random variables that take this type is known as a Bernoulli random
variable, and we describe the probability of response using the
parameter 𝜋
Bernoulli Random Variable
• Often coded so that 𝑌 = 1 is called an event or success, and 𝑌 = 0 is
called a failure
• 𝜋 is defined as the probability of success, 𝜋 = 𝑃(𝑌 = 1)
 Coin flip: let 𝑌 = 1 if heads and 𝑌 = 0 if tails, then 𝜋 = 0.5= 𝑃(𝑌 = 0)
 Gender at birth in US: let 𝑌 = 1 if male and 𝑌 = 0 if female, then 𝜋=0.512
and 𝑃 𝑌 = 1 = 1-𝜋 = 0.488
Bernoulli Distribution
• Y takes value 1 with probability 𝜋 and 0 with probability 1 − 𝜋
• 𝑃 𝑌 = 𝑦 = 𝜋 𝑦 1 − 𝜋 1−𝑦  Calculate 𝑃 𝑌 = 0 and 𝑃(𝑌 = 1) pi

• We want to extend this to a more complex setting: in a randomly


selected group of 3 students, how surprising would it be to get 2
smokers?
Case Study: Smoking
• Suppose: it is reported that roughly 20% of Korean adults are smokers
• 𝜋 = 𝑃 𝑌 = 1 = 0.2
• Now suppose we randomly select two adults in Korean adults and let
a new random variable 𝑋 represent the number of smokers: the
possible values of 𝑋 are 0,1, or 2. Assume these persons are
independent (we can use the multiplicative rule)
Case Study: Smoking
• The probability distribution of number of smokers out of two people
is given by
𝑃 𝑋 = 0 = 0.64, 𝑃 𝑋 = 1 = 0.32, 𝑃 𝑋 = 2 = 0.04
• Interpretation: if we randomly sample two people from the Korean
population, the probability that both smokers is 4% chance. The
probability both are nonsmokers s 64% chance. The probability that
only one smokers is 32% chance.
Case Study: Smoking
• If we randomly sample 3 people, what is the chance all 3 are smokers?
Case Study: Smoking
• The probability distribution of number of smokers out of three people
is given by
𝑃 𝑋 = 0 = 0.512, 𝑃 𝑋 = 1 = 0.384, 𝑃 𝑋 = 2 = 0.096, 𝑃 𝑋 = 3 = 0.008
• If we randomly sample 4 people, what is the chance all 4 are smokers?
• This is getting ridiculous, now we need a formula!
• We can use the binomial distribution to help determine this probability
Binomial Distribution
• The binomial distribution is used to give us the probability of 𝑋 ‘successes’
from a sequence of 𝑛 independent Bernoulli trials.
• In our example, each person would be an independent Bernoulli trial
(either a smoker or not)
• This distribution involves three assumptions
• There is fixed number of Bernoulli trials, 𝑛, each of which results in one of two
mutually exclusive outcomes
• The outcomes of the 𝑛 trials are independent
• The probability of success 𝜋 is the same for each trial
• The distribution is
𝑛 𝑥
𝑃 𝑋=𝑥 = 𝜋 (1 − 𝜋)𝑛−𝑥
𝑥
has mean 𝑛𝜋 and variance 𝑛𝜋(1 − 𝜋)
Math
• 𝑛! = 𝑛 𝑛 − 1 𝑛 − 2 … (3)(2)(1) is 𝑛 𝑓𝑎𝑐𝑡𝑜𝑟𝑖𝑎𝑙 allows us to
calculate the number of ways in which the 𝑛 individuals can be
ordered (𝑛 choices for 1st, 𝑛-1 choices for the 2nd…)
• By definition 0! =1
𝑛 𝑛!
• 𝑥
= is the combination of 𝑛 objects chosen 𝑥 at a time. The
𝑥! 𝑛−𝑥 !
number of ways in which 𝑥 objects can be selected from a total of 𝑛
objects regardless of order
Binomial Distribution
Binomial(10,0.2)

• 𝑋 ~𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙 𝑛, 𝜋 ⇔

0.30
𝑃 𝑋=𝑥 =

0.25
𝑛 𝑥 (1 − 𝜋)𝑛−𝑥
𝑥
𝜋

0.20
• 𝜋 𝑥 (1 − 𝜋)𝑛−𝑥 accounts for
the probability of two

Density

0.15
smokers in order

0.10
𝑛
• 𝑥
accounts for all the

0.05
possible ways in which we
have two smokers regardless

0.00
of order 0 2 4 6 8 10

No. of successes
Exercise
R Workshop
• Random number generation
• Calculate a density given a value
• Calculate right tail and left tail areas
• Calculate quantiles
Continuous Distribution
• We discussed discrete random variables
• As we move to discussion of continuous random variables, we will consider
the distribution of a continuous random variable 𝑋
• Suppose 𝑋 represents height
• An individual exactly 163cm tall is rare
• Theoretically, 𝑋 can assume an infinite number of intermediate values,
such as 163.0001cm or 163.01
• In reality we measure only discrete values due to the limitations of our
measuring instruments
• In result, the distribution of a continuous random variable is represented
by a smooth curve, called density function
Continuous Distributions
• A continuous distribution describes the probabilities of possible
values of a continuous random variable (infinite and uncountable)
• Density functions/curves, like histograms, can have any shape. The
area under the density curve is always 1.
• How do you find the area of interest in the curves?
• Integration!

Empirically observed frequency Analytical probability density
(count the number of values observed) (area under the curve)
Normal Distributions

• For the normal distribution,


Parameters:
-μ = Mean (x)
-σ = StDev (x)

• Also called Gaussian Distribution


• Symmetric Bell curve
The normal is
symmetric and
centered on μ

σ=1

σ affects the
σ=2
width of the curve
σ=3

μ affects the
position of the
center of the curve
μ
• The SD measures the distance
from the mean to the point of
inflection
• About 68% of the data are falling
in 1-SD of the mean.
• About 5% of the data are further
than 2-SD from the mean in each
tail
Finding tail probabilities (P given x)
P ( X< x1) = … P (X < x2) = …

x1 x2
In R, pnorm (x =x1, mean = …, sd = …)
Input: quantile and parameters
Finding quantiles (x given P)
P (X <x …) = P1

Area = P1

In R, qnorm (p =P1, mean = …, sd = …)


Input: left tail probability and parameters
Use the property of symmetry

P (x < μ - k) = Pk P (x > μ + k) = Pk P (x < μ + k) = 1 - Pk

1-Pk

Pk Pk
Pk

μ-k μ μ μ+k μ μ+k


Probabilities between two values
Z-scores
• The z-score is a standard normal variable, following normal distribution
with mean zero and unit standard deviation
• The z-score is used to transform normally distributed variables with mean μ
and SD σ into a variable that follows standard normal distribution

• Z ~ N(0,1)
• When we standardize by finding z-scores, we change the the normal
distribution by moving the location (mean moves to zero) and changing the
scale (SD moves to 1)
• Check Workshop!
Quantile-Quantile plot (Q-Q plot)
• Q-Q plot is designed to compare two probability distributions by
plotting their quantiles against each other.
• Many statistical methods are developed under normality assumption
• Q-Q plot for normality check is called normal Q-Q plot
• We obtain data and a statistical method with normality assumption will be
used
• We need to check if the method is ok to be applied to our data.
• Try Q-Qplot which is a scatterplot for quantiles from data vs. the normal
distribution (theoretical )
Example: Annual Precipitation in US Cities
The average amount of
rainfall in inches for
each of the 70 states

You might also like