W4 Lecture4

Biostatistics
Lecture 4
Theoretical Probability Distributions
2022-2 Fall Semester
Instructor: Min Jin Ha

Department of Health Informatics and Biostatistics
Graduate School of Public Health
Yonsei University
Reading
• Pagano and Gauvreau, Chapter 7.1, 7.2 and 7.4
• Credits:
• CMU Open Learning Initiative: Probability & Statistics v5.0
Random Variable
• Any characteristic that can be measured or categorized is called a variable
• If a variable can assume a number of different values such that any
particular outcome is determined by chance, it is random variable
• Ex. Serum cholesterol level of a 25-to 34-year old male in the US
• Random variables typically represented by upper case letters such as X, Y
and Z.
• A discrete random variable can assume only a finite or countable number
of outcomes X x1...x100
• Ex. Marital status, gender, the number of pregnancies
• A continuous random variable can take on any value within a specified
interval
• Ex. Weight, Height, Serum cholesterol level
Probability Distribution
• Every random variable has a corresponding probability distribution to
describe the behavior of the random variable
• We need probability distribution to make statements about how likely
an event may be
• Empirical Distributions: Revisit Lecture 2
Discrete Case
• Research Question: Perception of their own body among college
students in Korea
• Survey Question: Do you feel you are overweight, underweight, or
about right?
• From a random sample of the target population (i.e., college students
in Korea), we obtain categorical responses.
ategory Frequency Relative Frequency
About right 855 (855/1200) ∗ 100 = 71.3%
Overweight 235 (235/1200) ∗ 100 = 19.6%
Underweight 110 (110/1200) ∗ 100 = 9.2%
Total n=1200 100%
Continuous Case empirical distribution
• Pima Indian Woman

• A population of women who were at least 21 years old, of Pima Indian
heritage and living near Phoenix, Arizona
• Interested in BMI
Rcode:
hist(Pima.tr$bmi,col="coral",main="BMI of Pima IndianWomen",xlab="BMI",border="burlywood",breaks=30,freq=F)
lines(density(Pima.tr$bmi,adjust=0.7),col="burlywood",lwd=3)
Theoretical Distributions
• Probabilities that are calculated from a finite amount of data are
called empirical probabilities
• The probability distributions can be determined based on theoretical
considerations, which is called theoretical probability distributions
Bernoulli Distribution
• Consider a dichotomous (two-level) random variable Y.
• By definition, Y must assume one of two possible values:
• Failure or success
• Dead or alive
• Male or Female
• Current smoker or not
• Heads or tails (coin flip)
• A random variables that take this type is known as a Bernoulli random
variable, and we describe the probability of response using the
parameter 𝜋
Bernoulli Random Variable
• Often coded so that 𝑌 = 1 is called an event or success, and 𝑌 = 0 is
called a failure
• 𝜋 is defined as the probability of success, 𝜋 = 𝑃(𝑌 = 1)
 Coin flip: let 𝑌 = 1 if heads and 𝑌 = 0 if tails, then 𝜋 = 0.5= 𝑃(𝑌 = 0)
 Gender at birth in US: let 𝑌 = 1 if male and 𝑌 = 0 if female, then 𝜋=0.512
and 𝑃 𝑌 = 1 = 1-𝜋 = 0.488
Bernoulli Distribution
• Y takes value 1 with probability 𝜋 and 0 with probability 1 − 𝜋
• 𝑃 𝑌 = 𝑦 = 𝜋 𝑦 1 − 𝜋 1−𝑦  Calculate 𝑃 𝑌 = 0 and 𝑃(𝑌 = 1) pi
• We want to extend this to a more complex setting: in a randomly

selected group of 3 students, how surprising would it be to get 2
smokers?
Case Study: Smoking
• Suppose: it is reported that roughly 20% of Korean adults are smokers
• 𝜋 = 𝑃 𝑌 = 1 = 0.2
• Now suppose we randomly select two adults in Korean adults and let
a new random variable 𝑋 represent the number of smokers: the
possible values of 𝑋 are 0,1, or 2. Assume these persons are
independent (we can use the multiplicative rule)
Case Study: Smoking
• The probability distribution of number of smokers out of two people
is given by
𝑃 𝑋 = 0 = 0.64, 𝑃 𝑋 = 1 = 0.32, 𝑃 𝑋 = 2 = 0.04
• Interpretation: if we randomly sample two people from the Korean
population, the probability that both smokers is 4% chance. The
probability both are nonsmokers s 64% chance. The probability that
only one smokers is 32% chance.
Case Study: Smoking
• If we randomly sample 3 people, what is the chance all 3 are smokers?
Case Study: Smoking
• The probability distribution of number of smokers out of three people
is given by
𝑃 𝑋 = 0 = 0.512, 𝑃 𝑋 = 1 = 0.384, 𝑃 𝑋 = 2 = 0.096, 𝑃 𝑋 = 3 = 0.008
• If we randomly sample 4 people, what is the chance all 4 are smokers?
• This is getting ridiculous, now we need a formula!
• We can use the binomial distribution to help determine this probability
Binomial Distribution
• The binomial distribution is used to give us the probability of 𝑋 ‘successes’
from a sequence of 𝑛 independent Bernoulli trials.
• In our example, each person would be an independent Bernoulli trial
(either a smoker or not)
• This distribution involves three assumptions
• There is fixed number of Bernoulli trials, 𝑛, each of which results in one of two
mutually exclusive outcomes
• The outcomes of the 𝑛 trials are independent
• The probability of success 𝜋 is the same for each trial
• The distribution is
𝑛 𝑥
𝑃 𝑋=𝑥 = 𝜋 (1 − 𝜋)𝑛−𝑥
𝑥
has mean 𝑛𝜋 and variance 𝑛𝜋(1 − 𝜋)
Math
• 𝑛! = 𝑛 𝑛 − 1 𝑛 − 2 … (3)(2)(1) is 𝑛 𝑓𝑎𝑐𝑡𝑜𝑟𝑖𝑎𝑙 allows us to
calculate the number of ways in which the 𝑛 individuals can be
ordered (𝑛 choices for 1st, 𝑛-1 choices for the 2nd…)
• By definition 0! =1
𝑛 𝑛!
• 𝑥
= is the combination of 𝑛 objects chosen 𝑥 at a time. The
𝑥! 𝑛−𝑥 !
number of ways in which 𝑥 objects can be selected from a total of 𝑛
objects regardless of order
Binomial Distribution
Binomial(10,0.2)
• 𝑋 ~𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙 𝑛, 𝜋 ⇔
0.30
𝑃 𝑋=𝑥 =
0.25
𝑛 𝑥 (1 − 𝜋)𝑛−𝑥
𝑥
𝜋
0.20
• 𝜋 𝑥 (1 − 𝜋)𝑛−𝑥 accounts for
the probability of two
Density
0.15
smokers in order
0.10
𝑛
• 𝑥
accounts for all the
0.05
possible ways in which we
have two smokers regardless
0.00
of order 0 2 4 6 8 10
No. of successes
Exercise
R Workshop
• Random number generation
• Calculate a density given a value
• Calculate right tail and left tail areas
• Calculate quantiles
Continuous Distribution
• We discussed discrete random variables
• As we move to discussion of continuous random variables, we will consider
the distribution of a continuous random variable 𝑋
• Suppose 𝑋 represents height
• An individual exactly 163cm tall is rare
• Theoretically, 𝑋 can assume an infinite number of intermediate values,
such as 163.0001cm or 163.01
• In reality we measure only discrete values due to the limitations of our
measuring instruments
• In result, the distribution of a continuous random variable is represented
by a smooth curve, called density function
Continuous Distributions
• A continuous distribution describes the probabilities of possible
values of a continuous random variable (infinite and uncountable)
• Density functions/curves, like histograms, can have any shape. The
area under the density curve is always 1.
• How do you find the area of interest in the curves?
• Integration!
•
Empirically observed frequency Analytical probability density
(count the number of values observed) (area under the curve)
Normal Distributions
• For the normal distribution,

Parameters:
-μ = Mean (x)
-σ = StDev (x)
• Also called Gaussian Distribution

• Symmetric Bell curve
The normal is
symmetric and
centered on μ
σ=1
σ affects the
σ=2
width of the curve
σ=3
μ affects the
position of the
center of the curve
μ
• The SD measures the distance
from the mean to the point of
inflection
• About 68% of the data are falling
in 1-SD of the mean.
• About 5% of the data are further
than 2-SD from the mean in each
tail
Finding tail probabilities (P given x)
P ( X< x1) = … P (X < x2) = …
x1 x2
In R, pnorm (x =x1, mean = …, sd = …)
Input: quantile and parameters
Finding quantiles (x given P)
P (X <x …) = P1
Area = P1
In R, qnorm (p =P1, mean = …, sd = …)

Input: left tail probability and parameters
Use the property of symmetry
P (x < μ - k) = Pk P (x > μ + k) = Pk P (x < μ + k) = 1 - Pk
1-Pk
Pk Pk
Pk
μ-k μ μ μ+k μ μ+k

Probabilities between two values
Z-scores
• The z-score is a standard normal variable, following normal distribution
with mean zero and unit standard deviation
• The z-score is used to transform normally distributed variables with mean μ
and SD σ into a variable that follows standard normal distribution
• Z ~ N(0,1)
• When we standardize by finding z-scores, we change the the normal
distribution by moving the location (mean moves to zero) and changing the
scale (SD moves to 1)
• Check Workshop!
Quantile-Quantile plot (Q-Q plot)
• Q-Q plot is designed to compare two probability distributions by
plotting their quantiles against each other.
• Many statistical methods are developed under normality assumption
• Q-Q plot for normality check is called normal Q-Q plot
• We obtain data and a statistical method with normality assumption will be
used
• We need to check if the method is ok to be applied to our data.
• Try Q-Qplot which is a scatterplot for quantiles from data vs. the normal
distribution (theoretical )
Example: Annual Precipitation in US Cities
The average amount of
rainfall in inches for
each of the 70 states

W4 Lecture4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

W4 Lecture4

Uploaded by

Copyright:

Available Formats

Biostatistics

Instructor: Min Jin Ha

• Pima Indian Woman

• We want to extend this to a more complex setting: in a randomly

• For the normal distribution,

• Also called Gaussian Distribution

In R, qnorm (p =P1, mean = …, sd = …)

P (x < μ - k) = Pk P (x > μ + k) = Pk P (x < μ + k) = 1 - Pk

μ-k μ μ μ+k μ μ+k

You might also like