You are on page 1of 41

Unit 1: Random variables and probability

distributions

PUBHLTH 490Z (Thanks to Professor David Harrington and


Julie Vu for course materials)

August 25, 2020

1 / 41
Random Variables

The Normal distribution

The Binomial distribution

The Poisson distribution

2 / 41
Random Variables

3 / 41
Reading

• Definition of random variables (Section 3.1 in OIBiostat)


• Binomial, Normal and Poisson random variables (Sections
3.2-3.4)
• Distributions related to Bernoulli trials (Section 3.5)
• Distributions of pairs of random variables (Section 3.6)

4 / 41
Types of random variables

• Discrete (examples include Binomial, Poisson)


• Continuous (examples include Normal, T)

5 / 41
Discrete random variables

A random variable assigns numerical values to the outcome of a


random phenomenon, and is usually written with a capital letter
such as X , Y , or Z .
A discrete random variable takes on a finite number of values.
Suppose X is the number of heads in 3 tosses of a coin.

6 / 41
Distribution of discrete random variables

The distribution of a discrete random variable is the collection of its


values and the probabilities asssociated with those values.
Usually recorded in a table and/or in a graph.
i 1 2 3 4 Total
xi 0 1 2 3 –
P(X = xi ) 1/8 3/8 3/8 1/8 8/8 = 1.00

7 / 41
Bar graph showing a distribution
0.4

0.3
Probabilities

0.2

0.1

0.0
0 1 2 3

8 / 41
S = {HHHH, HHHT, HHTH, ..., TTTT}
P(HHHH)
Probability = P(HHHT) = ... = P(TTTT) = 1/16
Distributions
• X is a discrete random variable with 5 possible
For a discrete random variable, a probability distribution is a table
values {0, 1, 2, 3, 4}
(often shown as a graph) of all disjoint outcomes and their
• Its probabilities.
associated probability distribution is:
Value 0 1 2 3 4
For example, counting the number of heads, X in 4 tosses of a fair
Probability 1/16 4/16 6/16 4/16 1/16
coin:

44
If the variable X records the number of heads, then X is a random
variable and the graph show its distribution.

9 / 41
Using a simulation to construct a probability
distribution

Distributions of random variables that arise in science can be more


complex.
In lecture, we will explore the R simulation used to create Figure 3.4
in OI Biostat.
The graph shows the distribution of the number of good responses
in a clinical trial with 20 participants, under the assumption that the
probability of a good response is 0.20.

10 / 41
Expectation of a discrete random variable

If the X has outcomes x1 , . . . , xk with probabilities P(X = x1 ), . . . ,


P(X = xk ), the expected value of X is the sum of each outcome
multiplied by its corresponding probability:

E (X ) = x1 P(X = x1 ) + · · · + xk P(X = xk )
k
X
= xi P(X = xi )
i=1

The Greek letter µ may be used in place of the notation E (X ) and


is sometimes written µX .

11 / 41
Simple example

Let X be the number of heads in 3 tosses of a coin. Then

E (X ) = 0P(X = 0) + 1P(X = 1) + 2P(X = 2) + 3P(X = 3)


= (0)(1/8) + (1)(3/8) + (2)(3/8) + (3)(1/8)
= 12/8
= 1.5

12 / 41
Variance and SD of a discrete random
variable
If X takes on outcomes x1 , . . . , xk with probabilities P(X = x1 ),
. . . , P(X = xk ) and expected value µ = E (X ), then the variance of
X , denoted by Var(X ) or the symbol σ 2 , is

σ 2 = (x1 − µ)2 P(X = x1 ) + · · ·


· · · + (xk − µ)2 P(X = xk )
k
X
= (xj − µ)2 P(X = xj )
j=1

The standard deviation (sd) of X , labeled σ, is the square root of


the variance. It is sometimes written σX .

13 / 41
Coin tossing again

Let X be the number of heads in 3 tosses of a coin. Then

σX2 = (xi − µX )2 P(X = x1 ) + · · ·


· · · + (x4 − µ)2 P(X = x4 )
= (0 − 1.5)2 (1/8) + (1 − 1.5)2 (3/8)+
(2 − 1.5)2 (3/8) + (3 − 1.5)2 (1/8)
= 3/4 = 0.75.

p √
The standard deviation is 3/4 = 3/2 = 0.866.

14 / 41
Continuous random variables

• Continuous random variables are random variables that can


take infinitely many values in an interval.
• Examples: Height, Weight, BMI, etc.
• Continuous random variables are characterized by a probability
density function (pdf), that gives the relative likelihood of any
outcome in a continuum occurring.
• Analogous definitions for the mean (or expectation, expected
value), variance and standard deviation exist.

15 / 41
The Normal distribution

16 / 41
Continuous distributions

We will discuss specific discrete distributions that arise in medicine


and biology later.
Begin with

• Concept of continuous distribution: the distribution for a


variable that can take on all values in a specified range.
• The normal distribution, perhaps the most important
continuous distribution in statistics.

17 / 41
Probabilities for continuous distributions
Two important features of continuous distributions

• The total area under the density curve is 1


• The probability that a variable has a value within a specified
interval is the area under the curve over that interval

140 160 180 200


height (cm)
18 / 41
The normal distribution
• 68% of the data are within 1 SD of the mean

• 95% of the data are within 2 SDs of the mean

• 99.7% of the data are within 3 SDs of the mean

68%

95%

99.7%

µ − 3σ µ − 2σ µ−σ µ µ+σ µ + 2σ µ + 3σ

19 / 41
A normal example
The distribution of test scores on the SAT and the ACT are both
nearly normal.
Suppose that one student scores an 1800 on the SAT (Student A)
and another student scores a 24 on the ACT (Student B). Which
student performed better?

Student A

900 1200 1500 1800 2100

X
Student B

11 16 21 26 31
20 / 41
A normal example . . .

• SAT scores are N(1500, 300). ACT scores are N(21, 5).
• xA represents the score of Student A; xB represents the score of
Student B.

xA − µSAT 1800 − 1500


ZA = = =1
σSAT 300

xB − µACT 24 − 21
ZB = = = 0.6
σACT 5

21 / 41
Calculating normal probabilities (I)
What is the percentile rank for a student who scores an 1800 on the
SAT for a year in which the scores are N(1500, 300)?

1. Calculate a Z -score. If X is a normal random variable with


mean µ and standard deviation σ,
X −µ
Z=
σ
is a standard normal (mean µ = 0, standard deviation σ = 1).
2. Calculate the normal probability.
• pnorm(z) calculates the area (i.e., probability) to the left of z

pnorm(1)

## [1] 0.8413447

22 / 41
Calculating normal probabilities (II)

What score on the SAT would put a student in the 99th percentile?

1. Identify the Z -value. qnorm(p) calculates the value z such


that for a standard normal variable Z , p = P(Z ≤ z).

qnorm(0.99)

## [1] 2.326348

23 / 41
Calculating normal probabilities (II). . .

2. Calculate the score, X . If Z is standard Normal distribution,

X = σZ + µ

is Normal with mean µ and standard deviation σ.

X =σZ + µ
=300(2.33) + 1500
=2199

24 / 41
The Binomial distribution

25 / 41
Binomial random variables (3.2 in OI Biostat)

A binomial random variable, X , is the number of successes in n


independent replications of an experiment, where

• Each replication has outcome either success or failure (1 or 0)


• The probability of success in each replicate is a constant p

A binomial variable takes on values (0, 1, 2, . . . , n)


The number of heads in 3 tosses of a coin is a binomial variable
with n = 3 and p = 0.5

26 / 41
The bionomial coefficient

The binomial coefficient xn is the number of ways to choose x




items from a set of size n, where the order of the choice is ignored.
Mathematically,
!
n n!
=
x x !(n − x )!

• n = 1, 2, . . .
• x = 0, 1, 2, . . . , n
• For any integer m, m! = (m)(m − 1)(m − 2) · · · (1)

27 / 41
Formula for the binomial distribution

Let x = number of successes in n trials

 
# of trials
P(x successes) = p # of successes (1 − p)# of trials - # of successes
# of successes

!
n x
P(X = x ) = p (1 − p)n−x , x = 0, 1, 2, . . . , n
x

Parameters of the distribution:

• n = number of trials
• p = probability of success

28 / 41
Mean and standard deviation for binomial
random variable

For the binomial distribution with parameters n and p, it can be


shown that:

• Mean = np

• Standard Deviation =
p
np(1 − p)

Not derived here or in the text; derivation will not be asked on a


p-set or exam

29 / 41
Calculating binomial probabilities in R

Suppose X is a binomial random variable with n = the number of


trials and p = the success probability for each trial.

• dbinom(a, n, p) = P(X = a)
• pbinom(a, n, p) = P(X ≤ a)

30 / 41
Normal approximation to the binomial

Before computing was inexpensive and widely available, the normal


distribution was sometimes used to approximate binomial
probabilities.

• We will not cover the formula since you will always be able to
use R to calculate binomial probabilities.
• Will not be on an exam or quiz
• Read section 3.3.6, OI Biostat if you wish to know more details.

31 / 41
The Poisson distribution

32 / 41
Introduction to the Poisson distribution
(Section 3.4 in OI Biostat)

The Poisson distribution is used to calculate probabilities for rare


events that accumulate over time.
It used most often in settings where events happen at a rate λ per
unit of population and per unit time, such as the annual incidence
of a disease in a population.

• Typical example: for children ages 0 - 14, the incidence rate of


acute lymphocytic leukemia (ALL) was approximately 30
diagnosed cases per million children per year in 2010.
• Always take care to note and understand the units.

33 / 41
Example: Outbreaks of childhood leukemia

Fortunately, childhood cancers are rare.


For children ages 0 - 14, the incidence rate of acute lymphocytic
leukemia (ALL) was approximately 30 diagnosed cases per million
children per year in the decade from 2000 - 2010. Approximately
20% of the US population are in this age range.

• What is the incidence rate over a 5 year period?


• In a small city of 75,000 people, what is the probability of
observing exactly 8 cases of ALL over a 5 year period?
• In the small city, what is the probability of observing 8 or more
cases over a 5 year period?

34 / 41
Poisson Distribution
Suppose events occur over time in such a way that

1. The probability an event occurs in an interval is proportional to


the length of the interval.
2. Events occur independently at a rate λ per unit of time.

Then the probability of exactly x events in one unit of time is

e −λ λx
P(X = x ) = , x = 0, 1, 2, . . .
x!

The probability of exactly x events t units of time is

e −λt (λt)x
P(X = x ) = , x = 0, 1, 2, . . .
x!

Derivation given in more theoretical courses, such as Stat 110.


35 / 41
Poisson distribution with λ = 2.25

Poisson Distribution
0.25
0.20
Probability
0.15
0.10
0.05
0.00

0 2 4 6 8 10
x

36 / 41
Poisson mean and standard deviation

For the Poisson distribution modeling the number of events in one


unit of time:

• The mean is λ.

• The standard deviation is λ

In t units
√ of time, the mean and standard deviation are, respectively,
λt and λt.

37 / 41
Childhood leukemia cases. Example 3.37, OI
Biostat

The incidence rate of ALL in a year is 30 per 1,000,000 children:

• 30/1, 000, 000 = 0.00003 = 3 × 10(−5) .

The incidence rate over a 5-year period is (5)(30) per 1,000,000


children:

• 150/1, 000, 100 = 0.00015 = 1.5 × 10−4 .

38 / 41
What about a city of size 75,000?

In a city of 75,000 people, approximately (75, 000)(0.20) = 15, 000


children will be age 0-14 (from slide 23).
The five-year rate of new cases for the city would be:

(1.5 × 10−4 )(15, 000) = 2.25

39 / 41
What is the probability of 8 cases over 5
years?

e −λ λx e (−2.25) (2.25)8
P(X = 8) = =
x! 8!
Easiest to calculate this in R using the function dpois() rather
than by hand. . .
Suppose X has a Poisson distribution with parameter λ.

• dpois(k, lambda) = P(X = k)

dpois(8, lambda = 2.25)

## [1] 0.001717027

40 / 41
What is the probability of 8 or more cases?
Would 8 or more cases be a rare event?
Suppose X has a Poisson distribution with parameter λ.

• ppois(k, lambda) = P(X ≤ k)


• ppois(k, lambda, lower.tail = FALSE) = P(X > k)

1 - ppois(7, lambda = 2.25)

## [1] 0.002267088

ppois(7, lambda = 2.25, lower.tail = FALSE)

## [1] 0.002267088

41 / 41

You might also like