Professional Documents
Culture Documents
Name:
.
Faculty of Applied Sciences Department of Mathematics and Physics
2021
Contents
1 Markov Chains with Biological Applications 2
1.1 Definition and Types of Stochastic Process . . . . . . . . . . . . . . . . . 2
1.2 Definition and Properties of a Discrete-Time Markov Chain . . . . . . . . 8
1.3 Long Run Behaviour of a Markov Chain . . . . . . . . . . . . . . . . . . 13
1.4 Absorbing States and First Step Analysis . . . . . . . . . . . . . . . . . . 16
1.5 Biological Applications of Markov Chains . . . . . . . . . . . . . . . . . 25
Textbooks
• The following sources were used in preparing these notes: Pinsky and Karlin (2011),
Allen (2008), Myers (2010), Fox (2016), Givens and Hoeting (2005)
1
3. Statistics 2A (linear regression, logistic regression)
◦ E.g. we flip a coin repeatedly and record 1 if we get Heads and 0 if we get
tails
◦ Stochastic is just another word for random
• The range of possible values for the random variables is known as the state space
(similar to sample space)
Random Variable
Discrete Continuous
Discrete Xt (e.g., Markov Chain) Xt (e.g., time series)
Time
Continuous X(t) (e.g., Poisson Process) X(t) (e.g., Brownian Motion)
• Let A be an event.
• The event that A does not occur is called the complement of A and is written math-
ematically as Ac (see Figure 1.1)
2
Figure 1.1: Illustration of Event Complements
• The event that at least one of A or B occurs is called the union of A and B and is
written mathematically as A ∪ B (see Figure 1.2)
• The event that both A and B occur is called the intersection of A and B and is
written mathematically as A ∩ B (see Figure 1.3)
3
Figure 1.3: Illustration of Intersection of Events
Pr Ac = 1 − Pr (A)
• This is simply because either event A will happen or it will not happen; there are
no other possible outcomes
Pr (A) + Pr Ac = 1
Pr (A ∪ B) = Pr (A) + Pr (B) − Pr (A ∩ B)
Pr (A ∪ B) = Pr (A) + Pr (B)
4
Figure 1.4: Illustration of Disjoint Events
Pr (A ∩ B) = Pr (A) × Pr (B)
Pr (A ∩ B) = Pr (A) × Pr (B|A)
• Or,
Pr (A ∩ B) = Pr (B) × Pr (A|B)
• The two ways of expressing the multiplicative rule for dependent events can be
rearranged as conditional probability formulas:
Pr (A ∩ B)
Pr (B|A) =
Pr (A)
Pr (A ∩ B)
Pr (A|B) =
Pr (B)
• The two ways of expressing the multiplicative rule for dependent events can also be
set equal to each other to give Bayes’ Theorem:
5
Review of Probability Results: Probability Mass Functions
• If X is a discrete random variable and S is the set of possible values that X can
take on (with nonzero probability), then the function f (x), defined over the set of
all integers, is called the probability mass function of X if and only if
(
f (x) if x ∈ S
Pr (X = x) =
0 otherwise
6
Law of Total Probability
• The Law of Total Probability states that, for event A and for disjoint (mutu-
ally exclusive) events B1 , B2 , . . . , Bb that partition the whole sample space (i.e.,
Xb
Pr (Bk ) = 1), we have
k=1
b
X b
X
Pr (A) = Pr (A ∩ Bk ) = Pr (A|Bk ) Pr (Bk )
k=1 k=1
• The law of total probability also gives us another way of expressing Bayes’ Theo-
rem:
Pr (B1 ) Pr (A|B1 )
Pr (B1 |A) =
Pr (A)
Pr (B1 ) Pr (A|B1 )
= b
X
Pr (A|Bk ) Pr (Bk )
k=1
• Suppose that you have three bags that each contain 10 balls. Bag 1 has 3 blue balls
and 7 green balls. Bag 2 has 5 blue balls and 5 green balls. Bag 3 has 8 blue balls
and 2 green balls. You choose a bag at random and then chooses a ball from this
bag at random. There is a 31 chance that you choose Bag 1, a 21 chance that you
choose Bag 2, and a 61 chance that you choose Bag 3. What is the probability that
7
the chosen ball is blue?
• A Markov process {Xt } is a stochastic process with the property that, given the
value of Xt , the values of X s for s > t are not influenced by the values of Xu for
u < t.
• In words: given the present state of the process, the future state of the process does
not depend on the past states of the process
• The state space of a Markov Chain, denoted S, is the set of all possible values that
the process random variable can take on
8
Transition Probabilities
• The probability of Xn+1 being in state j given that Xn is in state i is called the one-
step transition probability and is denoted by Pn,n+1
ij . That is:
Pn,n+1
ij = Pr (Xn+1 = j|Xn = i)
• In general, transition probability depends not only on the initial state (i) and final
state ( j) but also on the current time (n)
• If the one-step transition probabilities are independent of the current time n (i.e. are
the same regardless of the value of n), we refer to the Markov Chain as homoge-
neous and denote the transition probability from state i to state j by Pi j
• In this module, all Markov Chains will be assumed to be homogeneous unless oth-
erwise stated
State Diagrams
9
• If the state space of a Markov chain is S = {0, 1, 2} then the transition probability
matrix would look like this:
0 1 2
0 P00 P01 P02
P = 1 P10 P11 P12
2 P20 P21 P22
• Here, for instance, P10 is the probability of the process being in state 0 given that it
was in state 1 at the previous time step
• In other words, it is the probability that a process in state 1 proceeds next to state 0
• If the state space of a Markov chain is S = {0, 1, 2, . . .} then the transition probabil-
ity matrix would look like this:
0 1 2 ··· j ···
0 P00 P01 P02 ··· P0 j · · ·
1 P10 P11 P12 ··· P1 j · · ·
2 P20 P21 P22 ··· P2 j · · ·
P = .. .. .. .. ... .. . .
. . . . . .
··· · · ·
i Pi0 Pi1 Pi2 Pi j
.. .. .. .. ... .. . . .
. . . . .
• Transition probabilities must satisfy the following properties (similar to those for
any probability):
• The second property implies that a process in state i at time n must be in some state
at time n + 1
Exercise
• A frog can jump between four lily pads, labeled A, B, C and D. If the frog is on lily pad A,
it will jump to B with a probability of 34 and to C with a probability of 14 . If the frog is on
B, it will jump to A, C, or D each with a probability of 13 . If the frog is on C it will jump to
A with probability 12 or stay on C with probability 12 . If the frog is on D it will jump to B
with probability 1. Let Xn denote the frog’s position after the nth jump (Xn = 0 if the frog
is at A, Xn = 1 if at B, etc.)
• Draw the the transition probability matrix for this Markov Chain.
10
n-step Transition Probabilities
• What if we are interested in the n-step transition probabilities, i.e. the probability
that the chain goes from state i to state j in n transitions?
i j = Pr (Xm+n = j|Xm = i)
P(n)
◦ Note: our assumption that the Markov Chain is homogeneous means that m
can be any non-negative integer
• The Chapman-Kolmogorov equations give us a means to work out the n-step tran-
sition probabilities of any Markov Chain from the one-step transition probabilities
• The Chapman-Kolmogorov equations are stated as follows: for any states i and j in
the state space S, and any positive integers m and n,
X
P(m+n)
ij = P(m) (n)
ik Pk j
k∈S
P(m+n)
ij = Pr (Xm+n = j|X0 = i)
X
= Pr (Xm+n = j, Xm = k|X0 = i) by law of total probability
k∈S
X
= Pr (Xm+n = j|Xm = j, X0 = i) Pr (Xm = k|X0 = i) by conditional probability
k∈S
X
= Pr (Xm+n = j|Xm = k) Pr (Xm = k|X0 = i) by Markov property
k∈S
X
= P(m) (n)
ik Pk j
k∈S
• From linear algebra we recognise this as the formula for matrix multiplication
11
• And, by iterating:
P(n) = |
P × P {z P = Pn
× ··· ×}
n factors
Example
• Consider a Markov chain {Xn } with states 0, 1 and 2 which has the following tran-
sition probability matrix:
0 1 2
0 0.1 0.2 0.7
P = 1 0.2 0.2 0.6
2 0.6 0.1 0.3
0 1 2 0 1 2 0 1 2
0 0.1 0.2 0.7 0 0.1 0.2 0.7 0 0.47 0.13 0.4
P = 1 0.2
2
0.2 0.6 = 1 0.42 0.14 0.44
0.2 0.6 × 1 0.2
2 0.6 0.1 0.3 2 0.6 0.1 0.3 2 0.26 0.17 0.57
12
Exercise
• A particle moves among the states 0, 1 and 2 according to a Markov chain whose
transition probability matrix is:
0 1 2
0 0 21 1
2
P =1
1 0 1
2 2
1 1
2
2 2
0
• Hint: it is not necessary to calculate the whole matrix each time, though that would
work.
• Suppose we have a Markov Chain with the following transition probability matrix:
0 1
" #
0 0.33 0.67
P =
1 0.75 0.25
• Calculate by hand the first few powers of P and what do you see?
0 1
" #
0 0.5286 0.4713
P8 =
1 0.5277 0.4723
• It appears that after many steps, the probabilities of being in state j converge to the
same values regardless of the previous state i
• Thus, by computing a high power of P we can find out approximately the long-run
or limiting distribution of the Markov Chain
13
Exercise
• A sociologist develops a Markov Chain model of social mobility, i.e. the relation-
ship between the economic status of one generation to the next.
• Determine what proportion of the population will be in each economic class in the
long run by computing a high power of this matrix
• Hint: you can save time by using the fact that P4 = P2 × P2 and P8 = P4 × P4
1. For every pair of states i, j it is possible after some number of steps to go from
i to j
2. There is at least one state i for which Pii > 0
14
• Or, in matrix form:
1. π = P> π
2. 1> π = 1 where 1 is an N + 1-vector of ones
• The first set of equations above follows from the property that the limiting distribu-
tion is stationary with respect to time and independent of the initial state
• To see this, consider writing the Chapman-Kolmogorov Equations as follows:
X
P(n)
ij = P(n−1)
ik Pk j , i ∈ S
k∈S
• The limiting distribution is interpreted as the probability that, after a long duration,
the process {Xn } will be found in state j, regardless of the value of X0
• It can also be interpreted as the long run average fraction of time that the process
{Xn } is in state j
• In practice we can determine the limiting distribution by solving the system of equa-
tions given by (1) and (2) above
◦ Note: this can be done by substitution, or using matrix row operations (Gaussian
Elimination)
• IMPORTANTX NOTE: the system in (1) has one redundant equation because of
the restriction Pik = 1. Thus one of the equations in (1) must be replaced with
k∈S
(2)
Limiting Distribution of a Markov Chain: Example
• Consider the social mobility Markov Chain discussed above
• From the transition probability matrix we can get the following system of equations:
• Remember: one of the first three equations is redundant; we can choose which one
to get rid of (in this case, (3))
15
• Thus our system is:
0.40π0 + 0.05π1 + 0.05π2 = π0
(1)
0.50π0 + 0.70π1 + 0.50π2 = π1
(2)
(3)
π0 + π1 + π2 = 1
(4)
65π0 − 8π1 = 0
65π0 = 5
• Thus π0 = 5
65
= 1
13
, π1 = 5
8
and π2 = 1 − π0 − π1 = 31
104
• A state i in a Markov Chain is said to be absorbing if Pii = 1 (that is, the probability
on the main diagonal of the transition probability matrix is 1)
• That is, once the process reaches state i, it will remain there forever
• Thus, if a process has state space S = {0, 1, 2, 3}, and states 0 and 3 are absorbing,
then R = {0, 3}
• Let T = min{n ≥ 0; Xn ∈ R} be the ‘time of absorption’, i.e. the time at which the
process first reaches (and becomes stuck in) an absorbing state
16
• Two quantities of interest in Markov Chains involving absorbing states are the ex-
pected time of absorption and the probability of absorption to some particular
absorbing state r
◦ Note that the second quantity is interesting only when we have two or more
absorbing states; if there is only one absorbing state then the probability of
being absorbed to this state is 1
• Both of these quantities can be expressed in terms of T , and they are generally
expressed in conditional terms since they depend on the initial value of the process:
vi = E (T |X0 = i)
ui,r = Pr (XT = r|X0 = i)
• Consider a Markov Chain {Xn } where Xn describes the condition of a machine part
at time n
0 1 2
0 0.9 0.1 0
P = 1 0 0.8 0.2
2 0 0 1
◦ v0 = E (T |X0 = 0) gives the mean number of times the machine can be used
before the part fails given that the part is currently working perfectly
◦ v1 = E (T |X0 = 1) gives the mean number of times the machine can be used
before the part fails given that the part is currently showing signs of wear
◦ v2 = E (T |X0 = 2) gives the mean number of times the machine can be used
before the part fails given that the part has already failed; clearly v2 = 0 by
inspection
17
Probability of Absorption to State r: Example
• Let Xn be the vertex where the ant is located after the nth move. Clearly {Xn } is a
Markov Chain with state space S = {0, 1, 2, 3, 4}
• The transition probability matrix for this Markov Chain would be as follows:
0 1 2 3 4
0 1 0 0 0 0
1 13 1 1
0 3 3
0
P = 2 31 1
0 0 1
3 3
3 13 1
0 0 1
3 3
4 0 0 0 0 1
• What is the probability that the ant reaches the honey rather than dying, given that
it starts in state 1? This is precisely the probability u1,0 = Pr (XT = 0|X0 = 1)
• We can use an ingenious technique called First Step Analysis to determine our
quantities of interest for a particular Markov Chain with absorbing states
18
• This technique uses the law of total probability, which (you will recall) states that,
b
X
for events A and B1 , B2 , . . . , Bb such that Pr (Bk ) = 1,
k=1
b
X
Pr (A) = Pr (A|Bk ) Pr (Bk )
k=1
• Let us begin by seeing how first step analysis is used to find the probability of
absorption to state r
• Suppose we have a Markov Chain with state space S = {0, 1, . . . , N} and at least
two absorbing states (e.g., R = {0, N})
• Suppose further that we are interested in finding ui,r = Pr (XT = r|X0 = i) for all
i ∈ S and for one particular absorbing state r ∈ R
• Event ‘A’ in the law of total probability is in this case the event XT = r (i.e., that the
process is eventually absorbed into state r), while events B1 , B2 , . . . , Bb are in this
case the events X1 = 0, X1 = 1, . . . , X1 = N
• The derivation proceeds as follows:
ui,r = Pr (XT = r|X0 = i)
X
ui,r = Pr (XT = r|X0 = i, X1 = k) Pr (X1 = k|X0 = i) (by law of total probability)
k∈S
X
ui,r = Pr (XT = r|X1 = k) Pr (X1 = k|X0 = i) (by Markov property)
k∈S
but since T is random, Pr (XT = r|X1 = k) = Pr (XT = r|X0 = k)
X
Thus ui,r = uk,r Pik = u0,r Pi0 + u1,r Pi1 + · · · + uN,r PiN
k∈S
• The Pik values for all i and k are known from the transition probability matrix, so
we are left with N + 1 unknowns, u0,r , u1,r , . . . , uN,r and N + 1 equations
• If we solve this system of equations we will have our answer
• There is one other observation that simplifies our task considerably: ur,r = 1 and
ui,r = 0 for all absorbing states other than r (i.e. for all i , r ∈ R)
◦ ur,r = Pr (XT = r|X0 = r) = 1: if we begin in state r, we have already been
absorbed to this state, so absorption to this state is certain
◦ Similarly, if i is an absorbing state other than r, then ui,r = Pr (XT = r|X0 = i) =
0: if we begin in absorbing state i, we have already been absorbed to this state,
so absorption to state r is impossible
• Thus in practice we need to solve the system of equations
X
ui,r = uk,r Pik , i < R
k∈S
after substituting in 1 for ur,r and 0 for all u values of absorbing states other than r
19
First Step Analysis: Example 1
• Let us apply first step analysis to solve the pyramid ant problem discussed above:
what is the probability that the ant eventually reaches the honey rather than the
poison?
• In this case, the absorbing states are R = {0, 4}, and we want to find u1,0 (the proba-
bility of being absorbed to state 0 given that the ant starts in state 1)
• We know by inspection that u0,0 = 1 and u4,0 = 0, so the only unknowns remaining
are u1,0 , u2,0 , and u3,0
u1,0 = u0,0 P10 + u1,0 P11 + u2,0 P12 + u3,0 P13 + u4,0 P14
u2,0 = u0,0 P20 + u1,0 P21 + u2,0 P22 + u3,0 P23 + u4,0 P24
u3,0 = u0,0 P30 + u1,0 P31 + u2,0 P32 + u3,0 P33 + u4,0 P34
• Substituting in what we know (that u0,0 = 1 and u4,0 = 0, plus all the transition prob-
abilities), we proceed to solve the system of three equations and three unknowns as
follows:
1 1 1
u1,0 = (1) + u1,0 (0) + u2,0 + u3,0 + (0)(0)
3 3 3
1 1 1
u2,0 = (1) + u1,0 + u2,0 (0) + u3,0 (0) + (0)
3 3 3
1 1 1
u3,0 = (1) + u1,0 + u2,0 (0) + u3,0 (0) + (0)
3 3 3
1 1 1
u1,0 = + u2,0 + u3,0
3 3 3
1 1
u2,0 = + u1,0
3 3
1 1
u3,0 = + u1,0
3 3
Substituting the last two equations into the first:
! !
1 1 1 1 1 1 1
u1,0 = + + u1,0 + + u1,0
3 3 3 3 3 3 3
1 1 1 1 1
u1,0 = + + u1,0 + + u1,0
3 9 9 9 9
7 5
u1,0 =
9 9
5
u1,0 =
7
• Thus, if the ant starts at vertex 1 of the pyramid, the probability that the ant eventu-
ally reaches vertex 0 (the honey pot) is 57 (which also implies that the probability of
the ant eventually reaching the poison instead [vertex 4] is 1 − 57 = 27 )
20
First Step Analysis Technique for Finding Expected Time of Absorption
• We have seen how to use first step analysis to determine the probability of absorp-
tion to a particular absorbing state r in cases where there are two or more absorbing
states in the Markov Chain
• A very similar first step analysis technique can be used to determine the expected
time of absorption, vi = E (T |X0 = i)
• This time, instead of using the law of total probability, our derivation uses the law
of total expectation
• This law is very similar to the law of total probability, but uses expected value rather
than probability only
b
X
E (Y) = E (Y|Bk ) Pr (Bk )
k=1
• Suppose we have a Markov Chain with state space S = {0, 1, 2, . . . , N} which con-
tains at least one absorbing state: R , {}
• In this case, T will be the random variable Y in our law of total expectation formula
and X1 = k will be our event Bk
vi = E (T |X0 = i)
X
vi = E (T |X1 = k, X0 = i) Pr (X1 = k|X0 = i) (by law of total expectation)
k∈S
X
vi = E (T |X1 = k) Pr (X1 = k|X0 = i) (by Markov property)
k∈S
But E (T |X1 = k) = 1 + E (T |X0 = k) , thus
X X X
vi = (1 + vk ) Pik = Pik + vk Pik
k∈S k∈S k∈S
X X
vi = 1 + vk Pik (since Pik = 1, a property of all transition prob. matrices)
k∈S k∈S
vi = 1 + v0 Pi0 + v1 Pi1 + · · · + vN PiN
21
◦ For example, if state 0 is absorbing, then v0 = E (T |X0 = 0) = 0: if the pro-
cess was already in an absorbing state at time 0, then the expected time of
absorption is 0
• Returning to our machine part example, let us apply first step analysis to determine
the expected number of times a machine part can be used before failure, given that
the machine part is currently working fine
• Recall that this Markov Chain had states 0 (part working perfectly), 1 (part showing
signs of wear), and 2 (part failed), and transition probability matrix as follows:
0 1 2
0 0.9 0.1 0
P = 1 0 0.8 0.2
2 0 0 1
• By inspection we can say that v2 = 0; we then have a system of two equations and
two unknowns, v0 and v1
• Thus the expected number of times that the machine can be used before a part that
is working perfectly will fail is 15
22
First Step Analysis Exercises
• Consider the Markov Chain {Xn } whose transition probability matrix is given by:
0 1 2 3
0 1 0 0 0
1 0.1 0.4 0.1 0.4
P =
2 0.2 0.1 0.6 0.1
3 0 0 0 1
• The general equations that can be used for first step analysis of any Markov Chain
with states 0, 1, 2, . . . , N are:
X
ui,r = uk,r Pik
k∈S
X
vi = 1 + vk Pik
k∈S
• The equation should be written out for each state i that is NOT an absorbing state
• The uk,r or vk values for the absorbing states can be determined trivially (always 0
or 1) and substituted into the equations
• We then have a system of an equal number of equations and unknowns that can be
solved
• The quantities vi (expected time of absorption given initial state i) and ui,r (proba-
bility of absorption to state r given initial state i) can also be solved using a matrix
approach, described as follows
• Suppose that our Markov Chain consists of a absorbing states and b transient states
(a + b states in total)
• We first need to reorganise the transition probability matrix into the matrix P ?
consisting of four sub-matrices, as follows:
Q 0b×a
?
P =
R Ia
where
23
◦ Q is a b × b matrix of transition probabilities between transient states only
◦ R is an a×b matrix of transition probabilities from transient states to absorbing
states
◦ 0b×a is a b × a matrix of zeroes
◦ Ia is the a × a identity matrix
• We further define the fundamental matrix N = (Ib − Q)−1 (this can be time-
consuming if the matrix is 3 × 3 or larger; see here for some methods for finding the
inverse of a 3 × 3 matrix by hand)
• This matrix allows us to compute v, the vector of expected absorption times per
initial state, as the row sums of N
• Furthermore, U , the matrix whose (i, r)th element is ui,r , the probability of absorp-
tion into state r given initial state i, can be computed as
U = N R>
0.9 0.1
, R = 0 0.2 , 02×1 = 0, and I1 = 1 .
h i h i
• Here, Q =
0 0.8 0
0.1 −0.1 10 5
• Thus, I2 − Q = and N = (I2 − Q)−1 =
0 0.2 0 5
15
• Summing the rows of N , we obtain v = , which matches our answers v0 = 15
5
and v1 = 5 obtained earlier
24
1 1
0 3 3 1 1 1
• Thus, Q = 13 0 0, and R = 3
3 3
1 1
1 0 3 3
3
0 0
5
• Thus, u1,0 = , just as we obtained previously
7
• There are are many systems in science that can be described by them
• We will now look at examples: the Ehrenfest model for diffusion across a mem-
brane, genetics models, epidemiological models, and ecological models
• Imagine two containers (urns) containing a total of 2a balls (which represent molecules).
• Suppose the first container, labelled A, holds k balls while the second container, B,
holds the remaining 2a − k balls
• A ball is selected at random and moved to the other container (as in the diagram in
Figure 1.8)
25
Figure 1.8: Illustration of Ehrenfest Urn Process
• What will be the long-term behaviour of such a system? A Markov Chain model
can help us to answer this question
• Let Yn be the number of balls in urn A after n time steps, and let
Xn = Yn − a
• Xn represents the excess balls in urn A over and above half the total (thus Xn = 0
means there is an equal number of balls in each urn, while a positive value of Xn
means there are more balls in urn A than in urn B, and a negative value of Xn means
there are less balls in urn A than in urn B)
26
Ehrenfest Urn Model: Transition Probabilities
• Consider an Ehrenfest Urn Model with a = 3, meaning that there are a total of
2a = 6 balls in the system
• Suppose that Xn = 0, meaning there is an equal number of balls in each urn, i.e. 3
balls in urn A and 3 balls in urn B
◦ When one of the six balls is chosen and moved at random, there is a 36 = 12
chance that it will be a ball from urn A (moved to urn B) and a 63 = 12 chance
that it will be a ball from urn B (moved to urn A)
◦ Thus, Xn+1 can either be -1 (if a ball is moved from urn A to urn B) or 1 (if a
ball is moved from urn B to urn A), each with probability 21
• Suppose that Xn = 1, which means one-more-than-half of the balls are in urn A (4
balls in this case)
◦ When one of the six balls is chosen and moved at random, there is a 64 = 23
chance that it will be a ball from urn A (moved to urn B) and a 62 = 13 chance
that it will be a ball from urn B moved to urn A
◦ Thus, Xn+1 can either be 0 (if a ball is moved from urn A to urn B, which
happens with probability 32 ) or 2 (if a ball is moved from urn B to urn A,
which happens with probability 13 )
• Do you see the pattern? In general, the probability that the number of balls in urn
A increases by 1 (Xn+1 = Xn + 1) is proportional to the current number of balls
in urn B, while the probability that the number of balls in urn A decreases by 1
(Xn+1 = Xn − 1) is proportional to the current number of balls in urn A
• Thus the transition probability matrix when there are 2a = 6 balls will be:
−3 −2 −1 0 1 2 3
=1 0 0
6
−3 0 6
0 0 0
1 5
−2
6
0 6
0 0 0 0
2
−1
0 6
0 46 0 0 0
P= 0
0 0 3
6
0 3
6
0 0
1
0 0 0 64 0 2
6
0
5 1
2
0 0 0 0 6
0 6
0 0 0 0 0 6
=1 0
3
6
27
Ehrenfest Urn Model: Exercise
1. Work out the transition probability matrix P for an Ehrenfest Urn Model with 4
balls in the system
• The simplest version of this model is the Random Reproduction Model, which dis-
regards mutation pressures and selective forces
• Let Xn be the number of individuals of allele type a in the population at the nth
generation (which implies that there are 2N − Xn individuals of allele type A)
• For each trial (birth), the probability that this new individual is of type a is equal to
the proportion of the parent generation that is of type a, that is,
i
pi =
2N
(where we assume that Xn = i is the number of type a individuals in generation n)
• The probability that a new individual is of type A is equal to the proportion of the
parent generation that is of type A, that is,
2N − i i
qi = 1 − pi = =1−
2N 2N
• This implies that {Xn } is a Markov Chain with state space S = {0, 1, 2, . . . , 2N}
whose transition probabilities follow a binomial distribution:
!
2N j
Pi j = Pr (Xn+1 = j|Xn = i) = p (1 − pi )2N− j for i, j = 0, 1, 2, . . . , 2N
j i
• You will notice that the only parameter in this model is N (half the population
size); thus, once we know the population size we can fully specify the transition
probability matrix and analyse the long-term behaviour of the process
28
Random Reproduction Model: Example
• Consider the Wright-Fisher Random Reproduction Model in the case where the
population size is 4 (N = 2)
• Using the binomial transition probability formula above, we can easily determine
that the transition probability matrix is as follows:
0 1 2 3 4
0 1 0 0 0 0
1 0.3164 0.4219 0.2109 0.0469 0.0039
P = 2 0.0625 0.25 0.375 0.25 0.0625
3 0.0039 0.0469 0.2109 0.4219 0.3164
4 0 0 0 0 1
• Thus, regardless of population size 2N, state 0 is always an absorbing state in this
Markov Chain. Similarly,
!
2N 2N 0
P2N,2N = 1 0 = 1 and, for any 0 ≤ j < 2N,
2N
!
2N j 2N− j
P2N, j = 10 =0
j
• Thus, regardless of population size 2N, state 2N is always an absorbing state in this
Markov Chain
• The practical interpretation is simply this: once the entire population consists of
allele type a genes, allele type A becomes extinct, disappearing forever (and vice
versa, if the entire population consists of A genes)
• The biological interpretation of this phenomenon is that the gene pool will tend to
become less diverse over time as some alleles become extinct
• This is certain to happen in the long run; the only uncertainties in the model are
◦ Which of the two alleles will survive and which will become extinct?
◦ How many generations will it take for one of the alleles to go extinct (i.e. for
the chain to reach an absorbing state)?
• These two questions can be answered probabilistically using first step analysis
29
Exercise
• For the Random Reproduction Model with N = 2, determine the probability that
allele a eventually goes extinct
• For the Random Reproduction Model with N = 2, determine the expected number
of generations until one of the alleles goes extinct
• Give the transition probability matrix for the Random Reproduction Model with
N=3
• A more realistic genetics model must take into account mutations. One way of do-
ing this in our Random Reproduction Model is to assume that, prior to the formation
of the new generation, each gene has the possibility to mutate, i.e., to change into a
gene of the other type
• Specifically, we assume that for each gene the mutation a→A occurs with proba-
bility α, and A→a occurs with probability β
• Otherwise our assumptions are the same: we have a fixed population size of 2N,
and the birth of each new generation is a binomial experiment with 2N independent
trials
• Let BA be the event that a gene is born as allele A and let MA→a be the event that a
gene undergoes mutation A→a
• Let Ga be the event that a gene is of type a when it reproduces (i.e. after a possible
mutation)
• There are two ways that event Ga could happen: either the gene is born as type a
and does not mutate, or the gene is born as type A and does mutate. These two
situations are mutually exclusive; thus,
pi = Pr (Ga ) = Pr Ba ∩ Ma→Ac
+ Pr (BA ∩ MA→a )
= Pr (Ba ) Pr Ma→A
c
|Ba + Pr (BA ) Pr (MA→a |BA )
i i
= (1 − α) + 1 − β
2N 2N
• Therefore, the transition probabilities for this Markov Chain are binomially dis-
tributed with number of trials 2N and success probabilities pi as given above, i.e.,
30
!
2N j i i
Pi j = Pr (Xn+1 = j|Xn = i) = pi (1 − pi )2N− j where pi = (1 − α) + 1 − β
j 2N 2N
• Provided that both mutation probabilities are nonzero (i.e. αβ > 0), this Markov
Chain has no absorbing states
• Instead, it has a limiting distribution, which you can calculate using the methods
learnt in this chapter
• Two other Markov Chain genetics models not covered in these lecture notes are:
• Two classical and simple compartmental epidemiological models are the SIS Model
and the SIR Model
• In the SIS model, individuals in the population are classified as either ‘susceptible’
(S), meaning that they are not currently infected but could become infected, and
‘infected’ (I), meaning that they are currently infected as well as infectious (can
infect others)
• Within the SIS epidemic model, it is assumed that a susceptible individual can
become infected through contact with an infected individual; it is further assumed
that infected individuals eventually recover and return to the susceptible class
31
Figure 1.9: Compartmental Diagram for SIS Model
• The SIS Model has been applied to sexually transmitted infections (STI’s)
• The compartmental diagram in Figure 1.9 shows how individuals move between the
S and I classes; it is not to be confused with a state diagram, since S and I do not
represent states in a Markov Chain, but disease status of an individual
• In the SIR model, besides the ‘susceptible’ (S) and ‘infected’ (I) classes, there is a
third class called ‘removed’ (R), sometimes referred to as ‘immune’, which refers
to individuals who were previously infected and are now neither infected nor sus-
ceptible
• Thus, unlike in the SIS Model, infected individuals do not, after recovering, return
to being susceptible; they become immune to the disease
• The SIR Model has been applied to ‘once-off’ childhood diseases such as chicken
pox, measles, and mumps
• The compartmental diagram in Figure 1.10 shows how individuals move between
the S, I, and R classes; again, it is not to be confused with a state diagram
• In your Biomathematics module, you have learned or will learn about deterministic
versions of the SIS and SIR Models
32
◦ This includes discrete-time (difference equations) and continuous-time (ordi-
nary differential equations) versions of the models
• In a deterministic model, once we set the initial value(s) and the values of all pa-
rameters, the model tells us exactly what will happen, i.e. the number of susceptible
and the number of infected individuals at any point in time
• In this module, we are studying only stochastic models, not deterministic models.
With a stochastic model, once we set the initial value(s) and the values of all pa-
rameters, the model still cannot tell us exactly what will happen. It allows us to
calculate probabilities and expectations about what will happen.
• To reduce the number of transitions and simplify the model, we assume that the
time step size (i.e. the amount of time that passes between time n and time n + 1) is
small enough that the number of infected individuals changes at most by one during
this interval
• The transition probability formula for this model can then be stated as follows:
i N − i
β = βi(N − i)/N 2 j = i + 1
N N
γi/N j=i−1
Pi j = Pr (In+1 = j|In = i) =
h i
1 − βi(N − i)/N 2 + γi/N j=i
0 otherwise
• Here, i(N − i) in the Pi,i+1 transition probability reflects that the probability of a new
infection occurring is proportional to the number of potential interactions between
the current number of infected individuals i and the current number of susceptible
individuals N − i
33
• The i in the Pi,i−1 transition probability indicates that the probability of an infected
person recovering and returning to susceptibility is proportional to the current num-
ber of infected individuals i
• Dividing by N twice in the Pi,i+1 transition probability and once in the Pi,i−1 transi-
tion probability is done just so that the i and N − i terms are expressed as a fraction
of the total population N
• A basic restriction is that β > 0, γ > 0, and N > 0 (N must also be an integer);
however, the parameters must also be chosen in such a way that Pi j satisfies the
properties of transition probabilities
X
• We can see from the transition probability formula above that the property Pi j = 1
j∈S
is satisfied by definition:
X
Pi j = 0 + Pi,i−1 + Pii + Pi,i+1
j∈S
• However, we must also ensure that the property Pi j ≥ 0 is satisfied, and this depends
on the values of the parameters
• Let g(i) = βi(N − i)/N 2 + γi/N; let us use calculus to find the maximum value of
g(i); we will call this g(i? ) where i? is the value of i that maximises g(i)
34
• We will then set g(i? ) ≤ 1 to ensure that Pii ≥ 0
(β + γ)i βi2
g(i) = − 2
N N
β + γ 2βi
g0 (i) = − 2 =0
N N
2βi β + γ
=
N2 N
? β + γ N 2 N(β + γ)
i = =
N 2β 2β
35
• Thus, in order to ensure that Pii ≥ 0, the restriction on parameters must be that
(β + γ)2
≤ 1 (if γ ≤ β) or that γ ≤ 1 (if γ > β)
4β
• The first case (γ ≤ β) can be simplified thus:
(β + γ)2 ≤ 4β
β2 + 2βγ + γ2 − 4β ≤ 0
β2 + 2 (γ − 2) β + γ2 ≤ 0
Suppose we choose some fixed γ > 0
Roots of inequation are:
q
2(γ − 2) 2 − 4(1)(γ2 )
−2 (γ − 2) ±
β=
p 2
4 γ2 − 4γ + 4 − 4γ2
β = −(γ − 2) ±
2
−16γ + 16
p
β = −(γ − 2) ±
2
β=2−γ±2 1−γ
p
◦ 0 < γ ≤ 1 and
γ ≤β≤2−γ+2 1−γ
p
0<γ≤1
0<β≤2−γ+2 1−γ
p
36
Further Analysis of the Transition Probabilities
• We may also observe that it will be important to the behaviour of the Markov Chain
which of Pi,i+1 = βi(N − i)/N 2 and Pi,i−1 = γi/N is greater (which determines
whether the number of infected individuals will tend to increase or decrease)
• Let us consider the circumstances under which Pi,i+1 > Pi,i−1
βi(N − i)/N 2 − γi/N > 0
(β − γ)i βi2
− 2 >0
" N N#
(β − γ) βi
i − 2 >0
N N
Roots (zeroes) of inequation:
(β − γ)N
i = 0, i =
β
(β − γ)N
• Since i will never be negative, we can see that Pi,i+1 > Pi,i−1 provided that >
β
0, which will be true provided that β > γ
• Thus, as long as the intensity rate of infection exceeds the intensity rate of recovery,
the epidemic will tend to spread
Long-Term Behaviour
• It is evident that, when i = 0, Pii = 1 while Pi,i+1 = Pi,i−1 = 0, which means this is
an absorbing state
• It is further evident that, when i = N, Pi,i+1 = 0, Pi,i−1 = γ and Pii = 1 − γ; thus this
is not an absorbing state
• The only absorbing state is state 0, which represents 0 infections (the epidemic has
been eradicated)
• First step analysis can be used to determine the expected amount of time until the
epidemic is eradicated
• We could also ‘pretend’ that any other individual state is an absorbing state and use
first step to find the probability that the epidemic reaches that state (i.e. the number
of infected individuals reaches that number) before it is eradicated, for given initial
conditions
Markov Chain SIR Model
• The Markov Chain SIR Model is more complicated in that includes three random
variables
S n , the number of susceptible individuals at time n
In , the number of infected individuals at time n
Rn , the number of removed (immune) individuals at time n
37
• Since S n + In + Rn = N, one of the three random variables can be expressed in terms
of the other two. For instance, Rn |(S n , In ) = N − S n − In is a constant; the process is
fully specified by the two random variables S n and In
• Still, we have two random variables and so our state space is two-dimensional: we
are interested in the joint probability
Pr (S n = s ∩ In = i)
• As before, we assume that our time step size is small enough that at most one change
in state can occur in one time step
• This means that at each time step, either there is no change, or there is one new in-
fection (S n decreases by 1 and In increases by 1) or there is one new removal/recovery
(In decreases by 1).
• The transition probabilities can therefore be expressed as follows:
i s
β = βis/N 2 (k, j) = (s − 1, i + 1) (new infection)
N N
γi
(k, j) = (s, i − 1) (new recovery)
P si,k j =
N
1 − βis/N 2 − γi/N (k, j) = (s, i) (no change)
0 otherwise
• The state space in this case is {0, 1, 2, . . . , N} for both S n and In with the restriction
that S n + In ≤ N
• It is not very convenient to represent the transition probabilities in a two-dimensional
matrix structure, but it can be done; for instance we make the first row (0, 0), the
second row (1, 0), the third row (2, 0), and so on up to the N + 1th row (N, 0); then
the N +2th row is (0, 1), followed by (1, 1), and so on up to (N −1, 1) [since the state
(N, 1) does not exist], then (0, 2), (1, 2), . . ., (N − 2, 2). The pattern would continue
until the final row, (0, N)
• In general, the number of rows (and columns) of the transition probability matrix,
that is, the total number of distinct states in the Markov Chain will be
N+2 (N + 1)(N + 2)
!
=
2 2
• As with the SIS Markov Chain Model, the parameters β and γ represent the infec-
tion rate and recovery rate respectively
• We will not derive restrictions on the parameters γ and β in the notes but it is clear
that γ > 0 and β > 0; also we require that
g(i, s) = βis/N 2 + γi/N ≤ 1 to ensure that P si,si ≥ 0
38
Markov Chain SIR Model: Example
Long-Term Behaviour
• All of the states where the second element is 0 are absorbing since this represents
a disease-free state (no infected individuals; all individuals are either susceptible or
removed)
• For example, if the system with N = 100 is absorbed to the state (1, 0), this means
RT = 100 − 1 − 0 = 99 people contracted the disease and recovered, while S T = 1
person never got the disease. By contrast, if the system is absorbed to the state
(90, 0), this means RT = 100−90 = 10 people contracted the disease and recovered,
while S T = 90 people never got the disease
• It could also be of interest to determine the expected absorption time, i.e. how many
time steps it takes before the epidemic stops
39
• Here, x(t) is the population size of the prey species while y(t) is the population size
of the predator species
• In this section we consider a simple stochastic version of this model of predator-
prey dynamics
• Suppose that Xn is the population of the prey species at time n and Yn is the popula-
tion of the predator species at time n
• As in the SIR Markov Chain Model, we have to jointly consider the two random
variables (Xn , Yn ); the transition probabilities of interest are
Pr (Xn+1 = i ∩ Yn+1 = j|Xn = x ∩ Yn = y)
• In order to keep our transition probability matrix dimensions finite, we assume that
each species has some maximum population (carrying capacity): NX for the prey
species and NY for the predator species
◦ Thus population growth is according to a logistic curve rather than exponential
• The state space for Xn is therefore {0, 1, 2, . . . , NX } and the state space for Yn is
{0, 1, 2, . . . , NY }
◦ This time, the two random variables are not jointly constrained, so the number
of states will be
(NX + 1)(NY + 1)
• As in our epidemiological models, we make the simplifying assumption that the
time step size is only large enough to allow a change of one unit in the predator or
prey population size
• The transition probabilities are defined as follows:
x (NX − x)
a10 (i, j) = (x + 1, y) (prey population increases by 1)
NX2
xy
(i, j) = (x − 1, y) (prey population decreases by 1)
a12
N X NY
xy (NY − y)
(i, j) = (x, y + 1) (predator population increases by 1)
a21
NX NY2
y
=
P xy,i j (i, j) = (x, y − 1) (predator population decreases by 1)
a20
NY
x (NX − x) xy
1 − a10 2
− a12
NX NX NY
xy (NY − y) y
(i, j) = (x, y) (no change)
− a21 − a20
2
N X NY N Y
0
otherwise
Restrictions on Parameters
• We require that a10 , a12 , a21 , a20 > 0; further restrictions are required to ensure that
Pi j,i j ≥ 0
40
Long-Term Behaviour
◦ If the predator species goes extinct first, the prey species can no longer de-
crease and will survive and inevitably reach its carrying capacity NX
◦ If the prey species goes extinct first, the predator species can no longer in-
crease and will also inevitably go extinct
• Consider a very simple case where the carrying capacity of the prey species is NX =
2 and the carrying capacity of the predator species is also NY = 2
• We can see that even in this case the dimensions of the transition probability matrix
will be 9 × 9
(0, 0) (1, 0) (2, 0) (0, 1) (1, 1) (2, 1) (0, 2) (1, 2) (2, 2)
(0, 0) 1 0 0 0 0 0 0 0 0
a a10
(1, 0)
0 1 − 410 4 0 0 0 0 0 0
(2, 0) 0 0 1 0 0 0 0 0 0
a20 a
1 − 220
(0, 1) 2 0 0 0 0 0 0 0
a20 a12 a10 a20 a10
P = (1, 1)
0 2 0 4 1− 4 − a412 − 2 − a21
8 4 0 a21
8 0
a20 a12 a20
(2, 1)
0 0 2 0 2 1− 2 − a212 − a21
4 0 0 a21
4
(0, 2) 0 0 0 a20 0 0 1 − a20 0 0
a12 a10 a10
− a212 − a20
(1, 2) 0 0 0 0 a20 0 2 1− 4 4
(2, 2) 0 0 0 0 0 a20 0 a12 1 − a20 − a12
• Similar Markov Chain models could be formulated for competing species or for
symbiotic species (species that help each other to survive)
41
2 Stochastic Event Processes
Stochastic Event Processes
• In this section we will be looking at stochastic processes that are used to model
countable events occurring in continuous time
e−λ λk
Pr (Y = k) = for k = 0, 1, 2, 3, . . .
k!
• The mean and variance of the Poisson Distribution are E (Y) = λ and Var (Y) = λ
• Here are two other interesting results about the Poisson Distribution that you may
not have been aware of:
1. For any time points t0 = 0 < t1 < t2 < · · · < tn , the process increments
X(t1 )−X(t0 ), X(t2 )−X(t1 ), . . . , X(tn )−X(tn−1 ) are independent random variables
2. For any s ≥ 0 and t > 0, the random variable X(s + t) − X(s) ∼ Poisson(λt)
42
3. X(0) = 0
• Note that time is now continuous, rather than discrete as in the Markov Chains we
studied
• Most quantities that can be modelled using a Poisson distribution can also be mod-
elled as a Poisson process, for example:
• Births at a certain hospital occur at random with an average rate of 0.4 per hour. It
is currently 08:00 at the hospital. Model the births as a Poisson Process and answer
the following questions:
• To answer (a) we observe that 10:00 is 2 time units (hours) after 08:00. Thus we
are interested in Pr (X(2) − X(0) = 0), where X(2) − X(0) = X(2), since X(0) = 0
(property 3 above). By assumption, X(2) has a Poisson distribution with parameter
λt = 0.4(2) = 0.8. Thus
e−0.8 0.80
Pr (X(2) = 0) = = 0.4493
0!
• To answer (b) we observe that 11:00 is 3 time units (hours) after 08:00. The number
of births between 10:00 and 11:00 is therefore X(3) − X(2). We are interested
in Pr (X(3) − X(2) = 1|X(2) − X(0) = 0). However, X(3) − X(2) is independent of
X(2) − X(0) = X(2) (see property 1 above). Thus the conditional probability is
the same as the unconditional probability: Pr (X(3) − X(2) = 1|X(2) = 0) reduces
to Pr (X(3) − X(2) = 1). Also, the distribution of X(3) − X(2) is the same as the
distribution of X(1) − X(0) = X(1) (see property 2 above: letting t = 1 and s = 0 or
s = 2, we see that in both cases we get a random variable with Poisson(λ(1) = 0.4)
distribution). Thus:
e−0.4 0.41
Pr (X(3) − X(2) = 1) = Pr (X(1) = 1) = = 0.2681
1!
43
Poisson Process: Example 2
• Notice in this case that the λ parameter is measured in units of per minute. We must
either transform the units to be per hour (λ = 0.15(60) = 9) or we can keep λ = 0.15
and measure t in minutes in our calculations concerning Pr (X(t)). Let us use both
approaches to show that we get the same answer. In the first case (time measured
in hours) we proceed as follows:
" ! ! #
1 5
Pr X =1∩X = 20
2 2
" ! ! ! #
1 5 1
= Pr X =1∩X −X = 19
2 2 2
" ! # " ! ! #
1 5 1
= Pr X = 1 Pr X −X = 19 (by independence, from property 1)
2 2 2
" ! #
1
= Pr X = 1 Pr [X (2) − X (0) = 19] (by property 2)
2
" ! #
1
= Pr X = 1 Pr [X (2) = 19] (by property 3)
2
e−9/2 (9/2)1 e−9×2) (9 × 2)19
=
1! 19!
= (0.04999)(0.08867) = 0.004433
• Mutations occur in a gene according to a Poisson process with average rate of 1.4
per generation. Determine the following:
(a) The probability that two mutations occur in the first generation
44
(b) The probability that two mutations occurred in the first generation and that six
mutations occur in the first three generations
(c) The probability that two mutations occurred in the first generation, given that
six mutations occur in the first three generations
(d) The probability that six mutations occur in the first three generations, given
that two mutations occurred in the first generation
(a)
e−1.4(1) (1.4 × 1)2
Pr (X(1) = 2) = = 0.2417
2!
(b)
(c) There are actually two methods to calculate this probability. One method is to
use conditional probability rules as follows:
Pr (X(1) = 2 ∩ X(3) = 6)
Pr (X(1) = 2|X(3) = 6) =
Pr (X(3) = 6)
0.037637
= −1.4(3)
e (1.4 × 3)6
6!
0.037637
=
0.11432
= 0.3292
45
(d) Note: in this case we cannot use the binomial approach, because k > n and
u>t
Pr (X(1) = 2 ∩ X(3) = 6)
Pr (X(3) = 6|X(1) = 2) =
Pr (X(1) = 2)
Pr (X(1) = 2) Pr (X(2) = 4)
= (see (b) above)
Pr (X(1) = 2)
= Pr (X(2) = 4)
= 0.1557 (see (b) above)
• Let {X(t); t ≥ 0} be a Poisson process having rate parameter λ = 2.3. Determine the
following probabilities:
(a) Pr (X(1) ≤ 2)
(b) Pr (X(1) ≥ 2|X(1) ≥ 1)
(a)
(b)
Pr (X(1) ≥ 2 ∩ X(1) ≥ 1)
Pr (X(1) ≥ 2|X(1) ≥ 1) =
Pr (X(1) ≥ 1)
Pr (X(1) ≥ 2)
= since X(1) ≥ 1 is necessarily true if X(1) ≥ 2
Pr (X(1) ≥ 1)
1 − Pr (X(1) = 1) − Pr (X(1) = 0)
=
1 − Pr (X(1) = 0)
1−e −2.3(1) (2.3 × 1)1 /1! − e−2.3(1) (2.3 × 1)0 /0!
=
1 − e−2.3(1) (2.3 × 1)0 /0!
0.6691
=
0.8997
= 0.7437
46
2.1.2 Non-Homogeneous Poisson Process
Non-homogeneous Poisson Process: Definition
• A non-homogeneous Poisson Process is one where the rate λ is not constant but
varies with time: λ = λ(t)
• Properties 1 and 3 of the homogeneous Poisson Process definition still hold (inde-
pendence of increments over disjoint intervals; initial value 0), but Property 2 is
different for a nonhomogeneous Poisson process:
e−3 32
• Therefore Pr (X (2) = 2) = = 0.2240
2!
• For the last two hours:
Z 4
Λ= 4 − tdt
2
4
= 4t − t2 /2 2
= (16 − 8) − (8 − 2) = 2
47
e−2 22
• Therefore Pr (X (4) − X (2) = 2) = = 0.2707
2!
• Thus Pr (X (2) = 2 ∩ X (4) − X (2) = 2) = 0.2240 × 0.2707 = 0.0606
• A video camera is placed near a watering hole in the Kruger National Park to mon-
itor animals coming to drink. Within each 24 hour period, beginning at 00h00 and
ending at 24h00, the number of animals arriving to drink per hour follows a Poisson
process with the following rate function, where t is measured in hours from 00h00:
1 4
λ(t) = −t + 48t3 − 792t2 + 5184t for 0 ≤ t ≤ 24
1000
Determine the following quantities:
(a) The probability that at least three animals come to drink before 01h00.
(b) The number of animals that are expected to arrive between 06h00 and 09h30.
(a)
Z 1
Λ= λ(t)dt
0
Z 1
1 4
= −t + 48t3 − 792t2 + 5184t dt
0 1000
Z 1
1
= −t4 + 48t3 − 792t2 + 5184t dt
1000 0
" 5 #1
1 t
= − + 12t − 264t + 2592t
4 3 2
1000 5 0
" 5 #
1 1
= − + 12(1) − 264(1) + 2592(1)
4 3 2
1000 5
1
= [2339.8]
1000
= 2.3398
Pr (X(t) ≥ 3) = 1 − Pr (X(t) = 0) − Pr (X(t) = 1) − Pr (X(t) = 2)
e−2.3398 (2.3398)0 e−2.3398 (2.3398)1 e−2.3398 (2.3398)2
=1− − −
0! 1! 2!
= 0.4145
48
(b)
Z 9.5
Λ= λ(t)dt
6
Z 9.5
1 4
= −t + 48t3 − 792t2 + 5184t dt
6 1000
Z 9.5
1
= −t4 + 48t3 − 792t2 + 5184t dt
1000 6
" 5 # 9.5
1 t
= − + 12t4 − 264t3 + 2592t2
1000 5 6
5
" # " 5 #!
1 9.5 6
= − + 12(9.5)4 − 264(9.5)3 + 2592(9.5)2 − − + 12(6)4 − 264(6)3 + 2592(6)2
1000 5 5
1
= (89846.13 − 50284.8)
1000
1
= (39561.33)
1000
= 39.561
Thus E (X(9.5) − X(6)) = 39.561
49
1 1
• The mean and variance of this distribution are E (T ) = and Var (T ) = 2 .
λ λ
• The exponential distribution is important in the theory of continuous-time stochastic
processes.
• Given that the event has not occurred by time t, it can be shown that the conditional
distribution of the remaining waiting time T − t does not depend on t. For any
t, s > 0,
Pr (T > t + s|T > t) = Pr (T > s)
• It means that, given that an event has not occurred up to time t, the random variable
of its remaining lifetime is statistically the same as if the process is just starting, i.e.
t=0
• For a practical example, suppose that you are standing at the side of the road waiting
for a taxi and that the waiting time for a taxi is exponentially distributed with a mean
of 2 minutes. You wait by the side of the road for two minutes and no taxi has come
yet. What is the expected waiting time now? Is it 0? No, it is still 2 minutes. That
is how the memoryless property works.
α
• The mean of a Gamma-distributed random variable is E (U) = and the variance
λ
α
is Var (U) = 2
λ
• If n is small enough then the integral necessary to work out probabilities can be
done using integration by parts (see example below); if n is large, the probabilities
can be computed numerically in MatLab or other statistical software
50
Figure 2.1: Line Graph Showing State of Poisson Process over Time
• If we have a Poisson process {X(t); t ≥ 0} with rate λ > 0 then we can define the nth
waiting time, that is, the time of occurrence of the nth event, as Wn
• The differences S n = Wn+1 − Wn are called sojourn times; S n measures the duration
that the Poisson process spends in state n (see illustration in Figure 2.1)
• It can be shown that the waiting time to the first event, W1 , is a random variable
with an exponential distribution with rate λ; thus the probability density function of
W1 is
fW1 (t) = λe−λt for t ≥ 0
• It can also be shown that the sojourn times S 0 , S 1 , . . . , S n−1 are independent random
variables, each having the exponential distribution with rate λ; thus the probability
density function of S k is
fS k (s) = λe−λs for s ≥ 0
51
• Furthermore, because Wn = S 0 + S 1 + S 2 + · · · + S n−1 , by the definition of the Gamma
probability distribution described above, Wn has a Gamma distribution with param-
eters α = n and λ; thus the probability density function of Wn is
λ
fWn (t) = (λt)n−1 e−λt for t ≥ 0
(n − 1)!
= 0.002479 − 0.00004540
= 0.00243
• What is the expected waiting time before the third particle is emitted?
n 3
E (W3 ) = = = 1.5
λ 2
It is expected that the third particle will be emitted after 1.5 minutes, that is, 1
minute 30 seconds.
• What is the probability that the waiting time before the third particle is emitted is
more than four minutes?
• Since our interval has no upper bound, we integrate the probability density function
of W3 from 4 up to the maximum value for which the function is nonzero, which is
∞
λ
Z ∞
Pr (W3 > 4) = (λt)n−1 e−λt dt
(n − 1)!
Z4 ∞
2
= (2t)2 e−2t dt
4 (2)!
Z ∞
=4 t2 e−2t dt
4
52
At this point we need to use integration by parts to go further. Re-
member
Z b that integration by parts
Z b breaks an integral down using the fact that
b
f (t)g (t)dt = f (t)g(t) −
0
f 0 (t)g(t)dt. Here, f (t) = t2 and g0 (t) = e−2t . Thus:
a a a
Z ∞
−1 2 −2t ∞
!
−1
Pr (W3 > 4) = 4 te − −2t
(2t)e dt
2 4 4 2
Z ∞
−1 2 −2t ∞
!
=4 te + −2t
te dt
2 4 4
we must use integration by parts again, with f (t) = t and g0 (t) = e−2t
Z ∞
−1 2 −2t ∞ −1 −2t ∞
" #!
−1 −2t
=4 te + te − e dt
2 4 2 4 4 2
−1 2 −2t ∞ −1 −2t ∞ 1 −1 −2t ∞
" #!
=4 te + te + e
2 4 2 4 2 2 4
" # " # " #!
−1 2 −8 −1 −1 −8
=4 0− 4 e + 0− (4)e + 0 −
−8
e
2 2 4
!
1 −8
= 4 8e + 2e + e
−8 −8
4
= 4 10.25e−8
= 0.01375
53
2.1.4 Spatial Poisson Processes
Spatial Poisson Processes: Definition
• So far we have only talked about Poisson processes where the index variable is time
• However, the Poisson distribution can also describe the occurrence of objects in
space, such as:
• We define λ > 0 as the intensity of the process (the ‘rate’ at which objects occur in
space on average)
• We define |A| as the size of A (which could be length, area of volume depending on
n)
e−λ|A| (λ|A|)k
Pr (N(A) = k) = for k = 0, 1, 2, . . .
k!
• The volume of a cylinder is V = πr2 h where r is the radius and h is the height. Thus
the volume of this cylinder is:
54
• Thus:
There is about a 98% chance that there are more than 2 bacteria in the fluid.
• Defects (air bubbles, chips, etc.) occur over the surface of a varnished table top
(1m by 2m in size) according to a Poisson process at a mean rate of one defect
per square metre. If an inspector checks two such tables for defects, what is the
probability that he will find at least one defect in both tables? (Assume that the
numbers of defects on the two table tops are independent.)
Pr (N1 (A) > 0 ∩ N2 (A) > 0) = Pr (N1 (A) > 0) × Pr (N2 (A) > 0)
= (1 − Pr (N1 (A) = 0)) × (1 − Pr (N2 (A) = 0))
!2
e−1(2) (1 × 2)0
= 1−
0!
= 0.7476
There is about a 75% chance that at least one defect will be found in both tables.
• A pure birth process is a generalization of the Poisson process in which the prob-
ability of an event occurring at a given instant of time depends on the number of
events that have already occurred
• Now, instead of λ or λ(t) we have λk : the rate at which events occur depends on k,
the current state of the process
• The name comes from the fact that this type of process can be used to model births
in a population
• The process variable X(t) denotes the number of births in the time interval (0, t],
not necessarily the population size
55
• We define Pn (t) = Pr (X(t) = n|X(0) = 0) as the probability that there are n births by
time t, given that there were 0 births at time 0
• However, it can be shown that the times between births S k are exponentially dis-
tributed random variables:
S k ∼ Exponential(λk ) for k = 0, 1, 2, . . .
∞
X
• Once we put in the restriction that Pn (t) = 1 and if we assume that no two birth
n=0
parameters λ0 , λ1 , . . . are equal, then this integral equation solves as follows:
P0 (t) = e−λ0 t
n−1
Y n
X
Pn (t) = λ` Bk,n e−λk t = λ0 · · · λn−1 B0,n e−λ0 t + · · · + Bn,n e−λn t , n = 1, 2, . . . , where
`=0 k=0
n
Y −1 1
Bk,n = λ j − λk = , k = 0, 1, . . . , n
j=0; j,k
(λ0 − λk ) · · · (λk−1 − λk )(λk+1 − λk ) · · · (λn − λk )
• We can use this formula to find any Pn (t), although obviously this formula gets very
complicated as n increases
• A pure birth process starting from X(0) = 0 and with time measured in years has
birth parameters λ0 = 1, λ1 = 3, λ2 = 2, λ3 = 5.
56
1.
2. The probability that there will be more than three births in the first two years
is:
57
∞
X
Pr (X(2) > 3) = Pn (2)
n=4
3
X
=1− Pn (2)
n=0
= 1 − P0 (2) − P1 (2) − P2 (2) − P3 (2)
!
1 −2 1 −3(2)
=1−e − e − e
−2
2 2
! !
1 −2 1 −3(2) 1 −2 1 −3(2) 1 −2(2) 1 −5(2)
−3 e + e −e −2(2)
−6 e + e − e − e
2 2 8 4 3 24
15 5 1
= 1 − e−2 − e−6 + 5e−4 + e−10
4 2 4
= 0.5779
58
2.2.2 Yule Process
The Yule Process
• The Yule Process is a special case of the Pure Birth Process in which λk = kβ
◦ This means that the birth rate is directly proportional to the population size,
with the proportionality constant being the individual birth rate β
◦ As such, the Yule Process is a stochastic analogue of the deterministic popu-
lation growth model represented by the ODE dy/dt = αy
• With a Yule Process we usually let X(0) = 1
• We can derive the following ODE:
P0n (t) = −β (nPn (t) − (n − 1)Pn−1 (t)) for n = 1, 2, . . .
• Under the initial conditions P1 (0) = 1, Pn (0) = 0 for n = 2, 3, . . . the solution is:
n−1
Pn (t) = e−βt 1 − e−βt for n ≥ 1
• This is the probability mass function of the geometric distribution with p = e−βt
• Remember that the negative binomial distribution is used to model the number of
independent binomial experiment trials required to achieve k successes where the
probability of success in each trial is p
• The geometric distribution is a special case of the negative binomial distribution
where k = 1
1 βt 1 − e−βt 1 − e−βt
• Thus E (X(t)) = = e and Var (X(t)) = = −2βt = e2βt − eβt
e−βt (e−βt )2 e
Yule Process: Example
1.
P6 (3/4) = e−3.2(3/4) (1 − e−3.2(3/4) )6−1
= e−2.4 (1 − e−2.4 )5
= 0.0564
The probability that the population size will be exactly 6 after 45 seconds is
about 6%
59
2. We will use the following formula for the first N terms of a geometric series:
N
1 − rN
X !
ar n−1
=a
n=1
1−r
∞
X 8
X
Pn (3/4) = 1 − Pn (3/4)
n=9 n=0
X8
=1− Pn (3/4) since P0 (t) = 0
n=1
we can compute this sum using the geometric series formula with N = 8
= 1 − 1 − (1 − e−βt )N
= (1 − e−βt )N
= (1 − e−(3.2)(3/4) )8
= 0.4673
Thus the probability that the population size will be greater than 8 after 45
seconds is about 47%.
• A pure death process begins in state N and then moves successively through states
N − 1, N − 2, . . . , 2, 1 and finally is absorbed into state 0
• This parameter gives the average rate of deaths which depends on the current pop-
ulation
• By convention we set µ0 = 0
60
• Similarly to the pure birth process, as long as no two death parameters are equal
(µ j , µk if j , k) then we can state the transition probabilities of the process
explicitly
PN (t) = e−µN t
YN N
X
Pn (t) = µ` Ak,n e−µk t
`=n+1 k=n
= µn+1 µn+2 · · · µN An,n e−µn t + · · · + AN,n e−µN t , n = N − 1, N − 2, . . . , 2, 1, 0,
YN −1
where Ak,n = µ j − µk
j=n, j,k
1
= for k = n, n + 1, . . . , N
(µn − µk ) · · · (µk−1 − µk )(µk+1 − µk ) · · · (µN − µk )
61
state 3, the only possible values of X(t) are 0, 1, 2, and 3, and therefore P0 (t) =
1 − [P1 (t) + P2 (t) + P3 (t)].
As an exercise, use the general Pn (t) formula to find P0 (t) and see if you get
the same answer (you should!)
2. To find E [X(2)], we can use the basicXdefinition of expected value for a dis-
crete random variable Y, i.e. E (Y) = y Pr (Y = y). In this case, our random
y
variable is X(t)|X(0) = N, so:
N
X
E [X(t)|X(0) = N] = nPn (t) = 0P0 (t) + 1P1 (t) + 2P2 (t) + 3P3 (t)
n=0
!
10 −2t 5 −5t 5 −2t 5 −5t
= −5e + e + e + 2 e − e
−3t
+ 3 e−5t
3 3 3 3
10 5 10 10
= −5e−3t + e−2t + e−5t + e−2t − e−5t + 3e−5t
3 3 3 3
20 −2t 4 −5t
= −5e + e + e
−3t
3 3
20 −2(2) 4 −5(2)
E [X(2)] = −5e −3(2)
+ e + e
3 3
= 0.1098
62
Figure 2.3: Probability Functions Pn (t) for a Pure Death Process, n = 0, 1, 2, 3
• We suppose that µk = kα, that is, the average death rate is proportional to the current
population
• α is the proportionality constant representing the individual death rate in the popu-
lation
• That is, X(t) is binomially distributed with the number of trials being N and the
probability of success p = e−αt
• This implies that E (X(t)) = Ne−αt and Var (X(t)) = Ne−αt 1 − e−αt
• We can also define T as the time of extinction, that is, T = min{t ≥ 0; X(t) = 0}
• It can be shown that Pr (T < t) = 1 − e−αt N and E (T ) = α1 1 + 12 + 13 + · · · + N1
63
• Note that, for large N, the harmonic series 1 + 12 + 31 + · · · + N1 ≈ ln N + γ where γ
is the Euler-Mascheroni constant ≈ 0.5772156649. Thus we can approximate
1
E (T ) ≈ (ln N + γ)
α
• Consider a linear death process in which the initial population is 5 and the individual
average death rate is 2.
1.
N = 5 and α = 2
Pr (X(t) = 3|X(0) = 5) = P3 (t)
!
5 −2(3)t
= e (1 − e−2t )5−3
3
= 10e−6t (1 − e−2t )2
2.
Pr (T < 2) = (1 − e−αt )N
= (1 − e−2(2) )5
= (1 − e−4 )5
= 0.9117
64
2.4 Birth-and-Death Process
Birth-and-Death Process: Definition
• A birth-and-death process combines the birth process and death process into a
single process
• We have a population in which births occur at rate λk and deaths occur with rate µk
both of which depend on the current population k
• We can define Pi j (t) = Pr (X(t + s) = j|X(s) = i) for all s ≥ 0 as the probability that
the population changes from i to j during a period of length t
• This leads to some stochastic differential equations that are beyond our scope, but
we consider one special case: linear growth with immigration
• If the initial population is X(0) = i, t can be shown that the expected population size
at time t satisfies the equation:
E [X(t)] = at + i if λ = µ
a h (λ−µ)t i
E [X(t)] = e − 1 + ie(λ−µ)t if λ , µ
λ−µ
65
• Expected population size after 10 weeks:
a h (λ−µ)t i
E [X(t)] = e − 1 + ie(λ−µ)t
λ−µ
3 h (2−2.2)(10) i
= e − 1 + 400e(2−2.2)(10)
2 − 2.2
= 67.104
66
3 Generalised Linear Models with Biological Applications
The Response Variable
• In Statistics 2A, you were introduced to the linear regression model (both sim-
ple and later multiple), which relates a dependent variable or response variable,
usually denoted yi , to one or more independent variables or explanatory vari-
ables x1 , x2 , . . . , xk and their coefficients β0 , β1 , β2 , . . . , βk (including the intercept
β0 ), which are treated as constant but unknown parameters
• Thus, we deal with the fact that our response variable does not have an exact linear
relationship with our explanatory variable(s) by including i , a random error or
disturbance
• We make certain assumptions about the i , namely that the i are all independent
of each other, and that they are normally distributed with a 0 mean and a constant
variance
• Since the normal distribution is a continuous distribution, this means that the yi
values are also continuous, and so if we want the linear regression model to fit our
data well, the yi values in our data should be continuous or at least approximately
continuous
• What happens if we want to model the relationship between a response variable that
is not continuous and some explanatory variables?
◦ A binary or dichotomous variable (e.g., ‘Yes’ vs. ‘No’, ‘Survived’ vs. ‘Died’,
‘Passed’ vs. ‘Failed’, etc.
◦ A categorical variable with more than two categories (a polychotomous vari-
able), where the measurement scale could be nominal or ordinal
◦ A count variable that counts events or objects in time or space; the values
will therefore be integers
• In this section we will discuss a class of models called Generalized Linear Mod-
els (GLMs), and a couple of special cases for modelling relationships involving a
binary response variable or a count response variable
67
Expected Response in the Linear Regression Model
• In order to introduce GLMs, let us consider how to generalise some of the features
of the linear regression model
• The left side of the equation, as we know, consists of the response variable yi
• The right side consists of two parts: β0 + β1 xi1 + β2 xi2 + · · · + βk xik , which we refer
to as the linear predictor (which we can call ηi ), and i , the random error
• Now, one quantity of great interest in the model is the expected response, E (yi ),
which we can call µi
• This can be derived easily as follows for the linear regression model:
µi = E (yi ) = E (ηi + i )
= E (ηi ) + E (i )
= ηi + 0
= ηi
• The linear predictor has an expected value equal to itself because it does not contain
any random variables; the β j are all assumed to be constants and the xi j are assumed
to be fixed as well
• The random error has an expected value of 0 because this is one of the model
assumptions
• Thus, in the linear regression model, the expected response µi is equal to the linear
predictor ηi :
µi = ηi
• GLMs are more general than the linear regression model in that, instead of assuming
that µi = ηi (that the linear predictor is equal to the expected response), they assume
that g(µi ) = ηi (that the linear predictor is equal to some function of the expected
response)
68
• The function g(·) is known as the link function because it links the linear predictor
to the expected response
• From the above equation we can see that the following is also true:
µi = E (yi ) = g−1 (ηi ) = g−1 (β0 +β1 xi1 +β2 xi2 +· · ·+βk xik ) where g−1 (·) is the inverse of g(·)
• Note the linear regression model is a GLM; it is just that in this case the link function
is g(µi ) = µi
• In this section we will be considering four other GLMs that differ from linear re-
gression in two ways:
◦ The probability distribution of the response variable (conditional on the linear
predictor): in the linear regression model the distribution is assumed to be
normal, but this is not a helpful assumption when we have a binary or count
response
◦ The link function will no longer be simply g(µi ) = µi
• Specifically, we will be considering four models:
1. Binary Logistic Regression
2. Probit Regression
3. Poisson Regression
4. Negative Binomial Regression
• The first two models above have a binary response variable (they can be extended
to deal with polychotomous response variables, whether nominal or ordinal, but we
will not cover that)
• The last two models above have a count response variable
• The name of each model comes either from the link function or from the conditional
distribution of the response variable
69
Table 3.1: Caesarean Section Data from a Sample of n = 30 Women
70
Would Linear Regression Work with a Binary Response?
• Although it is true that the predicted values ŷi = p̂i tend to be smaller when the
actual yi value is 0 than when the actual yi value is 1, the predicted values are not
actually 0’s and 1’s
71
• Hence we can instead interpret them as probabilities p̂i , but there is again a major
problem: some of the fitted p̂i values are below 0, and others are above 1, as can be
seen in Figure 3.1
• So, to summarise, a major difficulty with using a linear regression model with a
binary response variable is that we cannot ensure that our predicted probabilities
are between 0 and 1
• The problem we have just discussed motivates us to consider a different form of the
link function g(·) that can ensure that the predicted probabilities fall between 0 and
1
72
• When the input to a logit function is a probability (as it is in this case), we can also
refer to the logit as the log-odds
◦ Notice that if Pr (A) = 12 , the odds will be 1. If Pr (A) > 12 , the odds will be
> 1, and if Pr (A) < 12 , the odds will be < 1
◦ Notice further that if Pr (A) = 0, the odds will be 0, and if Pr (A) = 1, the odds
will be infinite; thus odds will always fall in the interval [0, ∞)
◦ In the logit function, p is a probability and thus p/(1 − p) is an odds, so the
logit is the log of the odds, or log-odds
• In order to see the usefulness of the logit function we need to find its inverse:
p
η = log
1− p
p
eη =
1− p
p = eη (1 − p) = eη − eη p
p (1 + eη ) = eη
eη
p=
1 + eη
eη
• Thus g−1 (p) =
1 + eη
1
• Note that this function can also be written as :
1 + e−η
eη e−η eη−η 1
g−1 (p) = × = =
1+e η e−η e +e
−η η−η 1 + e−η
• We can readily see that the values of g−1 (p) will always fall in the interval [0, 1],
so predicted probabilities generated from this model will also fall into this interval
(see Figure 3.3)
• We can observe from Figure 3.2 (or from the vertical axis of Figure 3.3, which is
basically Figure 3.2 turned on its side) that the logit function is almost a straight
line for much of its domain, for approximately p ∈ [0.2, 0.8]
• This means it is only for extreme values that a linear model with logit link function
will give very different results from a linear model with identity link function (as in
linear regression)
73
Figure 3.3: Inverse of Logit Function
• Using the link function described above, we can specify the model equation for
logistic regression as follows:
pi
log = ηi = β0 + β1 xi1 + β2 xi2 + · · · + βk xik where
1 − pi
pi = Pr (yi = 1|ηi )
• Alternatively, using the inverse of the logit function, we can specify the model
equation as
1
pi =
1 + exp{− (β0 + β1 xi1 + β2 xi2 + · · · + βk xik )}
• Since this method cannot be implemented easily by hand and is quite technical, we
will omit the details; we can use statistical software to fit the model for us
• SAS output from fitting the Caesarean Section data to a logistic regression model
is shown in Table 3.2
74
Table 3.2: Parameter Estimates from Logistic Regression Model Fit to Caesarean Section
Data
• The β̂0 , β̂1 , and β̂2 estimates respectively are shown in the Estimate column; thus
the fitted model equation is
p̂i 1
log = −56.7017+0.3061x1 +1.1933x2 or p̂i =
1 − p̂i 1 + exp{56.7017 − 0.3061x1 − 1.1933x2 }
• (Note the change of signs inside the exp if we are using the second equation!)
• Note that in this case the significance test is not a t-test but an approximate χ2 test;
for further details of the test refer to Statistics 2A notes
◦ Let us start with the intercept β̂0 . Recall that in a linear regression model,
β̂0 is the expected value of the response variable y when all the explanatory
variables are set to 0. In this case, as we can see from the fitted model equation,
β̂0 is the value of the log-odds when all the explanatory variables are set to 0:
p̂i
log = β̂0 + β̂1 (0) + β̂2 (0) = β̂0
1 − p̂i
◦ We can make this interpretation a little bit more friendly by taking exp{·} of
both sides, to give us the odds rather than the log-odds
p̂i
= exp{β̂0 }
1 − p̂i
75
◦ The interpretations of the gradient estimates β̂1 and β̂2 are more meaningful
and useful. Recall that in a linear regression, we interpret the gradient β̂1
as follows: if x1 increases by 1 unit, the expected response ŷ increases by
β̂1 units. In this case, we interpret as follows: if x1 increases by 1 unit, the
p̂i
expected log-odds of a Caesarean Section log increases by β̂1 units. Or,
1 − p̂i
taking e to the power of both sides, we can say: if x1 increases by 1 unit, the
p̂i
expected odds of a Caesarean Section increases by a factor of exp{β̂1 }
1 − p̂i
(or, if β̂1 < 0, decreases by a factor of exp{−β̂1 })
◦ Thus, in our Caesarean Section model, we interpret as follows: for every one-
year increase in a woman’s age, the odds of her having a Caesarean Section
are expected to increase by a factor of exp{0.3061} = 1.3581 (that is, by
35.81%). For every one-week increase in a woman’s gestation period, the
odds of her having a Caesarean Section are expected to increase by a factor of
exp{1.1933} = 3.2979 (that is, by more than three times, or by 329.79%)
• We can use a logistic regression model to make predictions: by substituting the xi1
and xi2 values of individual observations into the fitted model equation (usually the
one in the form p̂i = 1/(1 + exp{−β̂0 + β̂1 x1 + β̂2 x2 + · · · + β̂k xk })), we obtain predicted
probabilities that the response variable equals 1 (in this case, that the woman has a
Caesarean Section)
• For example, the first woman in the data is 25 years old (xi1 = 25) and had a 40-
week gestation (xi2 = 40), so
1
p̂i = = 0.2113
1 + exp{56.7017 − 0.3061(25) − 1.1933(40)}
• This probability is fairly low, so it is more likely that yi = 0 in this case (as, in fact,
it is)
• To make predictions, we need a decision rule consisting of threshold value for p̂i
to use, called τ, so that:
1 if p̂ ≥ τ
(
ŷ =
0 if p̂ < τ
• An obvious value of τ to choose would be 0.5, since p̂ ≥ 0.5 means that the event
y = 1 is more likely than the event y = 0 (since its probability is ≥ 0.5), while
p̂ < 0.5 means that the event y = 1 is less likely than the event y = 1
• However, we may not always want to use a threshold of τ = 0.5; sometimes we are
modelling rare events (e.g., insurance fraud) where the probability that the response
is 1 is very low; using τ = 0.5 would lead us to predict y = 0 for all observations,
which means we would never successfully predict fraud
76
Evaluating the Predictive Ability of a Logistic Regression Model: Sensitivity and
Specificity
• How do we know how well a logistic regression model is predicting the response
variable?
• For starters, we should not really evaluate a predictive model using the same data
that we used to fit the model (which is called training data); we should actually
obtain additional data that was not used to fit the model and make predictions on
this data (which is called test data)
◦ If you continue your studies to the Advanced Diploma, you will learn about
a more advanced way of splitting the available data set into training and test
sets, known as cross-validation
◦ You will also learn many other classification methods that, like logistic re-
gression, are designed to ‘classify,’ that is, to predict the values of a binary
response
• However, for simplicity’s sake let us consider the problem of evaluating predictions
on our training data in the Caesarean Section example
◦ The Accuracy of a model is the overall proportion of cases that were correctly
predicted, regardless of class:
# of negatives predicted as negatives + # of positives predicted as positives
Accuracy =
total # of cases predicted
= Pr (y = 0 ∩ ŷ = 0) ∪ (y = 1 ∩ ŷ = 1)
77
◦ This is done for the Caesarean Section data (using a threshold of τ = 0.5) in
Table 3.3
Table 3.3: Making and Classifying Predictions on Caesarean Section Data with Threshold
τ = 0.5
Observation xi1 xi2 yi p̂i ŷi Result
1 25 40 0 0.2116 0 True Negative
2 30 40 1 0.5536 1 True Positive
3 37 40 1 0.9136 1 True Positive
4 26 38 0 0.0324 0 True Negative
5 33 36 0 0.0256 0 True Negative
6 26 39 0 0.0995 0 True Negative
7 24 42 1 0.6825 1 True Positive
8 19 40 0 0.041 0 True Negative
9 26 41 1 0.5459 1 True Positive
10 18 42 0 0.2551 0 True Negative
11 32 40 1 0.6958 1 True Positive
12 23 40 1 0.127 0 False Negative
13 24 42 1 0.6825 1 True Positive
14 29 38 0 0.0774 0 True Negative
15 23 42 1 0.6128 1 True Positive
16 25 38 0 0.0241 0 True Negative
17 33 40 0 0.7565 1 False Positive
18 29 40 0 0.4773 0 True Negative
19 37 38 0 0.4928 0 True Negative
20 31 42 1 0.9482 1 True Positive
21 16 38 0 0.0016 0 True Negative
22 22 39 0 0.0315 0 True Negative
23 30 40 0 0.5536 1 False Positive
24 21 38 0 0.0072 0 True Negative
25 17 42 0 0.2014 0 True Negative
26 18 39 0 0.0095 0 True Negative
27 23 42 0 0.6128 1 False Positive
28 28 38 0 0.0582 0 True Negative
29 31 43 1 0.9837 1 True Positive
30 38 37 1 0.2858 0 False Negative
Actual Condition
Positive(yi = 1) Negative(yi = 0) Total
Predicted Positive (ŷi = 1) 9 (TP) 3 (FP) 12
Condition Negative(ŷi = 0) 2 (FN) 16 (TN) 18
Total 11 19 30
78
(16 out of 19 actual negatives were predicted as negatives)
◦ Note that sensitivity is equivalent to the concept of ‘power’ in hypothesis test-
ing
◦ Note that, in addition to sensitivity and specificity, we can calculate the em-
pirical Type I error and Type II error rates as follows:
# of false positives
Type I Error Rate = = 1 − Specificity
# of total negatives
# of false negatives
Type II Error Rate = = 1 − Sensitivity
# of total positives
◦ Thus, in this case the type I error rate is 1 − 16
19
= 3
19
= 0.1579 and the type II
9
error rate is 1 − 11 = 11
2
= 0.1818
◦ Similarly, we can compute the model accuracy:
# of true positives + # of true negatives 9 + 16
Accuracy = = = 0.8333
# of cases predicted 30
And thus the misclassification rate is 1 − Accuracy = 1 − 0.8333 = 0.1667
• The sensitivity and specificity values above were calculated for one specific choice
of the threshold τ
• Sensitivity and specificity are inversely related (like type I and type II error), so that
when one increases the other decreases
• We can then create a graph with 1 − specificity (essentially type I error rate) on the
horizontal axis and sensitivity (essentially power) on the vertical axis
79
• We can also include a dashed line through the origin with gradient 1 on the graph;
this corresponds to a baseline model where our predictions are completely random
(e.g., we generate a random number between 0 and 1 and use this as our p̂i )
◦ Can you see why, in such cases, sensitivity and 1−specificity will tend to be
equal?
• This graph is called a Receiver Operating Characteristic (ROC) Curve, and
gives us an idea of how good our predictive model is overall
• Basically, the closer to the top left corner of the graph the curve is, the better the
predictive model
• The closer to the 45◦ line the curve is, the worse the predictive model (i.e., the more
similar it is to a completely random prediction)
• The ROC curve for the Caesarean Section logistic regression model is shown in
Table 3.4
Figure 3.4: Receiver Operating Characteristic Curve for Logistic Regression Model Fit to
Caesarean Section Data
• We can see that this particular model performs reasonably well; the ROC curve is
quite high above that of the baseline random prediction model
80
• The reason why the curve has steps instead of being smooth is that the sensitivity
and specificity values only change whenever we overtake one of the p̂i values for
the observations in our data. If we had a very large and diverse data set, the ROC
curve would be more smooth
• Just as the response variable in binary logistic regression is binary, we can also have
explanatory variables that are binary rather than continuous
• Consider a medical researcher who is investigating the effectiveness of a certain
treatment for treating male and female patients with a certain skin condition
• Table 3.5 displays data collected on sixteen patients, half of whom were given a new
experimental treatment and the other half received the standard treatment. After 6
months it was recorded whether each patient’s skin condition had healed. We define
the following variables:
(
1 if ith patient had recovered from the skin condition after 6 months
Let yi =
0 if ith patient had not recovered from the skin condition after 6 months
(
1 if ith patient received the new experimental treatment
Let xi1 =
0 if ith patient received the standard treatment
(
1 if ith patient was male
Let xi2 =
0 if ith patient was female
Table 3.5: Data on Patient Recovery after Treatment for Skin Condition
• A logistic regression model is fit to the data using Maximum Likelihood Estimation.
The SAS output is shown in Table 3.6
• Now, we can see that the β1 and β2 coefficients are not statistically significant in
this case at 5% significance level (their p-values are above 0.05), but let us proceed
anyway with interpreting the parameter estimates and making predictions
• The interpretation of the parameter estimates changes because it no longer makes
sense to speak of a ‘one-unit increase’ in x1 or x2 since they are categorical variables
81
Table 3.6: Parameter Estimates from Logistic Regression Model Fit to Skin Condition
Data
• Instead we make a comparison between the ‘1’ category and the ‘0’ category of the
variables
◦ In the case of the intercept estimate β̂0 = −1.3815, this corresponds to the
expected log-odds of a patient having recovered from the skin condition given
that the patient was on the standard treatment (x1 = 0) and that the patient was
female (x2 = 0)
◦ Taking exp{−1.3815} = 0.2512, we can say that the estimated odds of a female
patient who was on the standard treatment having recovered from the skin
condition are 0.2512
◦ Coming to the gradients, β̂1 = 2.7630 is the estimated difference in the log-
odds of recovery between a patient on the new experimental treatment and a
patient on the standard treatment. A more simpler interpretation is possible if
we take exp{β̂1 } = exp{2.7630} = 15.8473. This is an odds ratio: the odds of
recovery for a patient on the new experimental treatment are estimated to be
15.8473 times the odds of recovery for a patient on the standard treatment
◦ Similarly, exp{β̂2 } = exp{−1.5791} = 0.2062 means that the odds of recovery
for a male patient are estimated to be 0.2062 times the odds of recovery for
a female patient. Or, we can change the sign to exp{1.5791} = 4.8506 and
change the comparison around: the odds of recovery for a female patient are
estimated to be 4.8506 times the odds of recovery for a male patient
• Table 3.7 gives the predicted probabilities calculated from the equation
1
p̂i =
1 + exp{−(−1.3815 + 2.7630xi1 − 1.5791xi2 )}
• This time, there are only four different combinations of xi1 , xi2 values and thus only
four different p̂i values, which makes it feasible (although still somewhat time-
consuming) to work out the ROC Curve by hand
• The workings are shown in Table 3.8, and these give us the table of values in 3.9
that we use to create the ROC Curve in 3.5
82
Table 3.7: Predicted Probabilities of Recovery for Skin Condition Patients from Logistic
Regression Model
Table 3.8: ROC Curve Workings for Skin Condition Logistic Regression Model
τ=0 τ = 0.0492 τ = 0.2008 τ = 0.4508 τ = 0.7992 τ=1
yi p̂i ŷi Prediction ŷi Prediction ŷi Prediction ŷi Prediction ŷi Prediction ŷi Prediction
Result Result Result Result Result Result
0 0.4508 1 FP 1 FP 1 FP 0 TN 0 TN 0 TN
0 0.0492 1 FP 0 TN 0 TN 0 TN 0 TN 0 TN
1 0.4508 1 TP 1 TP 1 TP 0 FN 0 FN 0 FN
0 0.0492 1 FP 0 TN 0 TN 0 TN 0 TN 0 TN
1 0.7992 1 TP 1 TP 1 TP 1 TP 1 TP 0 FN
1 0.2008 1 TP 1 TP 0 FN 0 FN 0 FN 0 FN
1 0.7992 1 TP 1 TP 1 TP 1 TP 1 TP 0 FN
0 0.2008 1 FP 1 FP 0 TN 0 TN 0 TN 0 TN
0 0.4508 1 FP 1 FP 1 FP 0 TN 0 TN 0 TN
0 0.0492 1 FP 0 TN 0 TN 0 TN 0 TN 0 TN
1 0.4508 1 TP 1 TP 1 TP 0 FN 0 FN 0 FN
0 0.0492 1 FP 0 TN 0 TN 0 TN 0 TN 0 TN
0 0.7992 1 FP 1 FP 1 FP 1 FP 1 FP 0 TN
0 0.2008 1 FP 1 FP 0 TN 0 TN 0 TN 0 TN
1 0.7992 1 TP 1 TP 1 TP 1 TP 1 TP 0 FN
0 0.2008 1 FP 1 FP 0 TN 0 TN 0 TN 0 TN
Sensitivity: Sensitivity: Sensitivity: Sensitivity: Sensitivity: Sensitivity:
#TP+#FN = 1 = = = = #TP+#FN = 0
#TP #TP #TP #TP #TP #TP
#TP+#FN #TP+#FN #TP+#FN #TP+#FN
6 =1 6 = 0.8333 6 = 0.5 6 = 0.5
6 5 3 3
83
Table 3.9: Sensitivity and 1−Specificity Values for ROC Curve
of Logistic Regression Model of Recovery from Skin Condition
1−Specificity Sensitivity
0 0
0.1 0.5
0.1 0.5
0.3 0.8333
0.6 0.1
1 1
Figure 3.5: ROC Curve for Logistic Regression Model Fit to Skin Condition Recovery
Data
• Like binary logistic regression, probit regression is used when there is a binary
response variable
• Remember that in binary-response regression models our µi = E (yi ) is a probability
84
pi (since E (yi ) in a Bernoulli Distribution is pi , the probability of ‘success’ in the
single trial)
• This means we need a link function that is always going to produce values of p̂i
between 0 and 1
p
◦ The logit link function g(p) = log used in logistic regression is one
1− p
such link function; the identity link function g(p) = p that is used in linear
regression is not
• In probit regression we use as our link function the inverse of a cumulative dis-
tribution function of a probability distribution, most commonly of the standard
normal distribution
• This function gives the area under the standard normal probability density function
between −∞ and z; in other words it gives us Pr (Z < z) from the standard normal
distribution for any value z, as pictured in Figure 3.6
Figure 3.6: Graph Showing Cumulative Distribution Function Φ(z) for Standard Normal
Distribution for a particular value of z
85
Figure 3.7: Graph Showing Cumulative Distribution Function Φ(z) for Standard Normal
Distribution
• The graph of Φ(z) itself is shown in Figure 3.7; it is clear that this function will
always give us values between 0 and 1
• Note that Φ(z) is precisely the function whose values we look up in the ‘Z Table’
(Table 4.3) when we want to find probabilities from the standard normal distribution
• Now, recall that the general form of a GLM is g(µi ) = ηi where (in this case) µi = pi
and where ηi = β0 + β1 xi1 + β2 xi2 + · · · + βk xik
• It is the values of pi that we need to be between 0 and 1, and pi is the input argument
of the function g(·) rather than the output; this means that using g(·) = Φ(·) as the
link function does not guarantee that pi is between 0 and 1 (since the domain of
Φ(·) is from −∞ to ∞); it is the range of Φ(·) that is from 0 to 1.
• Thus our link function will actually be the inverse of Φ(·): g(pi ) = Φ−1 (pi ) = ηi ,
which is equivalent to
86
Probit Regression: Application in Toxicology
• Technically, any data set that is appropriate for logistic regression is also appropriate
for probit regression; the two are basically competing alternatives that differ only
in their link function
• However, one application in which probit regression is widely used is for estimating
the toxicity of chemical substances to living organisms
• The typical scenario is that one administers different dosage levels of the toxin to a
random sample of different organisms, and monitors a binary response of death vs.
no death, i.e. (
1 if the ith organism died
yi =
0 if the ith organism did not die
• We will explore this by means of an example after first considering two important
concepts from toxicology
• Two important quantities used to measure the toxicity of a chemical substance are
the LD50 and LC50
• The ‘LD’ in LD50 stands for ‘Lethal Dose’, while the ‘LC’ in LC50 stands for
‘Lethal Concentration’
• LD50 refers to the dose level of an administered chemical substance that would be
lethal for 50% of organisms in a population, in other words, that would have a 50%
probability of killing any randomly selected individual
• LC50 has essentially the same meaning but is used for chemical concentration level
in the air rather than a dose of a substance that is administered orally or intra-
venously
87
• Of course, since β0 and β1 are unknown parameters, we estimate them with their
Maximum Likelihood Estimators, β̂0 and β̂1 ; thus our estimated LD50 or LC50 is
calculated as
LD c 50 ) = − β̂0
c 50 (or LC
β̂1
• The data is too large to display in full but a few observations are shown for illustra-
tion in Table 3.10
88
Table 3.11: Parameter Estimates from Probit Regression Model Fit to Picloram Data
• The p-value of the Dose coefficient (β1 ) is < .0001 so we can see that there is a
statistically significant relationship between Dose and the response variable (Death
vs. No Death).
• A more intuitive way to interpret the model output is to calculate the estimated
LD50 :
c 50 = − β̂0 = − −2.1011 = 1.349
LD
β̂1 1.5577
• Thus we estimate that the Picloram dosage level that will kill about half of larksur
plants is 1.349 kg/ha
• We can again make predictions using the p̂i values from a probit regression model
and a threshold value τ, and then calculate sensitivity and specificity and construct
an ROC Curve.
• In our picloram example, there are only four distinct dosage levels in the data, so
we will only have four distinct p̂ values (Table 3.12)
89
Table 3.12: Predicted Probabilities of Death of Larksur Plants by Picloram Herbicide
Dosage, Using Probit Regression
β̂2j
• Under H0 , the Wald statistic χ = ∼ χ21
2
Var β̂ j
• Var β̂ j is the jth element of the matrix X > V X −1 , where:
◦ X is a n × (k + 1) matrix containing all 1’s in the first column, x11 , x21 , . . . , xn1
in the second column, x12 , x22 , . . . , xn2 in the third column, and so on until
x1k , x2k , . . . , xnk in the (k + 1)th column
◦ V is a diagonal matrix whose diagonal elements are Var (y1 |η) , Var (y2 |η) , . . . , Var (yn |η)
◦ Where the yi |ηi are binary variables (following a Bernoulli distribution with
probability parameter pi ), Var (yi |ηi ) = pi (1 − pi ) (the variance of a Bernoulli-
distributed random variable with probability parameter pi )
1
◦ Our estimate of V is V̂ where we replace each pi with p̂i = :
1 + e−(β̂0 +β̂1 xi1 +β̂2 xi2 +···+β̂k xik )
p̂1 (1 − p̂1 ) 0 0 ··· 0
0 p̂2 (1 − p̂2 ) 0 ··· 0
V̂ =
0 0 p̂3 (1 − p̂3 ) · · · 0
.. .. .. .. ..
. . . . .
0 0 0 · · · p̂n (1 − p̂n )
β̂2j
χ =
2
−1
X > V̂ X
jj
• In this module you will not be asked to calculate this test statistic by hand, but it is
helpful for you to know where it comes from
90
3.3 Poisson Regression
Poisson Regression Model
• Recall that the Poisson distribution can be used to model probabilities of how many
events occur in a particular interval of time (or a particular space), provided that
they occur with a fixed average rate λ
• If y is a Poisson-distributed random variable with rate parameter λ, that is, if y ∼
e−λ λk
Poisson(λ), then Pr (y = k) = for k = 0, 1, 2, 3, . . .
k!
• y might be, for example, the number of taxis that arrive at a taxi rank in a certain
period of time, or the number of customers that arrive in a queue in a certain period
of time, or the number of mutations that occur in a certain amount of genetic code
• If we are interested in analysing the relationship between such a ‘count’ response
variable and one or more explanatory variables, we can use a Poisson regression
model, which is a type of GLM
• Our mean in this case µi = E (yi ) = λi , and our link function is g(λ) = log λ (where
log is the natural logarithm)
• Thus g(µi ) = ηi in this case yields the model equation
• By taking the inverse of the function we can also express the model in terms of λi :
log λ = η
λ = exp{η}
• This also shows the reason for our choice of link function: it ensures that λ > 0
(which is a requirement of the Poisson distribution)
• As with other GLMs, the Method of Maximum Likelihood is used to estimate the
parameters
• Table 3.13 provides data on the number of elephant poaching incidents per year
in the Central Luangwa Valley, Zambia, from 1988 to 1995, together with two
explanatory variables, which are law enforcement expenditure in thousands of US
dollars per km2 (x1 ) and number of bonus claims paid to anti-poaching scouts in
thousands (x2 )
• A Poisson regression model is fit to the data and SAS output is displayed in Table
3.14
• Thus the fitted regression equation is
91
Table 3.13: Elephant Poaching Incidents in Central Luangwa Valley Per Year, 1988-1995
Table 3.14: Parameter Estimates from Poisson Regression Model Fit to Elephant Poach-
ing Data
• We can see from the p-values in the last column that β2 is statistically significant at
5% level while β1 is not
• We can also make predictions from this model of the expected number of elephants
poached under different values of the explanatory variables. For instance, supposing
that in a certain year the law enforcement expenditure is $50 USD per km2 and that
3000 bonus claims are paid out to scouts, what is the expected number of elephants
92
killed? (Note: be careful of units!)
• We have seen that the interpretation of the β̂ j parameter estimates is different for
different GLMs (linear regression, logistic regression, probit regression, Poisson
regression)
• This can be confusing, so it helps to have an intuitive method for deriving the inter-
pretation instead of just memorising it
• In all of these models, the interpretation of the gradient estimates concerns what
happens when a particular explanatory variable x j increases by one unit
• For simplicity let us assume that there is only one explanatory variable x1
• We can derive the interpretation by comparing the estimate of the expected response
µ̂i when the explanatory variable value is xi1 to the estimate of the expected response
µ̂?i when the explanatory variable value is xi1 + 1
Thus we can interpret β̂1 in the linear regression model as the change in ŷi
when xi1 increases by one unit; consequently β̂1 is the estimated change in the
response when xi1 increases by one unit
◦ Logistic regression case:
p̂i
log = β̂0 + β̂1 xi1
1 − p̂i
p̂?i
log = β̂0 + β̂1 (xi1 + 1)
1 − p̂?i
p̂?i p̂i h i
log − log = β̂ + β̂ (x + 1) − β̂ + β̂1 i1 = β̂1
x
1 − p̂?i
0 1 i1 0
1 − p̂i
93
when xi1 increases by 1. Or:
p̂i
= exp{β̂0 + β̂1 xi1 }
1 − p̂i
p̂?i
= exp{β̂0 + β̂1 (xi1 + 1)}
1 − p̂?i
p̂?i
1 − p̂?i exp{β̂0 + β̂1 (xi1 + 1)}
=
p̂i exp{β̂0 + β̂1 xi1 }
1 − p̂i
h i
= exp{β̂0 + β̂1 (xi1 + 1) − β̂0 + β̂1 xi1 }
= exp{β̂1 }
Thus β̂1 can be interpreted as the change in the log of expected response when
xi1 increases by 1. Or:
94
• Our estimate of V is V̂ where we replace each λi with λ̂i = exi β̂ :
>
λ̂1 0 0 · · · 0
• Our test statistic, with formula as follows, still follows a χ21 distribution under the
null hypothesis β j = 0:
β̂2j
χ2 = −1
X > V̂ X
jj
• Under the null hypothesis that our model is correctly specified, D follows a χ2
distribution with n−(k+1) degrees of freedom, where k is the number of explanatory
variables in our model
• Thus if D is larger than the critical value χ2α,n−k−1 , we would reject the null hypoth-
esis and conclude that our model does not fit the data well
• You will not be expected to calculate D by hand but should be able to interpret it
when encountered in SAS output
95
• A rule of thumb in Poisson regression is that the residual deviance D should be
roughly equal to its degrees of freedom n−(k+1). If D is much larger than n−(k+1)
(suggesting a lack of fit), this may indicate the presence of overdispersion
• One option we have when our Poisson regression model is overdispersed is to in-
stead use Negative Binomial Regression
• What this means is that the variance of a negative binomial distributed random
variable is larger than the variance of a Poisson distributed random variable by the
amount of θλ2 , and if θ = 0 then the negative binomial distribution reduces to a
Poisson distribution
• Thus we can use the negative binomial distribution to model count data which is
overdispersed
• Note that negative binomial regression uses the same link function as Poisson re-
gression, so the model equation is the same and so is the interpretation of the pa-
rameters
96
4 Introduction to Monte Carlo Simulation
Estimating Quantities through Simulation
• The Monte Carlo method is a technique used to approximate quantities (e.g., inte-
grals) that are difficult or impossible to work out analytically
• In our context, we could use Monte Carlo method to estimate certain probabilities
or expected values from Markov Chains, for instance
• Let us in general consider two types of quantities we may want to estimate using
Monte Carlo simulations: the probability of an event A, and the expected value of
a random variable X
Thus:
Pr (X = 1) = Pr (A) = p and Pr (X = 0) = Pr Ac = 1 − p
97
Pseudorandom Number Generation
• How do we generate a ‘random’ value from the distribution of T |X0 = i (or any
other probability distribution)?
• The quality of PRNG algorithms has improved greatly over the past few decades in
terms of the properties of the random numbers they produce
• The basic probability distribution from which PRNGs generate is the continuous
uniform distribution U(0, 1), which produces real-numbered values uniformly dis-
tributed within the interval [0, 1]
• Algorithms can then be produced to transform the U(0, 1) numbers to another prob-
ability distribution
• For instance, consider a Markov Chain with state space S = {0, 1, 2} where Xn = 0
and the transition probabilities for row 0 are as follows:
0.2 if j = 0
P0 j = Pr (Xn+1 = j|Xn = 0) = 0.3 if j = 1
0.5 if j = 2
• If we wanted to generate a random value from this transition probability (to deter-
mine where the process will be at step n + 1), we can generate a U(0, 1) random
value un+1 and then apply the following rule:
• Programmers are often so eager to come up with Monte Carlo estimates that they
forget about the problem of how precise their estimates are
• Ideally we would want to calculate the standard error of our Monte Carlo estimates
and we can then also produce approximate confidence intervals for them using the
Central Limit Theorem
98
• Consider our problem of estimating the expected value of random variable X; our
s
1X
Monte Carlo estimator is a sample mean µ̂MC x̄ = xi . Recall from Statistics 1B
s i=1
the sampling distribution of a sample mean: E X̄ = µ (the true distribution mean)
σ2
and Var X̄ = where σ2 is the true distribution variance.
n
• If σ2 is unknown, as will often be the case, we can estimate it with
s
1 X
σ̂ =
2
[xi − x̄]2
s − 1 i=1
• Thus the approximate standard error of our Monte Carlo expected value estimator
σ̂
c µ̂MC ) = √
is SE(
s
• Applying Central Limit Theorem, provided that our number of simulation trials s is
large, an approximate (1 − α)100% confidence interval for our unknown expected
value µ = E (X) is given by
σ̂
x̄ ± zα/2 √
s
• Consider our problem of estimating the probability of event A; our p̂MC estimator
s s
1X X
is Xi where Xi is binomially distributed with success probability p and
s i=1 i=1
number of trials s. We know that the variance of a binomial random variable with s
trials and probability of success p is sp(1 − p), so we can replace σ2 with sp(1 − p)
s s
1 X 1 X 1 p(1 − p)
Var ( p̂MC ) = Var Xi = 2 Var Xi = 2 (sp(1 − p)) =
s i=1 s i=1
s s
• Suppose you are estimating the expected value of a random variable X using s
Monte Carlo simulation trials and you obtain a point estimate of µ̂MC = 2.52 and
a sample standard deviation of σ̂ = 0.43. Give the halfwidth of a 95% confidence
interval for µ if the number of simulation trials was 100, 1000, 10000, 100000, or
1000000
99
• (Note: the halfwidth of the confidence interval is the part after the ±)
Table 4.1: Halfwidth of 95% Confidence Interval for µ for Different Numbers of Simula-
tion Trials
σ̂
s zα/2 √
s
100 0.08428
1000 0.02665
10000 0.008428
100000 0.002665
1000000 0.0008428
• Suppose you are estimating the probability of an event A using s Monte Carlo
simulation trials and you obtain a point estimate of p̂MC = 0.4. Give the halfwidth
of a 95% confidence interval for p if the number of simulation trials was 100, 1000,
10000, 100000, or 1000000
Table 4.2: Halfwidth of 95% Confidence Interval for p for Different Numbers of Simula-
tion Trials
r
p̂MC (1 − p̂MC )
s zα/2
s
100 0.09602
1000 0.03036
10000 0.009602
100000 0.003036
1000000 0.0009602
100
Table 4.3: Standard Normal Lower Cumulative Probabilities Pr (Z < z)
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359
0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753
0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141
0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517
0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879
0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224
0.6 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549
0.7 0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852
0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133
0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389
1.0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621
1.1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830
1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015
1.3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177
1.4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319
1.5 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418 0.9429 0.9441
1.6 0.9452 0.9463 0.9474 0.9484 0.9495 0.9505 0.9515 0.9525 0.9535 0.9545
1.7 0.9554 0.9564 0.9573 0.9582 0.9591 0.9599 0.9608 0.9616 0.9625 0.9633
1.8 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693 0.9699 0.9706
1.9 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.9750 0.9756 0.9761 0.9767
2.0 0.9772 0.9778 0.9783 0.9788 0.9793 0.9798 0.9803 0.9808 0.9812 0.9817
2.1 0.9821 0.9826 0.9830 0.9834 0.9838 0.9842 0.9846 0.9850 0.9854 0.9857
2.2 0.9861 0.9864 0.9868 0.9871 0.9875 0.9878 0.9881 0.9884 0.9887 0.9890
2.3 0.9893 0.9896 0.9898 0.9901 0.9904 0.9906 0.9909 0.9911 0.9913 0.9916
2.4 0.9918 0.9920 0.9922 0.9925 0.9927 0.9929 0.9931 0.9932 0.9934 0.9936
2.5 0.9938 0.9940 0.9941 0.9943 0.9945 0.9946 0.9948 0.9949 0.9951 0.9952
2.6 0.9953 0.9955 0.9956 0.9957 0.9959 0.9960 0.9961 0.9962 0.9963 0.9964
2.7 0.9965 0.9966 0.9967 0.9968 0.9969 0.9970 0.9971 0.9972 0.9973 0.9974
2.8 0.9974 0.9975 0.9976 0.9977 0.9977 0.9978 0.9979 0.9979 0.9980 0.9981
2.9 0.9981 0.9982 0.9982 0.9983 0.9984 0.9984 0.9985 0.9985 0.9986 0.9986
3.0 0.9987 0.9987 0.9987 0.9988 0.9988 0.9989 0.9989 0.9989 0.9990 0.9990
3.1 0.9990 0.9991 0.9991 0.9991 0.9992 0.9992 0.9992 0.9992 0.9993 0.9993
3.2 0.9993 0.9993 0.9994 0.9994 0.9994 0.9994 0.9994 0.9995 0.9995 0.9995
3.3 0.9995 0.9995 0.9995 0.9996 0.9996 0.9996 0.9996 0.9996 0.9996 0.9997
3.4 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9998
Probability Shown is Area under Std. Normal Probability Density Function from −∞ to z
101
References
Allen, L. J. S. (2008), An introduction to stochastic epidemic models, in F. Brauer,
P. van den Driessche and J. Wu, eds, ‘Mathematical Epidemiology’, Springer, Berlin,
pp. 81–130.
Fox, J. (2016), Applied Regression Analysis & Generalized Linear Models, 3rd edn, Sage,
Thousand Oaks.
Myers, R. (2010), Generalized Linear Models with Applications in Engineering and the
Sciences, 2nd edn, Wiley, Hoboken, NJ.
102