Bts 360 S Article

Faculty of Applied Sciences
Department of Mathematics & Physics
Biostatistics Lecture Notes
Author: Thomas Farrar
Name:
.
Faculty of Applied Sciences Department of Mathematics and Physics
Biostatistics Lecture Notes

Author: Thomas Farrar
2021
Contents
1 Markov Chains with Biological Applications 2
1.1 Definition and Types of Stochastic Process . . . . . . . . . . . . . . . . . 2
1.2 Definition and Properties of a Discrete-Time Markov Chain . . . . . . . . 8
1.3 Long Run Behaviour of a Markov Chain . . . . . . . . . . . . . . . . . . 13
1.4 Absorbing States and First Step Analysis . . . . . . . . . . . . . . . . . . 16
1.5 Biological Applications of Markov Chains . . . . . . . . . . . . . . . . . 25
2 Stochastic Event Processes 42

2.1 Poisson Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.2 Birth Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.3 Death Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.4 Birth-and-Death Process . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3 Generalised Linear Models with Biological Applications 67

3.1 Binary Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.2 Probit Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.3 Poisson Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4 Introduction to Monte Carlo Simulation 97
Textbooks
• The following sources were used in preparing these notes: Pinsky and Karlin (2011),
Allen (2008), Myers (2010), Fox (2016), Givens and Hoeting (2005)
What you will be expected to already know
1. Statistics 1A (descriptive statistics, graphical methods for displaying data, probabil-

ity concepts and rules, important probability distributions [e.g., binomial, Poisson,
normal])
2. Statistics 1B (point and interval estimates, hypothesis testing concepts)
1
3. Statistics 2A (linear regression, logistic regression)
4. Mathematics (matrices, integral calculus)
5. Programming 2A (basic programming concepts and abilities)
1 Markov Chains with Biological Applications

1.1 Definition and Types of Stochastic Process
Definition of Stochastic Process
• A stochastic process is a set of random variables Xt or X(t), where t is an index

parameter (usually describing time)
◦ E.g. we flip a coin repeatedly and record 1 if we get Heads and 0 if we get
tails
◦ Stochastic is just another word for random
• The range of possible values for the random variables is known as the state space
(similar to sample space)
• A variable is called discrete if it can only take on a finite or countably infinite

number of distinct values
• A variable is called continuous if it can take on any real-numbered value, at least

within a specified interval
• There are four main classes of stochastic processes
Four Types of Stochastic Process
Table 1.1: Four Types of Stochastic Process
Random Variable
Discrete Continuous
Discrete Xt (e.g., Markov Chain) Xt (e.g., time series)
Time
Continuous X(t) (e.g., Poisson Process) X(t) (e.g., Brownian Motion)
• In this module we will be focusing on stochastic processes in which the random

variable is discrete
Review of Probability Results: Complement
• Let A be an event.
• The event that A does not occur is called the complement of A and is written math-
ematically as Ac (see Figure 1.1)
2
Figure 1.1: Illustration of Event Complements
Review of Probability Results: Union
• Let A and B be two events.
• The event that at least one of A or B occurs is called the union of A and B and is
written mathematically as A ∪ B (see Figure 1.2)
Figure 1.2: Illustration of Union of Events
Review of Probability Results: Intersection
• The event that both A and B occur is called the intersection of A and B and is
written mathematically as A ∩ B (see Figure 1.3)
3
Figure 1.3: Illustration of Intersection of Events
Review of Probability Results: Complement Rule
• Complement Rule of Probability:
Pr Ac = 1 − Pr (A)

• This is simply because either event A will happen or it will not happen; there are
no other possible outcomes
• It is a rearrangement of the statement
Pr (A) + Pr Ac = 1

Review of Probability Results: Additive Rule
• Additive Rule of Probability:
Pr (A ∪ B) = Pr (A) + Pr (B) − Pr (A ∩ B)
• If Pr (A ∩ B) = 0 then A and B are said to be mutually exclusive or disjoint (see

Figure 1.4)
• In this case, the Additive Rule reduces to
Pr (A ∪ B) = Pr (A) + Pr (B)
4
Figure 1.4: Illustration of Disjoint Events
Review of Probability Results: Multiplicative Rule
• Events A and B are independent if the occurrence of event A is unrelated to the

probability of B (and vice versa)
• Multiplicative Rule for Independent Events:
Pr (A ∩ B) = Pr (A) × Pr (B)
• Multiplicative Rule for Dependent Events:
Pr (A ∩ B) = Pr (A) × Pr (B|A)
• Or,
Pr (A ∩ B) = Pr (B) × Pr (A|B)
Conditional Probability and Bayes’ Theorem
• The two ways of expressing the multiplicative rule for dependent events can be
rearranged as conditional probability formulas:
Pr (A ∩ B)
Pr (B|A) =
Pr (A)
Pr (A ∩ B)
Pr (A|B) =
Pr (B)
• The two ways of expressing the multiplicative rule for dependent events can also be
set equal to each other to give Bayes’ Theorem:
Pr (A) Pr (B|A) = Pr (B) Pr (A|B)

Pr (B) Pr (A|B)
Pr (B|A) = (Bayes’ Theorem)
Pr (A)
5
Review of Probability Results: Probability Mass Functions
• If X is a discrete random variable and S is the set of possible values that X can
take on (with nonzero probability), then the function f (x), defined over the set of
all integers, is called the probability mass function of X if and only if
(
f (x) if x ∈ S
Pr (X = x) =
0 otherwise
• f (x) must satisfy the following properties:

(a) 0 ≤ f (x) ≤ 1 for all x ∈ S
X
(b) f (x) = 1
x∈S
• The expected value (theoretical mean) of X is defined as

X
E (X) = x f (x)
x∈S
• Example: A random variable X has the probability mass function

k
f (x) = for x = 2, 4, 8
x
• Determine the value of k so that f (x) meets the requirements of a probability mass
function
X
f (x) = 1
x∈S
X k
=1
x∈{2,4,8}
x
k k k
+ + =1
2 4 8
7k
=1
8
8
k=
7
• Determine Pr (X = 4) and E (X)
8 2
Pr (X = 4) = f (4) = =
7(4) 7
X
E (X) = x f (x)
x∈S
X 8
= x
x∈{2,4,8}
7x
X 8 3(8) 24
= = =
x∈{2,4,8}
7 7 7
6
Law of Total Probability
• The Law of Total Probability states that, for event A and for disjoint (mutu-
ally exclusive) events B1 , B2 , . . . , Bb that partition the whole sample space (i.e.,
Xb
Pr (Bk ) = 1), we have
k=1
b
X b
X
Pr (A) = Pr (A ∩ Bk ) = Pr (A|Bk ) Pr (Bk )
k=1 k=1
• This law is visualised in Figure 1.5
Figure 1.5: Illustration of Law of Total Probability
• The law of total probability also gives us another way of expressing Bayes’ Theo-
rem:
Pr (B1 ) Pr (A|B1 )
Pr (B1 |A) =
Pr (A)
Pr (B1 ) Pr (A|B1 )
= b
X
Pr (A|Bk ) Pr (Bk )
k=1
Law of Total Probability: Example
• Suppose that you have three bags that each contain 10 balls. Bag 1 has 3 blue balls
and 7 green balls. Bag 2 has 5 blue balls and 5 green balls. Bag 3 has 8 blue balls
and 2 green balls. You choose a bag at random and then chooses a ball from this
bag at random. There is a 31 chance that you choose Bag 1, a 21 chance that you
choose Bag 2, and a 61 chance that you choose Bag 3. What is the probability that
7
the chosen ball is blue?
Let A be the event that the chosen ball is blue

Let Bk be the event that the kth bag is chosen
X3 3
X
Pr (A) = Pr (A ∩ Bk ) = Pr (A|Bk ) Pr (Bk )
k=1 k=1
= Pr (A|B1 ) Pr (B1 ) + Pr (A|B2 ) Pr (B2 ) + Pr (A|B3 ) Pr (B3 )
3 1 5 1 8 1
= + +
10 3 10 2 10 6
29
=
60
1.2 Definition and Properties of a Discrete-Time Markov Chain

Markov Chain: Definition
• A Markov process {Xt } is a stochastic process with the property that, given the
value of Xt , the values of X s for s > t are not influenced by the values of Xu for
u < t.
• In words: given the present state of the process, the future state of the process does
not depend on the past states of the process
• Or: Markov processes have a short memory!
• Examples of real-world processes that seem to behave according to the Markov

property:
◦ Your sequence of fortunes after successive bets in a casino

◦ The sequence of genes in a line of descent
• A discrete-time Markov Chain is a Markov process whose time index is discrete
• The state space of a Markov Chain, denoted S, is the set of all possible values that
the process random variable can take on
◦ In many applications it is common to have the state space S = {0, 1, 2, . . . , N}

where N is the maximum state
• A sequence of random variables (X0 , X1 , X2 , . . . , Xn , Xn+1 ) is a Markov chain if and

only if it satisfies the Markov property, expressed mathematically as follows:
Pr (Xn+1 = j|X0 = i0 , . . . , Xn−1 = in−1 , Xn = i)

= Pr (Xn+1 = j|Xn = i)
for all time points n and all states i0 , . . . , i, j
• If Xn = i we say that at time n the process is in state i
8
Transition Probabilities
• The probability of Xn+1 being in state j given that Xn is in state i is called the one-
step transition probability and is denoted by Pn,n+1
ij . That is:
Pn,n+1
ij = Pr (Xn+1 = j|Xn = i)
• In general, transition probability depends not only on the initial state (i) and final
state ( j) but also on the current time (n)
• If the one-step transition probabilities are independent of the current time n (i.e. are
the same regardless of the value of n), we refer to the Markov Chain as homoge-
neous and denote the transition probability from state i to state j by Pi j
• In this module, all Markov Chains will be assumed to be homogeneous unless oth-
erwise stated
• It is customary to express transition probabilities in a state diagram or a matrix
State Diagrams
• A state diagram is a graphical way of representing a Markov Chain (see example

in Figure 1.6)Figure 1.6: Markov Chain State Diagram Example
Transition Probability Matrices
• Transition probabilities are stored in a matrix to make calculations easier
9
• If the state space of a Markov chain is S = {0, 1, 2} then the transition probability
matrix would look like this:
0 1 2
 
0  P00 P01 P02 
P = 1  P10 P11 P12 
 
2 P20 P21 P22
 
• Here, for instance, P10 is the probability of the process being in state 0 given that it
was in state 1 at the previous time step
• In other words, it is the probability that a process in state 1 proceeds next to state 0
• If the state space of a Markov chain is S = {0, 1, 2, . . .} then the transition probabil-
ity matrix would look like this:
0 1 2 ··· j ···
 
0  P00 P01 P02 ··· P0 j · · · 
 
1  P10 P11 P12 ··· P1 j · · · 
2  P20 P21 P22 ··· P2 j · · · 
 
P = ..  .. .. .. ... .. . . 
.  . . . . . 
··· · · · 
 
i  Pi0 Pi1 Pi2 Pi j
..  .. .. .. ... .. . . . 

. . . . .
Transition Probability Properties
• Transition probabilities must satisfy the following properties (similar to those for
any probability):
1. Pi j ≥ 0 for all i, j ∈ S (all transition probabilities must be non-negative)

X
2. Pi j = 1 for all i ∈ S (each row of transition probability matrix must sum
j∈S
to 1)
• The second property implies that a process in state i at time n must be in some state
at time n + 1
Exercise
• A frog can jump between four lily pads, labeled A, B, C and D. If the frog is on lily pad A,
it will jump to B with a probability of 34 and to C with a probability of 14 . If the frog is on
B, it will jump to A, C, or D each with a probability of 13 . If the frog is on C it will jump to
A with probability 12 or stay on C with probability 12 . If the frog is on D it will jump to B
with probability 1. Let Xn denote the frog’s position after the nth jump (Xn = 0 if the frog
is at A, Xn = 1 if at B, etc.)
• Draw a state diagram for the Markov Chain {Xn }.
• Draw the the transition probability matrix for this Markov Chain.
10
n-step Transition Probabilities
• The transition probability matrix P gives the one-step transition probabilities
• What if we are interested in the n-step transition probabilities, i.e. the probability
that the chain goes from state i to state j in n transitions?
i j = Pr (Xm+n = j|Xm = i)
P(n)
◦ Note: our assumption that the Markov Chain is homogeneous means that m
can be any non-negative integer
• The Chapman-Kolmogorov equations give us a means to work out the n-step tran-
sition probabilities of any Markov Chain from the one-step transition probabilities
• The Chapman-Kolmogorov equations are stated as follows: for any states i and j in
the state space S, and any positive integers m and n,
X
P(m+n)
ij = P(m) (n)
ik Pk j
k∈S
• Derivation of Chapman-Kolmogorov Equations:
P(m+n)
ij = Pr (Xm+n = j|X0 = i)
X
= Pr (Xm+n = j, Xm = k|X0 = i) by law of total probability
k∈S
X
= Pr (Xm+n = j|Xm = j, X0 = i) Pr (Xm = k|X0 = i) by conditional probability
k∈S
X
= Pr (Xm+n = j|Xm = k) Pr (Xm = k|X0 = i) by Markov property
k∈S
X
= P(m) (n)
ik Pk j
k∈S
• By replacing m in the Chapman-Kolmogorov Equations with 1 and replacing n

with n − 1, we obtain a recursive formula that can be used to calculate the n-step
transition probabilities from the one-step transition probabilities. We would start
with n = 2 to get the two-step transition probabilities, then let n = 3 to get the
three-step transition probabilities, and so on:
X
P(n)
ij = P(1) (n−1)
ik Pk j
k∈S
n-step Transition Probability Matrix
• From linear algebra we recognise this as the formula for matrix multiplication
• Thus P(n) = P × P(n−1) .
11
• And, by iterating:
P(n) = |
P × P {z P = Pn
× ··· ×}
n factors
• That is, the n-step transition probability P(n)

i j is the entry in the ith row and jth
n
column of the matrix P .
• Note: since P is always a square matrix, it is always possible to multiply it by itself.
Example
• Consider a Markov chain {Xn } with states 0, 1 and 2 which has the following tran-
sition probability matrix:
0 1 2
 
0  0.1 0.2 0.7 
P = 1  0.2 0.2 0.6 
 
2 0.6 0.1 0.3
 
• We want to find Pr (X2 = 1|X0 = 0)
• This involves finding the entry in row 0, column 1 of the matrix P2 .
0 1 2 0 1 2 0 1 2
     
0  0.1 0.2 0.7  0  0.1 0.2 0.7  0  0.47 0.13 0.4 
P = 1  0.2
2
0.2 0.6  = 1  0.42 0.14 0.44 
     
0.2 0.6  × 1  0.2
     
2 0.6 0.1 0.3 2 0.6 0.1 0.3 2 0.26 0.17 0.57
• Thus Pr (X2 = 1|X0 = 0) = 0.13
• Alternatively, we can use the n-step transition probability formula:

2
X
ij =
P(2) Pik P(1)
kj
k=0
2
X
01 =
P(2) P0k Pk1
k=0
= P00 P01 + P01 P11 + P02 P21
= 0.1 × 0.2 + 0.2 × 0.2 + 0.7 × 0.1
= 0.13
12
Exercise
• A particle moves among the states 0, 1 and 2 according to a Markov chain whose
transition probability matrix is:
0 1 2
 
0  0 21 1
2

P =1
 
 1 0 1 
 2 2 
1 1
2
2 2
0
• Let Xn denote the position at the particle at the nth move.
• Calculate Pr (Xn = 0|X0 = 0) for n = 0, 1, 2, 3, 4.
• Hint: it is not necessary to calculate the whole matrix each time, though that would
work.
1.3 Long Run Behaviour of a Markov Chain

Long Run Behaviour of Markov Chains
• Suppose we have a Markov Chain with the following transition probability matrix:
0 1
" #
0 0.33 0.67
P =
1 0.75 0.25
• Recall: we can compute the n-step transition probability matrix by computing Pn
• Calculate by hand the first few powers of P and what do you see?
0 1
" #
0 0.5286 0.4713
P8 =
1 0.5277 0.4723
• It appears that after many steps, the probabilities of being in state j converge to the
same values regardless of the previous state i
• Thus, by computing a high power of P we can find out approximately the long-run
or limiting distribution of the Markov Chain
13
Exercise
• A sociologist develops a Markov Chain model of social mobility, i.e. the relation-
ship between the economic status of one generation to the next.
• She determines that the transition probability matrix is as follows:
Lower Middle Upper

 
Lower  0.40 0.50 0.10 
P = Middle  0.05 0.70 0.25 
 
Upper 0.05 0.50 0.45
 
• Determine what proportion of the population will be in each economic class in the
long run by computing a high power of this matrix
• Hint: you can save time by using the fact that P4 = P2 × P2 and P8 = P4 × P4
Regular Markov Chains
• A Markov Chain whose limiting distribution exists is called a regular Markov

Chain
2
• It can be proven that the limiting distribution exists as long as P(K
ij
)
> 0 for all i and
j, where K is the number of states
2
◦ That is, if we take PK then all its elements should be positive
◦ If doing it by hand, we can save time by simply putting a + or a 0 for each
element (the exact value of a positive element does not matter for this purpose;
it only matters whether it is positive or zero)
1. For every pair of states i, j it is possible after some number of steps to go from
i to j
2. There is at least one state i for which Pii > 0
Limiting Distribution of a Markov Chain
• The limiting distribution of a regular Markov Chain is a vector of probabilities

π > = (π0 , π1 , . . . , πN ) satisfying the following two properties:
N
X
1. π j = πk Pk j for j = 0, 1, . . . , N
k=0
N
X
2. πk = 1
k=0
X X
• More generally, we could write π j = πk Pk j for j ∈ S and πk = 1 (in case the
k∈S k∈S
states are not numbered from 0 to N)
14
• Or, in matrix form:
1. π = P> π
2. 1> π = 1 where 1 is an N + 1-vector of ones
• The first set of equations above follows from the property that the limiting distribu-
tion is stationary with respect to time and independent of the initial state
• To see this, consider writing the Chapman-Kolmogorov Equations as follows:
X
P(n)
ij = P(n−1)
ik Pk j , i ∈ S
k∈S
• Now, suppose that P(n−1)

ik = P(n)
ik (stationarity property)
• Further, write P(n)

i j as π j , to reflect that it does not depend on the initial state i
• The equation now becomes

X
πj = πk Pk j , j ∈ S
k∈S
• The limiting distribution is interpreted as the probability that, after a long duration,
the process {Xn } will be found in state j, regardless of the value of X0
• It can also be interpreted as the long run average fraction of time that the process
{Xn } is in state j
• In practice we can determine the limiting distribution by solving the system of equa-
tions given by (1) and (2) above
◦ Note: this can be done by substitution, or using matrix row operations (Gaussian
Elimination)
• IMPORTANTX NOTE: the system in (1) has one redundant equation because of
the restriction Pik = 1. Thus one of the equations in (1) must be replaced with
k∈S
(2)
Limiting Distribution of a Markov Chain: Example
• Consider the social mobility Markov Chain discussed above
• From the transition probability matrix we can get the following system of equations:
0.40π0 + 0.05π1 + 0.05π2 = π0




 (1)
0.50π0 + 0.70π1 + 0.50π2 = π1

(2)



0.10π0 + 0.25π1 + 0.45π2 = π2




 (3)

π0 + π1 + π2 =1

 (4)
• Remember: one of the first three equations is redundant; we can choose which one
to get rid of (in this case, (3))
15
• Thus our system is:
0.40π0 + 0.05π1 + 0.05π2 = π0



 (1)
0.50π0 + 0.70π1 + 0.50π2 = π1

(2)







 (3)

π0 + π1 + π2 = 1

 (4)
• It now only remains to solve this system
−60π0 + 5π1 + 5π2 = 0




 (1)
 5π0 − 3π1 + 5π2 = 0

(2)







 (3)

π0 + π1 + π2 = 1

 (4)
• We eliminate π2 by subtracting (2)−(1) and (1)−5×(4):
65π0 − 8π1 = 0
65π0 = 5
• Thus π0 = 5
65
= 1
13
, π1 = 5
8
and π2 = 1 − π0 − π1 = 31
104
1.4 Absorbing States and First Step Analysis

Definition of Absorbing State
• A state i in a Markov Chain is said to be absorbing if Pii = 1 (that is, the probability
on the main diagonal of the transition probability matrix is 1)
• That is, once the process reaches state i, it will remain there forever
• A state that is not absorbing is referred to as a transient state
Absorbing States: Quantities of Interest
• We define R as the set of absorbing states (obviously, R is a subset of the state

space, S)
• Thus, if a process has state space S = {0, 1, 2, 3}, and states 0 and 3 are absorbing,
then R = {0, 3}
• Let T = min{n ≥ 0; Xn ∈ R} be the ‘time of absorption’, i.e. the time at which the
process first reaches (and becomes stuck in) an absorbing state
16
• Two quantities of interest in Markov Chains involving absorbing states are the ex-
pected time of absorption and the probability of absorption to some particular
absorbing state r
◦ Note that the second quantity is interesting only when we have two or more
absorbing states; if there is only one absorbing state then the probability of
being absorbed to this state is 1
• Both of these quantities can be expressed in terms of T , and they are generally
expressed in conditional terms since they depend on the initial value of the process:
vi = E (T |X0 = i)
ui,r = Pr (XT = r|X0 = i)
Expected Time of Absorption: Example
• Consider a Markov Chain {Xn } where Xn describes the condition of a machine part
at time n
• Time increments by one unit each time the machine is used
◦ Xn = 0 if the part is working perfectly after the nth use

◦ Xn = 1 if the part is showing signs of wear (working imperfectly) after the nth
use
◦ Xn = 2 if the part has failed after the nth use
• The transition probability matrix for such a model could be:
0 1 2
 
0  0.9 0.1 0 
P = 1  0 0.8 0.2 
 
2 0 0 1
 
• Clearly, state 2 is an absorbing state since P22 = 1
• In this example, expected time of absorption can be interpreted as expected time

until failure, i.e. the mean number of times the machine can be used before the part
fails
◦ v0 = E (T |X0 = 0) gives the mean number of times the machine can be used
before the part fails given that the part is currently working perfectly
before the part fails given that the part is currently showing signs of wear
before the part fails given that the part has already failed; clearly v2 = 0 by
inspection
17
Probability of Absorption to State r: Example
• Consider a rectangular-based pyramid with five vertices (0, 1, 2, 3, 4) as shown in

Figure 1.7. An ant walks between the vertices along the edges of the pyramid. The
rule governing the ant’s movement is as follows. At each time step, the ant ran-
domly chooses (with equal probabilities) one of the vertices adjacent to its current
position, and walks to the chosen vertex. Thus, for example, if the ant is currently
at vertex 2, it could move to vertex 0, 1, or 4, each with probability 13 . Vertices 0
and 4 are absorbing states. If the ant arrives at vertex 0 it finds honey and stays
there happily ever after. If the ant arrives at vertex 4 it finds poison and dies.
Figure 1.7: Illustration of Ant-on-Pyramid Markov Chain
• Let Xn be the vertex where the ant is located after the nth move. Clearly {Xn } is a
Markov Chain with state space S = {0, 1, 2, 3, 4}
• The transition probability matrix for this Markov Chain would be as follows:
0 1 2 3 4
 
0  1 0 0 0 0 
1  13 1 1
 
0 3 3
0 
P = 2  31 1
0 0 1
 
3 3 
3  13 1
0 0 1
 
 3 3 
4 0 0 0 0 1
• What is the probability that the ant reaches the honey rather than dying, given that
it starts in state 1? This is precisely the probability u1,0 = Pr (XT = 0|X0 = 1)
First Step Analysis Technique for Finding Probability of Absorption to State r
• We can use an ingenious technique called First Step Analysis to determine our
quantities of interest for a particular Markov Chain with absorbing states
18
• This technique uses the law of total probability, which (you will recall) states that,
b
X
for events A and B1 , B2 , . . . , Bb such that Pr (Bk ) = 1,
k=1
b
X
Pr (A) = Pr (A|Bk ) Pr (Bk )
k=1
• Let us begin by seeing how first step analysis is used to find the probability of
absorption to state r
• Suppose we have a Markov Chain with state space S = {0, 1, . . . , N} and at least
two absorbing states (e.g., R = {0, N})
• Suppose further that we are interested in finding ui,r = Pr (XT = r|X0 = i) for all
i ∈ S and for one particular absorbing state r ∈ R
• Event ‘A’ in the law of total probability is in this case the event XT = r (i.e., that the
process is eventually absorbed into state r), while events B1 , B2 , . . . , Bb are in this
case the events X1 = 0, X1 = 1, . . . , X1 = N
• The derivation proceeds as follows:
ui,r = Pr (XT = r|X0 = i)
X
ui,r = Pr (XT = r|X0 = i, X1 = k) Pr (X1 = k|X0 = i) (by law of total probability)
k∈S
X
ui,r = Pr (XT = r|X1 = k) Pr (X1 = k|X0 = i) (by Markov property)
k∈S
but since T is random, Pr (XT = r|X1 = k) = Pr (XT = r|X0 = k)
X
Thus ui,r = uk,r Pik = u0,r Pi0 + u1,r Pi1 + · · · + uN,r PiN
k∈S
• The Pik values for all i and k are known from the transition probability matrix, so
we are left with N + 1 unknowns, u0,r , u1,r , . . . , uN,r and N + 1 equations
• If we solve this system of equations we will have our answer
• There is one other observation that simplifies our task considerably: ur,r = 1 and
ui,r = 0 for all absorbing states other than r (i.e. for all i , r ∈ R)
◦ ur,r = Pr (XT = r|X0 = r) = 1: if we begin in state r, we have already been
absorbed to this state, so absorption to this state is certain
◦ Similarly, if i is an absorbing state other than r, then ui,r = Pr (XT = r|X0 = i) =
0: if we begin in absorbing state i, we have already been absorbed to this state,
so absorption to state r is impossible
• Thus in practice we need to solve the system of equations
X
ui,r = uk,r Pik , i < R
k∈S
after substituting in 1 for ur,r and 0 for all u values of absorbing states other than r
19
First Step Analysis: Example 1
• Let us apply first step analysis to solve the pyramid ant problem discussed above:
what is the probability that the ant eventually reaches the honey rather than the
poison?
• In this case, the absorbing states are R = {0, 4}, and we want to find u1,0 (the proba-
bility of being absorbed to state 0 given that the ant starts in state 1)
• We know by inspection that u0,0 = 1 and u4,0 = 0, so the only unknowns remaining
are u1,0 , u2,0 , and u3,0
• We need to solve the following system of equations:
u1,0 = u0,0 P10 + u1,0 P11 + u2,0 P12 + u3,0 P13 + u4,0 P14
u2,0 = u0,0 P20 + u1,0 P21 + u2,0 P22 + u3,0 P23 + u4,0 P24
u3,0 = u0,0 P30 + u1,0 P31 + u2,0 P32 + u3,0 P33 + u4,0 P34
• Substituting in what we know (that u0,0 = 1 and u4,0 = 0, plus all the transition prob-
abilities), we proceed to solve the system of three equations and three unknowns as
follows:
1 1 1
u1,0 = (1) + u1,0 (0) + u2,0 + u3,0 + (0)(0)
3 3 3
1 1 1
u2,0 = (1) + u1,0 + u2,0 (0) + u3,0 (0) + (0)
3 3 3
1 1 1
u3,0 = (1) + u1,0 + u2,0 (0) + u3,0 (0) + (0)
3 3 3
1 1 1
u1,0 = + u2,0 + u3,0
3 3 3
1 1
u2,0 = + u1,0
3 3
1 1
u3,0 = + u1,0
3 3
Substituting the last two equations into the first:
! !
1 1 1 1 1 1 1
u1,0 = + + u1,0 + + u1,0
3 3 3 3 3 3 3
1 1 1 1 1
u1,0 = + + u1,0 + + u1,0
3 9 9 9 9
7 5
u1,0 =
9 9
5
u1,0 =
7
• Thus, if the ant starts at vertex 1 of the pyramid, the probability that the ant eventu-
ally reaches vertex 0 (the honey pot) is 57 (which also implies that the probability of
the ant eventually reaching the poison instead [vertex 4] is 1 − 57 = 27 )
20
First Step Analysis Technique for Finding Expected Time of Absorption
• We have seen how to use first step analysis to determine the probability of absorp-
tion to a particular absorbing state r in cases where there are two or more absorbing
states in the Markov Chain
• A very similar first step analysis technique can be used to determine the expected
time of absorption, vi = E (T |X0 = i)
• This time, instead of using the law of total probability, our derivation uses the law
of total expectation
• This law is very similar to the law of total probability, but uses expected value rather
than probability only
• The law of total expectation may be stated as follows:
◦ If we have a random variable Y and we have events B1 , B2 , . . . , Bb that partition

b
X
the probability space (i.e. Pr (Bk ) = 1), then:
k=1
b
X
E (Y) = E (Y|Bk ) Pr (Bk )
k=1
• Suppose we have a Markov Chain with state space S = {0, 1, 2, . . . , N} which con-
tains at least one absorbing state: R , {}
• In this case, T will be the random variable Y in our law of total expectation formula
and X1 = k will be our event Bk
vi = E (T |X0 = i)
X
vi = E (T |X1 = k, X0 = i) Pr (X1 = k|X0 = i) (by law of total expectation)
k∈S
X
vi = E (T |X1 = k) Pr (X1 = k|X0 = i) (by Markov property)
k∈S
But E (T |X1 = k) = 1 + E (T |X0 = k) , thus
X X X
vi = (1 + vk ) Pik = Pik + vk Pik
k∈S k∈S k∈S
X X
vi = 1 + vk Pik (since Pik = 1, a property of all transition prob. matrices)
k∈S k∈S
vi = 1 + v0 Pi0 + v1 Pi1 + · · · + vN PiN
• We could again set up a system of N + 1 equations and N + 1 unknowns, but once

again we can simplify the situation by recognising that vi = 0 for all i ∈ R, that is,
for all absorbing states
21
◦ For example, if state 0 is absorbing, then v0 = E (T |X0 = 0) = 0: if the pro-
cess was already in an absorbing state at time 0, then the expected time of
absorption is 0
• Thus, we need to solve the system of equations

X
vi = 1 + vk Pik , i < R
k∈S
after substituting in 0 for all v values of absorbing states
First Step Analysis: Example 2
• Returning to our machine part example, let us apply first step analysis to determine
the expected number of times a machine part can be used before failure, given that
the machine part is currently working fine
• Recall that this Markov Chain had states 0 (part working perfectly), 1 (part showing
signs of wear), and 2 (part failed), and transition probability matrix as follows:
0 1 2
 
0  0.9 0.1 0 
P = 1  0 0.8 0.2 
 
2 0 0 1
 
• Our absorbing state is state 2 and our task is to find v0 = E (T |X0 = 0)
• By inspection we can say that v2 = 0; we then have a system of two equations and
two unknowns, v0 and v1
v0 = 1 + v0 P00 + v1 P01 + v2 P02

v1 = 1 + v0 P10 + v1 P11 + v2 P12
v0 = 1 + v0 (0.9) + v1 (0.1) + (0)(0) = 1 + 0.9v0 + 0.1v1
v1 = 1 + v0 (0) + v1 (0.8) + (0)(0.2) = 1 + 0.8v1
Working with the second equation,
0.2v1 = 1
v1 = 5
v0 = 1 + 0.9v0 + 0.1(5)
0.1v0 = 1.5
v0 = 15
• Thus the expected number of times that the machine can be used before a part that
is working perfectly will fail is 15
22
First Step Analysis Exercises
• Consider the Markov Chain {Xn } whose transition probability matrix is given by:
0 1 2 3
 
0  1 0 0 0 
1  0.1 0.4 0.1 0.4 
P = 
2  0.2 0.1 0.6 0.1 

 
3 0 0 0 1
• If the process starts in state 1, determine:
1. The probability that the Markov Chain ends in state 0

2. The mean time to absorption
System of Linear Equations Method for First Step Analysis
• The general equations that can be used for first step analysis of any Markov Chain
with states 0, 1, 2, . . . , N are:
X
ui,r = uk,r Pik
k∈S
X
vi = 1 + vk Pik
k∈S
• The equation should be written out for each state i that is NOT an absorbing state
• The uk,r or vk values for the absorbing states can be determined trivially (always 0
or 1) and substituted into the equations
• We then have a system of an equal number of equations and unknowns that can be
solved
Matrix Method for First Step Analysis
• The quantities vi (expected time of absorption given initial state i) and ui,r (proba-
bility of absorption to state r given initial state i) can also be solved using a matrix
approach, described as follows
• Suppose that our Markov Chain consists of a absorbing states and b transient states
(a + b states in total)
• We first need to reorganise the transition probability matrix into the matrix P ?
consisting of four sub-matrices, as follows:
 
Q 0b×a 
?
P =  
R Ia
where
23
◦ Q is a b × b matrix of transition probabilities between transient states only
◦ R is an a×b matrix of transition probabilities from transient states to absorbing
states
◦ 0b×a is a b × a matrix of zeroes
◦ Ia is the a × a identity matrix
• We further define the fundamental matrix N = (Ib − Q)−1 (this can be time-
consuming if the matrix is 3 × 3 or larger; see here for some methods for finding the
inverse of a 3 × 3 matrix by hand)
• This matrix allows us to compute v, the vector of expected absorption times per
initial state, as the row sums of N
• Furthermore, U , the matrix whose (i, r)th element is ui,r , the probability of absorp-
tion into state r given initial state i, can be computed as
U = N R>
Matrix Method for First Step Analysis: Examples
• Consider the machine part example above. P is rearranged into P ? as follows:

 
0.9 0.1 0
P ? =  0
 
0.8 0
 
0 0.2 1
   
0.9 0.1
, R = 0 0.2 , 02×1 = 0, and I1 = 1 .
h i h i
• Here, Q = 
 
0 0.8 0
   
0.1 −0.1 10 5
• Thus, I2 − Q =   and N = (I2 − Q)−1 = 

 

0 0.2 0 5
 
15
• Summing the rows of N , we obtain v =  , which matches our answers v0 = 15
5
and v1 = 5 obtained earlier
• As for the ant-on-pyramid problem earlier, we can rearrange P to obtain P ? as

follows:  1 1 
0 3 3 0 0
 1 
 3 0 0 0 0
P ? =  13 0 0 0 0
 
 1 1 1 
 3 3 3 1 0
 1 1 
0 3 3 0 1
24
 1 1
0 3 3  1 1 1

• Thus, Q =  13 0 0, and R =  3
  
3 3
1 1
 1  0 3 3

3
0 0
• From this, we obtain  

 1 − 13 − 31 
I3 − Q = − 13 1
 
0 
 1 
−3 0 1
and  
! 9 3 3
1 
N = (I3 − Q)−1 =

3 8 1
7   
3 1 8
• Taking N R> yields  

! 5 2
1 
U=

4 3
7  
4 3
5
• Thus, u1,0 = , just as we obtained previously
7
1.5 Biological Applications of Markov Chains

Markov Chain Models
• Why have we been learning about Markov Chains?
• There are are many systems in science that can be described by them
• We will now look at examples: the Ehrenfest model for diffusion across a mem-
brane, genetics models, epidemiological models, and ecological models
1.5.1 The Ehrenfest Urn Model for Diffusion across a Membrane

Ehrenfest Urn Model
• A classical mathematical description of diffusion across a membrane (e.g. cellular

membrane)
• Imagine two containers (urns) containing a total of 2a balls (which represent molecules).
• Suppose the first container, labelled A, holds k balls while the second container, B,
holds the remaining 2a − k balls
• A ball is selected at random and moved to the other container (as in the diagram in
Figure 1.8)
◦ Note: all 2a balls have the same probability of selection
25
Figure 1.8: Illustration of Ehrenfest Urn Process
• Each selection generates a transition of the process
• What will be the long-term behaviour of such a system? A Markov Chain model
can help us to answer this question
Ehrenfest Urn Model: State Space
• Let Yn be the number of balls in urn A after n time steps, and let
Xn = Yn − a
• Xn represents the excess balls in urn A over and above half the total (thus Xn = 0
means there is an equal number of balls in each urn, while a positive value of Xn
means there are more balls in urn A than in urn B, and a negative value of Xn means
there are less balls in urn A than in urn B)
• Then {Xn } is a Markov chain with the state space
S = {−a, −a + 1, . . . , −1, 0, +1, . . . , a − 1, a}
26
Ehrenfest Urn Model: Transition Probabilities
• Consider an Ehrenfest Urn Model with a = 3, meaning that there are a total of
2a = 6 balls in the system
• Suppose that Xn = 0, meaning there is an equal number of balls in each urn, i.e. 3
balls in urn A and 3 balls in urn B
◦ When one of the six balls is chosen and moved at random, there is a 36 = 12
chance that it will be a ball from urn A (moved to urn B) and a 63 = 12 chance
that it will be a ball from urn B (moved to urn A)
◦ Thus, Xn+1 can either be -1 (if a ball is moved from urn A to urn B) or 1 (if a
ball is moved from urn B to urn A), each with probability 21
• Suppose that Xn = 1, which means one-more-than-half of the balls are in urn A (4
balls in this case)
◦ When one of the six balls is chosen and moved at random, there is a 64 = 23
chance that it will be a ball from urn A (moved to urn B) and a 62 = 13 chance
that it will be a ball from urn B moved to urn A
◦ Thus, Xn+1 can either be 0 (if a ball is moved from urn A to urn B, which
happens with probability 32 ) or 2 (if a ball is moved from urn B to urn A,
which happens with probability 13 )
• Do you see the pattern? In general, the probability that the number of balls in urn
A increases by 1 (Xn+1 = Xn + 1) is proportional to the current number of balls
in urn B, while the probability that the number of balls in urn A decreases by 1
(Xn+1 = Xn − 1) is proportional to the current number of balls in urn A
• Thus the transition probability matrix when there are 2a = 6 balls will be:
−3 −2 −1 0 1 2 3
=1 0 0
 6

−3  0 6
0 0 0 
 1 5

−2 
 6
0 6
0 0 0 0 

2
−1 
 0 6
0 46 0 0 0 

P= 0

 0 0 3
6
0 3
6
0 0 

1

 0 0 0 64 0 2
6
0 

5 1
2

 0 0 0 0 6
0 6


0 0 0 0 0 6
=1 0
 
3
6
• Question: does this Markov Chain have any absorbing states?

• In general, when there are 2a balls in the system, the transition probabilities can be
computed using the following formula:
a + i
if j = i − 1



2a




Pi j = Pr (Xn+1 = j|Xn = i) = 
a − i

 if j = i + 1
2a






0 otherwise
27
Ehrenfest Urn Model: Exercise
1. Work out the transition probability matrix P for an Ehrenfest Urn Model with 4
balls in the system
2. Determine, for this model, Pr (X2 = 0|X0 = 2).
3. Determine and interpret the limiting distribution of this Markov Chain
1.5.2 Markov Chain Genetics Models

Random Reproduction Model
• The Wright-Fisher Genetics Model is a Markov Chain model developed by S.

Wright and R.A. Fisher to investigate the fluctuation of gene frequency under the
influence of mutation and selection
• The simplest version of this model is the Random Reproduction Model, which dis-
regards mutation pressures and selective forces
• We assume there is a fixed population size of 2N genes composed of type a and

type A individuals (alleles)
• Let Xn be the number of individuals of allele type a in the population at the nth
generation (which implies that there are 2N − Xn individuals of allele type A)
• We assume that the makeup of the (n + 1)th generation is determined by a binomial

experiment consisting of 2N independent trials
• For each trial (birth), the probability that this new individual is of type a is equal to
the proportion of the parent generation that is of type a, that is,
i
pi =
2N
(where we assume that Xn = i is the number of type a individuals in generation n)
• The probability that a new individual is of type A is equal to the proportion of the
parent generation that is of type A, that is,
2N − i i
qi = 1 − pi = =1−
2N 2N
• This implies that {Xn } is a Markov Chain with state space S = {0, 1, 2, . . . , 2N}
whose transition probabilities follow a binomial distribution:
!
2N j
Pi j = Pr (Xn+1 = j|Xn = i) = p (1 − pi )2N− j for i, j = 0, 1, 2, . . . , 2N
j i
• You will notice that the only parameter in this model is N (half the population
size); thus, once we know the population size we can fully specify the transition
probability matrix and analyse the long-term behaviour of the process
28
Random Reproduction Model: Example
• Consider the Wright-Fisher Random Reproduction Model in the case where the
population size is 4 (N = 2)
• Using the binomial transition probability formula above, we can easily determine
that the transition probability matrix is as follows:
0 1 2 3 4
 
0  1 0 0 0 0 
 
1  0.3164 0.4219 0.2109 0.0469 0.0039 
P = 2  0.0625 0.25 0.375 0.25 0.0625 
 
3  0.0039 0.0469 0.2109 0.4219 0.3164 
 
 
4 0 0 0 0 1
Random Reproduction Model: Long-Term Behaviour
• Notice that, in general,

!
2N 0 2N
P00 = 0 1 = 1 and, for any 0 < j ≤ 2N,
0
!
2N j 2N− j
P0 j = 01 =0
j
• Thus, regardless of population size 2N, state 0 is always an absorbing state in this
Markov Chain. Similarly,
!
2N 2N 0
P2N,2N = 1 0 = 1 and, for any 0 ≤ j < 2N,
2N
!
2N j 2N− j
P2N, j = 10 =0
j
• Thus, regardless of population size 2N, state 2N is always an absorbing state in this
Markov Chain
• The practical interpretation is simply this: once the entire population consists of
allele type a genes, allele type A becomes extinct, disappearing forever (and vice
versa, if the entire population consists of A genes)
• The biological interpretation of this phenomenon is that the gene pool will tend to
become less diverse over time as some alleles become extinct
• This is certain to happen in the long run; the only uncertainties in the model are
◦ Which of the two alleles will survive and which will become extinct?
◦ How many generations will it take for one of the alleles to go extinct (i.e. for
the chain to reach an absorbing state)?
• These two questions can be answered probabilistically using first step analysis
29
Exercise
• For the Random Reproduction Model with N = 2, determine the probability that
allele a eventually goes extinct
• For the Random Reproduction Model with N = 2, determine the expected number
of generations until one of the alleles goes extinct
• Give the transition probability matrix for the Random Reproduction Model with
N=3
Random Reproduction Model with Mutation
• A more realistic genetics model must take into account mutations. One way of do-
ing this in our Random Reproduction Model is to assume that, prior to the formation
of the new generation, each gene has the possibility to mutate, i.e., to change into a
gene of the other type
• Specifically, we assume that for each gene the mutation a→A occurs with proba-
bility α, and A→a occurs with probability β
• Otherwise our assumptions are the same: we have a fixed population size of 2N,
and the birth of each new generation is a binomial experiment with 2N independent
trials
• It is the ‘success probabilities’ in our binomial experiment that will change. If we

assume that Xn = i (the number of genes of allele a in generation n is i, after any
mutations), the probability of an individual in generation n + 1 being of allele a is
i
no longer pi = but can be derived using basic probability principles
2N
• Let Ba be the event that a gene is born as allele a and let Ma→A be the event that a
gene undergoes mutation a→A
• Let BA be the event that a gene is born as allele A and let MA→a be the event that a
gene undergoes mutation A→a
• Let Ga be the event that a gene is of type a when it reproduces (i.e. after a possible
mutation)
• There are two ways that event Ga could happen: either the gene is born as type a
and does not mutate, or the gene is born as type A and does mutate. These two
situations are mutually exclusive; thus,
pi = Pr (Ga ) = Pr Ba ∩ Ma→Ac
+ Pr (BA ∩ MA→a )

= Pr (Ba ) Pr Ma→A
c
|Ba + Pr (BA ) Pr (MA→a |BA )

i i
= (1 − α) + 1 − β
2N 2N
• Therefore, the transition probabilities for this Markov Chain are binomially dis-
tributed with number of trials 2N and success probabilities pi as given above, i.e.,
30
!
2N j i i
Pi j = Pr (Xn+1 = j|Xn = i) = pi (1 − pi )2N− j where pi = (1 − α) + 1 − β
j 2N 2N
Random Reproduction Model with Mutation: Long-Term Behaviour
• Provided that both mutation probabilities are nonzero (i.e. αβ > 0), this Markov
Chain has no absorbing states
• Instead, it has a limiting distribution, which you can calculate using the methods
learnt in this chapter
Random Reproduction Model with Mutation: Exercise
• Consider the Random Reproduction Model for a population of size 4 (N = 2)
• It is given that the mutation probabilities are α = 1

10
and β = 1
20
• Give the transition probability matrix for this Markov Chain
• Determine and interpret the limiting distribution of this Markov Chain
Other Genetics Models
• Two other Markov Chain genetics models not covered in these lecture notes are:
◦ An extension of the Random Reproduction Model to include selective forces

◦ The Moran Model, which conceptualises the formation of a new generation in
a different way than the Wright-Fisher Model does
1.5.3 Epidemiological Markov Chain Models

Compartmental Epidemiological Models
• Compartmental epidemiological models are used to model the spread of an epi-

demic within a population by classifying individuals within the population into cer-
tain categories (compartments) based on disease status
• Two classical and simple compartmental epidemiological models are the SIS Model
and the SIR Model
The SIS Model
• In the SIS model, individuals in the population are classified as either ‘susceptible’
(S), meaning that they are not currently infected but could become infected, and
‘infected’ (I), meaning that they are currently infected as well as infectious (can
infect others)
• Within the SIS epidemic model, it is assumed that a susceptible individual can
become infected through contact with an infected individual; it is further assumed
that infected individuals eventually recover and return to the susceptible class
31
Figure 1.9: Compartmental Diagram for SIS Model
• There is no disease-related death, and there is no vertical transmission (all individ-

uals are susceptible, rather than infected, at birth)
• The SIS Model has been applied to sexually transmitted infections (STI’s)
• The compartmental diagram in Figure 1.9 shows how individuals move between the
S and I classes; it is not to be confused with a state diagram, since S and I do not
represent states in a Markov Chain, but disease status of an individual
The SIR Model
• In the SIR model, besides the ‘susceptible’ (S) and ‘infected’ (I) classes, there is a
third class called ‘removed’ (R), sometimes referred to as ‘immune’, which refers
to individuals who were previously infected and are now neither infected nor sus-
ceptible
• Thus, unlike in the SIS Model, infected individuals do not, after recovering, return
to being susceptible; they become immune to the disease
• The SIR Model has been applied to ‘once-off’ childhood diseases such as chicken
pox, measles, and mumps
Figure 1.10: Compartmental Diagram for SIR Model
• The compartmental diagram in Figure 1.10 shows how individuals move between
the S, I, and R classes; again, it is not to be confused with a state diagram
Deterministic vs. Stochastic Epidemiological Models
• In your Biomathematics module, you have learned or will learn about deterministic
versions of the SIS and SIR Models
32
◦ This includes discrete-time (difference equations) and continuous-time (ordi-
nary differential equations) versions of the models
• In mathematics, ‘deterministic’ means non-random, in other words, the opposite of

stochastic
• In a deterministic model, once we set the initial value(s) and the values of all pa-
rameters, the model tells us exactly what will happen, i.e. the number of susceptible
and the number of infected individuals at any point in time
• In this module, we are studying only stochastic models, not deterministic models.
With a stochastic model, once we set the initial value(s) and the values of all pa-
rameters, the model still cannot tell us exactly what will happen. It allows us to
calculate probabilities and expectations about what will happen.
Markov Chain SIS Model
• Suppose we have a constant population size of N individuals; we assume there are

no births and no deaths
• Let In and S n = N − In denote the number of infected individuals and susceptible

individuals respectively at time n.
• Since S n = N − In , S n |In is a constant. Thus we actually need to focus on only one

sequence of random variables, In
• {In } is a Markov Chain with state space {0, 1, 2, . . . , N}
• To reduce the number of transitions and simplify the model, we assume that the
time step size (i.e. the amount of time that passes between time n and time n + 1) is
small enough that the number of infected individuals changes at most by one during
this interval
• The transition probability formula for this model can then be stated as follows:
 i N − i


β = βi(N − i)/N 2 j = i + 1



 N N
γi/N j=i−1


Pi j = Pr (In+1 = j|In = i) = 

h i
1 − βi(N − i)/N 2 + γi/N j=i









0 otherwise
• Here, i(N − i) in the Pi,i+1 transition probability reflects that the probability of a new
infection occurring is proportional to the number of potential interactions between
the current number of infected individuals i and the current number of susceptible
individuals N − i
• β is a constant parameter representing how infectious the disease is
33
• The i in the Pi,i−1 transition probability indicates that the probability of an infected
person recovering and returning to susceptibility is proportional to the current num-
ber of infected individuals i
• γ is a constant parameter representing how rapidly infected individuals tend to re-

cover
• Dividing by N twice in the Pi,i+1 transition probability and once in the Pi,i−1 transi-
tion probability is done just so that the i and N − i terms are expressed as a fraction
of the total population N
◦ We could have formulated the model without dividing by N, in which case

these N constants would have been absorbed into β and γ, giving them a dif-
ferent interpretation
Markov Chain SIS Model: Example
• Consider the case N = 5; the transition probability matrix will be as follows:

0 1 2 3 4 5
0  1 0 0 0 0 0 
 
1  γ/5 1 − 4β/25 − γ/5 4β/25 0 0 0 


2  0 2γ/5 1 − 6β/25 − 2γ/5 6β/25 0 0 

P=  
3  0 0 3γ/5 1 − 6β/25 − 3γ/5 6β/25 0 
4  0 0 0 4γ/5 1 − 4β/25 − 4γ/5 4β/25 
γ 1−γ

5 0 0 0 0
Markov Chain SIS Model: Restrictions on Parameters
• A basic restriction is that β > 0, γ > 0, and N > 0 (N must also be an integer);
however, the parameters must also be chosen in such a way that Pi j satisfies the
properties of transition probabilities
X
• We can see from the transition probability formula above that the property Pi j = 1
j∈S
is satisfied by definition:
X
Pi j = 0 + Pi,i−1 + Pii + Pi,i+1
j∈S
= γi/N + βi(N − i)/N 2 + 1 − βi(N − i)/N 2 − γi/N

=1
• However, we must also ensure that the property Pi j ≥ 0 is satisfied, and this depends
on the values of the parameters
• In particular, we must ensure that βi(N − i)/N 2 + γi/N ≤ 1 so that Pii ≥ 0
• Let g(i) = βi(N − i)/N 2 + γi/N; let us use calculus to find the maximum value of
g(i); we will call this g(i? ) where i? is the value of i that maximises g(i)
34
• We will then set g(i? ) ≤ 1 to ensure that Pii ≥ 0
(β + γ)i βi2
g(i) = − 2
N N
β + γ 2βi
g0 (i) = − 2 =0
N N
2βi β + γ
=
N2 N
? β + γ N 2 N(β + γ)
i = =
N 2β 2β
• Note that, by second derivative test, g(i? ) is a maximum:

2β
g00 (i) = − <0
N2
• Notice that if γ > β then i? > N, which is impossible since N is the maximum state
◦ Thus if γ > β we set i? = N as the maximum, since N is the highest possible
value of i and the function is still increasing at that point
N(β + γ)
• Notice further that i? has to be an integer; thus if is not an integer the true
2β
N(β + γ) N(β + γ)
maximum will occur at d e or b c
2β 2β
◦ We will ignore this consideration here but it will mean the restrictions can be
relaxed especially when N is small
• We now have two cases to consider
◦ If γ ≤ β:
? (β + γ)i? βi?2
g(i ) = − 2
N N
#2
β + γ N(β + γ) β N(β + γ)
"
= − 2
N 2β N 2β
(β + γ)2
(β + γ) 2
= −
2β 4β
2 (β + γ)2
(β + γ)2
= −
4β 4β
(β + γ)2
=
4β
◦ If γ > β:
(β + γ)N βN 2
g(i? ) = g(N) = − 2
N N
=β+γ−β
=γ
35
• Thus, in order to ensure that Pii ≥ 0, the restriction on parameters must be that
(β + γ)2
≤ 1 (if γ ≤ β) or that γ ≤ 1 (if γ > β)
4β
• The first case (γ ≤ β) can be simplified thus:
(β + γ)2 ≤ 4β
β2 + 2βγ + γ2 − 4β ≤ 0
β2 + 2 (γ − 2) β + γ2 ≤ 0
Suppose we choose some fixed γ > 0
Roots of inequation are:
q
2(γ − 2) 2 − 4(1)(γ2 )

−2 (γ − 2) ±
β=
p 2
4 γ2 − 4γ + 4 − 4γ2

β = −(γ − 2) ±
2
−16γ + 16
p
β = −(γ − 2) ±
2
β=2−γ±2 1−γ
p
• This indicates that 0 < γ ≤ 1 is a restriction, otherwise there will be no valid β

values
• In γ ≤ β case we can thus state restrictions as follows:
◦ 0 < γ ≤ 1 and
2−γ−2 1−γ ≤β≤2−γ+2 1−γ

p p
◦ But since γ ≤ β, and since 2 − γ − 2 1 − γ ≤ γ for all 0 < γ ≤ 1 (see if you

p
can prove this), this reduces to:
γ ≤β≤2−γ+2 1−γ
p
• In γ ≤ β case, the restrictions are: 0 < γ ≤ 1 and γ ≤ β ≤ 2 − γ + 2 1 − γ

p
• In γ > β case, the restrictions are: 0 < γ ≤ 1 and 0 < β < γ
• In general, the restrictions on γ and β (which, notice, do not depend on N) can be

expressed as follows:
0<γ≤1
0<β≤2−γ+2 1−γ
p
36
Further Analysis of the Transition Probabilities
• We may also observe that it will be important to the behaviour of the Markov Chain
which of Pi,i+1 = βi(N − i)/N 2 and Pi,i−1 = γi/N is greater (which determines
whether the number of infected individuals will tend to increase or decrease)
• Let us consider the circumstances under which Pi,i+1 > Pi,i−1
βi(N − i)/N 2 − γi/N > 0
(β − γ)i βi2
− 2 >0
" N N#
(β − γ) βi
i − 2 >0
N N
Roots (zeroes) of inequation:
(β − γ)N
i = 0, i =
β
(β − γ)N
• Since i will never be negative, we can see that Pi,i+1 > Pi,i−1 provided that >
β
0, which will be true provided that β > γ
• Thus, as long as the intensity rate of infection exceeds the intensity rate of recovery,
the epidemic will tend to spread
Long-Term Behaviour
• It is evident that, when i = 0, Pii = 1 while Pi,i+1 = Pi,i−1 = 0, which means this is
an absorbing state
• It is further evident that, when i = N, Pi,i+1 = 0, Pi,i−1 = γ and Pii = 1 − γ; thus this
is not an absorbing state
• The only absorbing state is state 0, which represents 0 infections (the epidemic has
been eradicated)
• First step analysis can be used to determine the expected amount of time until the
epidemic is eradicated
• We could also ‘pretend’ that any other individual state is an absorbing state and use
first step to find the probability that the epidemic reaches that state (i.e. the number
of infected individuals reaches that number) before it is eradicated, for given initial
conditions
Markov Chain SIR Model
• The Markov Chain SIR Model is more complicated in that includes three random
variables
S n , the number of susceptible individuals at time n
In , the number of infected individuals at time n
Rn , the number of removed (immune) individuals at time n
37
• Since S n + In + Rn = N, one of the three random variables can be expressed in terms
of the other two. For instance, Rn |(S n , In ) = N − S n − In is a constant; the process is
fully specified by the two random variables S n and In
• Still, we have two random variables and so our state space is two-dimensional: we
are interested in the joint probability
Pr (S n = s ∩ In = i)
• Our transition probabilities can be expressed as

P si,k j = Pr (S n+1 = k ∩ In+1 = j|S n = s ∩ In = i)
• As before, we assume that our time step size is small enough that at most one change
in state can occur in one time step
• This means that at each time step, either there is no change, or there is one new in-
fection (S n decreases by 1 and In increases by 1) or there is one new removal/recovery
(In decreases by 1).
• The transition probabilities can therefore be expressed as follows:
 i s


β = βis/N 2 (k, j) = (s − 1, i + 1) (new infection)


 N N
γi



(k, j) = (s, i − 1) (new recovery)

P si,k j = 



 N
1 − βis/N 2 − γi/N (k, j) = (s, i) (no change)








0 otherwise
• The state space in this case is {0, 1, 2, . . . , N} for both S n and In with the restriction
that S n + In ≤ N
• It is not very convenient to represent the transition probabilities in a two-dimensional
matrix structure, but it can be done; for instance we make the first row (0, 0), the
second row (1, 0), the third row (2, 0), and so on up to the N + 1th row (N, 0); then
the N +2th row is (0, 1), followed by (1, 1), and so on up to (N −1, 1) [since the state
(N, 1) does not exist], then (0, 2), (1, 2), . . ., (N − 2, 2). The pattern would continue
until the final row, (0, N)
• In general, the number of rows (and columns) of the transition probability matrix,
that is, the total number of distinct states in the Markov Chain will be
N+2 (N + 1)(N + 2)
!
=
2 2
• As with the SIS Markov Chain Model, the parameters β and γ represent the infec-
tion rate and recovery rate respectively
• We will not derive restrictions on the parameters γ and β in the notes but it is clear
that γ > 0 and β > 0; also we require that
g(i, s) = βis/N 2 + γi/N ≤ 1 to ensure that P si,si ≥ 0
38
Markov Chain SIR Model: Example
• Consider a Markov Chain SIR Model where the population size is N = 3
• The transition probability matrix is as follows:

(0, 0) (1, 0) (2, 0) (3, 0) (0, 1) (1, 1) (2, 1) (0, 2) (1, 2) (0, 3)
(0, 0) 1 0 0 0 0 0 0 0 0 0 
 

(1, 0) 0 1 0 0 0 0 0 0 0 0 
 
 
(2, 0)  0 0 1 0 0 0 0 0 0 0 
 
(3, 0)  0 0 0 1 0 0 0 0 0 0 
(0, 1)  γ/3 0 0 0 1 − γ/3 0 0 0 0 0 
P = 
(1, 1) 0 γ/3 0 0 0 1 − β/9 − γ/3 0 β/9 0 0 
 
γ/3 1 − 2β/9 − γ/3

(2, 1) 0 0 0 0 0 0 2β/9 0 
 
1 − 2γ/3
 
(0, 2)  0 0 0 0 2γ/3 0 0 0 0 

(1, 2) 
 0 0 0 0 0 2γ/3 0 0 1 − 2β/9 − 2γ/3 2β/9 
(0, 3) 0 0 0 0 0 0 0 γ 0 1−γ
Long-Term Behaviour
• All of the states where the second element is 0 are absorbing since this represents
a disease-free state (no infected individuals; all individuals are either susceptible or
removed)
• By determining the absorption probabilities for these various absorbing states, we

can determine the probability of the number of people in the population who did
get the disease before it was eradicated
• For example, if the system with N = 100 is absorbed to the state (1, 0), this means
RT = 100 − 1 − 0 = 99 people contracted the disease and recovered, while S T = 1
person never got the disease. By contrast, if the system is absorbed to the state
(90, 0), this means RT = 100−90 = 10 people contracted the disease and recovered,
while S T = 90 people never got the disease
• It could also be of interest to determine the expected absorption time, i.e. how many
time steps it takes before the epidemic stops
1.5.4 Ecological Markov Chain Models

A Predator-Prey Markov Chain Model
• The Lotka-Volterra Model, involving a system of ordinary differential equations,

is a famous deterministic model for the population size of two interacting species,
one of which preys on the other (e.g., rabbits and foxes, lions and zebras, whale and
krill, etc.)
• The model has the following form:

dx
= a10 x − a12 xy
dt
dy
= a21 xy − a20 y
dt
where a10 , a12 , a21 , a20 > 0
39
• Here, x(t) is the population size of the prey species while y(t) is the population size
of the predator species
• In this section we consider a simple stochastic version of this model of predator-
prey dynamics
• Suppose that Xn is the population of the prey species at time n and Yn is the popula-
tion of the predator species at time n
• As in the SIR Markov Chain Model, we have to jointly consider the two random
variables (Xn , Yn ); the transition probabilities of interest are
Pr (Xn+1 = i ∩ Yn+1 = j|Xn = x ∩ Yn = y)
• In order to keep our transition probability matrix dimensions finite, we assume that
each species has some maximum population (carrying capacity): NX for the prey
species and NY for the predator species
◦ Thus population growth is according to a logistic curve rather than exponential
• The state space for Xn is therefore {0, 1, 2, . . . , NX } and the state space for Yn is
{0, 1, 2, . . . , NY }
◦ This time, the two random variables are not jointly constrained, so the number
of states will be
(NX + 1)(NY + 1)
• As in our epidemiological models, we make the simplifying assumption that the
time step size is only large enough to allow a change of one unit in the predator or
prey population size
• The transition probabilities are defined as follows:
x (NX − x)

a10 (i, j) = (x + 1, y) (prey population increases by 1)



NX2







 xy
(i, j) = (x − 1, y) (prey population decreases by 1)



 a12




 N X NY
xy (NY − y)


(i, j) = (x, y + 1) (predator population increases by 1)

a21



NX NY2





y

=

P xy,i j (i, j) = (x, y − 1) (predator population decreases by 1)

a20
NY








 x (NX − x) xy


 1 − a10 2
− a12



 NX NX NY


xy (NY − y) y


(i, j) = (x, y) (no change)

− a21 − a20



 2
N X NY N Y






0

otherwise
Restrictions on Parameters
• We require that a10 , a12 , a21 , a20 > 0; further restrictions are required to ensure that
Pi j,i j ≥ 0
40
Long-Term Behaviour
• It is clear that (0, 0) is an absorbing state representing extinction of both species
• It is also clear that (NX , 0) is an absorbing state representing extinction of predator

species and survival of prey species with fixed population NX
◦ If the predator species goes extinct first, the prey species can no longer de-
crease and will survive and inevitably reach its carrying capacity NX
◦ If the prey species goes extinct first, the predator species can no longer in-
crease and will also inevitably go extinct
Predator-Prey Markov Chain Model: Example
• Consider a very simple case where the carrying capacity of the prey species is NX =
2 and the carrying capacity of the predator species is also NY = 2
• We can see that even in this case the dimensions of the transition probability matrix
will be 9 × 9
(0, 0) (1, 0) (2, 0) (0, 1) (1, 1) (2, 1) (0, 2) (1, 2) (2, 2)
(0, 0) 1 0 0 0 0 0 0 0 0
 
 
a a10
(1, 0) 
 0 1 − 410 4 0 0 0 0 0 0 

(2, 0)  0 0 1 0 0 0 0 0 0 
a20 a
1 − 220
 
(0, 1)  2 0 0 0 0 0 0 0 
a20 a12 a10 a20 a10
P = (1, 1)

 0 2 0 4 1− 4 − a412 − 2 − a21
8 4 0 a21
8 0 

a20 a12 a20
(2, 1) 
 0 0 2 0 2 1− 2 − a212 − a21
4 0 0 a21
4


(0, 2)  0 0 0 a20 0 0 1 − a20 0 0 
a12 a10 a10
− a212 − a20
 
(1, 2)  0 0 0 0 a20 0 2 1− 4 4 
(2, 2) 0 0 0 0 0 a20 0 a12 1 − a20 − a12
Other Ecological Models
• Similar Markov Chain models could be formulated for competing species or for
symbiotic species (species that help each other to survive)
41
2 Stochastic Event Processes
Stochastic Event Processes
• In this section we will be looking at stochastic processes that are used to model
countable events occurring in continuous time
• In particular, we will consider Poisson processes, birth processes, death processes,

and birth-and-death processes
2.1 Poisson Process

The Poisson Distribution
• The Poisson Distribution is a discrete probability distribution which is used to

model ‘rare’ events that occur randomly but with a fixed average rate per unit of
time (or space), λ > 0
◦ Note that although the Poisson-distributed random variable Y is discrete, λ

does not have to be an integer
• The probability mass function of the Poisson Distribution is as follows:
e−λ λk
Pr (Y = k) = for k = 0, 1, 2, 3, . . .
k!
• The mean and variance of the Poisson Distribution are E (Y) = λ and Var (Y) = λ
• Here are two other interesting results about the Poisson Distribution that you may
not have been aware of:
◦ Let Y1 be a random variable following a Poisson distribution with rate λ1 and

let Y2 be a random variable following a Poisson distribution with rate λ2 . If
Y1 and Y2 are independent then the sum Y1 + Y2 follows a Poisson distribution
with rate λ1 + λ2
◦ This can be extended to as many independent Poisson random variables as we
n
X Xn
like: Yi ∼ Poisson( λi )
i=1 i=1
2.1.1 Homogeneous Poisson Process

Poisson Process: Definition
• A Poisson Process of rate λ > 0 is an integer-valued stochastic process {X(t); t ≥ 0}

for which the following three properties hold:
1. For any time points t0 = 0 < t1 < t2 < · · · < tn , the process increments
X(t1 )−X(t0 ), X(t2 )−X(t1 ), . . . , X(tn )−X(tn−1 ) are independent random variables
2. For any s ≥ 0 and t > 0, the random variable X(s + t) − X(s) ∼ Poisson(λt)
42
3. X(0) = 0
• This also implies that E [X(t)] = λt and Var [X(t)] = λt
• Note that time is now continuous, rather than discrete as in the Markov Chains we
studied
• Most quantities that can be modelled using a Poisson distribution can also be mod-
elled as a Poisson process, for example:
◦ Time points of x-ray emissions of a substance undergoing radioactive decay

◦ Occurrence of road accidents at a certain intersection
◦ Location of faults or defects along the length of a cable
◦ Successive arrival times of customers for service
◦ Births at a hospital
◦ Occurrences of a non-communicable disease in a population
Poisson Process: Example 1
• Births at a certain hospital occur at random with an average rate of 0.4 per hour. It
is currently 08:00 at the hospital. Model the births as a Poisson Process and answer
the following questions:
(a) What is the probability that no births occur by 10:00?

(b) Given that no births occur by 10:00, what is the probability that exactly one
birth occurs between 10:00 and 11:00?
• To answer (a) we observe that 10:00 is 2 time units (hours) after 08:00. Thus we
are interested in Pr (X(2) − X(0) = 0), where X(2) − X(0) = X(2), since X(0) = 0
(property 3 above). By assumption, X(2) has a Poisson distribution with parameter
λt = 0.4(2) = 0.8. Thus
e−0.8 0.80
Pr (X(2) = 0) = = 0.4493
0!
• To answer (b) we observe that 11:00 is 3 time units (hours) after 08:00. The number
of births between 10:00 and 11:00 is therefore X(3) − X(2). We are interested
in Pr (X(3) − X(2) = 1|X(2) − X(0) = 0). However, X(3) − X(2) is independent of
X(2) − X(0) = X(2) (see property 1 above). Thus the conditional probability is
the same as the unconditional probability: Pr (X(3) − X(2) = 1|X(2) = 0) reduces
to Pr (X(3) − X(2) = 1). Also, the distribution of X(3) − X(2) is the same as the
distribution of X(1) − X(0) = X(1) (see property 2 above: letting t = 1 and s = 0 or
s = 2, we see that in both cases we get a random variable with Poisson(λ(1) = 0.4)
distribution). Thus:
e−0.4 0.41
Pr (X(3) − X(2) = 1) = Pr (X(1) = 1) = = 0.2681
1!
43
• Emergency calls are received at a dispatch centre according to a Poisson process

of rate 0.15 per minute. If it is currently 09:00, what is the probability that exactly
one call has been received by 09:30 and that exactly 20 calls have been received by
11:30?
• Notice in this case that the λ parameter is measured in units of per minute. We must
either transform the units to be per hour (λ = 0.15(60) = 9) or we can keep λ = 0.15
and measure t in minutes in our calculations concerning Pr (X(t)). Let us use both
approaches to show that we get the same answer. In the first case (time measured
in hours) we proceed as follows:
" ! ! #
1 5
Pr X =1∩X = 20
2 2
" ! ! ! #
1 5 1
= Pr X =1∩X −X = 19
2 2 2
" ! # " ! ! #
1 5 1
= Pr X = 1 Pr X −X = 19 (by independence, from property 1)
2 2 2
" ! #
1
= Pr X = 1 Pr [X (2) − X (0) = 19] (by property 2)
2
" ! #
1
= Pr X = 1 Pr [X (2) = 19] (by property 3)
2
e−9/2 (9/2)1 e−9×2) (9 × 2)19
=
1! 19!
= (0.04999)(0.08867) = 0.004433
• In the second case:
Pr [X (30) = 1 ∩ X (150) = 20]

= Pr [X (30) = 1 ∩ X (150) − X (30) = 19]
= Pr [X (30) = 1] Pr [X (150) − X (30) = 19] (by independence, from property 1)
= Pr [X (30) = 1] Pr [X (120) − X (0) = 19] (by property 2)
= Pr [X (30) = 1] Pr [X (120) = 19] (by property 3)
e−0.15(30) (0.15(30))1 e−0.15(120) (0.15(120))19
=
1! 19!
= (0.04999)(0.08867) = 0.004433
• Mutations occur in a gene according to a Poisson process with average rate of 1.4
per generation. Determine the following:
(a) The probability that two mutations occur in the first generation
44
(b) The probability that two mutations occurred in the first generation and that six
mutations occur in the first three generations
(c) The probability that two mutations occurred in the first generation, given that
six mutations occur in the first three generations
(d) The probability that six mutations occur in the first three generations, given
that two mutations occurred in the first generation
• By carefully applying the rules of conditional probability we can solve these.
(a)
e−1.4(1) (1.4 × 1)2
Pr (X(1) = 2) = = 0.2417
2!
(b)
Pr (X(1) = 2 ∩ X(3) = 6) = Pr (X(1) = 2 ∩ X(3) − X(1) = 4)

= Pr (X(1) = 2) × Pr (X(3) − X(1) = 4) (by property 1)
= Pr (X(1) = 2) × Pr (X(2) = 4) (by properties 2 and 3)
e−1.4(1) (1.4 × 1)2 e−1.4(2) (1.4 × 2)4
= ×
2! 4!
= 0.24167 × 0.15574
= 0.03764
(c) There are actually two methods to calculate this probability. One method is to
use conditional probability rules as follows:
Pr (X(1) = 2 ∩ X(3) = 6)
Pr (X(1) = 2|X(3) = 6) =
Pr (X(3) = 6)
0.037637
= −1.4(3)
e (1.4 × 3)6
6!
0.037637
=
0.11432
= 0.3292
There is a quicker way to calculate this probability. It can be proven that,

for a Poisson process {X(t)}! of rate λ > 0, for 0 < u < t and 0 ≤ k ≤ n,
n
Pr (X(u) = k|X(t) = n) = (u/t)k (1 − u/t)n−k ; that is, X(u)|X(t) has a binomial
k
distribution with parameters n = n and p = u/t.
Thus, in this example,
!
6
Pr (X(1) = 2|X(3) = 6) = (1/3)2 (1 − 1/3)6−2 = 0.3292
2
45
(d) Note: in this case we cannot use the binomial approach, because k > n and
u>t
Pr (X(1) = 2 ∩ X(3) = 6)
Pr (X(3) = 6|X(1) = 2) =
Pr (X(1) = 2)
Pr (X(1) = 2) Pr (X(2) = 4)
= (see (b) above)
Pr (X(1) = 2)
= Pr (X(2) = 4)
= 0.1557 (see (b) above)
• Let {X(t); t ≥ 0} be a Poisson process having rate parameter λ = 2.3. Determine the
following probabilities:
(a) Pr (X(1) ≤ 2)
(b) Pr (X(1) ≥ 2|X(1) ≥ 1)
(a)
Pr (X(1) ≤ 2) = Pr (X(1) = 0) + Pr (X(1) = 1) + Pr (X(1) = 2)

e−2.3(1) (2.3 × 1)0 e−2.3(1) (2.3 × 1)1 e−2.3(1) (2.3 × 1)2
= + +
0! 1! 2!
= 0.5960
(b)
Pr (X(1) ≥ 2 ∩ X(1) ≥ 1)
Pr (X(1) ≥ 2|X(1) ≥ 1) =
Pr (X(1) ≥ 1)
Pr (X(1) ≥ 2)
= since X(1) ≥ 1 is necessarily true if X(1) ≥ 2
Pr (X(1) ≥ 1)
1 − Pr (X(1) = 1) − Pr (X(1) = 0)
=
1 − Pr (X(1) = 0)
1−e −2.3(1) (2.3 × 1)1 /1! − e−2.3(1) (2.3 × 1)0 /0!
=
1 − e−2.3(1) (2.3 × 1)0 /0!
0.6691
=
0.8997
= 0.7437
46
2.1.2 Non-Homogeneous Poisson Process
Non-homogeneous Poisson Process: Definition
• A non-homogeneous Poisson Process is one where the rate λ is not constant but
varies with time: λ = λ(t)
• Properties 1 and 3 of the homogeneous Poisson Process definition still hold (inde-
pendence of increments over disjoint intervals; initial value 0), but Property 2 is
different for a nonhomogeneous Poisson process:
Z of events in the interval ( s, t] has

◦ An increment X(t) − X(s) giving the number
t
a Poisson distribution with parameter Λ = λ(u)du
s
• What we are basically doing when we calculate Λ is transforming the non-homogeneous

Poisson Process into a homogeneous Poisson process
Non-homogeneous Poisson Process: Example 1
• Demands on a first aid facility in a certain location occur according to a non-

homogeneous Poisson process having the rate function
if 0 ≤ t < 1



2t
λ(t) =  if 1 ≤ t < 2

2




4 − t if 2 ≤ t ≤ 4
• t is measured in hours from the opening time of the facility

• What is the probability that two demands occur in the first two hours of operation
and two in the last two hours?
• Since demands in disjoint intervals are independent random variables, we can an-
swer the two questions separately
• For the first two hours:
Z 1 Z 2
Λ= 2tdt + 2dt
0 1
1
= t2 0 + 2t|21
= 1 + (4 − 2) = 3
e−3 32
• Therefore Pr (X (2) = 2) = = 0.2240
2!
• For the last two hours:
Z 4
Λ= 4 − tdt
2
4
= 4t − t2 /2 2
= (16 − 8) − (8 − 2) = 2
47
e−2 22
• Therefore Pr (X (4) − X (2) = 2) = = 0.2707
2!
• Thus Pr (X (2) = 2 ∩ X (4) − X (2) = 2) = 0.2240 × 0.2707 = 0.0606
Non-homogeneous Poisson Process: Example 2
• A video camera is placed near a watering hole in the Kruger National Park to mon-
itor animals coming to drink. Within each 24 hour period, beginning at 00h00 and
ending at 24h00, the number of animals arriving to drink per hour follows a Poisson
process with the following rate function, where t is measured in hours from 00h00:
1 4
λ(t) = −t + 48t3 − 792t2 + 5184t for 0 ≤ t ≤ 24
1000
Determine the following quantities:
(a) The probability that at least three animals come to drink before 01h00.
(b) The number of animals that are expected to arrive between 06h00 and 09h30.
(a)
Z 1
Λ= λ(t)dt
0
Z 1
1 4
= −t + 48t3 − 792t2 + 5184t dt
0 1000
Z 1
1
= −t4 + 48t3 − 792t2 + 5184t dt
1000 0
" 5 #1
1 t
= − + 12t − 264t + 2592t
4 3 2
1000 5 0
" 5 #
1 1
= − + 12(1) − 264(1) + 2592(1)
4 3 2
1000 5
1
= [2339.8]
1000
= 2.3398
Pr (X(t) ≥ 3) = 1 − Pr (X(t) = 0) − Pr (X(t) = 1) − Pr (X(t) = 2)
e−2.3398 (2.3398)0 e−2.3398 (2.3398)1 e−2.3398 (2.3398)2
=1− − −
0! 1! 2!
= 0.4145
48
(b)
Z 9.5
Λ= λ(t)dt
6
Z 9.5
1 4
= −t + 48t3 − 792t2 + 5184t dt
6 1000
Z 9.5
1
= −t4 + 48t3 − 792t2 + 5184t dt
1000 6
" 5 # 9.5
1 t
= − + 12t4 − 264t3 + 2592t2
1000 5 6
5
" # " 5 #!
1 9.5 6
= − + 12(9.5)4 − 264(9.5)3 + 2592(9.5)2 − − + 12(6)4 − 264(6)3 + 2592(6)2
1000 5 5
1
= (89846.13 − 50284.8)
1000
1
= (39561.33)
1000
= 39.561
Thus E (X(9.5) − X(6)) = 39.561
2.1.3 Sojourn and Waiting Times

A Note on Continuous Probability Distributions
• Discrete probability distributions (like the Poisson distribution) have a probabil-
ity mass function that allows us to calculate the probability of a particular value
directly by substituting that value into the function
• Continuous probability distributions, like those we will encounter in this section
(Exponential and Gamma distributions) have a probability density function, not a
probability mass function
• One does not obtain the probability of a particular value by substituting that value
into the probability density function (in fact, if Y is a continuous random variable,
then Pr(Y = k) = 0 for any particular real number k)
• Rather, one uses the probability density function to find the probability that the
random variable falls within a certain interval by integrating over that interval
(thus finding the area under the probability density function within that interval)
• So, if Y is a continuous random variable with probability density function fY (y) then
Z k2
Pr(k1 ≤ Y ≤ k2 ) = fY (y)dy.
k1
The Exponential Distribution and Memoryless Property

• A non-negative random variable T is said to have an exponential distribution with
parameter λ > 0 if the probability density function is:
λe
( −λt
for t ≥ 0
f (t) =
0 for t < 0
49
1 1
• The mean and variance of this distribution are E (T ) = and Var (T ) = 2 .
λ λ
• The exponential distribution is important in the theory of continuous-time stochastic
processes.
• Think of T as the waiting time before an event occurs
• Given that the event has not occurred by time t, it can be shown that the conditional
distribution of the remaining waiting time T − t does not depend on t. For any
t, s > 0,
Pr (T > t + s|T > t) = Pr (T > s)
• This is called the memoryless property and is a continuous equivalent of the

Markov property of Markov Chains
• It means that, given that an event has not occurred up to time t, the random variable
of its remaining lifetime is statistically the same as if the process is just starting, i.e.
t=0
• For a practical example, suppose that you are standing at the side of the road waiting
for a taxi and that the waiting time for a taxi is exponentially distributed with a mean
of 2 minutes. You wait by the side of the road for two minutes and no taxi has come
yet. What is the expected waiting time now? Is it 0? No, it is still 2 minutes. That
is how the memoryless property works.
The Gamma Distribution
• If we have an integer number α > 0 of independent exponentially distributed

random variables T 1 , T 2 , . . . , T α having the same parameter λ, then their sum
U = T 1 + T 2 + · · · + T α has a Gamma distribution, whose probability density func-
tion is as follows:
 λ
α−1 −λu
 (α − 1)! (λu) e

 for u ≥ 0
f (u) = 


for u < 0

0
α
• The mean of a Gamma-distributed random variable is E (U) = and the variance
λ
α
is Var (U) = 2
λ
• If n is small enough then the integral necessary to work out probabilities can be
done using integration by parts (see example below); if n is large, the probabilities
can be computed numerically in MatLab or other statistical software
50
Figure 2.1: Line Graph Showing State of Poisson Process over Time
Sojourn and Waiting Times of a Poisson Process
• If we have a Poisson process {X(t); t ≥ 0} with rate λ > 0 then we can define the nth
waiting time, that is, the time of occurrence of the nth event, as Wn
• For simplicity we usually set W0 = 0
• The differences S n = Wn+1 − Wn are called sojourn times; S n measures the duration
that the Poisson process spends in state n (see illustration in Figure 2.1)
• It can be shown that the waiting time to the first event, W1 , is a random variable
with an exponential distribution with rate λ; thus the probability density function of
W1 is
fW1 (t) = λe−λt for t ≥ 0
• It can also be shown that the sojourn times S 0 , S 1 , . . . , S n−1 are independent random
variables, each having the exponential distribution with rate λ; thus the probability
density function of S k is
fS k (s) = λe−λs for s ≥ 0
51
• Furthermore, because Wn = S 0 + S 1 + S 2 + · · · + S n−1 , by the definition of the Gamma
probability distribution described above, Wn has a Gamma distribution with param-
eters α = n and λ; thus the probability density function of Wn is
λ
fWn (t) = (λt)n−1 e−λt for t ≥ 0
(n − 1)!
Waiting Times Example 1
• A radioactive source emits particles according to a Poisson process of rate λ = 2

particles per minute.
• What is the probability that the first particle appears some time after three minutes
but before five minutes?
• We obtain the answer by integrating over the probability density function of W1
between 3 and 5
Z 5
Pr (3 ≤ W1 ≤ 5) = λe−λt dt
3
Z 5
=2 e−2t dt
3
" #5
−1 −2t
=2 e
2 3
h i
= e−2(3)
−e−2(5)
= 0.002479 − 0.00004540
= 0.00243
• What is the expected waiting time before the third particle is emitted?
n 3
E (W3 ) = = = 1.5
λ 2
It is expected that the third particle will be emitted after 1.5 minutes, that is, 1
minute 30 seconds.
• What is the probability that the waiting time before the third particle is emitted is
more than four minutes?
• Since our interval has no upper bound, we integrate the probability density function
of W3 from 4 up to the maximum value for which the function is nonzero, which is
∞
λ
Z ∞
Pr (W3 > 4) = (λt)n−1 e−λt dt
(n − 1)!
Z4 ∞
2
= (2t)2 e−2t dt
4 (2)!
Z ∞
=4 t2 e−2t dt
4
52
At this point we need to use integration by parts to go further. Re-
member
Z b that integration by parts
Z b breaks an integral down using the fact that
b
f (t)g (t)dt = f (t)g(t) −
0
f 0 (t)g(t)dt. Here, f (t) = t2 and g0 (t) = e−2t . Thus:
a a a
Z ∞
−1 2 −2t ∞
!
−1
Pr (W3 > 4) = 4 te − −2t
(2t)e dt
2 4 4 2
Z ∞
−1 2 −2t ∞
!
=4 te + −2t
te dt
2 4 4
we must use integration by parts again, with f (t) = t and g0 (t) = e−2t
Z ∞
−1 2 −2t ∞ −1 −2t ∞
" #!
−1 −2t
=4 te + te − e dt
2 4 2 4 4 2
−1 2 −2t ∞ −1 −2t ∞ 1 −1 −2t ∞
" #!
=4 te + te + e
2 4 2 4 2 2 4
" # " # " #!
−1 2 −8 −1 −1 −8
=4 0− 4 e + 0− (4)e + 0 −
−8
e
2 2 4
!
1 −8
= 4 8e + 2e + e
−8 −8
4

= 4 10.25e−8
= 0.01375
53
2.1.4 Spatial Poisson Processes
Spatial Poisson Processes: Definition
• So far we have only talked about Poisson processes where the index variable is time
• Thus the process described events occurring in time
• However, the Poisson distribution can also describe the occurrence of objects in
space, such as:
◦ The distribution of scratches on a piece of furniture

◦ The distribution of bacteria on a microscope slide
◦ The distribution of weeds in a garden
◦ The distribution of stars in space
• Let A be a set of points in S which is a subset of n-dimensional space (we will be

using n = 2 or n = 3)
• We define {N(A); A ∈ S } to be a spatial Poisson process where N(A) counts the

number of objects of interest in the space defined by A
• We define λ > 0 as the intensity of the process (the ‘rate’ at which objects occur in
space on average)
• We define |A| as the size of A (which could be length, area of volume depending on
n)
Spatial Poisson Processes: Result
• The probability distribution of N(A) is given by:
e−λ|A| (λ|A|)k
Pr (N(A) = k) = for k = 0, 1, 2, . . .
k!
Spatial Poisson Processes: Example 1
• Bacteria are distributed throughout a cylindrical container of liquid according to a

spatial Poisson process of intensity λ = 0.06 organisms per mm3 . The container is
10 mm high and has a diameter of 4 mm. What is the probability that more than
two bacteria are in the container?
• The volume of a cylinder is V = πr2 h where r is the radius and h is the height. Thus
the volume of this cylinder is:
V = πr2 h = 3.14159(22 )10 = 125.6637
54
• Thus:
Pr (N(A) > 2) = 1 − [Pr (N(A) = 0) + Pr (N(A) = 1) + Pr (N(A) = 2)]

e−0.06(125.6637) (0.06 × 125.6637)0
=1−
0!
e−0.06(125.6637) (0.06 × 125.6637)1 e−0.06(125.6637) (0.06 × 125.6637)2
− −
1! 2!
= 0.9804
There is about a 98% chance that there are more than 2 bacteria in the fluid.
Spatial Poisson Processes: Example 2
• Defects (air bubbles, chips, etc.) occur over the surface of a varnished table top
(1m by 2m in size) according to a Poisson process at a mean rate of one defect
per square metre. If an inspector checks two such tables for defects, what is the
probability that he will find at least one defect in both tables? (Assume that the
numbers of defects on the two table tops are independent.)
• The area of each table top is 2 × 1 = 2m2 . Thus:
Pr (N1 (A) > 0 ∩ N2 (A) > 0) = Pr (N1 (A) > 0) × Pr (N2 (A) > 0)
= (1 − Pr (N1 (A) = 0)) × (1 − Pr (N2 (A) = 0))
!2
e−1(2) (1 × 2)0
= 1−
0!
= 0.7476
There is about a 75% chance that at least one defect will be found in both tables.
2.2 Birth Process

2.2.1 Pure Birth Process
Pure Birth Process: Definition
• A pure birth process is a generalization of the Poisson process in which the prob-
ability of an event occurring at a given instant of time depends on the number of
events that have already occurred
• Previously this probability was fixed or was a function of time
• Now, instead of λ or λ(t) we have λk : the rate at which events occur depends on k,
the current state of the process
• The name comes from the fact that this type of process can be used to model births
in a population
• The process variable X(t) denotes the number of births in the time interval (0, t],
not necessarily the population size
55
• We define Pn (t) = Pr (X(t) = n|X(0) = 0) as the probability that there are n births by
time t, given that there were 0 births at time 0
• However, it can be shown that the times between births S k are exponentially dis-
tributed random variables:
S k ∼ Exponential(λk ) for k = 0, 1, 2, . . .
• Pn (t) can be expressed using the following integral equation:

Z t
Pn (t) = λn−1 e−λn t
eλn x λn−1 Pn−1 (x)dx for n = 1, 2, . . .
0
∞
X
• Once we put in the restriction that Pn (t) = 1 and if we assume that no two birth
n=0
parameters λ0 , λ1 , . . . are equal, then this integral equation solves as follows:
P0 (t) = e−λ0 t
n−1
Y n
X
Pn (t) = λ` Bk,n e−λk t = λ0 · · · λn−1 B0,n e−λ0 t + · · · + Bn,n e−λn t , n = 1, 2, . . . , where
`=0 k=0
n
Y −1 1
Bk,n = λ j − λk = , k = 0, 1, . . . , n
j=0; j,k
(λ0 − λk ) · · · (λk−1 − λk )(λk+1 − λk ) · · · (λn − λk )
• We can use this formula to find any Pn (t), although obviously this formula gets very
complicated as n increases
Pure Birth Process: Example
• A pure birth process starting from X(0) = 0 and with time measured in years has
birth parameters λ0 = 1, λ1 = 3, λ2 = 2, λ3 = 5.
1. Determine Pn (t) for n = 0, 1, 2, 3

2. What is the probability that there will be more than three births in the first two
years?
56
1.
P0 (t) = e−λ0 t = e−t

P1 (t) = λ0 B0,1 e−λ0 t + B1,1 e−λ1 t
!
1 1
= λ0 e−λ0 t
+ e−λ1 t
λ1 − λ0 λ0 − λ1
!
1 −t 1 −3t
= e + e
3−1 1−3
1 1
= e−t − e−3t
2 2

P2 (t) = λ0 λ1 B0,2 e−λ0 t + B1,2 e−λ1 t + B2,2 e−λ2 t

= 3 B0,2 e−t + B1,2 e−3t + B2,2 e−2t
!
1 1 1
=3 e−t + e−3t + e−2t
(λ1 − λ0 )(λ2 − λ0 ) (λ0 − λ1 )(λ2 − λ1 ) (λ0 − λ2 )(λ1 − λ2 )
!
1 1 1
=3 e +
−t
e +−3t
e−2t
(3 − 1)(2 − 1) (1 − 3)(2 − 3) (1 − 2)(3 − 2)
!
1 −t 1 −3t
=3 e + e −e −2t
2 2

P3 (t) = λ0 λ1 λ2 B0,3 e−λ0 t + B1,3 e−λ1 t + B2,3 e−λ2 t + B3,3 e−λ3 t

= (1)(3)(2) B0,3 e−t + B1,3 e−3t + B2,3 e−2t + B3,3 e−5t
1 1
=6 e−t + e−3t
(λ1 − λ0 )(λ2 − λ0 )(λ3 − λ0 ) (λ0 − λ1 )(λ2 − λ1 )(λ3 − λ1 )
!
1 1
+ e +
−2t
e−5t
(λ0 − λ2 )(λ1 − λ2 )(λ3 − λ2 ) (λ0 − λ3 )(λ1 − λ3 )(λ2 − λ3 )
1 1
=6 e−t + e−3t
(3 − 1)(2 − 1)(5 − 1) (1 − 3)(2 − 3)(5 − 3)
!
1 1
+ e +
−2t
e −5t
(1 − 2)(3 − 2)(5 − 2) (1 − 5)(3 − 5)(2 − 5)
!
1 −t 1 −3t 1 −2t 1 −5t
=6 e + e − e − e
8 4 3 24
2. The probability that there will be more than three births in the first two years
is:
57
∞
X
Pr (X(2) > 3) = Pn (2)
n=4
3
X
=1− Pn (2)
n=0
= 1 − P0 (2) − P1 (2) − P2 (2) − P3 (2)
!
1 −2 1 −3(2)
=1−e − e − e
−2
2 2
! !
1 −2 1 −3(2) 1 −2 1 −3(2) 1 −2(2) 1 −5(2)
−3 e + e −e −2(2)
−6 e + e − e − e
2 2 8 4 3 24
15 5 1
= 1 − e−2 − e−6 + 5e−4 + e−10
4 2 4
= 0.5779
• Figure 2.2 displays graphically the probabilities we have calculated as functions of

time:
Figure 2.2: Probability Functions Pn (t) for a Pure Birth Process, n = 0, 1, 2, 3
58
2.2.2 Yule Process
The Yule Process
• The Yule Process is a special case of the Pure Birth Process in which λk = kβ
◦ This means that the birth rate is directly proportional to the population size,
with the proportionality constant being the individual birth rate β
◦ As such, the Yule Process is a stochastic analogue of the deterministic popu-
lation growth model represented by the ODE dy/dt = αy
• With a Yule Process we usually let X(0) = 1
• We can derive the following ODE:
P0n (t) = −β (nPn (t) − (n − 1)Pn−1 (t)) for n = 1, 2, . . .
• Under the initial conditions P1 (0) = 1, Pn (0) = 0 for n = 2, 3, . . . the solution is:
n−1
Pn (t) = e−βt 1 − e−βt for n ≥ 1
• This is the probability mass function of the geometric distribution with p = e−βt
• Remember that the negative binomial distribution is used to model the number of
independent binomial experiment trials required to achieve k successes where the
probability of success in each trial is p
• The geometric distribution is a special case of the negative binomial distribution
where k = 1
1 βt 1 − e−βt 1 − e−βt
• Thus E (X(t)) = = e and Var (X(t)) = = −2βt = e2βt − eβt
e−βt (e−βt )2 e
Yule Process: Example
• Suppose a microorganism follows a Yule Process with an individual growth rate of

β = 3.2 per minute.
• Given that the initial population size is 1, what is the probability that, after 45
seconds:
1. The population size will be exactly 6?
2. The population size will be more than 8?
1.
P6 (3/4) = e−3.2(3/4) (1 − e−3.2(3/4) )6−1
= e−2.4 (1 − e−2.4 )5
= 0.0564
The probability that the population size will be exactly 6 after 45 seconds is
about 6%
59
2. We will use the following formula for the first N terms of a geometric series:
N
1 − rN
X !
ar n−1
=a
n=1
1−r
In this case, a = e−βt and r = 1 − e−βt , so we have:

N
X 1 − (1 − e−βt )N
arn−1 = e−βt
n=1
1 − (1 − e−βt )
1 − (1 − e−βt )N
= e−βt
e−βt
= 1 − (1 − e−βt )N
∞
X 8
X
Pn (3/4) = 1 − Pn (3/4)
n=9 n=0
X8
=1− Pn (3/4) since P0 (t) = 0
n=1
we can compute this sum using the geometric series formula with N = 8

= 1 − 1 − (1 − e−βt )N
= (1 − e−βt )N
= (1 − e−(3.2)(3/4) )8
= 0.4673
Thus the probability that the population size will be greater than 8 after 45
seconds is about 47%.
2.3 Death Process

2.3.1 Pure Death Process
Pure Death Process: Definition
• A pure death process begins in state N and then moves successively through states
N − 1, N − 2, . . . , 2, 1 and finally is absorbed into state 0
• The process is specified by death parameters µk > 0 for k = 1, 2, . . . , N
• This parameter gives the average rate of deaths which depends on the current pop-
ulation
• By convention we set µ0 = 0
• The times between deaths S k ∼ Exponential(µk ) for k = N, N − 1, . . . , 1
60
• Similarly to the pure birth process, as long as no two death parameters are equal
(µ j , µk if j , k) then we can state the transition probabilities of the process
explicitly
• If we define Pn (t) = Pr (X(t) = n|X(0) = N) then:
PN (t) = e−µN t
YN N
X
Pn (t) = µ` Ak,n e−µk t
`=n+1 k=n
= µn+1 µn+2 · · · µN An,n e−µn t + · · · + AN,n e−µN t , n = N − 1, N − 2, . . . , 2, 1, 0,

YN −1
where Ak,n = µ j − µk
j=n, j,k
1
= for k = n, n + 1, . . . , N
(µn − µk ) · · · (µk−1 − µk )(µk+1 − µk ) · · · (µN − µk )
Pure Death Process: Example
• A pure death process starting from X(0) = 3 has death parameters µ0 = 0, µ1 =

3, µ2 = 2, µ3 = 5.
1. Determine Pn (t) for n = 0, 1, 2, 3.

2. Determine the expected population value at time t = 2.
1. To find Pn (t) for n = 0, 1, 2, 3:
P3 (t) = e−µ3 t = e−5t

P2 (t) = µ3 A2,2 e−µ2 t + A3,2 e−µ3 t
!
1 1
=5 e +
−2t
e −5t
µ3 − µ2 µ2 − µ3
!
1 −2t 1 −5t
=5 e − e
3 3
5 5
= e−2t − e−5t
3 3
P1 (t) = µ2 µ3 A1,1 e−µ1 t + A2,1 e−µ2 t + A3,1 e−µ3 t
!
1 1 1
= (2)(5) e−3t + e−2t + e−5t
(µ2 − µ1 )(µ3 − µ1 ) (µ1 − µ2 )(µ3 − µ2 ) (µ1 − µ3 )(µ2 − µ3 )
!
1 −3t 1 −2t 1 −5t
= 10 − e + e + e
2 3 6
10 5
= −5e−3t + e−2t + e−5t
3 3
For P0 (t), we have two options: we can either use the formula again, or we can
use the complement rule, which will be quicker: since the death process starts from
61
state 3, the only possible values of X(t) are 0, 1, 2, and 3, and therefore P0 (t) =
1 − [P1 (t) + P2 (t) + P3 (t)].
P0 (t) = 1 − P3 (t) − P2 (t) − P1 (t)

5 5 10 5
= 1 − e−5t − e−2t + e−5t + 5e−3t − e−2t − e−5t
3 3 3 3
= 1 + 5e−3t − 5e−2t − e−5t
As an exercise, use the general Pn (t) formula to find P0 (t) and see if you get
the same answer (you should!)
2. To find E [X(2)], we can use the basicXdefinition of expected value for a dis-
crete random variable Y, i.e. E (Y) = y Pr (Y = y). In this case, our random
y
variable is X(t)|X(0) = N, so:
N
X
E [X(t)|X(0) = N] = nPn (t) = 0P0 (t) + 1P1 (t) + 2P2 (t) + 3P3 (t)
n=0
!
10 −2t 5 −5t 5 −2t 5 −5t
= −5e + e + e + 2 e − e
−3t
+ 3 e−5t
3 3 3 3
10 5 10 10
= −5e−3t + e−2t + e−5t + e−2t − e−5t + 3e−5t
3 3 3 3
20 −2t 4 −5t
= −5e + e + e
−3t
3 3
20 −2(2) 4 −5(2)
E [X(2)] = −5e −3(2)
+ e + e
3 3
= 0.1098
The expected number of individuals remaining at time t = 2 is 0.1098.
• Figure 2.3 displays the probabilities we have calculated as functions of t
62
Figure 2.3: Probability Functions Pn (t) for a Pure Death Process, n = 0, 1, 2, 3
2.3.2 Linear Death Process

Linear Death Process: Definition
• This is the pure death process equivalent of the Yule Process
• We suppose that µk = kα, that is, the average death rate is proportional to the current
population
• α is the proportionality constant representing the individual death rate in the popu-
lation
• In this case it can be shown that:

!
N (−αt)n
Pn (t) = e (1 − e−αt )N−n for n = 0, . . . , N
n
• That is, X(t) is binomially distributed with the number of trials being N and the
probability of success p = e−αt
• This implies that E (X(t)) = Ne−αt and Var (X(t)) = Ne−αt 1 − e−αt

• We can also define T as the time of extinction, that is, T = min{t ≥ 0; X(t) = 0}

• It can be shown that Pr (T < t) = 1 − e−αt N and E (T ) = α1 1 + 12 + 13 + · · · + N1

63
• Note that, for large N, the harmonic series 1 + 12 + 31 + · · · + N1 ≈ ln N + γ where γ
is the Euler-Mascheroni constant ≈ 0.5772156649. Thus we can approximate
1
E (T ) ≈ (ln N + γ)
α
Linear Death Process: Example
• Consider a linear death process in which the initial population is 5 and the individual
average death rate is 2.
1. Determine Pr (X(t) = 3|X(0) = 5).

2. Determine the probability of extinction before time t = 2
3. Determine the expected time of extinction
1.
N = 5 and α = 2
Pr (X(t) = 3|X(0) = 5) = P3 (t)
!
5 −2(3)t
= e (1 − e−2t )5−3
3
= 10e−6t (1 − e−2t )2
2.
Pr (T < 2) = (1 − e−αt )N
= (1 − e−2(2) )5
= (1 − e−4 )5
= 0.9117
There is a 91% chance that the population will be extinct by time t = 2.

3.
!
1 1 1 1
E (T ) = 1 + + + ··· +
α 2 3 N
!
1 1 1 1 1
= 1+ + + +
2 2 3 4 5
= 1.1417
The expected time of extinction is t = 1.1417.
64
2.4 Birth-and-Death Process
Birth-and-Death Process: Definition
• A birth-and-death process combines the birth process and death process into a
single process
• We have a population in which births occur at rate λk and deaths occur with rate µk
both of which depend on the current population k
• If the process is in state k, it will sooner or later transition either to state k + 1 or to

state k − 1
• We can define Pi j (t) = Pr (X(t + s) = j|X(s) = i) for all s ≥ 0 as the probability that
the population changes from i to j during a period of length t
• This leads to some stochastic differential equations that are beyond our scope, but
we consider one special case: linear growth with immigration
Linear Growth with Immigration
• Suppose we have a Birth-and-Death Process with birth parameters λk = a + λk and

µk = µk for k = 0, 1, . . . where:
◦ λ > 0 is the individual birth rate

◦ a > 0 is the rate of immigration into the population
◦ µ > 0 is the individual death rate
• If the initial population is X(0) = i, t can be shown that the expected population size
at time t satisfies the equation:
E [X(t)] = at + i if λ = µ
a h (λ−µ)t i
E [X(t)] = e − 1 + ie(λ−µ)t if λ , µ
λ−µ
• If we take the limit as t → ∞, we see that E [X(t)] → ∞ if λ ≥ µ and E [X(t)] →

a
if λ < µ.
µ−λ
• This indicates that in the long run, if the death rate is higher than the birth rate, the
population stabilizes into a statistical equilibrium.
Linear Growth with Immigration: Example
• Suppose a population undergoes linear growth with immigration where λ = 2, µ =

2.2, a = 3 and time is measured in weeks. The initial population is 400. Determine
the expected population size after 10 weeks.
65
• Expected population size after 10 weeks:
a h (λ−µ)t i
E [X(t)] = e − 1 + ie(λ−µ)t
λ−µ
3 h (2−2.2)(10) i
= e − 1 + 400e(2−2.2)(10)
2 − 2.2
= 67.104
• We expect the population size after 10 weeks to be about 67
66
3 Generalised Linear Models with Biological Applications
The Response Variable
• In Statistics 2A, you were introduced to the linear regression model (both sim-
ple and later multiple), which relates a dependent variable or response variable,
usually denoted yi , to one or more independent variables or explanatory vari-
ables x1 , x2 , . . . , xk and their coefficients β0 , β1 , β2 , . . . , βk (including the intercept
β0 ), which are treated as constant but unknown parameters
• Our model equation is written as
yi = β0 + β1 xi1 + β2 xi2 + · · · + βk xik + i
• (Here, i represents the ith observation, where i = 1, 2, . . . , n)
• Thus, we deal with the fact that our response variable does not have an exact linear
relationship with our explanatory variable(s) by including i , a random error or
disturbance
• We make certain assumptions about the i , namely that the i are all independent
of each other, and that they are normally distributed with a 0 mean and a constant
variance
• Since the normal distribution is a continuous distribution, this means that the yi
values are also continuous, and so if we want the linear regression model to fit our
data well, the yi values in our data should be continuous or at least approximately
continuous
• What happens if we want to model the relationship between a response variable that
is not continuous and some explanatory variables?
• Examples of such response variables could be:
◦ A binary or dichotomous variable (e.g., ‘Yes’ vs. ‘No’, ‘Survived’ vs. ‘Died’,
‘Passed’ vs. ‘Failed’, etc.
◦ A categorical variable with more than two categories (a polychotomous vari-
able), where the measurement scale could be nominal or ordinal
◦ A count variable that counts events or objects in time or space; the values
will therefore be integers
• In this section we will discuss a class of models called Generalized Linear Mod-
els (GLMs), and a couple of special cases for modelling relationships involving a
binary response variable or a count response variable
67
Expected Response in the Linear Regression Model
• In order to introduce GLMs, let us consider how to generalise some of the features
of the linear regression model
• The model equation, again, is written as:
yi = β0 + β1 xi1 + β2 xi2 + · · · + βk xik + i
• The left side of the equation, as we know, consists of the response variable yi
• The right side consists of two parts: β0 + β1 xi1 + β2 xi2 + · · · + βk xik , which we refer
to as the linear predictor (which we can call ηi ), and i , the random error
◦ Thus the model equation can also be written simply as yi = ηi + i where

ηi = β0 + β1 xi1 + β2 xi2 + · · · + βk xik
• Now, one quantity of great interest in the model is the expected response, E (yi ),
which we can call µi
• This can be derived easily as follows for the linear regression model:
µi = E (yi ) = E (ηi + i )
= E (ηi ) + E (i )
= ηi + 0
= ηi
• The linear predictor has an expected value equal to itself because it does not contain
any random variables; the β j are all assumed to be constants and the xi j are assumed
to be fixed as well
• The random error has an expected value of 0 because this is one of the model
assumptions
• Thus, in the linear regression model, the expected response µi is equal to the linear
predictor ηi :
µi = ηi
‘Generalizing’ the Linear Regression Model
• GLMs are more general than the linear regression model in that, instead of assuming
that µi = ηi (that the linear predictor is equal to the expected response), they assume
that g(µi ) = ηi (that the linear predictor is equal to some function of the expected
response)
• Thus, to write in full,
g(µi ) = g E (yi ) = β0 + β1 xi1 + β2 xi2 + · · · + βk xik

68
• The function g(·) is known as the link function because it links the linear predictor
to the expected response
• From the above equation we can see that the following is also true:
µi = E (yi ) = g−1 (ηi ) = g−1 (β0 +β1 xi1 +β2 xi2 +· · ·+βk xik ) where g−1 (·) is the inverse of g(·)
• Note the linear regression model is a GLM; it is just that in this case the link function
is g(µi ) = µi
• In this section we will be considering four other GLMs that differ from linear re-
gression in two ways:
◦ The probability distribution of the response variable (conditional on the linear
predictor): in the linear regression model the distribution is assumed to be
normal, but this is not a helpful assumption when we have a binary or count
response
◦ The link function will no longer be simply g(µi ) = µi
• Specifically, we will be considering four models:
1. Binary Logistic Regression
2. Probit Regression
3. Poisson Regression
4. Negative Binomial Regression
• The first two models above have a binary response variable (they can be extended
to deal with polychotomous response variables, whether nominal or ordinal, but we
will not cover that)
• The last two models above have a count response variable
• The name of each model comes either from the link function or from the conditional
distribution of the response variable
3.1 Binary Logistic Regression

Motivational Example: Predicting Caesarean Sections
• An obstetrics researcher is interested in building a statistical model that can pre-
dict whether a mother-to-be will need a delivery by Caesarean section (C-section)
based on two explanatory variables: the woman’s age and the length of pregnancy
(gestation)
• The researcher observes a random sample of 30 pregnant women. He records their
ages in years (x1 ), gestation in weeks (x2 ), and whether or not the woman’s baby
was delivered by C-section. This is defined as a binary variable as follows:
(
1 if the woman’s baby was delivered by C-section
y=
0 if the woman’s baby was not delivered by C-section
• The data collected is displayed in Table 3.1
69
Table 3.1: Caesarean Section Data from a Sample of n = 30 Women
yi xi1 xi2 yi xi1 xi2

0 25 40 0 25 38
1 30 40 0 33 40
1 37 40 0 29 40
0 26 38 0 37 38
0 33 36 1 31 42
0 26 39 0 16 38
1 24 42 0 22 39
0 19 40 0 30 40
1 26 41 0 21 38
0 18 42 0 17 42
1 32 40 0 18 39
1 23 40 0 23 42
1 24 42 0 28 38
0 29 38 1 31 43
1 23 42 1 38 37
The Bernoulli Distribution

• A binary random variable follows what is called a Bernoulli Distribution, which
is nothing other than a binomial distribution where the number of trials (usually
denoted n) is 1 and the probability of success is p
• The probability mass function for the Bernoulli Distribution is
Pr (Y = k) = pk (1 − p)1−k , k = 0, 1
• The expected value is E (Y) = p; thus the probability of success is also the mean of
the distribution
• Since yi in our C-section is a binary random variable (C-section vs. no C-section),
it follows a Bernoulli Distribution
• The probability of success p in this case is the probability that a particular woman’s
delivery is done by C-section; we should rather denote it pi because it will not be
the same for each observation
• Since we want to use a GLM to predict yi , our assumption is that µi = E (yi ) = pi
is not constant but is related to the linear predictor ηi = β0 + β1 xi1 + β2 xi2 , a linear
function of the woman’s age and length of gestation
• The question is, how can we relate the expected response pi (which, again, is also
the probability that a birth is by C-section) to the linear predictor?
• A naive approach is to use the identity link function g(pi ) = pi , which means we
are effectively using a multiple linear regression model
70
Would Linear Regression Work with a Binary Response?
• Would it make sense to use linear regression in this case?

• If we fit the multiple linear regression model yi = β0 +β1 xi1 +β2 xi2 +i (assuming that
the expected response is related to the linear predictor as pi = ηi ) to the above data
using Ordinary Least Squares (OLS) estimation, the fitted model equation becomes
ŷ = p̂ = β̂0 + β̂1 x1 + β̂2 x2 = −6.703 + 0.03777x1 + 0.1525x2
• The multiple coefficient of determination r2 = 0.3516, and the p-values of the

significance tests on β1 and β2 are 0.0056 and 0.0012 respectively (both < 0.05),
suggesting statistical significance: a linear relationship does exist between x1 and y
and between x2 and y
• Everything sounds good until we calculate the predicted value for each of the 30
observations and plot these p̂i on a graph (Figure 3.1)
Figure 3.1: Estimated Probabilities of C-Section from Linear Regression Model
• Although it is true that the predicted values ŷi = p̂i tend to be smaller when the
actual yi value is 0 than when the actual yi value is 1, the predicted values are not
actually 0’s and 1’s
71
• Hence we can instead interpret them as probabilities p̂i , but there is again a major
problem: some of the fitted p̂i values are below 0, and others are above 1, as can be
seen in Figure 3.1
• These values can obviously not be interpreted as probabilities
• So, to summarise, a major difficulty with using a linear regression model with a
binary response variable is that we cannot ensure that our predicted probabilities
are between 0 and 1
The Logit Link
• The problem we have just discussed motivates us to consider a different form of the
link function g(·) that can ensure that the predicted probabilities fall between 0 and
1
• One solution is to use a logit function as the link function:

p
g(p) = log where log is the natural logarithm (base e)
1− p
• This function is graphed over the domain p ∈ [0, 1] in Figure 3.2
Figure 3.2: Logit Function Plotted in Domain [0, 1]
72
• When the input to a logit function is a probability (as it is in this case), we can also
refer to the logit as the log-odds
◦ Definition: if A is an event then the odds of A is the ratio of the probability

that A occurs to the probability that A does not occur
Pr (A) Pr (A)
Odds(A) = c
=
Pr (A ) 1 − Pr (A)
◦ Notice that if Pr (A) = 12 , the odds will be 1. If Pr (A) > 12 , the odds will be
> 1, and if Pr (A) < 12 , the odds will be < 1
◦ Notice further that if Pr (A) = 0, the odds will be 0, and if Pr (A) = 1, the odds
will be infinite; thus odds will always fall in the interval [0, ∞)
◦ In the logit function, p is a probability and thus p/(1 − p) is an odds, so the
logit is the log of the odds, or log-odds
• In order to see the usefulness of the logit function we need to find its inverse:
p
η = log
1− p
p
eη =
1− p
p = eη (1 − p) = eη − eη p
p (1 + eη ) = eη
eη
p=
1 + eη
eη
• Thus g−1 (p) =
1 + eη
1
• Note that this function can also be written as :
1 + e−η
eη e−η eη−η 1
g−1 (p) = × = =
1+e η e−η e +e
−η η−η 1 + e−η
• We can readily see that the values of g−1 (p) will always fall in the interval [0, 1],
so predicted probabilities generated from this model will also fall into this interval
(see Figure 3.3)
• We can observe from Figure 3.2 (or from the vertical axis of Figure 3.3, which is
basically Figure 3.2 turned on its side) that the logit function is almost a straight
line for much of its domain, for approximately p ∈ [0.2, 0.8]
• This means it is only for extreme values that a linear model with logit link function
will give very different results from a linear model with identity link function (as in
linear regression)
73
Figure 3.3: Inverse of Logit Function
The Logistic Regression Model
• Using the link function described above, we can specify the model equation for
logistic regression as follows:
pi
log = ηi = β0 + β1 xi1 + β2 xi2 + · · · + βk xik where
1 − pi
pi = Pr (yi = 1|ηi )
• Alternatively, using the inverse of the logit function, we can specify the model
equation as
1
pi =
1 + exp{− (β0 + β1 xi1 + β2 xi2 + · · · + βk xik )}
• Note that exp{·} is another notation for e(·)
• For estimating the parameters of the logistic regression model β0 , β1 , . . . , βk , we

do not use Ordinary Least Squares (OLS) but rather a method called Maximum
Likelihood Estimation (MLE)
• Since this method cannot be implemented easily by hand and is quite technical, we
will omit the details; we can use statistical software to fit the model for us
Predicting Caesarean Sections: Model Output and Interpretation
• SAS output from fitting the Caesarean Section data to a logistic regression model
is shown in Table 3.2
74
Table 3.2: Parameter Estimates from Logistic Regression Model Fit to Caesarean Section
Data
• The β̂0 , β̂1 , and β̂2 estimates respectively are shown in the Estimate column; thus
the fitted model equation is
p̂i 1
log = −56.7017+0.3061x1 +1.1933x2 or p̂i =
1 − p̂i 1 + exp{56.7017 − 0.3061x1 − 1.1933x2 }
• (Note the change of signs inside the exp if we are using the second equation!)
• The p-values for testing the null hypothesis H0 : β j = 0 against alternative H1 :

β j , 0, j = 0, 1, 2 are shown in the Pr > ChiSq column
• Note that in this case the significance test is not a t-test but an approximate χ2 test;
for further details of the test refer to Statistics 2A notes
• How do we interpret the parameter estimates?
◦ Let us start with the intercept β̂0 . Recall that in a linear regression model,
β̂0 is the expected value of the response variable y when all the explanatory
variables are set to 0. In this case, as we can see from the fitted model equation,
β̂0 is the value of the log-odds when all the explanatory variables are set to 0:
p̂i
log = β̂0 + β̂1 (0) + β̂2 (0) = β̂0
1 − p̂i
◦ We can make this interpretation a little bit more friendly by taking exp{·} of
both sides, to give us the odds rather than the log-odds
p̂i
= exp{β̂0 }
1 − p̂i
◦ Thus, in this case, if a woman is 0 years old and her gestation

period is 0 weeks, the expected odds of a Caesarean Section are
exp{−56.7017} = 2.3701 × 10−25 . Of course, this makes little sense in the con-
text of this problem, because women do not give birth when they are 0 years
old and do not give birth after 0 weeks!
75
◦ The interpretations of the gradient estimates β̂1 and β̂2 are more meaningful
and useful. Recall that in a linear regression, we interpret the gradient β̂1
as follows: if x1 increases by 1 unit, the expected response ŷ increases by
β̂1 units. In this case, we interpret as follows: if x1 increases by 1 unit, the
p̂i
expected log-odds of a Caesarean Section log increases by β̂1 units. Or,
1 − p̂i
taking e to the power of both sides, we can say: if x1 increases by 1 unit, the
p̂i
expected odds of a Caesarean Section increases by a factor of exp{β̂1 }
1 − p̂i
(or, if β̂1 < 0, decreases by a factor of exp{−β̂1 })
◦ Thus, in our Caesarean Section model, we interpret as follows: for every one-
year increase in a woman’s age, the odds of her having a Caesarean Section
are expected to increase by a factor of exp{0.3061} = 1.3581 (that is, by
35.81%). For every one-week increase in a woman’s gestation period, the
odds of her having a Caesarean Section are expected to increase by a factor of
exp{1.1933} = 3.2979 (that is, by more than three times, or by 329.79%)
Making Predictions from a Logistic Regression Model
• We can use a logistic regression model to make predictions: by substituting the xi1
and xi2 values of individual observations into the fitted model equation (usually the
one in the form p̂i = 1/(1 + exp{−β̂0 + β̂1 x1 + β̂2 x2 + · · · + β̂k xk })), we obtain predicted
probabilities that the response variable equals 1 (in this case, that the woman has a
Caesarean Section)
• For example, the first woman in the data is 25 years old (xi1 = 25) and had a 40-
week gestation (xi2 = 40), so
1
p̂i = = 0.2113
1 + exp{56.7017 − 0.3061(25) − 1.1933(40)}
• This probability is fairly low, so it is more likely that yi = 0 in this case (as, in fact,
it is)
• To make predictions, we need a decision rule consisting of threshold value for p̂i
to use, called τ, so that:
1 if p̂ ≥ τ
(
ŷ =
0 if p̂ < τ
• An obvious value of τ to choose would be 0.5, since p̂ ≥ 0.5 means that the event
y = 1 is more likely than the event y = 0 (since its probability is ≥ 0.5), while
p̂ < 0.5 means that the event y = 1 is less likely than the event y = 1
• However, we may not always want to use a threshold of τ = 0.5; sometimes we are
modelling rare events (e.g., insurance fraud) where the probability that the response
is 1 is very low; using τ = 0.5 would lead us to predict y = 0 for all observations,
which means we would never successfully predict fraud
76
Evaluating the Predictive Ability of a Logistic Regression Model: Sensitivity and
Specificity
• How do we know how well a logistic regression model is predicting the response
variable?
• For starters, we should not really evaluate a predictive model using the same data
that we used to fit the model (which is called training data); we should actually
obtain additional data that was not used to fit the model and make predictions on
this data (which is called test data)
◦ If you continue your studies to the Advanced Diploma, you will learn about
a more advanced way of splitting the available data set into training and test
sets, known as cross-validation
◦ You will also learn many other classification methods that, like logistic re-
gression, are designed to ‘classify,’ that is, to predict the values of a binary
response
• However, for simplicity’s sake let us consider the problem of evaluating predictions
on our training data in the Caesarean Section example
• To do so, we first need to define the concepts of sensitivity and specificity
◦ Sensitivity refers to the ability of a predictive model to correctly predict ‘suc-

cesses’ or ‘positives’, i.e. cases where y = 1. It can be defined as
# of positives predicted as positives Pr (y = 1 ∩ ŷ = 1)
Sensitivity = =
total # of positives Pr (y = 1)
◦ Specificity refers to the ability of a predictive model to correctly predict ‘fail-

ures’ or ‘negatives’, i.e. cases where y = 0. It can be defined as
# of negatives predicted as negatives Pr (y = 0 ∩ ŷ = 0)
Specificity = =
total # of negatives Pr (y = 0)
◦ The Accuracy of a model is the overall proportion of cases that were correctly
predicted, regardless of class:
# of negatives predicted as negatives + # of positives predicted as positives
Accuracy =
total # of cases predicted
= Pr (y = 0 ∩ ŷ = 0) ∪ (y = 1 ∩ ŷ = 1)

The proportion 1 − Accuracy is known as the Misclassification Rate

◦ A prediction should be made on each observation (ŷi determined using the
threshold value) and then compared to the observed yi value to classify the
observation as either a true positive (TP), true negative (TN), false positive
(FP), or false negative (FN)
77
◦ This is done for the Caesarean Section data (using a threshold of τ = 0.5) in
Table 3.3
Table 3.3: Making and Classifying Predictions on Caesarean Section Data with Threshold
τ = 0.5
Observation xi1 xi2 yi p̂i ŷi Result
1 25 40 0 0.2116 0 True Negative
2 30 40 1 0.5536 1 True Positive
3 37 40 1 0.9136 1 True Positive
4 26 38 0 0.0324 0 True Negative
5 33 36 0 0.0256 0 True Negative
6 26 39 0 0.0995 0 True Negative
7 24 42 1 0.6825 1 True Positive
8 19 40 0 0.041 0 True Negative
9 26 41 1 0.5459 1 True Positive
10 18 42 0 0.2551 0 True Negative
11 32 40 1 0.6958 1 True Positive
12 23 40 1 0.127 0 False Negative
13 24 42 1 0.6825 1 True Positive
14 29 38 0 0.0774 0 True Negative
15 23 42 1 0.6128 1 True Positive
16 25 38 0 0.0241 0 True Negative
17 33 40 0 0.7565 1 False Positive
18 29 40 0 0.4773 0 True Negative
19 37 38 0 0.4928 0 True Negative
20 31 42 1 0.9482 1 True Positive
21 16 38 0 0.0016 0 True Negative
22 22 39 0 0.0315 0 True Negative
23 30 40 0 0.5536 1 False Positive
24 21 38 0 0.0072 0 True Negative
25 17 42 0 0.2014 0 True Negative
26 18 39 0 0.0095 0 True Negative
27 23 42 0 0.6128 1 False Positive
28 28 38 0 0.0582 0 True Negative
29 31 43 1 0.9837 1 True Positive
30 38 37 1 0.2858 0 False Negative
◦ In order to calculate the empirical sensitivity and specificity it is convenient to

organise the results into a contingency table (two-way frequency table) as in
Table 3.4. Such a table that displays the actual vs. predicted frequencies for a
classification model is known as a Confusion Matrix
Table 3.4: Confusion Matrix for Caesarean Section Logistic Regression Model
Actual Condition
Positive(yi = 1) Negative(yi = 0) Total
Predicted Positive (ŷi = 1) 9 (TP) 3 (FP) 12
Condition Negative(ŷi = 0) 2 (FN) 16 (TN) 18
Total 11 19 30
◦ From Table 3.4, we can see that the sensitivity is 11 9

= 0.8182 (9 out of 11
actual positives were predicted as positives) and the specificity is 16
19
= 0.8421
78
(16 out of 19 actual negatives were predicted as negatives)
◦ Note that sensitivity is equivalent to the concept of ‘power’ in hypothesis test-
ing
◦ Note that, in addition to sensitivity and specificity, we can calculate the em-
pirical Type I error and Type II error rates as follows:
# of false positives
Type I Error Rate = = 1 − Specificity
# of total negatives
# of false negatives
Type II Error Rate = = 1 − Sensitivity
# of total positives
◦ Thus, in this case the type I error rate is 1 − 16
19
= 3
19
= 0.1579 and the type II
9
error rate is 1 − 11 = 11
2
= 0.1818
◦ Similarly, we can compute the model accuracy:
# of true positives + # of true negatives 9 + 16
Accuracy = = = 0.8333
# of cases predicted 30
And thus the misclassification rate is 1 − Accuracy = 1 − 0.8333 = 0.1667
Evaluating the Predictive Ability of a Logistic Regression Model: ROC Curve
• The sensitivity and specificity values above were calculated for one specific choice
of the threshold τ
• Sensitivity and specificity are inversely related (like type I and type II error), so that
when one increases the other decreases
• Consider two extreme cases:
◦ Extreme Case 1: we set our threshold to be τ = 1. This means we will predict

ŷi = 0 for all observations. The result is that all our actual negatives will be
predicted as negatives (specificity= 1) while all our actual positives will also
be predicted as negatives (sensitivity= 0)
◦ For instance, in our Caesarean Section example, if we set τ = 1, we will have
11 false negatives and 19 true negatives
◦ Extreme Case 2: we set our threshold to be τ = 0. This means we will predict
ŷi = 1 for all observations. The result is that all our actual positives will be
predicted as positives (sensitivity= 1) while all our actual negatives will also
be predicted as positives (specificity= 0)
◦ For instance, in our Caesarean Section example, if we set τ = 0, we will have
11 true positives and 19 false negatives
• By cycling through various τ values from 0 to 1, we can produce a table of sensi-

tivity and specificity values for various thresholds
• We can then create a graph with 1 − specificity (essentially type I error rate) on the
horizontal axis and sensitivity (essentially power) on the vertical axis
79
• We can also include a dashed line through the origin with gradient 1 on the graph;
this corresponds to a baseline model where our predictions are completely random
(e.g., we generate a random number between 0 and 1 and use this as our p̂i )
◦ Can you see why, in such cases, sensitivity and 1−specificity will tend to be
equal?
• This graph is called a Receiver Operating Characteristic (ROC) Curve, and
gives us an idea of how good our predictive model is overall
• Basically, the closer to the top left corner of the graph the curve is, the better the
predictive model
• The closer to the 45◦ line the curve is, the worse the predictive model (i.e., the more
similar it is to a completely random prediction)
• The ROC curve for the Caesarean Section logistic regression model is shown in
Table 3.4
Figure 3.4: Receiver Operating Characteristic Curve for Logistic Regression Model Fit to
Caesarean Section Data
• We can see that this particular model performs reasonably well; the ROC curve is
quite high above that of the baseline random prediction model
80
• The reason why the curve has steps instead of being smooth is that the sensitivity
and specificity values only change whenever we overtake one of the p̂i values for
the observations in our data. If we had a very large and diverse data set, the ROC
curve would be more smooth
Logistic Regression with Binary Explanatory Variables
• Just as the response variable in binary logistic regression is binary, we can also have
explanatory variables that are binary rather than continuous
• Consider a medical researcher who is investigating the effectiveness of a certain
treatment for treating male and female patients with a certain skin condition
• Table 3.5 displays data collected on sixteen patients, half of whom were given a new
experimental treatment and the other half received the standard treatment. After 6
months it was recorded whether each patient’s skin condition had healed. We define
the following variables:
(
1 if ith patient had recovered from the skin condition after 6 months
Let yi =
0 if ith patient had not recovered from the skin condition after 6 months
(
1 if ith patient received the new experimental treatment
Let xi1 =
0 if ith patient received the standard treatment
(
1 if ith patient was male
Let xi2 =
0 if ith patient was female
Table 3.5: Data on Patient Recovery after Treatment for Skin Condition
Observation # yi xi1 xi2 Observation # yi xi1 xi2

1 0 1 1 9 0 1 1
2 0 0 1 10 0 0 1
3 1 1 1 11 1 1 1
4 0 0 1 12 0 0 1
5 1 1 0 13 0 1 0
6 1 0 0 14 0 0 0
7 1 1 0 15 1 1 0
8 0 0 0 16 0 0 0
• A logistic regression model is fit to the data using Maximum Likelihood Estimation.
The SAS output is shown in Table 3.6
• Now, we can see that the β1 and β2 coefficients are not statistically significant in
this case at 5% significance level (their p-values are above 0.05), but let us proceed
anyway with interpreting the parameter estimates and making predictions
• The interpretation of the parameter estimates changes because it no longer makes
sense to speak of a ‘one-unit increase’ in x1 or x2 since they are categorical variables
81
Table 3.6: Parameter Estimates from Logistic Regression Model Fit to Skin Condition
Data
• Instead we make a comparison between the ‘1’ category and the ‘0’ category of the
variables
◦ In the case of the intercept estimate β̂0 = −1.3815, this corresponds to the
expected log-odds of a patient having recovered from the skin condition given
that the patient was on the standard treatment (x1 = 0) and that the patient was
female (x2 = 0)
◦ Taking exp{−1.3815} = 0.2512, we can say that the estimated odds of a female
patient who was on the standard treatment having recovered from the skin
condition are 0.2512
◦ Coming to the gradients, β̂1 = 2.7630 is the estimated difference in the log-
odds of recovery between a patient on the new experimental treatment and a
patient on the standard treatment. A more simpler interpretation is possible if
we take exp{β̂1 } = exp{2.7630} = 15.8473. This is an odds ratio: the odds of
recovery for a patient on the new experimental treatment are estimated to be
15.8473 times the odds of recovery for a patient on the standard treatment
◦ Similarly, exp{β̂2 } = exp{−1.5791} = 0.2062 means that the odds of recovery
for a male patient are estimated to be 0.2062 times the odds of recovery for
a female patient. Or, we can change the sign to exp{1.5791} = 4.8506 and
change the comparison around: the odds of recovery for a female patient are
estimated to be 4.8506 times the odds of recovery for a male patient
• Table 3.7 gives the predicted probabilities calculated from the equation
1
p̂i =
1 + exp{−(−1.3815 + 2.7630xi1 − 1.5791xi2 )}
• This time, there are only four different combinations of xi1 , xi2 values and thus only
four different p̂i values, which makes it feasible (although still somewhat time-
consuming) to work out the ROC Curve by hand
• The workings are shown in Table 3.8, and these give us the table of values in 3.9
that we use to create the ROC Curve in 3.5
82
Table 3.7: Predicted Probabilities of Recovery for Skin Condition Patients from Logistic
Regression Model
Observation # xi1 xi2 yi p̂i

1 1 1 0 0.4508
2 0 1 0 0.0492
3 1 1 1 0.4508
4 0 1 0 0.0492
5 1 0 1 0.7992
6 0 0 1 0.2008
7 1 0 1 0.7992
8 0 0 0 0.2008
9 1 1 0 0.4508
10 0 1 0 0.0492
11 1 1 1 0.4508
12 0 1 0 0.0492
13 1 0 0 0.7992
14 0 0 0 0.2008
15 1 0 1 0.7992
16 0 0 0 0.2008
Table 3.8: ROC Curve Workings for Skin Condition Logistic Regression Model
τ=0 τ = 0.0492 τ = 0.2008 τ = 0.4508 τ = 0.7992 τ=1
yi p̂i ŷi Prediction ŷi Prediction ŷi Prediction ŷi Prediction ŷi Prediction ŷi Prediction
Result Result Result Result Result Result
0 0.4508 1 FP 1 FP 1 FP 0 TN 0 TN 0 TN
0 0.0492 1 FP 0 TN 0 TN 0 TN 0 TN 0 TN
1 0.4508 1 TP 1 TP 1 TP 0 FN 0 FN 0 FN
0 0.0492 1 FP 0 TN 0 TN 0 TN 0 TN 0 TN
1 0.7992 1 TP 1 TP 1 TP 1 TP 1 TP 0 FN
1 0.2008 1 TP 1 TP 0 FN 0 FN 0 FN 0 FN
1 0.7992 1 TP 1 TP 1 TP 1 TP 1 TP 0 FN
0 0.2008 1 FP 1 FP 0 TN 0 TN 0 TN 0 TN
0 0.4508 1 FP 1 FP 1 FP 0 TN 0 TN 0 TN
0 0.0492 1 FP 0 TN 0 TN 0 TN 0 TN 0 TN
1 0.4508 1 TP 1 TP 1 TP 0 FN 0 FN 0 FN
0 0.0492 1 FP 0 TN 0 TN 0 TN 0 TN 0 TN
0 0.7992 1 FP 1 FP 1 FP 1 FP 1 FP 0 TN
0 0.2008 1 FP 1 FP 0 TN 0 TN 0 TN 0 TN
1 0.7992 1 TP 1 TP 1 TP 1 TP 1 TP 0 FN
0 0.2008 1 FP 1 FP 0 TN 0 TN 0 TN 0 TN
Sensitivity: Sensitivity: Sensitivity: Sensitivity: Sensitivity: Sensitivity:
#TP+#FN = 1 = = = = #TP+#FN = 0
#TP #TP #TP #TP #TP #TP
#TP+#FN #TP+#FN #TP+#FN #TP+#FN
6 =1 6 = 0.8333 6 = 0.5 6 = 0.5
6 5 3 3
1−Specificity: 1−Specificity: 1−Specificity: 1−Specificity: 1−Specificity: 1−Specificity:

#TN+#FP = 1 = = = = #TN+#FP = 0
#FP #FP #FP #FP #FP #FP
#TN+#FP #TN+#FP #TN+#FP #TN+#FP
10 = 0.6 10 = 0.3 10 = 0.1 10 = 0.1
6 3 1 1
83
Table 3.9: Sensitivity and 1−Specificity Values for ROC Curve
of Logistic Regression Model of Recovery from Skin Condition
1−Specificity Sensitivity
0 0
0.1 0.5
0.1 0.5
0.3 0.8333
0.6 0.1
1 1
Figure 3.5: ROC Curve for Logistic Regression Model Fit to Skin Condition Recovery
Data
3.2 Probit Regression

The Probit Link
• Like binary logistic regression, probit regression is used when there is a binary
response variable
• Remember that in binary-response regression models our µi = E (yi ) is a probability
84
pi (since E (yi ) in a Bernoulli Distribution is pi , the probability of ‘success’ in the
single trial)
• This means we need a link function that is always going to produce values of p̂i
between 0 and 1
p
◦ The logit link function g(p) = log used in logistic regression is one
1− p
such link function; the identity link function g(p) = p that is used in linear
regression is not
• In probit regression we use as our link function the inverse of a cumulative dis-
tribution function of a probability distribution, most commonly of the standard
normal distribution
• The cumulative distribution function of the standard normal distribution is given by

Z z
1 1
Φ(z) = √ exp{− y2 }dy
2π −∞ 2
• This function gives the area under the standard normal probability density function
between −∞ and z; in other words it gives us Pr (Z < z) from the standard normal
distribution for any value z, as pictured in Figure 3.6
Figure 3.6: Graph Showing Cumulative Distribution Function Φ(z) for Standard Normal
Distribution for a particular value of z
85
Figure 3.7: Graph Showing Cumulative Distribution Function Φ(z) for Standard Normal
Distribution
• The graph of Φ(z) itself is shown in Figure 3.7; it is clear that this function will
always give us values between 0 and 1
• Note that Φ(z) is precisely the function whose values we look up in the ‘Z Table’
(Table 4.3) when we want to find probabilities from the standard normal distribution
• Now, recall that the general form of a GLM is g(µi ) = ηi where (in this case) µi = pi
and where ηi = β0 + β1 xi1 + β2 xi2 + · · · + βk xik
• It is the values of pi that we need to be between 0 and 1, and pi is the input argument
of the function g(·) rather than the output; this means that using g(·) = Φ(·) as the
link function does not guarantee that pi is between 0 and 1 (since the domain of
Φ(·) is from −∞ to ∞); it is the range of Φ(·) that is from 0 to 1.
• Thus our link function will actually be the inverse of Φ(·): g(pi ) = Φ−1 (pi ) = ηi ,
which is equivalent to
pi = Φ(ηi ) = Φ(β0 + β1 xi1 + β2 xi2 + · · · + βk xik )
• Finding the inverse of Φ(·) is essentially what we do when we do a ‘reverse lookup’

in our Z Table (Table 4.3): the function’s input is the probability and the function’s
output is the z score
86
Probit Regression: Application in Toxicology
• Technically, any data set that is appropriate for logistic regression is also appropriate
for probit regression; the two are basically competing alternatives that differ only
in their link function
• However, one application in which probit regression is widely used is for estimating
the toxicity of chemical substances to living organisms
• The typical scenario is that one administers different dosage levels of the toxin to a
random sample of different organisms, and monitors a binary response of death vs.
no death, i.e. (
1 if the ith organism died
yi =
0 if the ith organism did not die
• We will explore this by means of an example after first considering two important
concepts from toxicology
LD50 and LC50
• Two important quantities used to measure the toxicity of a chemical substance are
the LD50 and LC50
• The ‘LD’ in LD50 stands for ‘Lethal Dose’, while the ‘LC’ in LC50 stands for
‘Lethal Concentration’
• LD50 refers to the dose level of an administered chemical substance that would be
lethal for 50% of organisms in a population, in other words, that would have a 50%
probability of killing any randomly selected individual
• LC50 has essentially the same meaning but is used for chemical concentration level
in the air rather than a dose of a substance that is administered orally or intra-
venously
• In a probit regression where x1 is the dosage or concentration level of the toxin

and y is the binary death/no death response defined above, the LD50 or LC50 can be
determined by substituting pi = 0.5 into the model equation and solving for x1 , as
follows:
Φ−1 (pi ) = β0 + β1 xi1

Φ−1 (0.5) = β0 + β1 xi1
0 = β0 + β1 xi1 (since 0 is the z score that gives Φ(z) = 0.5)
β1 xi1 = −β0
β0
xi1 = −
β1
β0
LD50 (or LC50 ) = −
β1
87
• Of course, since β0 and β1 are unknown parameters, we estimate them with their
Maximum Likelihood Estimators, β̂0 and β̂1 ; thus our estimated LD50 or LC50 is
calculated as
LD c 50 ) = − β̂0
c 50 (or LC
β̂1
Probit Regression Example: Toxicity of Picloram to the Larksur Plant
• Picloram is a herbicide (plant-killing toxin). A plant toxicologist administers dose

levels of 0, 1.1, 2.2, and 4.5 kilograms per hectare (kg/ha) to a random sample
of 313 larksur plants, and the response (whether the plant died [yi = 1] or lived
[yi = 0]) is recorded
• The data is too large to display in full but a few observations are shown for illustra-
tion in Table 3.10
Table 3.10: Toxicology Data on Picloram Dosage Administered to Larksur Plants
xi1 (Dose) yi (Died)

1.1 0
4.5 1
2.2 1
0 0
2.2 1
2.2 1
0 0
0 0
0 0
0 0
2.2 1
2.2 1
2.2 0
4.5 1
1.1 1
.. ..
. .
• As with logistic regression, we use the Maximum Likelihood Method to obtain

estimates of the parameters
• SAS output from this model is displayed in Table 3.11
• We can see from this that the fitted model equation is

p̂ = Φ β̂0 + β̂1 x1 = Φ (−2.1011 + 1.5577x1 )
88
Table 3.11: Parameter Estimates from Probit Regression Model Fit to Picloram Data
• The p-value of the Dose coefficient (β1 ) is < .0001 so we can see that there is a
statistically significant relationship between Dose and the response variable (Death
vs. No Death).
• Interpretation of Parameter Estimates
◦ β̂0 = −2.1011 corresponds to the estimated value of the linear predictor ηi

when no dose of toxin is administered (xi1 = 0). Thus, when xi1 = 0, the
probability of death is p̂i = Φ(−2.1011) ≈ 1 − Φ(2.10) = 1 − 0.9821 = 0.0179
(the value of Φ(2.10) was taken from the Z Table)
◦ β̂1 = 1.5577 does not have as intuitive an interpretation as the gradient esti-
mates from linear or logistic regression, but can be explained as follows: if the
dosage x1 increases by 1 unit, the inverse-probit function Φ−1 ( p̂) is expected
to increase by 1.5577 units
• A more intuitive way to interpret the model output is to calculate the estimated
LD50 :
c 50 = − β̂0 = − −2.1011 = 1.349
LD
β̂1 1.5577
• Thus we estimate that the Picloram dosage level that will kill about half of larksur
plants is 1.349 kg/ha
Predicting Using Probit Regression Model
• We can again make predictions using the p̂i values from a probit regression model
and a threshold value τ, and then calculate sensitivity and specificity and construct
an ROC Curve.
• In our picloram example, there are only four distinct dosage levels in the data, so
we will only have four distinct p̂ values (Table 3.12)
Wald Inference on Parameters in Generalised Linear Models
• We have been referring to p-values on model parameters β j and interpreting them

in our SAS output from logistic and probit regression models
• It is worthwhile making a brief comment on the theory behind these significance

tests, which is not a t-test as in linear regression but rather an (approximate) χ2 test
89
Table 3.12: Predicted Probabilities of Death of Larksur Plants by Picloram Herbicide
Dosage, Using Probit Regression
x1 (Dose) η̂ (Linear Predictor) p̂ (Predicted Prob. of Death)

0 -2.1011 0.0179
1.1 -0.3876 0.3483
2.2 1.3259 0.9082
4.5 4.9087 1.0000
• Suppose we want to test the null hypothesis H0 : β j = 0 against a two-tailed alter-

native H1 : β j , 0
β̂2j
• Under H0 , the Wald statistic χ = ∼ χ21
2
Var β̂ j

• Var β̂ j is the jth element of the matrix X > V X −1 , where:

◦ X is a n × (k + 1) matrix containing all 1’s in the first column, x11 , x21 , . . . , xn1
in the second column, x12 , x22 , . . . , xn2 in the third column, and so on until
x1k , x2k , . . . , xnk in the (k + 1)th column
◦ V is a diagonal matrix whose diagonal elements are Var (y1 |η) , Var (y2 |η) , . . . , Var (yn |η)
◦ Where the yi |ηi are binary variables (following a Bernoulli distribution with
probability parameter pi ), Var (yi |ηi ) = pi (1 − pi ) (the variance of a Bernoulli-
distributed random variable with probability parameter pi )
1
◦ Our estimate of V is V̂ where we replace each pi with p̂i = :
1 + e−(β̂0 +β̂1 xi1 +β̂2 xi2 +···+β̂k xik )
 
 p̂1 (1 − p̂1 ) 0 0 ··· 0 
 
 0 p̂2 (1 − p̂2 ) 0 ··· 0 
V̂ = 
 
0 0 p̂3 (1 − p̂3 ) · · · 0 
 .. .. .. .. .. 

 . . . . . 

0 0 0 · · · p̂n (1 − p̂n )
• Thus the test statistic may be expressed as:
β̂2j
χ =
2
−1
X > V̂ X
jj
• In this module you will not be asked to calculate this test statistic by hand, but it is
helpful for you to know where it comes from
90
3.3 Poisson Regression
Poisson Regression Model
• Recall that the Poisson distribution can be used to model probabilities of how many
events occur in a particular interval of time (or a particular space), provided that
they occur with a fixed average rate λ
• If y is a Poisson-distributed random variable with rate parameter λ, that is, if y ∼
e−λ λk
Poisson(λ), then Pr (y = k) = for k = 0, 1, 2, 3, . . .
k!
• y might be, for example, the number of taxis that arrive at a taxi rank in a certain
period of time, or the number of customers that arrive in a queue in a certain period
of time, or the number of mutations that occur in a certain amount of genetic code
• If we are interested in analysing the relationship between such a ‘count’ response
variable and one or more explanatory variables, we can use a Poisson regression
model, which is a type of GLM
• Our mean in this case µi = E (yi ) = λi , and our link function is g(λ) = log λ (where
log is the natural logarithm)
• Thus g(µi ) = ηi in this case yields the model equation
log λi = β0 + β1 xi1 + β2 xi2 + · · · + βk xik
• By taking the inverse of the function we can also express the model in terms of λi :
log λ = η
λ = exp{η}
• This also shows the reason for our choice of link function: it ensures that λ > 0
(which is a requirement of the Poisson distribution)
• As with other GLMs, the Method of Maximum Likelihood is used to estimate the
parameters
Poisson Regression Model Example: Elephant Poaching in Zambia
• Table 3.13 provides data on the number of elephant poaching incidents per year
in the Central Luangwa Valley, Zambia, from 1988 to 1995, together with two
explanatory variables, which are law enforcement expenditure in thousands of US
dollars per km2 (x1 ) and number of bonus claims paid to anti-poaching scouts in
thousands (x2 )
• A Poisson regression model is fit to the data and SAS output is displayed in Table
3.14
• Thus the fitted regression equation is
λ = exp{β̂0 + β̂1 x1 + β̂2 x2 } = exp{3.4453 − 9.5224x1 − 0.0627x2 }
91
Table 3.13: Elephant Poaching Incidents in Central Luangwa Valley Per Year, 1988-1995
Year No. of Elephants Law Enforcement Ex- Number of Bonus

Killed (yi ) penditure (Thousands Claims Paid (Thou-
of US$ per km) (xi1 ) sands) (xi2 )
39 0.00311 0.054
16 0.02178 0.372
27 0.04884 1.189
16 0.05147 2.692
7 0.04766 22.537
9 0.03141 9.823
12 0.03698 3.483
23 0.02373 0.557
Table 3.14: Parameter Estimates from Poisson Regression Model Fit to Elephant Poach-
ing Data
• We can see from the p-values in the last column that β2 is statistically significant at
5% level while β1 is not
• Interpretation of the parameter estimates proceeds as follows:
◦ Intercept: exp{β̂0 } = exp{3.4453} = 31.3527 is the expected number of ele-

phants poached in a year when there is no law enforcement expenditure and
there are no bonus claims paid to scouts
◦ Gradients: exp{β̂1 } = exp{−9.5224} = 7.319 × 10−5 ; if law enforcement
expenditure increases by $1000 per km2 , the expected number of elephants
poached changes by a factor of 7.319 × 10−5 ; that is, it decreases by a factor
of exp{9.5224} = 1.3662 × 104 . exp{β̂2 } = exp{−0.0627} = 0.9392; if the
number of bonus claims paid out increases by 1000, the expected number of
elephants poached changes by a factor of 0.9392; that is, it decreases by a
factor of exp{0.0627} = 1.0647, or by about 6.47%
• We can also make predictions from this model of the expected number of elephants
poached under different values of the explanatory variables. For instance, supposing
that in a certain year the law enforcement expenditure is $50 USD per km2 and that
3000 bonus claims are paid out to scouts, what is the expected number of elephants
92
killed? (Note: be careful of units!)
λ̂ = exp{3.4453 − 9.5224(0.05) − 0.0627(3) = 16.136}
• In this year we would expect about 16 elephants to be poached in the valley
Interpreting Parameter Estimates in GLMs
• We have seen that the interpretation of the β̂ j parameter estimates is different for
different GLMs (linear regression, logistic regression, probit regression, Poisson
regression)
• This can be confusing, so it helps to have an intuitive method for deriving the inter-
pretation instead of just memorising it
• In all of these models, the interpretation of the gradient estimates concerns what
happens when a particular explanatory variable x j increases by one unit
• For simplicity let us assume that there is only one explanatory variable x1
• We can derive the interpretation by comparing the estimate of the expected response
µ̂i when the explanatory variable value is xi1 to the estimate of the expected response
µ̂?i when the explanatory variable value is xi1 + 1
◦ Linear regression case:
ŷi = β̂0 + β̂1 xi1

ŷ?i = β̂0 + β̂1 (xi1 + 1) = β̂0 + β̂1 xi1 + β̂1
ŷ?i − µi = β̂0 + β̂1 xi1 + β̂1 − β̂0 + β̂1 xi1 = β̂1
h i
Thus we can interpret β̂1 in the linear regression model as the change in ŷi
when xi1 increases by one unit; consequently β̂1 is the estimated change in the
response when xi1 increases by one unit
◦ Logistic regression case:
p̂i
log = β̂0 + β̂1 xi1
1 − p̂i
p̂?i
log = β̂0 + β̂1 (xi1 + 1)
1 − p̂?i
p̂?i p̂i h i
log − log = β̂ + β̂ (x + 1) − β̂ + β̂1 i1 = β̂1
x
1 − p̂?i
0 1 i1 0
1 − p̂i
Thus β̂1 can be interpreted as the change in expected log-odds of event yi = 1
93
when xi1 increases by 1. Or:
p̂i
= exp{β̂0 + β̂1 xi1 }
1 − p̂i
p̂?i
= exp{β̂0 + β̂1 (xi1 + 1)}
1 − p̂?i
p̂?i
1 − p̂?i exp{β̂0 + β̂1 (xi1 + 1)}
=
p̂i exp{β̂0 + β̂1 xi1 }
1 − p̂i
h i
= exp{β̂0 + β̂1 (xi1 + 1) − β̂0 + β̂1 xi1 }
= exp{β̂1 }
Thus exp{β̂1 } can be interpreted as the multiplicative change in odds of event

yi = 1 when xi1 increases by 1
◦ Poisson regression:
log λ̂i = β̂0 + β̂1 xi1

log λ̂?i = β̂0 + β̂1 (xi1 + 1)
log λ̂?i − log λ̂i = β̂0 + β̂1 (xi1 + 1) − β̂0 + β̂1 xi1 = β̂1
h i
Thus β̂1 can be interpreted as the change in the log of expected response when
xi1 increases by 1. Or:
λ̂i = exp{β̂0 + β̂1 xi1 }

λ̂?i = exp{β̂0 + β̂1 (xi1 + 1)}
λ̂?i exp{β̂0 + β̂1 (xi1 + 1)}
=
λ̂i exp{β̂0 + β̂1 xi1 }
h i
= exp{β̂0 + β̂1 (xi1 + 1) − β̂0 + β̂1 xi1 }
= exp{β̂1 }
Thus exp{β̂1 } can be interpreted as the multiplicative change in expected re-

sponse λ̂ when xi1 increases by 1
Wald Inference on Parameters in Poisson Regression
• The method for testing significance of an individual parameters β j in Poisson re-

gression is the same as in logistic or probit regression, with the exception that the
diagonal elements of the V matrix discussed earlier are λ1 , λ2 , . . . , λn rather than
p1 (1 − p1 ), p2 (1 − p2 ), . . . , pn (1 − pn ), since the variance of a Poisson-distributed
random variable with rate parameter λi is λi (the mean and variance of a Poisson
random variable are equal)
94
• Our estimate of V is V̂ where we replace each λi with λ̂i = exi β̂ :
>
λ̂1 0 0 · · · 0 
 
 0 λ̂2 0 · · · 0 

 
 
V̂ =  0 0 λ̂3 · · · 0 
 . . . .
 .. .. .. . . ... 

 
0 0 0 · · · λ̂n
• Our test statistic, with formula as follows, still follows a χ21 distribution under the
null hypothesis β j = 0:
β̂2j
χ2 = −1
X > V̂ X
jj
Goodness of Fit in Poisson Regression
• We cannot classify our predicted values in Poisson regression as True Positives,

True Negatives, False Positives, and False Negatives, and thus cannot compute
sensitivity or specificity or an ROC Curve, because we no longer have a binary
response; the response variable can take on any non-negative integer value
• In GLMs there is no r2 ‘multiple coefficient of determination’ statistic to use in mea-

suring goodness of fit, though some statisticians have developed so-called ‘pseudo-
r2 ’ statistics in order to produce something analogous
• One commonly used goodness-of-fit method in Poisson regression is to conduct a

goodness of fit test using the Residual Deviance statistic:
n !
X yi
D = −2 yi log − (yi − λ̂i )
i=1 λ̂i
• Under the null hypothesis that our model is correctly specified, D follows a χ2
distribution with n−(k+1) degrees of freedom, where k is the number of explanatory
variables in our model
• Thus if D is larger than the critical value χ2α,n−k−1 , we would reject the null hypoth-
esis and conclude that our model does not fit the data well
• You will not be expected to calculate D by hand but should be able to interpret it
when encountered in SAS output
Overdispersion in Poisson Regression
• Overdispersion is a violation of the model assumption that the response variable

follows a Poisson distribution
• In particular, overdispersion means Var (Yi | xi ) > λi whereas it should be that

Var (Yi | xi ) = λi
95
• A rule of thumb in Poisson regression is that the residual deviance D should be
roughly equal to its degrees of freedom n−(k+1). If D is much larger than n−(k+1)
(suggesting a lack of fit), this may indicate the presence of overdispersion
• One option we have when our Poisson regression model is overdispersed is to in-
stead use Negative Binomial Regression
• The negative binomial distribution is a discrete probability distribution we have not

previously encountered. We will not discuss it in detail here; we simply note that it
is also useful for modelling count data
• Through some clever transformations it can be shown that if Y is a negative binomi-

ally distributed random variable and E (Y) = λ then Var (Y) = λ + θλ2 where θ > 0
is a transformed parameter called the dispersion parameter
• What this means is that the variance of a negative binomial distributed random
variable is larger than the variance of a Poisson distributed random variable by the
amount of θλ2 , and if θ = 0 then the negative binomial distribution reduces to a
Poisson distribution
• Thus we can use the negative binomial distribution to model count data which is
overdispersed
• Note that negative binomial regression uses the same link function as Poisson re-
gression, so the model equation is the same and so is the interpretation of the pa-
rameters
96
4 Introduction to Monte Carlo Simulation
Estimating Quantities through Simulation
• The Monte Carlo method is a technique used to approximate quantities (e.g., inte-
grals) that are difficult or impossible to work out analytically
• In our context, we could use Monte Carlo method to estimate certain probabilities
or expected values from Markov Chains, for instance
• Let us in general consider two types of quantities we may want to estimate using
Monte Carlo simulations: the probability of an event A, and the expected value of
a random variable X
1. If we want to estimate the expected value of random variable Y, i.e., E (Y),

we need to randomly generate s values from the probability distribution of Y,
y1 , y2 , . . . , y s ; our estimate of µ = E (Y) will then be:
s
1X
µ̂MC = ȳ = yi
s i=1
2. If we want to estimate the probability of event A, i.e., p = Pr (A), we need to

randomly simulate s times the situation in which either A or Ac occurs; we
then define a random variable X such that
(
1 if event A occurs
X=
0 if event A does not occur
Thus:
Pr (X = 1) = Pr (A) = p and Pr (X = 0) = Pr Ac = 1 − p

X follows a Bernoulli distribution (binomial distribution with only one trial).

If we generate s independent values from the distribution of X, x1 , x2 , . . . , x s
s
X
(all of which will be 0’s and 1’s), then xi is a value generated from a bi-
i=1
nomial distribution with success probability p and number of trials s. The
expected value of a binomial random variable with success probability p and
s
1X 1
number of trials s is sp. Therefore, Xi has an expected value of (sp) =
s i=1 s
p. Therefore, we can estimate p using the method of (1) above, taking x1 , x2 , . . . , x s
as our y1 , y2 , . . . , y s so that
s
1X # of occurrences of event A
p̂MC = xi =
s i=1 # of trials
97
Pseudorandom Number Generation
• How do we generate a ‘random’ value from the distribution of T |X0 = i (or any
other probability distribution)?
• Strictly speaking this is impossible; we simply cannot produce something truly

‘random’
• Instead we produce what is called a pseudorandom number using a pseudorandom

number generator (PRNG), which is an algorithm designed to produce a sequence
of numbers that behave as though they were random
• The quality of PRNG algorithms has improved greatly over the past few decades in
terms of the properties of the random numbers they produce
• The basic probability distribution from which PRNGs generate is the continuous
uniform distribution U(0, 1), which produces real-numbered values uniformly dis-
tributed within the interval [0, 1]
• Algorithms can then be produced to transform the U(0, 1) numbers to another prob-
ability distribution
• For instance, consider a Markov Chain with state space S = {0, 1, 2} where Xn = 0
and the transition probabilities for row 0 are as follows:
0.2 if j = 0




P0 j = Pr (Xn+1 = j|Xn = 0) =  0.3 if j = 1



0.5 if j = 2


• If we wanted to generate a random value from this transition probability (to deter-
mine where the process will be at step n + 1), we can generate a U(0, 1) random
value un+1 and then apply the following rule:
0 if 0 ≤ un+1 < 0.2





= 1 if 0.2 ≤ un+1 < 0.5

Xn+1




2 if 0.5 ≤ u ≤ 1
n+1
Precision of Monte Carlo Estimators
• Programmers are often so eager to come up with Monte Carlo estimates that they
forget about the problem of how precise their estimates are
• Ideally we would want to calculate the standard error of our Monte Carlo estimates
and we can then also produce approximate confidence intervals for them using the
Central Limit Theorem
98
• Consider our problem of estimating the expected value of random variable X; our
s
1X
Monte Carlo estimator is a sample mean µ̂MC x̄ = xi . Recall from Statistics 1B
s i=1

the sampling distribution of a sample mean: E X̄ = µ (the true distribution mean)
σ2
and Var X̄ = where σ2 is the true distribution variance.
n
• If σ2 is unknown, as will often be the case, we can estimate it with
s
1 X
σ̂ =
2
[xi − x̄]2
s − 1 i=1
• Thus the approximate standard error of our Monte Carlo expected value estimator
σ̂
c µ̂MC ) = √
is SE(
s
• Applying Central Limit Theorem, provided that our number of simulation trials s is
large, an approximate (1 − α)100% confidence interval for our unknown expected
value µ = E (X) is given by
σ̂
x̄ ± zα/2 √
s
• Consider our problem of estimating the probability of event A; our p̂MC estimator
s s
1X X
is Xi where Xi is binomially distributed with success probability p and
s i=1 i=1
number of trials s. We know that the variance of a binomial random variable with s
trials and probability of success p is sp(1 − p), so we can replace σ2 with sp(1 − p)
 s   s 
 1 X  1 X  1 p(1 − p)
Var ( p̂MC ) = Var  Xi  = 2 Var  Xi  = 2 (sp(1 − p)) =
s i=1 s i=1
s s
• Since p is unknown we replace it with p̂MC

r
p̂MC (1 − p̂MC )
• Thus SE(
c p̂MC ) =
s
• This gives us an approximate (1 − α)100% confidence interval for p:
r
p̂MC (1 − p̂MC )
p̂MC ± zα/2
s
Precision of Monte Carlo Estimators: Examples
• Suppose you are estimating the expected value of a random variable X using s
Monte Carlo simulation trials and you obtain a point estimate of µ̂MC = 2.52 and
a sample standard deviation of σ̂ = 0.43. Give the halfwidth of a 95% confidence
interval for µ if the number of simulation trials was 100, 1000, 10000, 100000, or
1000000
99
• (Note: the halfwidth of the confidence interval is the part after the ±)
• Answers are given in Table 4.1
Table 4.1: Halfwidth of 95% Confidence Interval for µ for Different Numbers of Simula-
tion Trials
σ̂
s zα/2 √
s
100 0.08428
1000 0.02665
10000 0.008428
100000 0.002665
1000000 0.0008428
• Suppose you are estimating the probability of an event A using s Monte Carlo
simulation trials and you obtain a point estimate of p̂MC = 0.4. Give the halfwidth
of a 95% confidence interval for p if the number of simulation trials was 100, 1000,
10000, 100000, or 1000000
• Answers are given in Table 4.2
Table 4.2: Halfwidth of 95% Confidence Interval for p for Different Numbers of Simula-
tion Trials
r
p̂MC (1 − p̂MC )
s zα/2
s
100 0.09602
1000 0.03036
10000 0.009602
100000 0.003036
1000000 0.0009602
100
Table 4.3: Standard Normal Lower Cumulative Probabilities Pr (Z < z)
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359
0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753
0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141
0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517
0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879
0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224
0.6 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549
0.7 0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852
0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133
0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389
1.0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621
1.1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830
1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015
1.3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177
1.4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319
1.5 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418 0.9429 0.9441
1.6 0.9452 0.9463 0.9474 0.9484 0.9495 0.9505 0.9515 0.9525 0.9535 0.9545
1.7 0.9554 0.9564 0.9573 0.9582 0.9591 0.9599 0.9608 0.9616 0.9625 0.9633
1.8 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693 0.9699 0.9706
1.9 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.9750 0.9756 0.9761 0.9767
2.0 0.9772 0.9778 0.9783 0.9788 0.9793 0.9798 0.9803 0.9808 0.9812 0.9817
2.1 0.9821 0.9826 0.9830 0.9834 0.9838 0.9842 0.9846 0.9850 0.9854 0.9857
2.2 0.9861 0.9864 0.9868 0.9871 0.9875 0.9878 0.9881 0.9884 0.9887 0.9890
2.3 0.9893 0.9896 0.9898 0.9901 0.9904 0.9906 0.9909 0.9911 0.9913 0.9916
2.4 0.9918 0.9920 0.9922 0.9925 0.9927 0.9929 0.9931 0.9932 0.9934 0.9936
2.5 0.9938 0.9940 0.9941 0.9943 0.9945 0.9946 0.9948 0.9949 0.9951 0.9952
2.6 0.9953 0.9955 0.9956 0.9957 0.9959 0.9960 0.9961 0.9962 0.9963 0.9964
2.7 0.9965 0.9966 0.9967 0.9968 0.9969 0.9970 0.9971 0.9972 0.9973 0.9974
2.8 0.9974 0.9975 0.9976 0.9977 0.9977 0.9978 0.9979 0.9979 0.9980 0.9981
2.9 0.9981 0.9982 0.9982 0.9983 0.9984 0.9984 0.9985 0.9985 0.9986 0.9986
3.0 0.9987 0.9987 0.9987 0.9988 0.9988 0.9989 0.9989 0.9989 0.9990 0.9990
3.1 0.9990 0.9991 0.9991 0.9991 0.9992 0.9992 0.9992 0.9992 0.9993 0.9993
3.2 0.9993 0.9993 0.9994 0.9994 0.9994 0.9994 0.9994 0.9995 0.9995 0.9995
3.3 0.9995 0.9995 0.9995 0.9996 0.9996 0.9996 0.9996 0.9996 0.9996 0.9997
3.4 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9998
Probability Shown is Area under Std. Normal Probability Density Function from −∞ to z
101
References
Allen, L. J. S. (2008), An introduction to stochastic epidemic models, in F. Brauer,
P. van den Driessche and J. Wu, eds, ‘Mathematical Epidemiology’, Springer, Berlin,
pp. 81–130.
Fox, J. (2016), Applied Regression Analysis & Generalized Linear Models, 3rd edn, Sage,
Thousand Oaks.
Givens, G. and Hoeting, J. (2005), Computational Statistics, Wiley, Hoboken.
Myers, R. (2010), Generalized Linear Models with Applications in Engineering and the
Sciences, 2nd edn, Wiley, Hoboken, NJ.
Pinsky, M. A. and Karlin, S. (2011), An Introduction to Stochastic Modelling, 4th edn,

Academic Press, Boston.
102

Bts 360 S Article

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bts 360 S Article

Uploaded by

Copyright:

Available Formats

Faculty of Applied Sciences

Department of Mathematics & Physics

Biostatistics Lecture Notes

Author: Thomas Farrar

Biostatistics Lecture Notes

2 Stochastic Event Processes 42

3 Generalised Linear Models with Biological Applications 67

4 Introduction to Monte Carlo Simulation 97

What you will be expected to already know

1. Statistics 1A (descriptive statistics, graphical methods for displaying data, probabil-

2. Statistics 1B (point and interval estimates, hypothesis testing concepts)

4. Mathematics (matrices, integral calculus)

5. Programming 2A (basic programming concepts and abilities)

1 Markov Chains with Biological Applications

• A stochastic process is a set of random variables Xt or X(t), where t is an index

• A variable is called discrete if it can only take on a finite or countably infinite

• A variable is called continuous if it can take on any real-numbered value, at least

• There are four main classes of stochastic processes

Four Types of Stochastic Process

Table 1.1: Four Types of Stochastic Process

• In this module we will be focusing on stochastic processes in which the random

Review of Probability Results: Complement

Review of Probability Results: Union

• Let A and B be two events.

Figure 1.2: Illustration of Union of Events

Review of Probability Results: Intersection

Review of Probability Results: Complement Rule

• Complement Rule of Probability:

• It is a rearrangement of the statement

Review of Probability Results: Additive Rule

• Additive Rule of Probability:

• If Pr (A ∩ B) = 0 then A and B are said to be mutually exclusive or disjoint (see

• In this case, the Additive Rule reduces to

Review of Probability Results: Multiplicative Rule

• Events A and B are independent if the occurrence of event A is unrelated to the

• Multiplicative Rule for Independent Events:

• Multiplicative Rule for Dependent Events:

Conditional Probability and Bayes’ Theorem

Pr (A) Pr (B|A) = Pr (B) Pr (A|B)

• f (x) must satisfy the following properties:

• The expected value (theoretical mean) of X is defined as

• Example: A random variable X has the probability mass function

• This law is visualised in Figure 1.5

Figure 1.5: Illustration of Law of Total Probability

Law of Total Probability: Example

Let A be the event that the chosen ball is blue

1.2 Definition and Properties of a Discrete-Time Markov Chain

• Or: Markov processes have a short memory!

• Examples of real-world processes that seem to behave according to the Markov

◦ Your sequence of fortunes after successive bets in a casino

• A discrete-time Markov Chain is a Markov process whose time index is discrete

◦ In many applications it is common to have the state space S = {0, 1, 2, . . . , N}

• A sequence of random variables (X0 , X1 , X2 , . . . , Xn , Xn+1 ) is a Markov chain if and

Pr (Xn+1 = j|X0 = i0 , . . . , Xn−1 = in−1 , Xn = i)

• If Xn = i we say that at time n the process is in state i

• It is customary to express transition probabilities in a state diagram or a matrix

• A state diagram is a graphical way of representing a Markov Chain (see example

Transition Probability Matrices

• Transition probabilities are stored in a matrix to make calculations easier

Transition Probability Properties

1. Pi j ≥ 0 for all i, j ∈ S (all transition probabilities must be non-negative)

• Draw a state diagram for the Markov Chain {Xn }.

• The transition probability matrix P gives the one-step transition probabilities