You are on page 1of 161

MA2404

a module for the BSc in Actuarial Science

Principles of Financial Modelling


Module MA2404
Principles of Financial Modelling
Edition 2, September 2014

University
c of Leicester 2014

All rights reserved. No part of the publication may be reproduced, stored in


a retrieval system or transmitted in any form or by any means, mechanical,
photocopying, recording or otherwise, without the prior written consent of
the University of Leicester.

i
Preface
Welcome to module MA2404 Principles of Financial Modelling.

This module is part of your studies towards the BSc Diploma in Actuarial
Science with the University of Leicester. The module has no formal prerequi-
sites apart from that you are familiar with standard probability theory seen
during the first year of your study.

An important thread through all actuarial and financial disciplines is the use
of appropriate models. For example, it is obvious that human mortality is of
fundamental importance in the pricing of pension policies and life insurance
contracts; robust mathematical models of mortality are therefore needed by
institutions involved in the provision of such products. Actuaries within
these institutions need a thorough understanding of a variety of models so
that they can choose the best model in any given situation. The aim of this
module is to provide an introduction to some such models, with emphasis to
Markov ones.

We begin the module with a chapter that explains the underlying concepts
of financial and actuarial modelling in general. Then we discuss the Monte
Carlo method as simple probabilistic algorithms for simulating systems where
an underlying randomness exists. Then, after short review of probability
theory and stochastic processes, we study the theory of Markov processes and
their application to financial and actuarial modelling. In the last chapters,
we pay particular attention to mortality.

Dr Bogdan Grechuk

ii
Contents
1 Principles of actuarial and financial modelling 1
1.1 Why are models used? . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Key steps in the modelling process . . . . . . . . . . . . . . . 5
1.3 Benefits and limitations of modelling . . . . . . . . . . . . . . 8
1.4 Stochastic and deterministic models . . . . . . . . . . . . . . . 9
1.5 Suitability of a model . . . . . . . . . . . . . . . . . . . . . . . 11
1.6 Short-run and long-run properties of a model . . . . . . . . . . 13
1.7 Analysing model output . . . . . . . . . . . . . . . . . . . . . 13
1.8 Sensitivity testing . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.9 Communicating the results . . . . . . . . . . . . . . . . . . . . 15
1.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2 The Monte Carlo method 19


2.1 A motivating example . . . . . . . . . . . . . . . . . . . . . . 20
2.2 Accuracy of the method . . . . . . . . . . . . . . . . . . . . . 21
2.3 Application to stochastic modelling . . . . . . . . . . . . . . . 23
2.4 Random number generation . . . . . . . . . . . . . . . . . . . 24
2.4.1 Random variate generation from the uniform distribution 24
2.4.2 Random variate generation from a specified distribution 25
2.4.3 Random variate generation from the standard normal
distribution . . . . . . . . . . . . . . . . . . . . . . . . 30
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3 Probability theory and stochastic processes 36


3.1 Probability theory . . . . . . . . . . . . . . . . . . . . . . . . 36
3.1.1 Probability measure and probability space . . . . . . . 36
3.1.2 Random variables, expectation and variance . . . . . . 40
3.1.3 Probability distributions . . . . . . . . . . . . . . . . . 43
3.1.4 Independence . . . . . . . . . . . . . . . . . . . . . . . 44
3.1.5 Conditional Probability . . . . . . . . . . . . . . . . . 47
3.2 Stochastic processes . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2.1 Rough classification of random processes . . . . . . . . 49
3.2.2 General definitions . . . . . . . . . . . . . . . . . . . . 50
3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4 Markov Chains 57
4.1 The Markov property . . . . . . . . . . . . . . . . . . . . . . . 57
4.2 Definition of Markov Chains . . . . . . . . . . . . . . . . . . . 58
4.3 The Chapman-Kolmogorov equations . . . . . . . . . . . . . . 61

iii
4.4 Time dependency of Markov chains . . . . . . . . . . . . . . . 62
4.4.1 Time-inhomogeneous Markov chains . . . . . . . . . . 62
4.4.2 Time-homogeneous Markov chains . . . . . . . . . . . . 62
4.5 Further applications . . . . . . . . . . . . . . . . . . . . . . . 64
4.5.1 The simple (unrestricted) random walk . . . . . . . . . 64
4.5.2 The restricted random walk . . . . . . . . . . . . . . . 66
4.5.3 The modified NCD model . . . . . . . . . . . . . . . . 67
4.5.4 A model of accident proneness . . . . . . . . . . . . . . 69
4.5.5 General principles of modelling using Markov chains . . 70
4.6 Stationary distributions . . . . . . . . . . . . . . . . . . . . . 71
4.7 The long-term behaviour of Markov chains . . . . . . . . . . . 74
4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5 Markov Jump Processes 79


5.1 Poisson process . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.1.1 Interarrival times . . . . . . . . . . . . . . . . . . . . . 81
5.1.2 Compound Poisson process . . . . . . . . . . . . . . . . 83
5.2 The time-inhomogeneous Markov jump process . . . . . . . . 84
5.3 Transition rates . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.4 Time-homogeneous Markov jump processes . . . . . . . . . . . 88
5.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.5.1 Survival model . . . . . . . . . . . . . . . . . . . . . . 90
5.5.2 Sickness-death model . . . . . . . . . . . . . . . . . . . 91
5.5.3 Marriage model . . . . . . . . . . . . . . . . . . . . . . 93
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

6 The two-state Markov survival model 97


6.1 The two-state Markov model . . . . . . . . . . . . . . . . . . . 97
6.2 Developing probabilities . . . . . . . . . . . . . . . . . . . . . 99
6.3 Developing the two-state model . . . . . . . . . . . . . . . . . 100
6.4 The maximum likelihood estimator . . . . . . . . . . . . . . . 103
6.5 The distribution of . . . . . . . . . . . . . . . . . . . . . . . 104
6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

7 The multiple-state Markov model 110


7.1 Developing probabilities . . . . . . . . . . . . . . . . . . . . . 112
7.2 Solving the Kolmogorov equations for a simple multiple state
model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.3 The maximum likelihood estimators . . . . . . . . . . . . . . . 114
7.4 Properties of the maximum likelihood estimators . . . . . . . . 118
7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

iv
A Chapter 1 solutions 124

B Chapter 2 solutions 129

C Chapter 3 solutions 133

D Chapter 4 solutions 137

E Chapter 5 solutions 142

F Chapter 6 solutions 146

G Chapter 7 solutions 151

v
The following book has been used as the basis for the lecture notes

Faculty and Institute of Actuaries, CT4 Core Reading

vi
Chapter 1
Principles of actuarial and financial modelling

Introduction
In this chapter, we introduce the idea of using models to represent and ex-
amine various real life systems or processes. We discuss:
why models are used

how to construct a model

analysing the output from a model

communicating the results of a model

Modelling is a key aspect of financial and actuarial work. Financial analysts


and actuaries use models to solve real business problems. These business
problems typically involve analysing future financial events, especially when
the amount of a future payment, or the timing of when it is paid, is uncertain.
Examples of areas in which actuaries analyse such financial events include:
Life assurance: A life assurance policy pays out a lump sum on the
death of the insured life if it occurs during the period of cover. Actu-
aries help to calculate the charge for providing such life assurance (the
premiums) and the amount of money the life company needs to hold
(the reserves) to ensure that it can meet the cost of paying such lump
sums.

Pensions: A pension is a sum of money paid regularly as a retirement


benefit. There are many different types of pension arrangements and
actuaries are heavily involved in providing financial analysis and ad-
vice for these. This includes: producing projections illustrating future
pension benefits, advising companies on how much money to put into
their pension schemes to meet the cost of future pensions and provid-
ing advice on how to invest pension funds whilst ensuring the level of
investment risk is managed.

Permanent Health Insurance (PHI): PHI provides regular pay-


ments while an insured life is ill or unable to work. The payments
provide protection against lost salary and will be stopped once the in-
sured life returns to work. Actuaries provide similar support to that

1
provided for life assurance. However, unlike with death, the insured
can move into and out of the state of ill-health.

Within each section of this chapter, we discuss examples to put the modelling
process into financial and actuarial contexts.
This section of the module provides the principles of financial and actuar-
ial modelling, primarily as a theoretical concept. The module, as a whole,
then expands on this theoretical basis, providing details around the models
actually used by financial analysts and actuaries. Therefore, it is likely that
you will benefit from revisiting this chapter whilst studying the rest of this
module.

2
1.1 Why are models used?
A model is an imitation of a system or process. It is usually built to represent
and explore an event which could occur in the real world. For example, the
effect of medical treatment on a cancer patients, the flow of traffic on the
street or the results of a horse race.
We can use a model to investigate the possible outcome and consequences of
a particular scenario without having to wait for the actual scenario to run its
course. This allows us to observe systems in condensed time and so to plan
for possible consequences, or even to decide not to proceed with a certain
project at all.
The model also enables us to investigate the effect different input param-
eters have on the results of a model.
Example 1.1. Let us consider the expected lifetime of a terminally ill
cancer patient. Suppose we set up a model exploring the effect taking various
quantities of a new drug has on the patients lifetime. Then the quantity of
the new drug is an input parameter. Our model may then enable us to
optimise the amount of the drug to give to the patient in order to maximise
their lifetime.
Example 1.2. A 20 year old married man has decided to take out a personal
pension with a life company. He wishes to pay a fixed percentage of his
salary into this personal pension until he retires. The life company directs
him to a Pensions Calculator on their website, which projects his pension
at retirement. The Calculator asks for details like current earnings, the
percentage of salary he wishes to pay into the pension, the age at which
he wishes to retire and whether he wishes to provide a pension for his wife
(spouses pension) if he dies before she does. These details are all examples of
input parameters and affect the size of the projected pension at retirement.
The Pensions Calculator itself is an example of a model.

Models are simplified versions of real world system and involve making a
number of assumptions about how the system works. This includes whether
and how we choose to model the relationships between various parameters
in the model. The number of assumptions made can reduce the level of
complexity of the model compared to that of the real world system.
Example 1.3. In example 1.2 above, there are many factors which will
affect the projected level of pension. Some of the financial factors include
future salary increases, price inflation and future investment returns. There
is likely to be some correlation between these factors, i.e. salary increases

3
tend to be higher when price inflation is higher. However, expected salary
increases may also be dependent on the individual concerned e.g. is the man
a high-flyer?
Any Pensions Calculator will either include such factors as input parameters
(i.e. the user explicitly chooses these assumptions) or the assumptions will be
implicitly made within the model. There will need to be a balance between
making the Pensions Calculator easy to use and simple for customers and
ensuring accurate pensions projections are produced.

We will need to use input data in order to produce and parametrise the
model. Such data may relate to past observations, current observations or
expected future values (e.g. target inflation rate). Statistical methods can be
used to fit the data to the model, if the data is considered to be appropriate.
When deciding on the appropriate level of detail to include in the model, it
is important to consider the objectives of the model. That is, we need to
balance the cost and time of producing the most accurate model against the
purpose for which it is required.
Example 1.4. An actuary advises a large final salary pension scheme (i.e.
a pension scheme which provide a pension based on the members salary at
retirement). She has been asked to estimate the current funds required to
meet the cost of the pension benefits built up to date in the pension scheme.
This will involve, amongst other things, assumptions about the lifetime of
the members of the pension scheme.
The actuary could send a questionnaire to each member asking them about
their current health, whether they smoke, how much they drink etc. She
could then use this data in her model to establish exactly how long each
member can be expected to survive. However, this would be a time consum-
ing and costly exercise. Given the size of the pension scheme, it is unlikely
to add significantly to the accuracy of her estimate. It would be more appro-
priate for her to use standard life tables, produced by life companies or from
census data, and to adjust these if necessary, to assess the expected lifetime
of members. For example, if the company is a large accountancy firm, em-
ployees are likely to have lower mortality than the general population (for
many reasons; for example because general population includes people hav-
ing much more health-risky jobs) and so she may use the standard life table
rated down by two or three years.

4
1.2 Key steps in the modelling process
The modelling process is varied and does not rigidly follow a prescribed series
of steps. Movement between each stage of modelling is fluid, and actuaries
will often revisit earlier work in order to improve and fine tune their model.
However, there are 12 key stages which should always be considered when
constructing a model. These are outlined below.

1. Objectives: Develop a clear set of objectives to model the relevant


system or process.

2. Plan and validate: Plan the model around the chosen objectives,
ensuring that the models output can be validated i.e. checked to ensure
it accurately reflects the anticipated output from the relevant system
or process.

3. Data: Collect and analyse the necessary data for the model. This
will include assigning appropriate values to the input parameters and
justifying any assumptions made as part of the modelling process.

4. Capture real world system: The initial model should be described


so as to capture the main feature of the real world system. The level
of detail in the model can always be reduced at a later stage.

5. Expertise: Involve the experts on the relevant system or process.


They will be able to feedback on the validity of the model before it
is developed further.

6. Choose computer program: Decide on whether the model should


be built using a simulation package or a general purpose language.

7. Write model: Write the computer program for the model.

8. Debug the program.

9. Test model output: Test the reasonableness of the output from the
model. The experts on the relevant system or process should be in-
volved at the stage.

10. Review and amend: Review and carefully consider the appropriate-
ness of the model in the light of small changes in input parameters.

11. Analyse output: Analyse the output from the model.

5
12. Document and communicate: Communicate and document the re-
sults and the model.

Example 1.5. Bill is thinking about buying a property to rent out to


students. He has chosen the property and approached the bank for details
on mortgages already. He asks for your help with assessing whether this is
a sensible venture. You start by quickly talking him through how to build a
model to answer this question. Your discussions should cover the following
steps:

1. Objectives: Bill obviously wants to make a profit from this venture.


Further discussions with him indicate that he is particularly keen to
know at what point he may be able to pay off the mortgage he will take
out on the property. Therefore the objective of the model could be to
calculate the projected date on which Bill will pay off his mortgage,
ignoring any change in the actual property value.

2. Plan, validate and consult experts: Bills plan is to use all of the
rental income to pay off of the mortgage. He intends to do any repair
work or maintenance himself and to rent the property out directly
rather than through an agent. He has obtained a mortgage offer from
the bank and will use this to compare to his estimate of how quickly
he will pay off the mortgage.

3. Data: Bill will need to collect data on the cost of mortgages, the level
of rent he can expect to obtain from the property, and the anticipated
cost of any repair work or ongoing maintenance of the property. He will
also need to have an idea of the likely length of voids on the property.

4. Capture real world system: You discuss with Bill some of the com-
plexities involved with getting a mortgage and estimating the future
rental income. Issues such as fluctuating interest rates, estimating in-
flation, tax etc. These would all need to be taken into account in order
to model the true time period it could take to pay his mortgage. How-
ever, Bill says that at this stage he just wants an approximate idea
of the period. He accepts that this is subject to some potential risk,
upside or downside.

5. Consult experts: He has bought some magazines on buy-to-rent and


has spoken to a bank about this issue. He will use this information to
check whether the outcome from this model is reasonable. He also has
an acquaintance who has been renting out properties for many years,

6
and has built up a large portfolio. He will discuss this project with
him.
6. Choose computer program: Bill decides to ignore the potential
volatility of the input parameters, and build a simple model in excel.
This treats interest rates, void periods (i.e. periods where the property
is empty and hence no rent is received), and increases in rental income
as fixed values.
7. Write model: Bill uses excel to project four figures for each year after
he buys the house: the mortgage level at the start of the year, the
interest required for that year, the rental income received over the year
and the mortgage level at the end of the year. He multiplies the rental
income by a certain percentage (below 100) to allow for void periods.
8. Debug: You explain to Bill that he should check to ensure that he
has not made any silly errors in the model. He should also check the
figures in each row appear reasonable and that the mortgage level is
decreasing each year: if not, he will have either made a mistake or he
will be unable to pay off his mortgage based on rental income alone.
9. Test model output: Bill should looks at the overall results to check
that it appears reasonable, based on the discussions he has had with
his acquaintance and the bank.
10. Review and amend: You are concerned that Bill has such a relaxed
attitude to the potential volatility of the input parameters. You suggest
he reruns his model looking at different scenarios, e.g. consider a best
case scenario with low interest rates, high rental growth and low voids
and a worst case scenario with high interest rates, low rental growth
and high voids. He should try to understand any correlation that exists
between the financial and property market, to put these scenarios into
context. His magazines may help here.
11. Analyse output: He should ensure that he is comfortable with the
results, taking into account the different scenarios covered. He may
wish consider other options, for example, selling the property early
under the worst case scenario.
12. Document and communicate: This is the point at which he will
need to make a decision about whether to proceed. He should sum-
marise what the information at this stage to ensure he has fully under-
stood this investment.

7
1.3 Benefits and limitations of modelling
Benefits of modelling

Compressed timeframe: We are able to compress the time it takes


to examine the results of a real world system. This is particularly
useful in actuarial work, where financial planning is required, often for
events far in the future. For example, funding a pension scheme, where
contributions to the scheme are made now towards pensions that may
be paid many years in the future.

Ability to incorporate randomness: Standard mathematical or


logical models are not always capable of allowing for the random ele-
ments, such as interest rates, life expectancy or currency rates, and for
correlations between such elements. The nature of such random ele-
ments is often important to include in a model, enabling the user to see
the range of possible outcomes. Such randomness can be incorporated
into stochastic models.

Scenario testing: Models enable us to run several different scenarios,


varying parameters, and to easily observe the effects of such variation.
For example, we could see the expected increase in pension resulting
from paying a larger percentage of salary into a personal pension plan.

Greater control over experimental conditions: A model allows


us to set up the experimental conditions. This is in contrast to a real
world system where we often do not have the ability to influence some
of the conditions. This allows us to examine the output from a model,
without encountering unnecessary variation in the results.

Cost control: By building a model to represent a real world system


we can avoid making costly investments in the actual system before
fully understanding the implications.

Limitations of modelling
Whilst models are extremely useful when dealing with actuarial problems,
they have their limitations. Modelling a process is not always the most
effective or efficient way to approach the problem. We have explained some
of the general drawbacks/limitations that can occur below.

8
Time and cost: Modelling complex systems can require the invest-
ment of a significant amount of time and expertise. This, in turn, leads
to a significant cost to the client.

Several runs required: For a stochastic model, each run is only an


estimate of a models output. The model needs to be run a number
of times to construct an accurate indication of the distribution of the
potential outcome. Generally, models are more useful to examine the
effects of different input parameters than to optimise outputs.

Validation and verification: It is not easy to see past the complexity


of a model in order to ensure that it actually mimics the real world
system.

Reliance on data input: The model relies on accurate data being


used to set up and parametrise the model. If this is not the case, the
model is likely to be inappropriate i.e. rubbish in, rubbish out.

Inappropriate use: The model must be properly understood by its


user and communicated appropriately to the client. Without this level
of understanding, there is scope for the model to be applied in the
wrong situations.

Limited scope: It is not possible to create a model which covers all


possible future events. For example, the introduction of new legislation
may invalidate the results of our model. However, we are not always
able to anticipate such a change.

Difficulty interpreting some outputs: Some results may only make


sense in a relative sense i.e. they allow us to understand the effects of
varying different input parameters has on the output, but the actual
output on its own may add little or no understanding to the real world
system as a single output.

1.4 Stochastic and deterministic models


A stochastic model is a simulation model that allows for uncertain events
with probability distributions for some or all of the model parameters i.e.
the parameters are treated a random. A deterministic model is a model
which does not include any element of randomness.

9
When we run a stochastic model we obtain one possible outcome from the
model, i.e. the output is random. Each specific output is only one estimate
of the real world system. Therefore, several independent runs are required
in order to obtain an idea of the distribution of the output. The more runs,
the more accurate a picture we can obtain. For complex models, thousands
of runs maybe required. (We will develop this idea later in this course, when
we meet Monte Carlo simulations.)
In contrast, a deterministic model uses fixed parameters and has a single
set of outputs. You could view a deterministic model as a single run of a
stochastic model with fixed input parameters. That is, you only need run a
deterministic model once to determine the results.
Example 1.6. Consider the expected income from investing in a unit trust
over a three year period. Imagine the return in any year is independent and
is distributed such that there is an equal chance of a 20% fall, a 10% rise or
a 50% rise.
If we use a deterministic approach, we may decide to determine the result
simply by calculating the mean return, per annum, and assuming this is the
return each year. i.e.
(0.8+1.1+1.5)
3
= 1.1333 i.e. 13.3% p.a. and
3
1.1333 = 1.4557 i.e. 45.6% return over 3 years.
This result provides no indication of the possible range of returns from this
investment.
Alternatively, we could use a stochastic approach. We could use a string of
three random numbers, between 1 and 3, to determine the outcome for each
year of investment. Attributing 1 to a 20% loss, 2 to a 10% gain and 3 to a
50% gain. For example, imagine our string of three random numbers is 1,1,3
our return would be:
0.8 0.8 1.5 = 0.96 i.e. a 4% loss over the 3 years.
We could then rerun this model, with more strings of three random numbers,
to obtain predicted values for the possible investment return. The random-
number strings and corresponding output may look something like:
(1,1,3),(1,2,3),(1,2,2),(2,2,2),(3,2,3),(3,1,3),...
which translates to a 3 year return of:
4%, 32%, 3.2%, 33.1%, 147.5%, 80%,...
This approach does provide an indication of the possible range of returns
from this investment. This may be vital if the investor requires a lump sum
of a certain amount in three years time.

10
In actual fact, we could calculate the distribution for this investment and
do not need to examine the output from a number of runs to obtain an
indication of the distribution. However, this is often not the case for more
complex models.

The appropriateness of whether to use a stochastic or deterministic model


will depend on the real world system the model is imitating and the objectives
of the model. However, if there is a random element in the real world system
then we should explicitly decide whether to incorporate this into the model.
Output from deterministic models is usually obtained by direct calculation.
However, where this is not possible numerical approximations will often be
used, e.g. Newton-Raphson method or the Trapezium method may be used
to approximate integrals.
Stochastic models can rarely be solved by analytical methods i.e. solving the
equations to obtain a recognisable distribution for the model. Even if they
are, it is useful to be able to check the results of any complex calculations by
analysing the output from several runs.
If we cannot solve our stochastic model, or wish to check our results, we need
to produce and analyse a simulation of the model. We will need to carry out
a number of simulations to be able to assess the possible range of values
that could result from a stochastic model. This approach does not provide
an easy way of finding the optimum output for a model. However, it does
enable us to construct an idea of the distribution of results possible from the
model, and hence the real world system we are trying to imitate. The more
simulations carried out, the more accurate a picture we can construct.
Monte Carlo simulation, which you meet later in this module, uses a string
of computer generated pseudo-random numbers to generate random values
for the parameters of a stochastic model. Applying this approach enables us
to view a sample of the possible results from a stochastic model. We may
also benefit from trying to obtain the expected values (the mean) or median
values using a deterministic approach.

1.5 Suitability of a model


In assessing the suitability of a model for a particular exercise it is important
to consider the following:

Relevance to the objectives: A model should represent the system


we are trying to imitate, to the extent that is required by the objectives.

11
Too simple model may not be able to provide the results, sufficient to
achieve the objectives. On the other hand, the model should not be
too complicated, and provide more results than required.

The validity of the model: At all stages we should bear in mind the
validity of the model for the purpose to which it is to be applied.

The validity of the data to be used: The input data for the model
should be credible and appropriate. For example, if an input parameter
is highly unpredictable, it should probably be included in a model as a
random parameter rather than deterministic one.

Acceptability of the possible errors: A model may not be a perfect


representation of the real world system we are trying to model, but we
must ensure any simplifications to this system are acceptable.

The impact of correlations between the random parameters:


We should actively consider correlations which exist between random
variables used within the model. For example, imagine modelling the
average level of a mortgage repayment, as a percentage of salary. You
should, amongst other things, consider the link between salary inflation
and interest rates as this will affect the likelihood of certain scenarios
i.e. if higher interest rates lead to lower salary inflation, both of which
are bad news for the mortgagee, then the likelihood of these two events
happening at the same time is higher.

The current relevance of models written and used in the past:


Any previous or current models which can be used to help build our
model should be used i.e. we should not look to recreate the wheel.
However, care must be taken to ensure our model is tailored to our
objectives.

The credibility of the results output: The results output should


be believable. Any concern over its credibility should be addressed as
it could indicate an error in the model or inputs.

The dangers of spurious accuracy: One must be careful not to


involve a spurious level of accuracy, either within the model or when
communicating the results. This wastes time when trying to establish
parameters and can also lead the client to lose sight of the fact that
results are indicative only and not the actual outcome of the event
being modelled.

12
The ease of the communication: The model, and its results, must
be easily communicated to its intended audience. Any correlations
between results should be understood and communicated to the client.

1.6 Short-run and long-run properties of a model


Properties observed within a real world system that appear true over a
short period (short-run), may not actually apply in the long term (long-
run). Hence, care is needed when extending a model developed from data
and observations over a short period to project over a longer period.
Conversely, similar problems can arise from using long-run models for short-
run projections. Approximations which hold in the long-run may not be
appropriate for short-run projections.
Example 1.7. Consider modelling the future growth of a rabbit population.
The growth of such a population is exponential, ignoring disease, lack of food
and mans intervention. However, the approximation for exponential growth
over the short-run, i.e. for small t, is:
et = 1 + t + o(t2 ) where o(t2 ) is a limiting term
Hence, in the short term, the growth may appear linear as opposed to expo-
nential. However, this would lead to an inaccurate model if extended to a
long-run model.

1.7 Analysing model output


The purpose of analysing the models output is two-fold:

1. To check the models results are consistent with the type of results
we would expect from the system we are trying to imitate. For real
world systems the comparison can be carried out using a Turing test.
This involves experts in the real world system comparing several sets of
data from the model and data from the real world system. The experts
should be unable to differentiate between the two sets of data. If they
are able to differentiate, their method of differentiation can then be
used to adapt and improve the model.

2. To ensure that we fully understand the results we must be aware that


the scenarios generated by the model are produced by the same model
and cannot be treated as independent. Correlations between outputs
should be identified and allowed for. With this understanding, samples

13
from the models output can be investigated to yield some knowledge
about the system concerned.

1.8 Sensitivity testing


The choice of parameters for a model can be crucial. In order to obtain
an idea of how sensitive a models output is to these parameters, sensitivity
testing is used. This involves making small changes to the model input
parameters and analysing the effect this has on the models output. Usually
one parameter only is changed for each comparative run, to gauge the extent
of its significance in the model.
Sensitivity testing is particularly useful for deterministic models where ran-
dom parameters have been assigned fixed values. It then allows us to examine
the validity of making this assumption. In the case of a stochastic model,
the statistical distributions for parameters may be reviewed as part of the
sensitivity testing.
If a small change in a parameter results in a large change in the outputs, the
appropriateness of the model should be reviewed and, if necessary, refined.
Any large sensitivity to assumptions should be communicated to the client.
Example 1.8. An actuary has traditionally used a deterministic approach
when assessing the value of a pension schemes assets and liabilities. He
makes a number of assumptions as part of this process, for example future
interest rates, salary inflation and life expectancy of members. He refers
to the set of assumptions made as the basis. When assessing the schemes
funding position he will often present the results on several different bases to
allow the client to appreciate the sensitivity of the results to these assump-
tions. For example he may have an optimistic basis, a best-estimate basis
and a pessimistic basis.
Example 1.9. A company provides a lump sum payment to its employees
when they retire at age 65. The lump sum payment is based on their total
service with the company and salary at retirement. The present value of the
lump sum benefits accrued has been calculated to be 4.2m. This assumes
interest return on this amount will be 7% p.a.. The average age of employees,
weighted by past service and current salary, is 45. Use a sensitivity analysis
to investigate the effect changing the investment return assumption has on
the present value of the accrued benefits.

14
Consider an employee aged 45 now. This employee will receive his lump sum
in 20 years time. The present value of this benefit will include a discounting
1
factor to allow for investment return over this period i.e. (1.07)20 .

If we change our interest rate assumption from 7% p.a. to 8% p.a. then we


need to strip out the 7% p.a. discounting factor in the original present value
and replace is with a discounting factor of 8% p.a. i.e. multiply the original
present value by:
(1.07)20
(1.08)20
= 0.8302 i.e. a 17% p.a. reduction in the present value.
Since the lump sum benefit is based on past service and salary, using the
average age weighted by service and salary provides the appropriate average
to obtain an approximate indication of the effect this change in investment
return would have on our total present value. Hence, if we increase our
interest rate assumption by 1% p.a. we would expect the present value
to fall by 17% p.a., from 4.2m to 3.5m. This is a significant change and
it is important this level of sensitivity to the chosen investment return is
communicated to the the company.

1.9 Communicating the results


Throughout this chapter we have referred to issues relating to communicating
the results of a model. Whilst this is the final step of developing the model,
it is key to bear it in mind at each stage of the process. We should design our
model with the target audience in mind, ensuring that we take into account
their knowledge and requirements. The client may not need or want to
understand the technicalities of the model in detail. However, they must be
able to understand and interpret the results and appreciate any limitations
of these results.
Example 1.10. When you obtain a quote for your car insurance, you will
need to provide information e.g. age, past claims, points on licence etc. This
information is then used to calculate your premium.
You are rarely interested in how your premium has been calculated, but
wish to understand exactly what is covered by your insurance policy and
what effect changes like increasing the excess on your policy may have on the
premium. Hence, you would not want to be given lots of detail about the
assumptions made in the model used to calculate your premium.
Example 1.11. In example 1.8 we discussed how an actuary will value a pen-
sion schemes assets and liabilities on several different bases, e.g. optimistic,

15
best-estimate and pessimistic. It is important the client fully understands
the difference between each of these results.
The client should not simply choose the optimistic basis on the grounds that
it allows them to pay a lower contribution to the pension scheme now. Such
an opinion would result in a misunderstanding of the true cost of the pension
scheme and may unnecessarily threaten the funding position of the scheme,
and hence reduce the security of pension benefits for the scheme members.
On the other hand, the client should also be careful not to simply pay contri-
butions based on the pessimistic basis. This may lead to the scheme becoming
overfunded on a statutory basis (i.e. a basis prescribed by the government),
which can have tax implications, or the client deciding to reduce future pen-
sion benefits unnecessarily, because they appear to cost too much.
Therefore, it is vital that the actuary communicates such results as carefully
as possible. She should ensure the implications of changing each key assump-
tion is fully understood, and if necessary that results are provided on further
bases.
Indeed, in recent years, there has been a move towards carrying out such
valuations on a stochastic basis. This enables the client to fully understand
the distribution of the projected assets and liabilities.

16
1.10 Summary
In this chapter, we have:

introduced the concept of using a model to investigate a real world


system or process;

discussed why a model is used;

outlined the 12 key steps involved in the modelling process;

discussed some of the decisions that need to be made during this pro-
cess, e.g. should the model be deterministic or stochastic?

discussed how to analyse the outputs from a model;

introduced the idea of sensitivity testing;

emphasised the issues to consider when communicating a model.

17
Questions

1. What input parameters might you have for a model assessing the cost
of providing life assurance?

2. Explain the keys steps involved in constructing a model.

3. At what stage in the modelling process should you debug your program?

4. Briefly discuss the issues to consider when assessing the suitability of


a model.

5. Explain the difference between a stochastic and a deterministic model.

6. What does sensitivity analysis mean?

7. What is a Turing test?

8. Mr Jones plans to retire this month. He asks you to calculate what


amount of money he would require now in order to draw a 10000
p.a. pension at then end of each year from his funds. He says that he
expects to live for 20 years and to be able to achieve a future investment
return of 7% p.a.

(a) calculate the funds required by Mr Jones based on his expected


future lifetime and investment return.
You wish to demonstrate to Mr Jones the effect of any uncertainty
around his future lifetime and future investment returns. There-
fore you carry out some sensitivity testing on his required funds.
What funds will be required by Mr Jones if:
(b) he only survives for 15 years?
(c) he survives for 25 years?
(d) his future investment returns are 6% p.a.
(e) his future investment returns are 8% p.a.

18
Chapter 2
The Monte Carlo method
A complicated stochastic model can rarely be completely solved, in sense
that exact distribution of the results is analytically derived. If this is not
possible, the most natural approach one could imagine is just to run model
several times with different input parameters to have an idea of the possible
range of the model results. For example, assume that the model is intended
to estimate a single quantity of interest F , such as a possible profit of an
investment, or number of claims to an insurance company. Running the
model m times, we get a values f1 , f2 , . . . , fm , which provides us with an
indication what F can be. Based of this, different quantities of interest can
beP estimated; for example, the expected value of F can be approximated as
1 m
m i=1 fi . This simple idea of several runs is called Monte-Carlo simulation
for a model.
Thus, the Monte Carlo method provides a simple probabilistic algorithms
for simulating systems where an underlying randomness exists. The main
advantage of the method is that the basic concepts are easily understood
and can be programmed relatively quickly, even for the most exotic models.
With the advent of cheap high-powered computers, the Monte Carlo method
has become extremely important in all financial institutions.
Even though an Actuarys interest in the method is a practical one, it is
important to have an understanding of the theoretical background of the
method. For example, one can ask the following questions.
Accuracy of the method: How many runs of the Monte-Carlo method
is required to get a reasonably accurate results for a particular model?
Input data generation: Assuming that the input data for a model
follow complicated probability distributions, how to generate such in-
put data appropriately?

These and other questions will be considered in this chapter together with
some practical examples. We begin with a discussion of the basic concepts,
motivated by a simple example: the evaluation of a deterministic integral.
We then proceed to discuss the generation of random numbers which is a
fundamental requirement of Monte Carlo method, before considering the
formal development of the method as applied to solving stochastic financial
and actuarial models. We will see that the Monte Carlo method that is
the preferred numerical method for high-dimensional problems that typically
arise in financial and actuarial modelling.

19
2.1 A motivating example
By way of introduction to the Monte Carlo method we look at a straight
forward example. Consider the evaluation of a deterministic integral over
the unit interval [0, 1] Z 1
I= g(x)dx. (1)
0
The Monte Carlo method requires a probabilistic representation of the prob-
lem, even though a deterministic example is being considered. This is a
fundamental requirement of the method. We can do this by noting that
the probability density function for a Uniform distribution over this interval,
U (0, 1), is f (x) = 1. I can therefore be represented as
Z 1
I= g(x)f (x)dx = Eg(), (2)
0

where is a random variable with uniform distribution over the interval [0, 1],
i.e. from U (0, 1), and E represents an expectation under this distribution.
Using this representation we can now propose a probabilistic algorithm for
evaluating the integral, this will be the Monte Carlo method.
Let us first assume that we have an algorithm for generating random numbers
that draws points 1 , 2 , . . . , M independently from U (0, 1). We can then
produce an approximation to equation (2) in terms of an arithmetic average
of evaluations of g(x) at a number of random points:
M
1 X
I = Eg() IM = g(m ). (3)
M m=1

Note that since IM is formed from a finite sum of M random quantities, it is


a random quantity itself. IM is the Monte Carlo estimate of the deterministic
integral I. Assuming that g is an integrable function over the interval, then,
by using the strong law of large numbers, IM I with probability 1 as
M .
Further, if g is square integrable, we can denote the variance of g(), as
Z 1 Z 1
2 2
g = Var[g()] = f (x)(g(x) I) dx = (g(x) I)2 dx. (4)
0 0

Example 2.1. Question: A computer program generating random num-


bers from the U (0, 1) distribution have returned values 0.659, 0.931, 0.710, 0.688

20
and 0.711. Use these data to approximate the integral of x2 over the unit
domain.
Answer: The integral can be approximated by computing
5
1X 2
I5 = u,
5 i=1 i

where {ui } are the random numbers given. Using these values, we obtain
1
I5 = 0.6592 + 0.9312 + 0.7102 + 0.6882 + 0.7112 = 0.556.

5
Note that the actual value is 1/3 and so we see that the method has over-
estimated the integral using this very small number of random numbers. In
reality much higher values of M are required, see Example 2.2 for the orders
of magnitude.

2.2 Accuracy of the method


In practical applications it is impossible to have M and a finite number
of random numbers are used to compute IM . The Monte Carlo method
therefore provides only an approximation to the integral I and it is useful to
have a feel for how accurate this approximation is for a given value of M .
We define RM := IM I as the error, this is a random quantity dependent
on M . It is approximately
normally distribution with mean 0 and standard
deviation g / M :
g2
 
RM N 0, ,
M
which results from the additive property of normal distributions and equation
(4). We therefore see that the quality of the approximation is improving with
increasing M , as is expected.
The parameter g is unknown in practice but can be estimated in the same
experiment as
M
1 X 2
2
g = g (m ) IM
2
. (5)
M m=1
This means that, from the values of the function at the random points on the
interval, we obtain not only an estimate of the integral, but also a measure
of the error in the estimate. We note that the bias of the estimation of the
integral, defined as E[RM ], is zero. This indicates that the method produces

21
no systematic error in this application. Further, the statistical error, defined
as Var[RM ] = g2 /M , results from the finite number of random numbers
generated, and tends to zero as M increases. These two different measures
of the error are important in approximation techniques.
Since IM is a random estimate of the integral, we can form confidence inter-
vals for its value from the mean and variance as follows
 
g g
IM I c , I + c ,
M M
with probability 0.997 for c = 2.75 and 0.95 for c = 1.96, for example. The
confidence interval demonstrates how the statistical error of the Monte Carlo
method is of practical importance and emphasises that it is of O(M 1/2 ).
This means that to add one decimal place of preciseness the method requires
100 times as many random-number computations.
The trapezoidal rule, that you may be familiar with from your undergradu-
ate studies, is an alternative deterministic method for approximating I. In
particular,
M 1
g(0) + g(1) 1 X m
I + g . (6)
2M M m=1 M
In contrast, the error in this approximation is of O(M 2 ) which demonstrates
that the Monte Carlo method is not competitive for the 1-dimensional inte-
gral.
However, the great advantage of the Monte Carlo method is that it can,
in principle, deal with the curse of dimension: where using some numerical
integration techniques there would be an exponential increase in the com-
putational cost with the dimension of the problem, there is not using the
Monte Carlo method. This is due to the fact that the rate of convergence
of O(M 1/2 ) is not restricted to integrals over the unit interval. What was
done in this simple example can easily be extended to estimating an integral
over [0, 1]d or any other domain in Rd for all dimensions d. Of course, when
we change the dimension, we change the function g and so we change g2 , but

the statistical error still has the form g / M for an estimate computed from
M draws from [0, 1]d . In particular, the O(M 1/2 ) convergence rate holds for
all d.
In contrast, the error produced by the trapezoidal rule in d-dimensions is
of O(M 2/d ). This degradation in convergence rate with increasing dimen-
sion is characteristic of all deterministic integration methods. Thus, Monte
Carlo methods are attractive in evaluating integrals in high dimensions, as
do typically arise in financial and actuarial applications.

22
2.3 Application to stochastic modelling
The Monte Carlo method was introduced by the very simple application of
approximating deterministic integrals over a unit domain, see 2.1. As was
discussed, the method is not as efficient as standard methods of numerical
integration (for example, the trapezoidal method) in the 1-dimensional case,
but has the great advantage that it does not suffer from the curse of dimension
for general d-dimensional systems. Monte Carlo methods are therefore to be
preferred in multi-dimensional problems.
Assume that we have build a complicated stochastic model which is intended
to estimate a single quantity of interest F . Let X1 , X2 , . . . , Xn be (random)
input parameters, which for simplicity will be assumed to be independent.
The model can be viewed as a black box returning the result in response
to input. From this point of view, the model can be described as a single
equation
F = f (X1 , X2 , . . . , Xn ).
The underlying function f , however, is usually so complicated, that it cannot
be analysed analytically. We can, however, evaluate the value of this function
for any particular input parameters x1 , x2 , . . . , xn .
The expected value of F (as well as variance and other characteristics we may
be interested in) is n-dimensional integral and can be evaluated using the
Monte-Carlo method as described in the previous sections. Namely, assume
that we are able to generate M tuples of input parameters xi1 , xi2 , . . . , xin , i =
1, 2, . . . , M , which are independent random variates from the distributions of
X1 , X2 , . . . , Xn , correspondingly. Then we can evaluate

fi = f (xi1 , xi2 , . . . , xin ), i = 1, 2, . . . , M,

end estimate the expected value (say) of F as


m
1 X
E[F ] fi .
M i=1

From
chapter 2.2 we know that the accuracy of this estimate increases as
O( M ), no matter how complicated the underlying function f is, and how
many input parameters it has.
As we see, in order to apply the Monte Carlo method to stochastic modelling,
we need to be able to generate random numbers from distributions much more
complicated than U (0, 1). We therefore consider the generation of random
numbers from a general distribution in the next section.

23
2.4 Random number generation
As we have seen in 2.1, the Monte Carlo method is a statistical sampling
technique where one evaluates a non-random quantity as an expectation of
a random variable (see equation (3), for example). In order to apply the
technique it is therefore necessary to use a large number of random numbers
from a specified distribution. Generating these is a fundamental task in
implementing any Monte Carlo approximation.
Truly random numbers are generated from physical processes, such as ther-
mal noise and quantum phenomena, that can be exploited in physical devices.
Although the use of such devices is clearly impractical in nearly all conceiv-
able applications, in 1955 the RAND Corporation published a table of one
million random digits obtained with such a device. Subsets of these can be
incorporated into software and such lists of random numbers are convenient
to use in many applications. However, when implementing the Monte Carlo
method millions of random numbers are often required and problems of peri-
odicity in a finite-length table can therefore arise. For this reason one needs a
better source of random numbers and it is typical to use pseudo-random num-
bers that are generated by random number generating algorithms (RNGs).
These are not truly random in the sense that they are generated from a
small set of initial values (the seed) and a recursive formula which forms the
algorithm, but they approximate the statistical properties of truly random
numbers. In what follows we shall refer to realisations of a random variable
generated by a computer as random variates to distinguish them from truly
random numbers.
RNGs have the great advantage of being easily incorporated into the coding of
Monte Carlo simulations, producing an unlimited supply of random variates
quickly and without resorting to physical means. They have the further
advantage of being reproducible: if the same seed is given at the beginning of
two runs of the RNG, identical sequences of random variates will be produced.
Exact reproductions of individual simulations may be very usefully when
debugging the code and the advantage of having to store only a single seed
rather than a very large sequence of random numbers is clear.

2.4.1 Random variate generation from the uniform distribution


All commonly used computing packages have RNGs incorporated for the
generation of random variates from the uniform distribution U (0, 1). For
example, the RAND functions in both Excel and MATLAB. For this reason
details of RNGs for this distribution will not be discussed further here. It
is sufficient to understand that they are generated from a seed and recursive

24
formula. If further information is required you are invited to research linear
congruential generators or Fibonacci generators as particular examples of a
method.
In what follows it is assumed that a source of random variates from the
U (0, 1)-distribution has been established and the question of how to use
these to generate random variates from other distributions is considered.

Example 2.2. Question: Use Excel to compute Monte Carlo estimates


of the integral from Example 2.1 for a range of values of M .
Answer: Estimates of the integral that were obtained using Excels RAND()
function for M = 102 and 103 are shown below. The actual value is being
approached, but still higher values of M are required in order to obtain
accuracy to the third decimal place.

M IM
102 0.327
103 0.330
Note that each computation using this method will produce a different ran-
dom value of IM .

2.4.2 Random variate generation from a specified distribution


We first discuss general methods for the generation of random variates. Ex-
plicit algorithms may exist for the generation of random variates from par-
ticular distributions, as we shall see in 2.4.3 for the standard normal distri-
bution, but it is useful to have an understanding of how to use these more
general methods.

Inverse transform method


Suppose we require an algorithm for the generation of random variates from
a distribution. The distribution function F (x) necessarily returns a number
in the interval [0, 1] and we denote the inverse function to F (x) as F 1 (y),
defined for all y on that interval.
If a random variable U is uniformly distributed over the interval (i.e. U is
from U (0, 1) and P (U x) = x for x [0, 1]), then the random variable
X = F 1 (U ) has the distribution function F (x). This forms the basis of the
inverse transform method, because

P (X x) = P F 1 (U ) x = P (U F (x)) = F (x).


25
Therefore, if we require a random variate x from a given distribution, we can
use the following short algorithm:

1. Generate a random variate u from U (0, 1);

2. Return x = F 1 (u).

Example 2.3. Question: A random variable X has a Burr distribution


(with parameters , , > 0) if its probability density function is

x1
f (x) = , for x > 0.
( + x )+1

Generate a random variate from this distribution using the inverse transform
method.
Answer: The distribution function of X is given by
Z x  

F (x) = P (X x) = f (s)ds = 1 .
0 + x

To find F 1 (u) we solve the equation u = F (x) for the variable x, leading to
1/
F 1 (u) = x = (1 u)1/ 1

.

Therefore, to generate a random variate x from the Burr distribution we can


use the following algorithm:

1. Generate a random variate u from U (0, 1);


1/
2. Return x = (1 u)1/ 1

.

The main disadvantage of the inverse transform method is the need for either
an explicit expression for the inverse of the distribution function, F 1 (y), or
a numerical method to solve y = F (x) for an unknown x. This means that
it cannot be used for some distributions where the explicit expression does
not exist, and the computation (using Newtons method, for example) is too
expensive. For example, to generate a random variate from the standard
normal distribution using the inverse-transform method requires the inverse
of the distribution function
Z x
1 2
F (x) = et /2 dt.
2

26
Since no explicit solution to the equation u = F (x) can be found in this case,
numerical methods must be used.
Using the inverse transform method, it is also possible to generate random
variates from discrete distributions. Let X be a discrete random variable
which can take values x1 , x2 , . . . , xN where x1 < x2 < < xN . The
distribution function of X is given by
P (X = xi ) = pi , for i = 1, 2, . . . , N ;
where pi > 0 and N
i=1 pi = 1. The distribution function of X is therefore
X
F (x) = P [X x] = pi .
i:xi x

If x < x1 then i:xi x pi = 0.


The algorithm that generates random variates x from a discrete distribution
is:
1. Generate u from U (0, 1);
2. Find the positive integer i such that F (xi1 ) < u F (xi );
3. Return x = xi .

This algorithm can only return variates x from the range {x1 , x2 , . . . xN },
and the probability that a particular value x = xn is given by
P [value returned is xn ] = P [F (xn1 ) < U F (xn )] = F (xn )F (xn1 ) = pn .

Example 2.4. Question: A random variable can take values 0, 1, 2 or 3,


with probabilities 0.027, 0.189, 0.441, 0.343, correspondingly. Describe how
you would generate random variates from this distribution, using the inverse
transform method and a sequence of random numbers {ui } from U (0, 1).
Answer: The random variates can take values 0, 1, 2 or 3, with distribution
function
F (0) = 0.027, F (1) = 0.216, F (2) = 0.657, F (3) = 1.
If you have a random number u then return
0 if 0 u 0.027,
1 if 0.027 < u 0.216,
2 if 0.216 < u 0.657,
3 if 0.657 < u 1.

27
Acceptance-Rejection method

0.8

0.6
f(x)

0.4

(X,Y)
0.2

0
x0

0.2
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x

Figure 1: Choosing a random point under the graph of f .

This method is best motivated from a visual point of view: consider the den-
sity function f (x) plotted in Figure 1. If any point is selected at random from
the area enclosed by the graph and the horizontal axis, then the point can be
considered a random variable (X, Y ). In particular, the x-coordinate of the
point, X, is a random variable with density function f . This is intuitively
obvious but we can justify this slightly more formally as follows.
The quantity P (X x0 ) is the probability that the x-coordinate of the ran-
domly chosen point is less than x0 , i.e. it is the probability that the point is
within the area to the left of x0 . By saying that the point has been selected
randomly, we mean that it was selected uniformly by area under the plot of
the
R x0 density function to the left of x0 . The probability therefore has value

f (x)dx. Differentiating this we see that the density function of X is f .
This reasoning forms the basis of the acceptance-rejection method for random-
variate generation. If it is too difficult to generate points at random from
under the graph of f , a reasonable approximation is to generate points from
a larger area which includes the region under the graph of f , then discard
any points which are not acceptable. To this end, a simpler density function
h(x) is constructed that is straight forward to draw random points from un-
der and is such that f (x)/h(x) is bounded for all x. Once this function is
found, we define

f (x) f (x)
C = sup and g(x) = x, (7)
h(x) Ch(x)

28
so that 0 g(x) 1. This construction means that once a point (x, y) is
drawn at random from under graph Ch(x), the value g(x) gives the proba-
bility that the point also falls under the graph of f .
The following algorithm therefore arises for the acceptance-rejection method:

1. Generate a random variate u from U (0, 1);

2. Generate a random variate z from the distribution with density h(x);

3. If u > g(x) then go to step 1 (reject), otherwise return x = z (ac-


cept).

The algorithm demonstrates that C is related to the efficiency of the pro-


cedure, as it represents the number of values which must be generated on
average in order to produce a single acceptable value x. When choosing h(x)
it is desirable that the resulting C is as small as possible.
To demonstrate that the algorithm does actually generate random variates
from the distribution with density function f (x) we need to show that x is a
realisation of a random variable X with the density function f . This is left
as a question at the end of the chapter.

Example 2.5. Question: Using the acceptance-rejection method, gener-


ate a random variate x from the logistic distribution with density function

ex
f (x) = for < x < .
(1 + ex )2

0.9
f(x)
Ch(x)
0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
10 8 6 4 2 0 2 4 6 8 10
x

Figure 2: Acceptance-rejection sampling for the logistic function.

29
Answer: Inspection of the behaviour of f at 0 and at suggests that the
double exponential density may be appropriate (see Figure 2), with
1
h(x) = e|x| .
2
In order to find C we first consider the ratio f (x)/h(x),

f (x) 2
= 2.
h(x) (1 + e|x| )

According to equation (7), C is the largest value that this ratio can take,
which is 2, and this leaves

f (x) 2
g(x) = = 1 + e|x| .
Ch(x)

The algorithm for generation of random variates x from the logistic distribu-
tion is therefore given by

1. Generate u from U (0, 1);

2. Generate z from the double exponential distribution (see end of chapter


questions for this);

3. If u > g(z) go to step 1, otherwise return x = z.

The acceptance-rejection method is best considered in the visual sense de-


scribed above and is typically used only for continuous probability distribu-
tions. Discrete applications can be contrived, but are beyond the scope of
this course.

2.4.3 Random variate generation from the standard normal dis-


tribution
Many applications require random variates drawn from the normal distribu-
tion, N (, 2 ). This is particularly true in financial applications where the
Wiener process is fundamental. The two general methods discussed in 2.4.2
are of no use for producing such random variates as they rely on the manip-
ulation of the density function which is not easy to work with in this case.
Fortunately, there exist explicit methods for generating standard normal dis-
tributed variates from uniformly distributed variates. We shall see three such
methods in this section. Each method is based on the normal distributions

30
particular properties, but an understanding of the mathematics behind each
method is beyond the scope of the syllabus. The algorithms for the methods
are therefore stated without justification.
Note that the scaling property of the normal distribution means that any
standard normally distributed variate, Z N (0, 1), generated from these
methods is easily transformed to a normally distributed variate, X N (, 2 ),
via X = + Z.

BoxMuller algorithm
The following algorithm can be used for generating a pair of independent
random variates from the standard normal distribution, z1 , z2 :
1. Generate two random variates from U (0, 1), u1 and u2 ;

2. Return z1 = 2 ln u1 cos(2u2 ) and z2 = 2 ln u1 sin(2u2 ).
The BoxMuller method is easy to incorporate into a computer code, however
it suffers from the disadvantage that the computation of sin and cos func-
tions is time consuming. For this reason an alternative formulation of this
method is generally preferred when very large numbers of random variates
are required, as, for example, in the Monte Carlo method. This alternative
is called the Polar algorithm.

Polar algorithm
The polar algorithm is very similar to the BoxMuller method in its justi-
fication (not discussed here) but is modified through use of the acceptance-
rejection method to avoid computation of the trigonometric functions.
The Polar algorithm is as follows:
1. Generate two random variates from U (0, 1), u1 and u2 ;

2. Set 1 = 2u1 1, 2 = 2u2 1 and s = 12 + 22 ;

3. If s > 1 go to step 1. q q
2 ln s 2 ln s
Otherwise, return z1 = s
1 and z2 = s
2 .

As with the BoxMuller method, the Polar method generates a pair of inde-
pendent random variates from the standard normal distribution.

Example 2.6. Question: A computer program generating random num-


bers from the U (0, 1) distribution have returned values 0.587, 0.155, 0.030,

31
0.447, 0.048, 0.224, 0.593, 0.478, 0.165 and 0.113. Use the BoxMuller and
Polar algorithms to generate variates from the N (1, 4) distribution, based
on these data.
Answer: Standard normal variates are computed in pairs using the above
algorithms and each is transformed via X = 1 + 2Z to obtain the N (1, 4)-
variates. Results for the BoxMuller and Polar algorithms are shown below:
u1 u2 Z1 (BM ) Z2 (BM ) X1 (BM ) X2 (BM )
0.587 0.155 0.580 0.854 2.160 2.708
0.030 0.447 -2.503 0.866 -4.006 2.732
0.048 0.224 0.401 2.432 1.802 5.864
0.593 0.478 -1.013 0.141 -1.026 1.282
0.165 0.113 1.440 1.237 3.880 3.474

u1 u2 Z1 (P ) Z2 (P ) X1 (P ) X2 (P )
0.587 0.155 0.285 -1.131 1.570 -1.262
0.030 0.447 -0.468 -0.053 0.064 0.894
0.048 0.224 N/A N/A N/A N/A
0.593 0.478 2.504 -0.592 6.008 -0.184
0.165 0.113 N/A N/A N/A N/A
Where Zi and Xi indicate N (0, 1)- and N (1, 4)-variates computed using these
methods, respectively.
Note that the BoxMuller method produced 10 variates but the Polar method
produced only 6 from the same set of U (0, 1)-variates. This is a consequence
of the acceptancerejection method incorporated into the Polar method. If 10
variates were required from the Polar method, more pairs of U (0, 1)-variates
would have to be generated.

Approximate method
When the exact distribution of the variates is not important, a method often
used is to generate a sequence of U (0, 1)-variates, u1 , u2 , . . . , u12 and set z =
12
1 ui 6. The resulting variate has mean 0 and variance 1 and so, by the
Central Limit Theorem, is approximately normally distributed. This is called
the approximate method. If M variates are required from this method, 12M
variates from the U (0, 1)-distribution are required.

References
The following texts were used in the preparation of this chapter and you are
referred there for further reading if required.

32
Faculty and Institute of Actuaries, CT8 Core Reading;

G. N. Milstein & M. V. Tretyakov, Stochastic Numerics for Mathemat-


ical Physics;

P. Glasserman, Monte Carlo Methods in Financial Engineering.

33
2.5 Summary
The Monte Carlo method is a statistical sampling technique where one eval-
uates a non-random quantity as an expectation of a random variable.
The method can be applied to the approximation of deterministic integrals:
Z 1 M
1 X
I= g(x)f (x)dx = Eg() IM = g(m ).
0 M m=1

Where {m } are M random variates from the U (0, 1) distribution.


As M , IM 1 with probability 1.
In this setting the method produces zero bias and has statistical error of
O(M 1/2 ). This is unchanged by integration domain and dimension of the
problem.
Random variates from the U (0, 1) distribution can be obtained from RNGs
incorporated in most standard software packages.
The inverse-transform and acceptance-rejection algorithms can be used to
generate random variates for most distributions from U (0, 1)-variates.
Variates from the standard normal distribution can be obtained from the
BoxMuller and Polar algorithms. The approximate method allows approx-
imately standard normal distributed variates to be generated easily. These
are often also incorporated into standard software packages.
The Monte Carlo method that is the preferred numerical method for high-
dimensional problems that typically arise in actuarial modelling

34
Questions

1. Use Excels RAND() function to generate 30 U (0, 1)-random variates.


Use them to estimate the integral of f (x) = xe2x + 1 over the domain
x [0, 1].

2. Obtain the 95%-confidence interval for the Monte Carlo estimate of the
integral in question 1, based on M random simulations.

3. Write an algorithm to generate random variates from the double expo-


nential distribution with density function f (x) = 1/2e|x| for x R,
using the inverse-transform method.

4. Demonstrate that the acceptance-rejection method algorithm does gen-


erate random variates from the distribution function with density func-
tion f (x).

5. Estimate the proportion of computed numbers that would be rejected


3
from the acceptance-rejection method if f (x) = 32 (1 x)(x 5) for
1 x 5 and h(x) is the density function for the U (1, 5) distribution.

6. Use Excels RAND() function to generate 6 U (0, 1)-random variates.


Use them to generate 5 random variates from the N (10, 16) distribu-
tion.

35
Chapter 3
Probability theory and stochastic processes
The aim of this chapter is to give the review of probability theory and stochas-
tic processes. This background material is necessary for understanding the
models that will be developed in later chapters. This chapter is written quite
formally in terms of definitions and theorems and is therefore of a different
style to nearly all chapters you have seen in previous modules. This has been
done so that you are able to look up key definitions and concepts if required
when the ideas are applied in later chapters. Although a certain amount
of revision is containned here, the focus is on the rigorous development of
the stochastic process in the general sense which includes both discrete and
continuous models.
The rigorous development of probability theory and stochastic processes can
be very technical. Although we have attempted to avoid unnecessary tech-
nicalities in this text, a certain level of technicality is unavoidable. As this
chapter is intended as background material it is most important that you
understand the material on the intuitive level. However, it is also impor-
tant that you have a level of technical knowledge able to solve quantitative
problems when necessary.
We begin with the formal development of probability theory before moving
on to stochastic processes.

3.1 Probability theory


3.1.1 Probability measure and probability space
Intuitively, a random variable is a function with random values. For example,
it may be a lifetime of an individual, or the number of car accidents in the
next year. To model such randomness mathematically it is convenient to
assume that a random variable is a function defined somewhere on the
space of all possible states of the world, . The randomness comes from
the fact that we do not know exactly what we are in.
As you already know, in the discrete case a probability space is a triple
(, F, P ), where the set = {1 , 2 , . . . , } consists of the elementary events,
k , which can be thought of as possible states of the world. The set F
consists of all subsets of and the probability measure P maps F into [0, 1].

36
It is necessary that P has the properties

P () = 1, P () = 0,
P (Ac ) = 1 P (A) A F
P (A B) = P (A) + P (B) P (A B) A, B F.

Note that is an impossible event, A and B are possible events and the
superscript c denotes the complement of an event.
This definition allows one to construct random variables with countably many
possible outcomes. However, in many cases it is necessary to assume that
a random variable can take any real number, for example, if it measures
distance or temperature, say. This requires one to consider a continuum of
events in a continuous probability space.
To include this possibility of uncountably many outcomes arising from a
continuous probability space, we need to extend the above definition to the
case of uncountable .
Further, on a discrete probability space one cannot construct an infinite
sequence of independent events, such as an infinite number of coin tosses.
Indeed, if represents all possible outcomes of an infinite number of coin
tosses, then n o
= : = (a1 , a2 , . . .), ai = 0, 1

(where ai is the result of the ith toss: 1 if the coin lands head up; 0 otherwise).
This set is already uncountable, i.e. its elements cannot be enumerated by
integers, as = {k } k=1 .

The concept of uncountable events is therefore important in continuous and


some discrete probability spaces. We now use an example to demonstrate
that on an uncountable our intuition from countable fails.

Example 3.1. Consider that we toss a fair coin (i.e. that the probability
of a head is equal to the probability of a tail, and so these probabilities are
equal to 1/2). Let

N := { : a1 = = aN = 1}

be the event that all the first N coins land with the head up. It is natural
to assume that P (AN ) = 2N , since the event N is one of the 2N possible
outcomes. Therefore, for
n o
:= = (a1 , a2 , . . .) : ai = 1 i = 1, 2, . . .

37
we must have P ( ) P (N ) = 2N for all N , since N . It therefore
must be that P ( ) = 0.
With theSsame logic we get P () = 0 for every . However, P () = 1
and = {}, and the natural desire to use our intuition that
X X
P () = P () = 0 = 0,

leads to a logical contradiction.

The above example demonstrates a technical difficulty which turns out to be


general in constructing uncountable probability spaces. While for countable
it is sufficient to define the probability measure P on each in such a way
that X
P () = 1

and then we can naturally set


X
p(A) := P () for all A ,
A

in the case of uncountable we cannot define P on each and then extend


it.
To overcome this difficulty we need the probability measure to be defined on
some F rather than directly on all .
Before we consider the formal definition of a probability space it is necessary
to define the notion of a -algebra.

Definition 3.1. Let be a non-empty set and F be a collection of subsets


of . We say that F is a -algebra if it satisfies the following properties:

1. , F
S T
2. If A1 , A2 , F, then n=1 An F and n=1 An F

3. If A F then Ac F

The key feature of the above definition is that a -algebra is closed under
infinite countable unions and intersections, but it is not closed under un-
countable unions.
We can now give a formal definition for the notion of a probability space.

38
Definition 3.2. Let be a non-empty set and F be some -algebra of
subsets of . Suppose that a function P : F [0, 1] satisfies
1. P (0) = 0, P () = 1
S  P
2. P n=1 An = n=1 P (An ), provided A1 , A2 , F is a collection of
disjoint sets, i.e. Ai Aj = for all i 6= j.

Then we call such P a probability measure and the triple (, F, P ) a proba-


bility space. The subsets of belonging to F are called events.

From these properties of a probability measure, we can easily deduce some


other naturally desirable properties, such as P (E) + P (E c ) = 1 for all E F
and P (A B) = P (A) + P (B) P (A B) for all A, B F.
An important feature is that the second property of the probability measure
allows one to deduce the following continuity properties:
S  S 
N
If A1 , A2 , F, then P n=1 An = Nlim P n=1 An .
T


If B1 B2 B3 F, then P n=1 Bn = lim P (Bn ).
n

An important example of  a probability space is the so-called standard prob-


ability space [0, 1], B, , where B is the Borel -algebra, i.e. the smallest
-algebra containning all the intervals (a, b] and denotes the Lebesgue mea-
sure. Roughly speaking, the Borel -algebra consists of all sets which can be
produced by taking countable intersections and unions of all intervals of the
type (a, b), (a, b], [a, b), [a, b] [0, 1] and the Lebesgue measure is a natural
extension of the functional which assigns to each interval its length:

(a, b] = b a;

to the whole set B. There is a theorem which ensures that any countable
structure of random events or random variables can be constructed on this
standard probability space.

Example 3.2. The results of tossing the coin in Example 3.1 an infinite
(but countable) number of times can be constructed as

1, if the integer part of 2i in the binary system equals 1,
i () :=
0, otherwise

for all [0, 1]. We can say that the ith coin lands head up, if i = 1,
and tail up, if i = 0. It is easy to see that P (i = 0) = P (i = 1) = 1/2

39
for all i = 1, 2, . . . . Further, {i }
i=1 is a system of mutually independent
random variables (see the definition of independence in 3.1.4 below) which
makes the model equivalent to the infinite sequence of coin tosses (so-called
Bernoulli trials).

Recall that a Bernoulli random variable is a random variable assuming just


two different values with probabilities p (0, 1) and q := 1 p respectively.
Typically it is either {0, 1} or {1, 1}. In these cases the random variable
is called a standard Bernoulli variable. A sequence of independent standard
Bernoulli variables is called a sequence of Bernoulli trials.

3.1.2 Random variables, expectation and variance


We begin with the formal mathematical definitions. Given a probability
space (, F, P ) we call a mapping:

:R

a random variable if it is measurable, i.e. satisfies



: () r F for all r R.

(Borel sets are the elements of a Borel -algebra, the minimum -algebra
containning all open sets (see above)).
We may interpret a random variable as a number produced by an experiment,
such as temperature or the oil price tomorrow. The mathematical formalism
allows us to be accurate. In particular, the property of measurability is im-
portant to make sense of the probability of the event that a random variable
does not exceed a threshold: P r) := P ({ : () r} .
We say that the random variables and are equal almost surely (a.s.) if
 
P ( = ) := P : () = () = 1.

We need this technical feature because on a continuous probability space we


have P () = 0 for all , and therefore if we change a random variable
at a point, nothing really changes.

Example 3.3. As an example take [0, 1], B, and the random variables
, : [0, 1] R, given by 0 and () = 0 for 6= 1/2; (1/2) = 10.
These random variables are equal almost surely (denoted as = a.s.).

Often the term almost surely is omitted since in almost all probability
applications everything is defined up to a set of probability zero.

40
Example 3.4. The simplest random variables are indicators. For a set
(event) A F, its indicator is defined by

1, A
IA () :=
0, Ac
For example, if A is the event that tomorrow there is a thunderstorm, we set
to 1 if a thunderstorm is really going to take place and 0 otherwise. We
define the expectation of an indicator IA by

E[IA ] := P (A).

In a natural way we can use indicators to define simple random variables


representable as
XN
= ak IAk , (8)
k=1

for some events Ak and real numbers ak , k = 1, . . . , N . Here we define


N
X N
X
E[] := ak E[IAk ] ak P (Ak ). (9)
k=1 k=1

The expectation of a random variable is the average of its possible values


weighted according to their probabilities. Sometimes in the literature, the
expectation is also called the mathematical expectation or mean.
To calculate the average of possible values for general function (not nec-
essary of the form (8)), the summation in (9) is replaced by integration. By
definition, the expectation of a random variable is its integral on with
respect to P , so that
Z Z
E[] dP () dP ().

It is easy to check that for random variable of the form (8) this integral
reduces to (9).
If E[||] exists and is finite, then is called integrable. The class of all inte-
grable random variables is denoted by L1 (, F, P ) or just L1 . In particular,
all bounded random variables are integrable. A random variable is called
bounded if there exists a constant M such that P (|| < M ) = 1).
The key property of the expectation is linearity

E[a + b] = aE[] + bE[], for all , L1 and constants a, b R.

41
The variance of a random variable is defined by

Var() := E[( E[])2 ] E[ 2 ] E[]2 .

Not every L1 has a finite variance. The class of random variables with
finite variance is denoted by L2 (, F, P ) or just L2 . A random variable with
finite variance is called square-integrable. Variance is always non-negative
and is equal to zero only for constants.
p
The square root of the variance () = Var() is called the standard devi-
ation of .
Note that

Var(a) = a2 Var(), (a) = a(), for all L2 and constant a R.

For two random variables , L2 their covariance is defined by


 
Cov(, ) := E ( E[])( E[]) E[] E[]E[].

For a sum of two random variables we have


 2 
Var( + ) = E ( E[]) + ( E[])
= E ( E[])2 + 2E ( E[])( E[]) + E ( E[])2
     

= Var() + 2Cov(, ) + Var().

Covariance is equal to zero if the random variables are independent (see 3.1.4
for the definitions of independence). If covariance is close to zero, random
variables are sometimes considered as almost independent.
Sometimes it is convenient to normalise the covariance. For non-constant
random variables and , define the correlation of and by
Cov(, )
Corr(, ) := p p .
Var() Var()
CauchySchwarz inequality says
E[] 2 E[ 2 ]E[ 2 ]

for all , L2 ,

and one can easily deduce that

1 Corr(, ) 1 for all , L2 .

The correlation equals 1 if and only if = a with some constant a > 0.

42
3.1.3 Probability distributions
Given a random variable we define its cumulative distribution function by

F (x) := P ( x).

Clearly the distribution function is non-decreasing in x and lim F (x) = 0,


x
lim F (x) = 1.
x+

A random variable is called continuously distributed if its distribution func-


tion is continuous. In this case it can be represented as
Z x
F (x) = (z) dz,

for some non-negative function , which is called the probability density


function of .
The expectation of a random function can be calculated using its distribution
function or its density (if the latter exists) by
Z Z
E[] = x dF (x) = x (x) dx.

Or more generally, if for some random variable and function f : R R


the expectation E[f ()] exists, then
Z Z
 
E f () = f (x) dF (x) = f (x) (x) dx.

In particular, for the variance we have


Z  2 Z  2
Var() = x E[] dF (x) = x E[] (x) dx.

Note that quite different random variables 6= can have identical proba-
bility distributions F = F . In this case we say that and are identically
distributed (i.d.). It can be shown that the random variables and are
identically distributed if and only if E[f ()] = E[f ()] for any f : R R.

Example 3.5. Question: We say that random variable X has the expo-
nential distribution with parameter R > 0, and write X Exp(), if its
probability density function is

(x) = ex for x > 0.

43
Calculate E[X] and Var(X).
Answer: The expectation is calculated as
Z
1
E[X] = xex dx = .
0
The variance is calculated as
Z  2
1 1
Var(X) = x ex dx = 2 .
0

Some elementary calculus was required in each case.

For several random variables 1 , . . . d their joint distribution function is de-


fined as

F1 ,...,d (x1 , . . . , xd ) = P [1 x1 , 2 x2 , . . . , d xd ].

If it can be represented as
Z x1 Z xd
F1 ,...,d (x1 , . . . , xd ) = ... 1 ,...,d (z1 , . . . , zd ) dz1 . . . dzd ,

for some non-negative function 1 ,...,d (z1 , . . . , zd ), the latter is called the joint
probability density function of 1 , . . . , d .

If for some random variable 1 , . . . d and function f : Rd R the expectation


E[f (1 , . . . d )] exists, then
Z
 
E f (1 , . . . d ) = f (x1 , . . . , xd )1 ,...,d (x1 , . . . , xd ) dx1 . . . dxd .
Rd

3.1.4 Independence

The notion of independence is one of the most important in Probability


Theory. Intuitively we would like to call two events or random variables
independent if there is no mutual dependency. For example, tossing a coin
and rolling a die are independent events.
Recall that two events A and B are called independent if

P (A B) = P (A)P (B).

This property is consistent with the intuition of independence.

44
For random variables we have the following definition.

Definition 3.3. Random variables and are called independent if the


events
A = { : a < () < b}
B = { : c < () < d}
are independent for all real numbers a, b, c, d.

However, it is not sufficient to define independence only for pairs of events or


random variables. It is possible that all the pairs of events (A, B), (A, C) and
(B, C) are independent, but the triple (A, B, C) is not mutually independent.

Example 3.6. Consider a non-traditional dice with four faces. We number


the first three faces with numbers 1, 2 and 3 respectively, and on the fourth
face we put all the three numbers 1, 2 and 3. Now let us through the dice,
and let
A be the event that 1 is on the face down;
B be the event that 2 is on the face down;
C be the event that 3 is on the face down.
Simple logic says that A depends on (B, C), since if B and C happen simul-
taneously then A happens too with probability one. But one can easily check
that all the pairs (A, B), (A, C) and (B, C) are independent. (See question
at the end of the chapter).

Definition 3.4. A collection of events are called mutually independent, if


for every finite subset A1 , . . . , An of the collection we have

P (A1 An ) = ns=1 P (Ai ).

Similarly, we may define mutual independence for random variables.

Definition 3.5. Random variables 1 , . . . , n are called mutually indepen-


dent if the events
A1 = { : a1 < 1 () < b1 }

An = { : an < n () < bn }

are mutually independent for all real numbers {ak , bk }nk=1 . An infinite set
of random variables { } is called mutually independent if any finite subset
{1 , . . . , n } is mutually independent.

45
This definition can be reformulated in terms of the joint probability distribu-
tion. The random variables 1 , . . . d are mutually independent if and only if
their joint distribution (i.e. the distribution of the random vector (1 , . . . , d ))
is the product of the distribution functions of k s:
d
Y
F1 ,...,d (x1 , . . . , xd ) = Fk (xk ).
k=1

Moreover, it is easy toQ check that if the joint distribution can be expressed as
F1 ,...,dR(x1 , . . . , xd ) = dk=1 Gk (xk ), with some functions Gk , then Fk (x) =

Gk (x)( Gk (x)dx)1 for each k and so 1 , . . . d are mutually independent.
If the variables 1 , . . . , d have probability densities, then they are mutually
independent if and only if the random Q vector X also has a density which can
be factorised as X (x1 , . . . , xd ) = dk=1 k (xk ).
For independent random variables and we have
 
E f ()g() = E[f ()]E[g()]

for any functions f and g : R R (such that f () and g() L1 ). Indeed,


Z Z
 
E f ()g() = f (x)g(y)F, (x, y)dxdy = f (x)g(y)F (x)F (y)dxdy
R2 R2
Z + Z +
= f (x)F (x)dx F (y)g(y)dy = E[f ()]E[g()]

More generally, for mutually independent 1 , . . . , n we have


n
hY i Yn
E fk (k ) = E[fk (k )]
k=1 k=1

for any functions fk : R R (such that fk (k ) L1 ).


This is a very convenient property. For instance, it allows us to prove that
     
Cov(, ) := E ( E[])( E[]) = E E[] E E[] = 0

for independent and . So the covariance could serve as a proxy measure


for a degree of independence: the closer the covariance (or correlation) is
to zero, the less dependent are the random variables. However, it is not a
very good measure, since there exists dependent random variables with zero
covariance.

46
For any random variables , L2 such that Cov(, ) = 0 (so, in par-
ticular, for any independent and ) we have

Var( + ) = Var() + 2Cov(, ) + Var() = Var() + Var().


More generally,
n
X  n
X
Var k = Var(k )
k=1 k=1
2
for all 1 , . . . , n L provided Cov(j , k ) = 0 for all j 6= k (in particular
if the random variables are mutually independent). Random variables with
zero covariance are called uncorrelated.
Note that random variables that are independent and identically distributed
are often denoted i.i.d..

3.1.5 Conditional Probability


Sometimes, while estimating the probability of some event A, we can use the
information that some other event B happened.

Example 3.7. An insurance company, which provides regular payments


while a client is ill and unable to work, needs to estimate the probability
of event A = {a new customer will be ill during next year}. They have
a statistics, according to which, out of 100 current customers, 20 was ill
at least once during the last year (10 of them smokes) and 80 was not ill
during the last year (20 of them smokes). Based on this, they can estimate
20
P (A) = 100 = 0.2 (a ratio of the number of customers that was ill to the
total number of customers). However, if they know that the new customer
10
smokes, they can estimate the probability in question as 10+20 = 13 > 0.2 (a
ratio of the number of smokers that was ill to the total number of smokers).
In general, the probability of event A, when event B is known to occur,
is called conditional probability of A given B, denoted P [A|B], and can be
evaluated as
P [A B]
P [A|B] = .
P [B]
10
In the example above, with B = {a customer is a smoker}, P (A B) = 100
,
30
P (B) = 100 = 0.3, and P [A|B] = P P(AB)
(B)
= 0.1
0.3
= 13 .
Events A and B are independent, if and only if P [A|B] = P P[AB]
[B]
= P [A]P [B]
P [B]
=
P [A]. Intuitively, this means that information about event B does not have
any influence on the probability of the event A.

47
Knowing P [A], P [B], and P [A|B], one can estimate P [B|A] as follows

P [A B] P [A|B]P [B]
P [B|A] = = ,
P [A] P [A]

which is called the Bayes theorem.


If {Bn , n = 1, 2, . . . } is S
a finite or countably infinite partition of (that is,
Bi Bj = , i 6= j and n Bn = ), then, for any event A,
X X
P [A] = P [A Bn ] = P [A|Bn ]P [Bn ].
n n

The last relation is called the law of total probability.

Example 3.8. A car insurance company, which classifies drivers as new


and experienced, wants to estimate the probability of the event A = {a
randomly selected driver will cause a car accident during next year}. From
their past data, they estimate that the probability of A is 0.3 for new
drivers, and 0.05 for experienced ones. If 40% of their current customers
are new, then P [A] = P [A|N ]P [N ]+P [A|E]P [E] = 0.30.4+0.050.6 = 0.15
(here N and E are the events that the driver is new and experienced,
correspondingly).

3.2 Stochastic processes


A random variable is a suitable model for a number produced by a single
experiment, at a specified moment of time, for example a temperature to-
morrow at 14.00. If we are interesting how temperature will change in
time, we actually need to consider the whole collection of random variables
{Xt : t T }, where Xt is the (random) temperature at a specified moment
t, and T is the set of all times we are interested in. We will call such family
of random variables a stochastic process (or random process).
Formally, a stochastic process is a family of random variables {Xt : t T },
where T is an arbitrary index set. For example, any random variable is a
stochastic process with one-element set T . But typically parameter t repre-
sents time, and the most common examples for T are T = {0, 1, 2, . . . } and
T = R (or [0, )). In the first case the stochastic process is called discrete
time, and is actually just a sequence of random variables; in the second case
the random process is called continuous time.

48
3.2.1 Rough classification of random processes
There are many classifications of random processes. One of the most basic is
to classify them with respect to the time (index) set T and with respect to
the state space. By definition the state space S is the set of possible values
of a random process Xt .

Discrete state spaces with discrete time changes. Most typically T is


{0, 1, 2, . . . } in this case and the state space S is a discrete set. For example,
a motor insurance company reviews the status of each customer yearly with
respect to three possible levels of discount S = {0, 10%, 25%}.
It is not necessary that S should be a set of numbers. For example, it may be
the credit rating, or information from an applicant form. Typical examples
of discrete state processes with discrete time change are Markov chains which
are discussed in Section 4.3.

Discrete state spaces with continuous time changes. In this case


typically T = [0, ) and the state space S is a discrete set. For instance, an
individual insurance policy holder can be classified as healthy, sick or dead.
So S = {healthy, sick, dead}. It is natural to take the time set as continuous
as illness or death can happen at any time. An important special case of
this class are so-called counting processes. A process (Nt )t[0,) is counting,
if it is increasing and takes values in {0, 1, 2, . . . }. For instance, it can be
the cumulative number of customers who appear at random times. Such
processes are discussed in Section 4.8.

Continuous state spaces with continuous time changes. Typically in


this case T is [0, ) or R and S = [0, ), (0, ) or R. For instance, it is
natural to consider the exchange rate GBP/USD as a random process with
the state space (0, ) and continuous time. We shall be concerned with this
class of models in later chapters.

Continuous state spaces with discrete time changes. The typical


example of these is when an essentially continuously valued process such
as price or temperature is measured only at certain time intervals (days,
months, quarters, years). For example, if we do not care about about intraday
changing of the GBP/USD exchange rate, then we can consider this as a
discrete-time process.

Mixed type processes. There are special types of continuous-time pro-


cesses, with continuous or discrete state spaces, which have some specifically
structured changes at predetermined times. For example, the market price

49
of a coupon-paying bond changes at the deterministic times of the coupon
payments, but it also changes randomly all the time before its maturity, due
to the current situation in the market.

For every real-life process to be analysed, it is important to establish whether


it is most usefully modelled using a discrete, a continuous, or a mixed time
domain. Usually the choice of state space will be clear from the nature of the
process being studied (as, for example, with the Healthy-Sick-Dead model),
but whether a continuous or discrete time set is used will often depend on
the specific aspects of the process which are of interest, and upon practical
issues like the time points for which data are available. Continuous time
and continuous state space stochastic processes, although conceptually more
difficult than discrete ones, are often more convenient to deal with, in the
same way as it is often easier to calculate an integral than to sum an infinite
series.

3.2.2 General definitions


In this section we give some general definitions related to random processes
and do not concentrate on a particular T , assuming only T R.
Sample pathes. To determine a particular value for a general stochastic
process {Xt : t T }, we need to specify time t T , and a particular real-
ization . From this perspective we can interpret a stochastic process
as a function of two variables: t and . If we fix a particular state of nature
, we get a particular realization of stochastic process, which is a de-
terministic function from T to the state space S. This function is called a
sample path of the process.
For example, consider the exchange rate GBP/USD during a particular
month. In advance, it is hard to predict the exact exchange rate at every
moment of time during this month, so it is natural to model it as a random
process. During the month, we can observe a realization of this process as
a function of time, which is an example of a sample path. If we use the
model of continuous time process, a sample path is a continuous function,
defined for every moment t during the month. If we are interested only in
exchange rates at (say) 9.00 every day, the suitable model is a discrete-time
process, and the sample path is just a sequence of numbers. In this case we
will sometimes refer to it as to sample sequence of the discrete-time process.

Describing a stochastic process. A random variable is completely


defined, if we know () for every . However, for many practical
applications, knowing only the distribution function F (x) can be considered

50
as a satisfactory description. For example, if is the temperature tomorrow
at 9.00, the knowledge of F (x) allows us to answer all the questions of the
form what is the probability that the temperature will be between x and
y, which is actually all we would like to know. If is another forecast for
tomorrows 9.00 temperature with F (x) F (x), and can be considered
as indistinguishable from the practical point of view.
For stochastic process {Xt : t T }, however, the collection of all distribution
functions {FXt (x) : t T } is far from satisfactory description of the whole
process, because it say nothing about the dependencies among the underlying
random variables. To describe the whole process, we need to specify the joint
distributions of Xt1 , Xt2 , . . . , Xtn for all t1 , t2 , . . . , tn in T and all integers n.
The collection of the joint distributions above is called the family of finite
dimensional probability distributions (f.f.d. for short), and two stochastic
processes {Xt : t T1 } and {Yt : t T2 } are said to be identically distributed,
if they have the same f.f.d. and T1 = T2 .
To describe a stochastic process in practice, we will rarely give the exact
formulas for its f.f.d., but will rather use some indirect intuitive descriptions.
For example, take the familiar Bernoulli trials of consecutive tosses of a fair
coin. A sequence of i.i.d. Bernoulli variables (t ) t=1 is a stochastic process,
and its f.f.d. is fully determined by this description. Indeed, for any sequences
of times t1 , t2 , . . . , tn in T = {0, 1, 2, . . . } and results x1 , x2 , . . . , xn in S, we
are able to estimate the probability P (t1 = x1 t2 = x2 tn = xn ),
and it is equal to 2n .

Stationarity. In the example above, the probability to meet any se-


quences of results x1 , x2 , . . . , xn does not depend on times t1 , t2 , . . . , tn . This
means that the statistical properties of the process remain unchanged over
the time, which is intuitively obvious for tosses of a fair coin. If, however,
a stochastic process describes the tomorrows temperature, it would be rea-
sonable to expect a lower temperature during the morning than at noon.
Formally, a stochastic process {Xt : t T } is said to be stationary, or strictly
stationary, if for all integers n and all t, t1 , t2 , . . . , tn in T the joint dis-
tributions of Xt1 , Xt2 , . . . , Xtn and Xt+t1 , Xt+t2 , . . . , Xt+tn coincides. Sub-
stituting n = 1, we can see that, in particular, all distribution functions
{FXt (x) : t T } are the same for all t. Consequently, all parameters de-
pending only on distribution (such as mean and variance), if they exist, also
do not change over time.
Strict stationarity is a strong requirement which may be difficult to test fully
in real life. Actually, a much weaker condition, known as weak stationarity, is

51
often already very useful for applications. A stochastic process {Xt : t T }
is said to be weakly stationary, if the mean of the process, m(t) = E[Xt ], is
constant, and the covariance of the process, C(s, t) = E[(Xs m(s))(Xt
m(t))], depends only on the time difference t s. Obviously, any strictly
stationary stochastic process with finite mean and variance is also weakly
stationary.

Increments. If stochastic process {Xt : t T } describes the tomorrows


temperature, we are interested in the value of Xt itself. Sometimes, however,
the dynamic of how the value changes over the time is much more interesting.
For example, if Xt is the price of a stock share, a forecast Xt = 17100
provides almost no information by itself. If the current stock price is X0 =
1760, the above forecast is very optimistic; if, however, X0 = 17120, it is
pessimistic. What we are really interested in is the price dynamics, whether
it increases or decreases and how much.
Formally, an increment of the stochastic process {Xt : t T } is the quantity
Xt+u Xt , u > 0. Many processes are the most naturally defined through
their increments. For example, let Xt be total amount of money on a bank
account of a person A at the first day of month t. Assume that monthly salary
of A is a fixed amount C, and monthly expenses Yt are random. Then the
stochastic process Xt is naturally defined through its increments Xt+1 Xt =
C Yt .
In the above example, the process Xt is not stationary (even weakly) unless
Yt C, t. For example, if EYt < C, t, the total amount of money on the
bank account increases (on average) with time. However, if Yt are identically
distributed, the rate of growth of Xt remains unchanged over the time. Such
processes are said to have stationary increments. If, moreover, monthly ex-
penses Yt are (jointly) independent, the rate of growth of Xt does not depend
of its history, and we say that Xt has independent increments.
Formally, a stochastic process {Xt : t T } has stationary (or time-homogeneous)
increments, if for every u > 0 the increment Zt = Xt+u Xt is a stationary
process; a process {Xt : t T } is said to have independent increments if for
any a, b, c, d T such that a < b < c < d, random variables Xa Xb and
Xc Xd are independent.

References
The following texts were used in the preparation of this chapter and you are
referred there for further reading if required.
R.G. Grimmett & D.R. Stirzaker, Probability and Random Processes.

52
J. Jacod, & P. Protter, Probability Essentials.

A.N. Shiriaev, Probability.

53
3.3 Summary
Let be a non-empty set and F be a collection of subsets of . We say that
F is a -algebra if it satisfies the following properties:

1. , F
S T
2. If A1 , A2 , F, then n=1 An F and n=1 An F

3. If A F then Ac F

Further, suppose that a function P : F [0, 1] satisfies

1. P (0) = 0, P () = 1
S  P
2. P n=1 An = n=1 P (An ), provided A1 , A2 , F is a collection of
disjoint sets, i.e. Ai Aj = for all i 6= j.

Then we call such P a probability measure and the triple (, F, P ) a prob-


ability space. The subsets of belonging to F are called events.
Given a probability space (, F, P ) we call a mapping : R a random
variable if it is measurable, i.e. satisfies : () r F for all r R.
For a random variable , its cumulative distribution function (cdf) is F (x) :=
P ( x). If F (x) can be represented as
Z x
F (x) = (z) dz,

for some non-negative function , the latter is called the probability density
function (pdf) of .
R
The expectation of a random variable is defined as E[] dP . It can
be calculated as
Z Z
E[] = x dF (x) = x (x) dx.

The variance of a random variable is defined by Var() := E[( E[])2 ]


E[ 2 ] E[]2 . For two random variables
 , , their covariance is defined by
Cov(, ) := E ( E[])( E[]) E[] E[]E[].
Two events A and B are called independent if P (A B) = P (A)P (B).
Random variables and are called independent if the events A = {
: a < () < b} and B = { : c < () < d} are independent for

54
all real numbers a, b, c, d. If and are independent, Cov(, ) = 0, but the
converse is not always true.
The conditional probability of A given B, denoted P [A|B], can be evaluated
as P [A|B] = P P[AB]
[B]
.
A stochastic process is a family of random variables indexed in time, {Xt : t T }.
The time set T can be discrete or continuous, as can the state space S in
which the variables take their values.
Stochastic process can be roughly classified into the following groups:

Discrete S and discrete T ;

Continuous S and discrete T ;

Discrete S and continuous T ;

Continuous S and continuous T ; or

Mixed processes.

A stochastic process {Xt : t T } is said to be stationary, if for all integers


n and all t, t1 , t2 , . . . , tn in T the joint distributions of Xt1 , Xt2 , . . . , Xtn and
Xt+t1 , Xt+t2 , . . . , Xt+tn coincides. It is called weakly stationary, if the mean
of the process, m(t) = E[Xt ], is constant, and the covariance of the process,
C(s, t) = E[(Xs m(s))(Xt m(t))], depends only on the time difference
t s.
An increment of the stochastic process {Xt : t T } is the quantity Xt+u Xt ,
u > 0. A stochastic process has stationary (or time-homogeneous) incre-
ments, if for every u > 0 the increment Zt = Xt+u Xt is a stationary
process; a process {Xt : t T } is said to have independent increments if for
any a, b, c, d T such that a < b < c < d, random variables Xa Xb and
Xc Xd are independent.

55
Questions

1. Let X be a random variable from the continuous uniform distribution,


X U (0.5, 1.0). Starting with the probability density function, derive
expressions for the cumulative distribution function, expectation and
variance of X.

2. A sample space consists of five elements = {a1 , a2 , a3 , a4 , a5 }. For


which of the following sets of probabilities does the corresponding triple
(, A, P ) become a probability space? Why?

(a) p(a1 ) = 0.3; p(a2 ) = 0.2; p(a3 ) = 0.1; p(a4 ) = 0.1; p(a5 ) = 0.1;
(b) p(a1 ) = 0.4; p(a2 ) = 0.3; p(a3 ) = 0.1; p(a4 ) = 0.1; p(a5 ) = 0.1:
(c) p(a1 ) = 0.4; p(a2 ) = 0.3; p(a3 ) = 0.2; p(a4 ) = 0.1; p(a5 ) = 0.1;
(d) p(a1 ) = 0.4; p(a2 ) = 0.3; p(a3 ) = 0.2; p(a4 ) = 0.1; p(a5 ) = 0.1.

3. Assets A and B have the following distribution of returns in various


states:

State Asset A Asset B Probability


1 10% -2% 0.2
2 8% 15% 0.2
3 25% 0% 0.3
4 -14% 6% 0.3

Show that the correlation between the returns on asset A and asset B
is equal to -0.3830.

4. Formalise Example 3.6 as = {1 , 2 , 3 , 4 }, P (1 ) = P (2 ) =


P (3 ) = P (4 ) = 1/4 and

A := {1 , 4 }, B := {2 , 4 }, C := {3 , 4 }.

Prove that the pairs (A, B), (A, C) and (B, C) are independent, but
the triple (A, B, C) is not mutually independent according to Definition
3.4.

5. You intend to model the maximum daily temperature in your office as


a stochastic process. What time set and state space would you use?

56
Chapter 4
Markov Chains
The Markov property is used extensively in the Actuarial Mathematics to
develop two-state and multi-state Markov models of mortality and other
decrements. The rest of this course is devoted to a thorough description of
the Markov property in a general context and its applications to actuarial
modelling.
We will distinguish between two types of stochastic process that possess the
Markov property: Markov chains and Markov jump processes. Both have a
discrete state space, but Markov chains have a discrete time set and Markov
jump processes have a continuous time set.
We begin with Markov chains and discuss the mathematical formulation of
such process, leading to one important actuarial application: the no-claims-
discount process used in motor insurance. We then move onto Markov jump
processes.
The practical considerations of applying these models in the Actuarial Math-
ematics will be discussed in detail in later sections. In this chapter we focus
on the mathematical development of Markov models without reference to
their calibration to real data.

4.1 The Markov property


A major simplification to the general stochastic processes discussed in Section
3.2 occurs if the development of a process can be predicted from its current
state, i.e. without any reference to its past history. This is the Markov
property.
In this chapter we are concerned with the stochastic process {Xt } defined on
a state space S and time set t 0.
The Markov property can be stated mathematically as

P [ Xt A| Xs1 = x1 , Xs2 = x2 , . . . , Xsn = xn , Xs = x] = P [Xt A|Xs = x],


(10)
for all s1 < s2 < < sn < s < t in the time set, all states x1 , x2 , . . . , xn , x
in the state space S and all subsets of A of S.
Working with subsets A S is necessary so that the above definition of
the Markov property covers both the discrete and continuous state spaces.
Recall from Section 3.1 that in the continuous case the probability that Xt

57
is a particular value is zero, and so it is necessary to work with probabilities
of Xt lying in some subset of the state space in any general definition.
Although we are entirely concerned with discrete state spaces in this chapter,
it is important to realise that the Markov property can be possessed by
general stochastic processes.

For discrete state spaces the Markov property is written as

P [ Xt = a| Xs1 = x1 , Xs2 = x2 , . . . , Xsn = xn , Xs = x] = P [Xt = a|Xs = x], (11)

for all s1 < s2 < < sn < s < t and all states a, x1 , x2 , . . . , xn , x in S.
An important result is that any process with independent increments has the
Markov property.

Example 4.1. Question: Prove that any process with independent incre-
ments has the Markov property.
Answer: We begin with equation (10) and use the fact that Xt = Xt Xs +x
to introduce an increment

P [ Xt A| Xs1 = x1 , Xs2 = x2 , . . . , Xsn = xn , Xs = x],


=P [Xt Xs + x A| Xs1 = x1 , Xs2 = x2 , . . . , Xsn = xn , Xs = x] ,
=P [Xt Xs + x A| Xs = x] = P [Xt A| Xs = x] ,

the second equality arises from the definition of independent increments and
the fact that x is known.
A Markov process with a discrete state space and a discrete time set is called
a Markov chain, these are consider in this chapter. A Markov process with
discrete state space and continuous time set is called a Markov jump process,
these are considered in the next chapter.

4.2 Definition of Markov Chains


A Markov chain is a discrete-time stochastic process with a countable state
space S, obeying the Markov property. It is therefore a sequence of random
variables {Xt } with the property given by equation (11) which we rewrite for
notational convenience as

P [Xn = j| X0 = i0 , X1 = i1 , . . . , Xm1 = im1 , Xm = i] = P [Xn = j|Xm = i],

for all integer times n > m and states {i0 , i1 , . . . , im1 , i, j} S.

58
We define the transition probabilities as
 
pij (n, n + 1) = P Xn+1 = j| Xn = i . (12)

Therefore, pij (n, n + 1) is the probability of being in state j at time n + 1,


having been in state i at time n.
For each fixed n = 0, 1, . . . , we can form a matrix of transition probabilities
from time n to the next time step n + 1:
 
P(n, n + 1) = pij (n, n + 1) i,jS . (13)

Note that P(n, n+1) is a finite matrix in the case of a finite number of states,
and an infinite matrix in the case of an infinite number of states.

Example 4.2. Question: Consider a no claims discount (NCD) model


for car-insurance premiums. The insurance company offers discounts of 0%,
30% and 60% of the full premium, determined by the following rules:

1. All new policyholders start at the 0% level.

2. If no claim is made during the current year the policyholder moves up


one discount level, or remains at the 60% level.

3. If one or more claims are made the policyholder moves down one level,
or remains at the 0% level.

The insurance company believes that the chance of claiming each year is
independent of the current discount level and has a probability of 1/4. Why
can this process be modeled as a Markov chain? Give the state space and
transition matrix.
Answer: The model can be considered as a Markov chain since the future
discount depends only on the current level, not the entire history. The state
space is S = {0%, 30%, 60%}, which is convenient to denote as S = {0, 1, 2}
(where state 0 is the 0% state, 1 is the 30% state and 2 is the 60% state).
The transition probability matrix between two states in a unit time is given
by 1 3
4 4
0
P = 41 0 34 . (14)
1 3
0 4 4

A matrix A is called a stochastic matrix if

59
1. All its entries are non-negative, and
2. The sum of entries in any row is one.
It is clear that the transition matrix in Example 4.2 is a stochastic matrix
by this definition. More generally, every transition matrix P(n, n + 1) of a
Markov chain is a stochastic matrix. Indeed, allX the transition probabilities
pij (n, n + 1) are by definition non-negative, and pij = 1 for all i since the
jS
system must move to some state from any state i.
A clear way of representing Markov chains is by a transition graph. The
states are represented by circles linked by arrows indicating each possible
transition. Next to each arrow is the corresponding transition probability.
Example 4.3. Question: Draw the transition graph for the NCD system
defined in Example 4.2.
Answer: See Figure 3

Figure 3: Transition graph for the NCD system of Example 4.2. Reproduced
with permission of the Faculty and Institute of Actuaries.

4.3 The Chapman-Kolmogorov equations

Equation (12) defines the probabilities of transition over a single time step.
Similarly, the n-step transition probabilities pij (m, m + n) denote the prob-
ability that a process in state i at time m will be in state j at time m + n.
That is:
pij (m, m + n) = P [Xm+n = j| Xm = i] .
The transition probabilities of a Markov process satisfy the system of equa-
tions called the ChapmanKolmogorov equations
X
pij (m, n) = pik (m, l)pkj (l, n), (15)
kS

60
for all states i, j S and all integer times m < l < n. This can be expressed
in terms of n-step stochastic matrices as

P(m, n) = P(m, l)P(l, n),

where P(m, l)P(l, n) is the product of matrices in the usual sense.

Example 4.4. Question: Prove equation (15).


Answer: We use the Markov property and the law of total probability.

pij (n, m) = P (Xm = j| Xn = i) ,


X
= P (Xm = j, Xl = k| Xn = i) ,
kS
X
= P (Xm = j| Xl = k, Xn = i) P (Xl = k| Xn = i) ,
kS
X
= P (Xm = j| Xl = k) P (Xl = k| Xn = i) .
kS

Which is the required result.

The ChapmanKolmogorov equations provide a method for computing the


n-step transition probabilities from the one-step transition probabilities. The
distribution of a Markov chain is therefore fully determined once the following
are specified:
The one-step transition probabilities pij (n, n + 1).

The initial probability distribution j0 = P (X0 = j0 ).


The probability of any path can then be determined from
N
Y 1
P [X0 = j0 , X1 = j1 , . . . , XN = jN ] = j0 pjn ,jn+1 (n, n + 1). (16)
n=0

This should be intuitively clear but a formal proof is left as a question at the
end of the chapter.

4.4 Time dependency of Markov chains


4.4.1 Time-inhomogeneous Markov chains
For a time-inhomogeneous Markov chain, the transition probabilities pij (t, t + 1)
change with time t. The transition probabilities will therefore have a sequence

61
of stochastic matrices denoted by P(t):

p00 (t, t + 1) p01 (t, t + 1) . . .
P(t) = pij (t, t + 1) i,jS = p10 (t, t + 1) p11 (t, t + 1) . . .
 
.. ..
. .

The value of t can represent many factors such as time of year, age of pol-
icyholder or the length of time the policy has been in force. For example,
young drivers and very old drivers may have more accidents than middle-
aged drivers and therefore t might represent the age or age group of the
driver purchasing a motor insurance policy.
Although time-inhomogeneous models are important in practical modelling,
a further analysis is beyond the scope of this course.

4.4.2 Time-homogeneous Markov chains


A Markov chain is called time homogeneous if transition probabilities do not
depend on time. This is a significant simplification to any Markov-chain
model. In particular, for a time-homogeneous Markov chain, equation (16)
becomes
N
Y 1
P [Xn = jn , n = 0, 1, 2 . . . , N ] = P [X0 = j0 ] pjn jn+1 . (17)
n=0

It is therefore clear that the matrix of the n-step transition probabilities is


the n-th power of the matrix of 1-step transition probabilities {pij }:
(n)
X
P [Xn+m = j| Xm = i] := pij = pik1 pk1 k2 pkn2 kn1 pkn1 j .
k ,k ,...,k
| {z }
1 2 n1
n terms

If we let P(n) denote the n-step transition matrix, then

P(n) = Pn ,

where P is the one-step matrix of transition probabilities.

Example 4.5. Question: Calculate the 2-step transition matrix for the
NCD system from Example 4.2 and confirm that it is a stochastic matrix.

62
Answer: The 1-step transition matrix is given by equation (14) and so we
can compute that
1 3 1 3
0 0 4 3 9
4 4 4 4 1
P(2) = 14 0 43 14 0 34 = 1 6 9 .
1 3 1 3 16
0 4 4 0 4 4 1 3 12

We note that the two conditions for P(2) to be a stochastic matrix are satis-
fied.

Example 4.6. Question: Using the 2-step transition matrix from Exam-
ple 4.5 state the probabilities that

(a) A policyholder initially in the 0%-state is in the 60%-state after 2 years.

(b) A policyholder initially in the 60%-state is in the 30%-state after 2


years.

(c) A policyholder initially in the 0%-state is in the 0%-state after 2 years.

Answer:

(a) Element (P(2) )1,3 gives the required probability, 9/16.


Note that this is consistent with the path 0% 30% 60%, i.e. no
claims for two years, therefore the probability is 3/4 3/4 = 9/16.

(b) (P(2) )3,2 = 3/16.


Note that this is consistent with the path 60% 60% 30%, there-
fore the probability is 3/4 1/4 = 3/16.

(c) (P(2) )1,1 = 4/16.


Note that this is consistent with either path 0% 0% 0% or 0%
30% 0%, therefore the probability is 1/4 1/4 + 3/4 1/4 = 4/16.

4.5 Further applications


The simple NCD system of Example 4.2 gives a practical example of a time-
homogeneous Markov chain. We now consider three further examples.

63
4.5.1 The simple (unrestricted) random walk
A simple random walk is a stochastic process {Xt } with state space S = Z
i.e. the integers. The process is defined by

Xn = Y1 + Y2 + + Yn ,

where Y1 , Y2 , . . . are a sequence of i.i.d. Bernoulli variables such that

P (Yi = 1) = p and P (Yi = 1) = 1 p.

The simple random walk has the Markov property, that is:

P (Xm+n = j| X1 = i1 , X2 = i2 , . . . , Xm = i) ,
= P (Xm + Ym+1 + Ym+2 + + Ym+n = j| X1 = i1 , X2 = i2 , . . . , Xm = i) ,
= P (Ym+1 + Ym+2 + + Ym+n = j i) ,
= P (Xm+n = j| Xm = i) .

Hence a simple random walk is a time-homogeneous Markov chain with tran-


sition probabilities:

p if j = i + 1
pij = 1 p if j = i 1
0 otherwise

Since we are considering an unrestricted simple random walk, the transition


graph (Figure 4) and 1-step transition matrix are infinite.
In particular the 1-step transition matrix is given by
.
.. ...

.
..

0 p


1 p 0 p

P=
... ... ... .


1 p 0 p
..

1p 0 .


.. ..
. .

It is clear that this is a stochastic matrix.


To determine the n-step transition probabilities, consider moving from state
i to state j in n steps. Let the number of positive steps be r, (that is, r is

64
Figure 4: Transition graph for the unrestricted random walk. Reproduced
with permission of the Faculty and Institute of Actuaries.

the total number of steps where Xi+1 Xi = 1), and the number of negative
steps be l, (that is, l is the total number of steps where Xi+1 Xi = 1).
Since there are n steps in total, it follows that r + l = n and that r l = j i,
the excess of positive steps over negative steps. Solving these simultaneous
equations for r and l gives
1 1
r = (n + j i) and l = (n j + i).
2 2
From this we can see that the n-step transition probabilities are
 
(n) n 1 1
pij = 1 p 2 (n+ji) (1 p) 2 (nj+i) ,
2
(n + j i)

where nr is the number of possible paths with r positive steps, each of which


occurs with probability pr (1 p)nr . The expression arises since the distri-
bution of the number of positive steps in n steps is Binomial with parameters
n and p. Since r and l must be non-negative integers, it follows that both
n + j i and n j + i must be non-negative even numbers.
In addition to being time-homogeneous, a simple random walk is spatially-
homogeneous, that is
(n)
pij = P (Xn = j| X0 = i) = P (Xn = j + r| X0 = i + r) .

1
A simple random walk with p = q = 2
is called a symmetric simple random
walk.

4.5.2 The restricted random walk


We introduce the restricted random walk with an example:

65
A man needs to raise N to fund a specific project and asks his very rich
friend to accompany him to a casino where he hopes to win this money. The
man plays the following game: A fair coin is tossed. If it lands heads-up
the man wins 1; if it lands tails-up the man loses 1. If he loses all his
money he will borrow 1 from his friend and continue to play until he has
the required N . Once he has accumulated N he will stop playing the
game.
The restricted random walk is therefore a simple random walk with boundary
conditions. In this example the boundary conditions are specified at 0 and
N . At N the barrier is an absorbing barrier, while at 0 it is called a reflecting
barrier.
More formally, an absorbing barrier is a value b such that:

P (Xn+s = b| Xn = b) = 1 for all s > 0.

In other words, once state b is reached, the random walk stops and remains
in this state thereafter.
A reflecting barrier is a value c such that:

P (Xn+1 = c + 1| Xn = c) = 1.

In other words, once state c is reached, the random walk is pushed away.
A mixed barrier is a value d such that:

P (Xn+1 = d| Xn = d) = and P (Xn+1 = d + 1| Xn = d) = 1 ,

for all s > 0 and [0, 1]. In other words, once state d is reached, the
random walk remains in this state with probability or moves to the neigh-
boring state d + 1 with probability 1 , i.e. it is an absorbing barrier with
probability and a reflecting barrier with probability 1 .
If, in the example, above the man does not take his rich friend, he will
continue to gamble until either his money reaches the target N or he runs
out of money. In each case reaching the boundary means that the wealth
will remain there forever; the barriers therefore become absorbing barriers.
The transition graph for the general case of a restricted random walk with
two mixed barriers is given in Figure 5. The special cases of reflecting and
absorbing boundary conditions are obtained by taking or equal to 0 or
1.

66
Figure 5: Transition graph for the restricted random walk with mixed bound-
ary conditions. Reproduced with permission of the Faculty and Institute of
Actuaries.

The 1-step transition matrix is given by



1
1p 0 p

. .. ... ..
P= .
.


1p 0 p

1p 0 p
1
Note that the matrix is finite, which is in contrast to the transition matrix
for the unrestricted random walk.
The simple NCD model given in Example 4.2 is a practical example of a
restricted random walk.

4.5.3 The modified NCD model


The simple NCD model given in Example 4.2 can be modified with a number
of improvements. One such improvement is to have the following states:
State 0: no discount
State 1: 25% discount
State 2: 40% discount
State 3: 60% discount
The transition rules are as before except that when there is a claim during the
current year, the discount status moves down either two levels if there was a
claim in the previous year, or one level if the previous year was claim-free.
It is clear that the discount status of a policyholder at time n, Xn , does not
form a Markov chain since the future discount status does not only depend
on the current status but also on the previous years status.

67
For example
P [Xn+1 = 1| Xn = 2, Xn1 = 1] > 0, (18)
whereas
P [Xn+1 = 1| Xn = 2, Xn1 = 3] = 0. (19)

A Markov chain can be constructed from this non-Markov chain by splitting


state 2 into two states defined as:

2+ : 40% discount and no claim in the previous year, that is, the state
corresponding to {Xn = 2, Xn1 = 1}.

2 : 40% discount and claim in the previous year, that is, the state
corresponding to {Xn = 2, Xn1 = 3}.

Assuming that the probability of making no claims in any year is still 3/4, the
Markov chain on the modified state space S 0 = {0, 1, 2+ , 2 , 3} has transition
graph given by Figure 6, and 1-step transition matrix given by

1/4 3/4 0 0 0
1/4 0 3/4 0 0

P= 0 1/4 0 0 3/4 .
1/4 0 0 0 3/4
0 0 0 1/4 3/4

Figure 6: Transition graph for the modified NCD process. Reproduced with
permission of the Faculty and Institute of Actuaries.

68
Note that a policyholder can only be in state 2+ by moving up from state
1; and in state 2 by moving down from state 3. Hence equations (18) and
(19) become
P Xn+1 = 1| Xn = 2+ = 1/4,
 

and
P Xn+1 = 1| Xn = 2 = 0.
 

respectively. The transition probabilities are now determined by the current


discount status only and the process is Markov.

4.5.4 A model of accident proneness


An insurance company may want to use the whole history of the claims from
a given driver to estimate his/her accident proneness. Let Yi be a number
of claims during period i. In the simplest model, we may assume that it
can be no more than 1 claim per period, so Yi is either 0 or 1. By time
t a driver has a history of the form Y1 = y1 , Y2 = y2 , . . . , Yt = yt , where
yi {0, 1}, i = 1, . . . , t. Based of this history, the probability of future claim
can be estimated, say, as

f (y1 + y2 + + yt )
P [Yt+1 = 1|Y1 = y1 , Y2 = y2 , . . . , Yt = yt ] = ,
g(t)

where f, g are two given increasing functions satisfying 0 f (m) g(m), m.


The stochastic process {Yt , t = 0, 1, 2 . . . } does not have the Markov property
(11). However, the cumulative number of claims from the driver, given by
t
X
Xt = Yt
i=1

satisfies (11). Indeed,

P [Xt+1 = 1 + xt |X1 = x1 , X2 = x2 , . . . , Xt = xt ]
f (xt )
= P [Yt+1 = 1|Y1 = x1 , Y2 = x2 x1 , . . . , Yt = xt xt1 ] = ,
g(t)

which does not depend on the past history x1 , x2 , . . . , xt1 . Thus, {Xt , t =
0, 1, 2 . . . } is a Markov chain.

69
4.5.5 General principles of modelling using Markov chains
In this section we summarise the examples above and identify the key step in
modelling real-life situations using Markov chains. For simplicity, we discuss
only time-homogeneous models here.

Step 1. Setting up the state space: The most natural choice


is usually to identify the state space of the Markov chain with a set
of observations. For example, this is the case for the NCD system
from Example 4.2, where we setted up the state space S = {0, 1, 2}
in correspondence with possible discounts {0%, 30%, 60%}. However,
as we saw in the example with the modified NCD system (see section
4.5.3), such natural state space may be not suitable to form a Markov
chain, because the Markov property may fail. In the modified NCD
example, a small modification of the state space allowed us to construct
a Markov chain.

Step 2. Estimation transition probabilities: Once the state space


is determined, the Markov model must be fitted to the data by estimat-
ing the transition probabilities. In the NCD model (Example 4.2) we
have just claimed that the company believes that the chance of claim-
ing each year ... has a probability of 1/4. In practice, however, the
transition probabilities should be estimated from the data. Naturally,
the probability pij of transition from state i to state j should be esti-
mated as number of transitions from i to j, divided by the total number
of transitions from state i. More formally, let x1 , x2 , . . . , xN be a set of
available observations, ni be the number of times t (1 t N 1)
such that xt = i, and nij be the number of times t (1 t N 1) such
n
that xt = i and xt+1 = j. Then the best estimate for pij is pij = niji ,
and the 95% confidence interval can be approximated as
s s
pij 1.96 pij (1 pij ) , pij + 1.96 pij (1 pij )
ni ni

This follows from the fact that the conditional distribution of Nij given
Ni is binomial with parameters Ni and pij .

Step 3. Checking the Markov property: Once the state space and
transition probabilities are found, the model is fully determined. But,
to ensure that the fit of the model to the data is adequate, we need to
check that the Markov property seems to hold. In practice, it is often
considered sufficient to look at triplets of successive observations. For

70
a set of observations x1 , x2 , . . . , xN , let nijk be the number of times t
(1 t N 2) such that xt = i, xt+1 = j, and xt+2 = k. If the Markov
property holds, nijk is an observation from a Binomial distribution with
parameters nij and pjk . An effective test to check this is a 2 test: the
statistic
X X X (nijk nij pjk )2
X2 =
i j k
nij pjk

should approach the 2 distribution with r = |S|3 degrees of freedom.


For example, if |S| = 4, the statistic X 2 does not exceed the criti-
cal level 83.675 with probability 95%. Thus, exceeding this level is a
strong indication that the Markov property does not hold. The Chi-
square distribution table up to the level r = 1000 can be found, say, at
http://www.medcalc.org/manual/chi-square-table.php

Step 4. Using the model: Once the model parameters are de-
termined, and Markov property checked, we can use the established
model to estimate different quantities of interest. In particular, we
have used the Markov model for the Example 4.2 to address questions
like What is the probability that a policyholder initially in the 0%-
state is in the 0%-state after 2 years? (see Example 4.6). If the Markov
model is too complicated to answer questions of this type analytically,
we can use Monte-Carlo simulation (see chapter 2). Simulating a time-
homogeneous Markov chain is relatively straightforward. In addition
to commercial simulation packages, even standard spreadsheet software
can easily cope with the practical aspects of estimating transition prob-
abilities and performing a simulation.

4.6 Stationary distributions


In many cases the distribution of Xn converges to a limit in the sense that

P (Xn = j| X0 = i) j , (20)

and the limit is the same regardless of the starting point.


The distribution {j }jS is said to be a stationary distribution of a Markov
chain with transition matrix P if
X
1. j = i pij for all j, which can be expressed as = P where is
iS
a row vector and P is the usual vector-matrix product; and

71
X
2. j 0 for all j and j = 1.
jS

The interpretation of is that, if the initial probability distribution of , i.e.


i = P (X0 = i), then at time 1 the probability distribution of X1 is again
given by . Mathematically
X
P (X1 = j) = P (X1 = j| X0 = i) P (X0 = i) ,
iS
X
= i pij = j .
iS

By induction we have that


X
P (Xn = j) = P (Xn = j| Xn1 = i) P (Xn1 = i) ,
iS
X
= i pij = j .
iS

Hence if the initial distribution for a Markov chain is a stationary distribu-


tion, then Xn has the same probability distribution for all n.
A general Markov chain does not necessarily have a stationary probability
distribution, and if it does it need not be unique. For instance, the unre-
stricted random walk discussed in 4.5 has no stationary distribution, and
the uniqueness of the stationary distribution in the restricted random walk
depends on the parameters and .
However it is known that a Markov chain with finite state space has at least
one stationary probability distribution. This is stated without proof.
Whether the stationary distribution is unique is more subtle and requires
that we consider only irreducible chains. This is defined by the property
that any state j can be reached from any other state i in a finite number of
steps. In other words, a chain is irreducible if for any pair of states i and j
(n)
there exists an integer n such that pij > 0. It is often sufficient to view the
transition graph to determine whether a Markov chain is irreducible or not.

Example 4.7. Question: Are the simple NCD, modified NCD, unre-
stricted and restricted random walk processes irreducible?
Answer: It is clear from Figures 3, 4 & 6 that both NCD processes and
the unrestricted random walks are irreducible as all states have a non-zero
probability of being reached from any other state in a finite number of steps.

72
For the restricted random walk, Figure 5 shows that it is irreducible unless
either boundary is absorbing, i.e. it is irreducible for 6= 1 or =6= 1.

An irreducible Markov chain with a finite state space has a unique stationary
probability distribution. This is stated without proof.

Example 4.8. Question: Do the simple NCD, modified NCD, unre-


stricted and restricted random walk processes have a unique stationary dis-
tribution?
Answer: The simple NCD process is irreducible and has a finite state space.
It therefore has a unique stationary distribution.
The modified NCD process is irreducible and has a finite state space. It
therefore has a unique stationary distribution.
The unrestricted random walk is irreducible but does not have a finite state
space. It therefore does not have a unique stationary distribution.
The restricted random walk has a finite state space and is irreducible for
6= 1 and 6= 1. It therefore has a unique stationary distribution for 6= 1
and 6= 1.

Example 4.9. Question: Compute the stationary distribution for the


modified NCD model defined in 4.5.
Answer: The conditions for a stationary distribution defined above lead to
the following expressions
1 1 1
0 = 0 + 1 + 2 ,
4 4 4
3 1
1 = 0 + 2+ ,
4 4
3
2+ = 1 ,
4
1
2 = 3 ,
4
3 3 3
3 = 2+ + 2 + 3 .
4 4 4
This system of equations is not linearly independent since adding all the
equations results
X in an identity. This is a general feature of = P due to
the property pij = 1.
jS

73
We therefore discard one of the equations (discarding the last one will simplify
the system) and work in terms of a working variable, say 1 .

30 2 =1 , 30 + 2+ = 41 ,
3
2+ = 1 , 42 3 = 0.
4

This system is solved with


3 13
2+ = 1 , 0 = 1 ,
4 12
9
2 = 1 , 3 = 91 .
4
X
Using the requirement that j = 1, we arrive at the stationary distribu-
jS
tion  
13 12 9 27 108
= , , , , .
169 169 169 169 169

4.7 The long-term behaviour of Markov chains


It is natural to expect the distribution of a Markov chain to tend to the
stationary distribution for large times if exists. However, certain phe-
nomena can complicate this. For example, a state i is said to be periodic
with period d > 1 if a return to i is possible only in a number of steps that
(n)
is a multiple of d. More specifically, pii = 0 unless n = md for some integer
m.
Any periodic behaviour is usually evident from the transition graph. For
example, both NCD models considered above are aperiodic; the unrestricted
random walk has period 2 and restricted random walk is aperiodic unless
and are either 0 or 1.
We state the following result about convergence of a Markov chain without
proof:
(n)
Let pij be the n-step transition probability of an irreducible aperiodic Markov
(n)
chain on a finite state space. Then, lim pij = j for each i and j.
n

Example 4.10. Question: An insurance company has 10,000 policyhold-


ers on the modified NCD system defined in 4.5. Estimate the number of
policyholders on each discount rate.

74
Answer: The model is irreducible and aperiodic, therefore, assuming that
the policies have been held for a sufficient length of time, the distribution of
policyholders amongst states is given by the stationary distribution computed
in Example 4.9. We would therefore expect the following distribution:
State 0: no discount 10, 000 13/169 769
State 1: 25% discount 10, 000 12/169 710
State 2: 40% discount 10, 000 (9/169 + 27/169) 2, 130
State 3: 60% discount 10, 000 108/169 6, 391

References
The following texts were used in the preparation of this chapter and you are
referred there for further reading if required.

Faculty and Institute of Actuaries, CT4 Core Reading;

D. R. Cox & H. D. Miller, The Theory of Stochastic Processes;

S. Ross, Stochastic Processes.

75
4.8 Summary
For discrete state spaces the Markov property is written as

P [ Xt = a| Xs1 = x1 , Xs2 = x2 , . . . , Xsn = xn , Xs = x] = P [Xt = a|Xs = x],

for all s1 < s2 < < sn < s < t and all states a, x1 , x2 , . . . , xn , x in S.
Any process with independent increments has the Markov property.
Markov chains are discrete-time and discrete-state-space stochastic processes
satisfying the Markov property. You should be familiar with the simple
NCD, modified NCD, unrestricted random walk and restricted random walk
processes.
In general, the n-step transition probabilities pij (m, m + n) denote the prob-
ability that a process in state i at time m will be in state j at time m + n.
The transition probabilities of a Markov process satisfy the ChapmanKolmogorov
equations: X
pij (m, n) = pik (m, l)pkj (l, n),
kS

for all states i, j S and all integer times m < l < n. This can be expressed
in terms of n-step stochastic matrices as

P(m, n) = P(m, l)P(l, n).

An irreducible time-homogeneous Markov chain with a finite state space has


a unique stationary probability distribution, , such that

= P(n) .

Aperiodic processes will converge to the stationary distribution as n .

76
Questions

1. Consider a Markov chain with state space S = {0, 1, 2} and transition


matrix
p q 0
P = 1/4 0 3/4 .
p 1/2 7/10 1/5
(a) Calculate values for p and q.
(b) Draw the transition graph for the process.
(3)
(c) Calculate the transition probabilities pij .
(d) Find any stationary distributions for the process.
2. Prove equation (16) relating the probability of a particular path occur-
ring in a Markov chain.
3. A No-Claims Discount system operated by a motor insurer has the
following four levels:
Level 1: 0% discount;
Level 2: 25% discount;
Level 3: 40% discount;
Level 4: 60% discount.
The rules for moving between these levels are as follows:
Following a year with no claims, move to the next higher level, or
remain at level 4.
Following a year with one claim, move to the next lower level, or
remain at level 1.
Following a year with two or more claims, move down two levels,
or move to level 1 (from level 2) or remain at level 1.
For a given policyholder in a given year the probability of no claims is
0.85 and the probability of making one claim is 0.12. Xt denotes the
level of the policyholder in year t.
(i) Explain why Xt is a Markov chain. Write down the transition
matrix of this chain.
(ii) Calculate the probability that a policyholder who is currently at
level 2 will be at level 2 after:

77
i. one year.
ii. two years.
iii. three years.
(iii) Explain whether the chain is irreducible and/or aperiodic.
(iv) Does this Markov chain converge to a stationary distribution?
(v) Calculate the long-run probability that a policyholder is in dis-
count level 2.

78
Chapter 5
Markov Jump Processes
A Markov jump process is a stochastic process with discrete state space and
continuous time set, which has Markov property.
The mathematical development of Markov jump processes is similar to Markov
chains considered in the previous chapter. For example, the Chapman
Kolmogorov equations have the same format. However, Markov jump pro-
cesses are in continuous time and so the notion of a one-step transition prob-
ability does not exist and we are forced to consider time intervals of arbi-
trarily small length. Taking the limit of these intervals to zero leads to the
reformulation of the ChapmanKolmogorov equations in terms of differential
equations.
We begin my discussing the Poisson process which is the simplest example of
a Markov jump process. In doing so we will encounter some general features
of Markov jump processes.

5.1 Poisson process


The Poisson process {Nt }t[0,) , is an example of a counting process. That
is, it has state space S = {0, 1, 2, . . . , n, . . . } corresponding to the number
of occurrences of some event. The events occur singly and can occur at
any time. Counting process are useful in modelling customers in a queue,
insurance claims or car accidents, for example.
Informally, a counting process (with counts, for example, customers in a
queue) is a Poisson process, if customers arrives independently, and uni-
formly in time, i.e. with constant rate of customers per time unit. Thus,
in time interval of length h we would expect on average about h customers.
To make the above intuition more formal, let us assume that time interval
(t, t + h) is very short, such that the probability of two or more events during
this interval can be neglected. In this case the expected number of events
(which should be about h by the intuition above) is 0 (1 p) + 1 p = p,
where p is the probability of an event to occur.
Formally, the probability of an event in any short time interval (t, t + h) is
h + o(h), where a function f is said to be o(h) if
f (h)
lim = 0.
h0 h

The Poisson process can then be defined as follows:

79
The counting process {Nt }t[0,) is said to be a Poisson process with rate
> 0, if
1. N0 = 0;

2. The process has stationary and independent increments.

3. P (Nt+h Nt = 1) = h + o(h);

4. P (Nt+h Nt > 1) = o(h).

Example 5.1. Question: Prove that a Poisson process is a Markov jump


process.
Answer: A Poisson process has independent increments (property 2),
therefore it has the Markov property. The state state S = {0, 1, 2, . . . , n, . . . }
of the process is discrete, and time set t [0, ) is continuous, thus it is a
Markov jump process by definition.
It is possible to show that the Poisson process defined above coincides with
the other standard definition of the Poisson process, that is, a process having
independent stationary Poisson-distributed increments. Or more formally,
for any t > 0, Nt follows a Poisson distribution with parameter t, that is
(t)n
P (Nt = n) = et , for any n = 0, 1, 2, . . . (21)
n!
More generally, for any t, s > 0, Nt+s Ns has the same probability distri-
bution as Nt .

Example 5.2. Question: Prove that the two definitions of a Poisson


process are consistent.
Answer: Define the probability that there have been n events by time t as
pn (t) = P (Nt = n). Then,

p0 (t + h) = P (Nt+h = 0),
= P (Nt = 0, Nt+h Nt = 0),
= P (Nt = 0)P (Nt+h Nt = 0),
= p0 (t)(1 h + o(h)).

Rearranging this equation and dividing by h yields


p0 (t + h) p0 (t) o(h)p0 (t)
= p0 (t) .
h h
80
Taking the limit as h 0, leads to the differential equation

dp0 (t)
= p0 (t),
dt
with the initial condition, p0 (0) = 1. It is clear that this has solution

p0 (t) = et . (22)

Similarly, for n 1;

pn (t + h) = P (Nt+h = n),
= P (Nt = n, Nt+h Nt = 0) + P (Nt = n 1, Nt+h Nt = 1) + o(h),
= P (Nt = n)P (Nt+h Nt = 0) + P (Nt = n 1)P (Nt+h Nt = 1) + o(h),
= pn (t)p0 (h) + pn1 (t)p1 (h) + o(h),
= (1 h)pn (t) + hpn1 (t) + o(h).

Rearranging this for pn (t+h), and again taking the limit as h 0, we obtain
the differential equation

dpn (t)
= pn (t) + pn1 (t), (23)
dt
for n = 1, 2, 3, . . . .
It can be shown by mathematical induction, or using generating functions,
that the solution to the differential equation (23), with initial condition
pn (0) = 0 yields equation (21). As required.

A Poisson process has positive integer values and can jump at any time
t [0, ). However, since time is continuous, the probability of a jump is
zero at specific time point t. The process can be pictured as an upwards
staircase shown in Figure 7.

5.1.1 Interarrival times


Since the Poisson process changes only by unit upward jumps, its sample
paths are fully characterised by the times at which the jumps take place.
Consider a Poisson process and let 1 be the time at which the first event
occurs and let n for n > 1 denote the time between the (n 1)th and the
nth event. It is clear that n for n 1 is a continuous random variable which
takes values in the range [0, ).

81
Figure 7: Sample Poisson process. Horizontal distance is time.

The sequence {n }n1 is called the sequence of interarrival times (or holding
times). These are the horizontal distances between each step in Figure 7.
The random variables 1 , 2 , . . . are i.i.d., each having the exponential distri-
bution with parameter . They therefore each have the density function

f (t) = et for t > 0. (24)

To demonstrate this for general n , first consider 1 and note that the event
1 > t occurs if and only if there are zero events of the Poisson process in the
fixed interval (0, t], that is

P (1 > t) = P (Nt = 0) = et .

The distribution function of 1 is therefore

P (1 t) = 1 et ,

and so 1 is exponentially distributed with parameter .


Now consider the distribution of 2 conditional on 1 :

P (2 > t|1 = s) = P (0 events in (s, s + t]|1 = s),


= P (Nt+s Ns = 0|1 = s),
= P (Nt+s Ns = 0), (by independent increments)
= p0 (t) = et .

Therefore 2 is independent of 1 and has the same exponential distribution


as 1 .

82
The same argument can be repeated for 3 , 4 , . . . leading to the conclusion
that the interarrival times are i.i.d. random variables that are exponentially
distributed with parameter .
Further, it can be shown using similar arguments that if Nt and Nt are
two independent Poisson processes with parameters 1 and 2 respectively,
then their sum Nt = Nt + Nt is a Poisson process with parameter 1 + 2 .
This result follows immediately from our intuitive interpretation of a Poisson
process: assume that male customers are arriving uniformly with rate 1 ,
and female customers are arriving independently and uniformly with rate 2 .
Then Nt describes the cumulative number of male customers, Nt - female
customers, thus Nt = Nt + Nt is the total number of customers, which clearly
also arriving uniformly with rate 1 + 2 .
This can be extended to the sum of any number of Poisson processes and is
a very useful result.

Example 5.3. An insurance company assumes that the number of claims on


an individual motor insurance policy in a year is a Poisson random variable
with parameter q. Claims in successive time intervals are assumed to be
independent. The company holds 10, 000 such motor insurance policies which
are assumed to be independent.
For 10, 000 independent policies, the total number of claims in any year will
therefore be Poisson with mean 10, 000q.
The total number of claims on a policy in a two-year period is a Poisson
random variable with mean 2q.

5.1.2 Compound Poisson process


The Poisson process {Nt }t[0,) is a natural model for counting number of
claims reaching an insurance company during time period [0, t]. In practice,
however, the cumulative size of the claims is moreP important. If Yi is the size
of claim i, the cumulative size is given by Xt = N t
Y
i=0 i . The simplest model
is to assume that all claims Yi are independent and identically distributed.
In this case the stochastic process {Xt }t[0,) is called a compound Poisson
process.
Formally, a compound Poisson process with rate > 0 and jump size
distribution F is a continuous-time stochastic process given by
Nt
X
Xt = Yi (25)
i=0

83
where {Nt }t[0,) is a Poisson process with rate , and {Yi , i 1} are
independent and identically distributed random variables, with distribution
function F , which are also independent of Nt .
The expected value and variance of the compound Poisson process are
given by
E[Xt ] = tE[Y ], V ar[Xt ] = tE[Y 2 ], (26)
where Y is a random variable with distribution function F .

Example 5.4. In Example 5.3 assume that size of each claim is a random
variable uniformly distributed on [a, b]. All claims sizes are independent.
What is the mean and variance of the cumulative size of the claims from all
policies during 3 years?
Answer:
PNtThe cumulative size of the claims is the compound Poisson process
Xt = i=0 Yi , where Nt is the number of claims from all policies, which is
the Poisson process
R b with parameter = 10,
R b 000q, and Yi is size of claim i.
2 2
Then EYi = ba a xdx = 2 ; EYi = ba a x dx = a +ab+b
1 a+b 2 1 2
3
, which gives

a+b
E[X3 ] = 3E[Yi ] = 30, 000q = 15, 000q(a + b),
2
and
a2 + ab + b2
V ar[X3 ] = 3E[Y 2 ] = 30, 000q = 10, 000q(a2 + ab + b2 ).
3

Assume that a company has initial capital u, premium rate c, and the cumu-
lative claims size Xt is given by (25). Then the basic problem in risk theory
is to estimate the probability of ruin at time t > 0, defined as

t (u) = P [u + ct Xt < 0]. (27)

5.2 The time-inhomogeneous Markov jump process


Similar to the Markov chain, we introduce transition probabilities for a gen-
eral Markov jump process

pij (s, t) = P [Xt = j|Xs = i] , where pij (s, t) 0 and s < t. (28)

The transition probabilities must also satisfy the Chapman-Kolmogorov equa-


tions X
pij (t1 , t3 ) = pik (t1 , t2 ) pkj (t2 , t3 ), for t1 < t2 < t3 . (29)
kS

84
In matrix form, these are expressed as

P(t1 , t3 ) = P(t1 , t2 )P(t2 , t3 ).

The proof of these is analogous to that for equation (15) in discrete time,
and is left as a question at the end of the chapter.
We require that the transition probabilities satisfy the continuity condition

1, i = j
lim+ pij (s, t) = ij = (30)
ts 0, i 6= j

This condition means that as the time difference between two observations
approach zero, the process will very likely not change its state with proba-
bility approaching one in the limit.
It is easy to see that this condition is consistent with Chapman-Kolmogorov
equation. Indeed, taking the limits t2 t +
3 or t2 t1 in equation (29) we
obtain the identity.
However, this condition does not follow from the Chapman-Kolmogorov
equations. For example, pij (s, t) = 12 for i, j = 1, 2 satisfy equation (29),
since  1 1   1 1   1 1 
2
1
2
1 = 2
1
2
1 21 21 .
2 2 2 2 2 2

5.3 Transition rates


Let us assume that transition probabilities pij (s, t) for t > s have derivatives
with respect to t and s. Also, assume for simplicity that the state space S is
finite. Then by the standard definition of a derivative we have

pij (s, t) pij (s, t + h) pij (s, t)


= lim ,
t h0
P h
pik (s, t)pkj (t, t + h) pij (s, t)
= lim k
h0 h,
!
X pkj (t, t + h) pjj (t, t + h) 1
= lim pik (s, t) + pij (s, t) .
h0
k6=j
h h
:= lim ij (31)
h0

85
It follows from equation (31) that {ij } approach certain limits as h 0. In
particular, we define

pjj (t, t + h) 1
lim := qjj (t),
h0 h (32)
pkj (t, t + h)
lim := qkj (t), for k 6= j.
h0 h

The quantities qjj (t), qkj (t) are called transition rates. They correspond to
the rate of transition from state k to state j in a small time interval h, given
that state k is occupied at time t.
Transition probabilities pkj (t, t + h) can be expressed through the transition
rates as 
hqkj (t) + o(h), k 6= j
pkj (t, t + h) = (33)
1 + hqjj (t) + o(h), k = j

It follows from equation (31) that

pij (s, t) X
= pik (s, t)qkj (t). (34)
t kS

These differential equations are called Kolmogorovs forward equations. In


matrix form, they can be written as

P(s, t)
= P(s, t)Q(t),
t
where Q(t) is called the generator matrix with entries qij (t).
Repeating the procedure but differentiating with respect to s, we have

pij (s, t) pij (s + h, t) pij (s, t)


= lim ,
s h0 h P
pij (s + h, t) k pik (s, s + h)pkj (s + h, t)
= lim ,
h0 h !
X pik (s, s + h) pii (s, s + h) 1
= lim pkj (s + h, t) + pij (s + h, t) .
h0
k6=i
h h

Therefore
pij (s, t) X
= qik (s)pkj (s, t), (35)
s kS

86
and we see that the derivative with respect to s can also be expressed in
terms of the transition rates. The differential equations (35) are called Kol-
mogorovs backward equations. In matrix form these are written as

P(s, t)
= Q(s)P(s, t).
s

Therefore if transition probabilities pij (s, t) for t > s have derivatives with
respect to t and s, transition rates are well-defined and given by equation
(32).
Alternatively, if we can assume the existence of transition rates, then it
follows that transition probabilities pij (s, t) for t > s have derivatives with
respect to t and s, given by equations (34) and (35). These equations are
compatible, and we may ask whether we can find transition probabilities,
given transition rates, by solving equations (34) and (35).
It can be shown that each row of the generator matrix Q(s) has zero sum.
That is, X
qii (s) = qij (s).
j6=i

The residual holding time for a general Markov jump process is denoted Rs .
This is the random amount of time between time s and the next jump:

{Rs > w, Xs = i} = {Xu = i, s u s + w}.

It can be proved that


R s+w
qii (t)dt
P (Rs > w| Xs = i) = e s .

Similarly, the current holding time is denoted Ct . This is the time between
the last jump and time t:

{Ct w, Xs = j} = {Xu = i, t w u t}.

We will not study these questions further for general Markov processes, but
will investigate such and related questions for time-homogeneous Markov
processes below.

87
5.4 Time-homogeneous Markov jump processes
Just as we defined time-homogeneous Markov chains (equation (17)), we can
define time-homogeneous Markov jump processes.
Consider the transition probabilities for a Markov process given by equation
(28), a Markov process in continuous time is called time-homogeneous if the
transition probabilities pij (s, t) = pij (0, t s) for all i, j S and s, t > 0.
In other words, a Markov process in continuous time is called time-homogeneous
if the probability P (Xt = j| Xs = i) depends only on the time interval t s.
In this case we can write

pij (s, t) = P (Xt = j| Xs = i) = pij (t s),


pij (t, t + s) = P (Xt+s = j| Xt = i) = pij (s),
pij (0, t) = P (Xt = j| X0 = i) = pij (t).

Here, for example, pij (s) form a stochastic matrix for every s, that is
X
pij (s) 0 and pij (s) = 1,
jS

and is assumed to satisfy continuity conditions at s = 0



1, i = j
lim+ pij (s) = pij (0) = ij =
s0 0, i 6= j

Also pij (s) satisfy the Chapman-Kolmogorov equations, which, for a time-
homogeneous Markov process take the form
X
pij (t + s) = pik (t) pkj (s). (36)
kS

In matrix form, the ChapmanKolmogorov equations become

P(t + s) = P(t)P(s). (37)

Note that P(0) = I is the identity matrix.


If a time-homogeneous Markov process is currently in state i, it follows from
equation (30) that the probability of remaining in i is non-zero for all t, that
is, pii (t) > 0. Indeed, from equation (36) it follows that pii (t) pii (t/n)n for
any integer n. For example,
X
pii (t) = pik (t/2) pki (t/2) pii (t/2)pii (t/2).
kS

88
The argument for different values of n is similar. So, if for some t we would
have pii (t) = 0, this would imply pii (t/n) = 0 for all n, contradiction with
(30).
The following properties of transition functions and transition rates for a
time-homogeneous process are stated without proof:
dpij (t)
pij (h) ij
1. Transition rates qij = = lim exist for all i, j.
dt t=0
h0 h
Equivalently, as h 0, h > 0

hqij + o(h), i 6= j
pij (h) = (38)
1 + hqii + o(h), i = j

Comparing this to equation (33) we see that the only difference between
the time-homogeneous and time-inhomogeneous cases is that the tran-
sition rates qij are not allowed to change over time.

2. Transition rates are non-negative and finite for i 6= j, and are non-
positive when i = j, that is

qij 0 for i 6= j but qii 0 for i = j.


X
Differentiating pij (t) = 1 with respect to t at t = 0 yields that
jS

X
qii = qij .
j6=i

3. If the set of states S is finite, all transition rates are finite.

Kolmogorovs forward equations for a time-homogeneous process takes the


form
dpij (t) X
= pik (t)qkj ,
dt kS

and in matrix form


dP(t)
= P(t)Q
dt
where Q is the generator matrix with entries qkj .
Similarly, Kolmogorovs backward equations are:

dP(t)
= QP(t).
dt
89
X
Note that since qii = qij , each row of the matrix Q has zero sum.
j6=i

Example 5.5. Consider the Poisson process again. The rate at which
events occur is a constant , leading to

, j = i + 1
qij = 0, j 6= i, i + 1 (39)
, j = i

and pij (t) = P (Nt+s = j| Ns = i) .


The Kolmogorov forward equations are

dpi0 (t)
= pi0 (t),
dt
dpij (t)
= pij (t) + pij1 (t),
dt
with pij (0) = ij . These equations are essentially the same as equations (22)
and (23).
The backward equations are

dpij (t)
= pij (t) + pi+1,j (t).
dt

5.5 Applications
In this section we briefly discuss a number of applications of Markov jump
processes to actuarial modelling. In each case the models can be made time-
homogeneous by insisting that the transition rates are independent of time.
A more detailed discussion of the survival model is postponed to the next
chapters.

5.5.1 Survival model


Consider a two-state model where the two states are alive and dead, i.e.
transition is in one direction only, from the state alive (A) to the state dead
(D) with transition rate (t). This is the survival model and has discrete
state space S={A,D}. The transition graph is given in Figure 8.
In actuarial notation, the transition rate (t) is identified with the force
of mortality at age t.

90
Figure 8: Transition graph for the survival model. Reproduced with permis-
sion of the Faculty and Institute of Actuaries.

It is clear that the generator matrix Q(t) is given by


 
(t) (t)
Q(t) = .
0 0
The Kolmogorov forward equations therefore become
pAA (s, t)
= (t)pAA (s, t),
t
and it is clear that the solution corresponding to the initial condition pAA (s, s) = 1
is Rt
pAA (s, t) = e s (x) dx .
Note that pAA (s, t) is the probability that an individual alive at time (age) s
will still be alive at time (age) t.
Equivalently, consider the probability that an individual now aged s will
survive until at least age s+w, denoted w ps in the standard mortality notation
R s+w Rw
w ps = pAA (s, s + w) = e s (x) dx
= e 0 (s+u) du
.

5.5.2 Sickness-death model


The survival model can be extended to include the state of health of an
individual. In this so-called sickness-death model, the state of an individual
is described as being healthy (H), sick (S), or dead (D). The discrete state
space is therefore S={H,S,D}.
An individual in state H can jump to either state S or state D. Similarly, an in-
dividual in state S can jump to either state H or state D. Time-inhomogeneity
arises through the following age-dependent transitions rates
H S : (t)
H D : (t)
S H : (t)
S D : (t)

91
Figure 9: Transition graph for the sickness-death model. Reproduced with
permission of the Faculty and Institute of Actuaries.

The transition graph is given in Figure 9.


The generator matrix is:

((t) + (t)) (t) (t)
Q(t) = (t) ((t) + (t)) (t) .
0 0 0

Under this formulation it is possible to calculate probabilities such as


the probability that an individual who is healthy at time s will still be
healthy at time t; or

the probability that an individual who is sick at time s will still be sick
at time t.
These are in terms of the residual holding times as
Rt
P (Rs > t s| Xs = H) = e s ((u)+(u))du ,

and Rt
P (Rs > t s| Xs = S) = e s ((u)+(u))du ,
respectively.
We note that transition probabilities can be related to each other. For ex-
ample, the probability of a transition from state H at time s to S at time t
would be
Z ts  R 
s+w
pHS (s, t) = e s ((u)+(u))du (s + w)pSS (s + w, t)dw.
0

92
This is interpreted as the individual remains in the healthy state from time
s to time s + w and then jumps to the state sick at time s + w where he
remains. The derivation of this equation is beyond the scope of the course,
however similar expressions can be written down intuitively.
This sickness-death model can be extended to include the length of time an
individual has been in state S. This leads to the so-called long term care
model where the rate of transition out of state S will depend on the current
holding time in state S.

5.5.3 Marriage model


A further example of a time-inhomogeneous model is the marriage model
under which an individual can be either never married (B), married (M),
divorced (D), widowed (W) or dead (). A Markov jump process can be
formulated on the state space S ={B, M, D, W, }.
The transition graph is given in Figure 10, where we can see that the death
rate has been taken to be independent of the marital status for simplicity.

Figure 10: Transition graph for the marriage model. Reproduced with per-
mission of the Faculty and Institute of Actuaries.

Example 5.6. Question: State an expression for the probability of being


married at time t and of having been so for at least w given that you have
never been married at time s (w < t s).
Answer: If Ct is the current holding time, we have
Z tw
P [Xt = M, Ct > w|Xs = B] = pBB (s, t v)(t v)+
s
 Rt
pBW (s, t v)r(t v) + pBD (s, t v)(t v) e tv ((u)+(u)+d(u))du dv.

93
This mathematical statement can be read as the individual is in state B at
time s where he either remains until time (t v), or jumps to states W or D
by time (t v). At time (t v) he then jumps to state M and remains there
until time t.

References
The following texts were used in the preparation of this chapter and you are
referred there for further reading if required.

Faculty and Institute of Actuaries, CT4 Core Reading;

D. R. Cox & H. D. Miller, The Theory of Stochastic Processes;

S. Ross, Stochastic Processes.

94
5.6 Summary
Markov jump processes are continuous-time and discrete-state-space stochas-
tic processes satisfying the Markov property. You should be familiar with
the Poisson, survival, sickness-death and marriage models.
The Poisson process is a simple Markov jump process. It is time-homogeneous
with stationary increments that are Poisson distributed with mean > 0.
Waiting times between jumps are exponentially distributed with mean 1/.
As with Markov chains, transition probabilities exist for a general Markov
jump process
pij (s, t) = P [Xt = j|Xs = i] , where pij (s, t) 0 and s < t,
which must also satisfy the Chapman-Kolmogorov equations.
The quantities qjj (t), qkj (t) are the transition rates, such that
pjj (t, t + h) 1
lim := qjj (t),
h0 h
pkj (t, t + h)
lim := qkj (t), for k 6= j.
h0 h
Kolmogorovs forward and backwards equations are respectively
pij (s, t) X pij (s, t) X
= pik (s, t)qkj (t) and = qik (s)pkj (s, t).
t kS
s k

These can be written in matrix form as


P(s, t) P(s, t)
= P(s, t)Q(t) and = Q(s)P(s, t),
t t
where Q(t) is the generator matrix with entries qij (t).
In time-homogeneous models the time-dependence of the transition proba-
bilities and transition rates (therefore generator matrices) is removed.
The residual holding time Rs is the random amount of time between time s
and the next jump:
{Rs > w, Xs = i} = {Xu = i, s u s + w}.
It can be proved that
R s+w
qii (t)dt
P (Rs > w| Xs = i) = e s ,
.

95
Questions

1. Claims are known to follow a Poisson process with a uniform rate of 3


per day.

(a) Calculate the probability that there will be fewer than 1 claim on
a given day.
(b) Estimate the probability that another claim will be reported dur-
ing the next hour. State all assumptions made.
(c) If there have not been any claims for over a week, calculate the
expected time before a new claim occurs.

2. Prove equation (29) which gives the ChapmanKolmogorov equations


for a Markov jump process.

3. Consider the sickness-death model given in Figure 9, write down an


integral expression for pHD (s, t).

4. Let {Xt , t 0} be a time-homogeneous Markov process with state


space S = {0, 1} and transition rates q01 = , q10 = .

(a) Write down the generator matrix for this process.


(b) Solve the Kolmogorovs forward equations for this Markov jump
process to find all transition probabilities.
(c) Check that the ChapmanKolmogorov equations hold.
(d) What is the probability that the process will be in state 0 in the
long term? Does it depend on the initial state?

96
Chapter 6
The two-state Markov survival model

Introduction

This chapter describes the application of the two-state Markov model to


mortality; namely, we will use two-state Markov jump process to model future
lifetime by considering the two states, alive and dead, and the transition
intensity between these two states.
We will establish a probability density function for this model based on two
variables: the total number of deaths and the observation time for the popula-
tion concerned. We use this probability density function to find the maximum
likelihood estimator for the transition intensity, and develop this estimators
distribution. We will also provide examples of how the maximum likelihood
estimate can be calculated using empirical data.
The two-state Markov model can easily be extended to cope with multiple
states, something we will explore in chapter 7. Some of the advantages of
the Markov model include:

a complex model can be explained, as the Markov model is intuitively


straightforward

the approach allows the problem to be broken down into component


parts which can be separately analysed and modelled

the mathematics involved with this model is relatively tractable

6.1 The two-state Markov model


In the MA7414 (Mortality) course you have treated the future lifetime of a
person aged x as a continuous random variable, denoted by Tx . You have
derived relationships between different terms and introduced actuarial nota-
tion:

t qx : = P [Tx t] is the rate of mortality. It is the probability that


a life aged x dies within the next t years.

97
t px : = 1 t qx is the probability that a life, aged x, survives for the
next t years.

qx : = 1 qx ; px : = 1 px .
1
x : = lim+ h
P [Tx h] is the force of mortality at age x, or
h0
hazard rate.

Our approach in this section is based on the fact that a life is in one of two
states: alive or dead, and can only move from the state of being alive to the
state of being dead.
x
Alive Dead
Here x represents the age of the life. In other words, our model is a Markov
jump process with just 2 states, and the hazard rate x is the transition
intensity from state Alive to state Dead.
The two-state Markov model makes the following assumptions:

1. The Markov assumption: The probability that a life at any given


age will be found in either state at any subsequent age depends only
on the ages involved and on the state currently occupied.

2. For a short time interval dt:

dt qx+t = x+t dt + o(dt)

for t > 0 where o(dt) is a small correction term.

3. x+t is constant for positive integers x and 0 t < 1.

We discuss each of these assumptions in more details below.

The Markov assumptions

The Markov assumption means that the only information that affects the
future lifetime of the life, is its current age and state. Therefore, we ig-
nore any previous or current medical conditions or lifestyle factors, such as
smoking or exercise, and treat every life, aged x, as an identical life. The
Markov assumptions means that the probabilities of survival after age x + t
is independent of the probability of survival up to age x.

98
When studying mortality for a heterogeneous group, our results will reveal
only the average result for the group. Therefore the Markov assumption is
clearly a simplification.
In reality, the population could be split into homogeneous groups taking into
account characteristics thought to have a significant effect on mortality. A
separate model could then be developed for each group. For example, we
could model mortality for a group of 40 year old males who smoke and rarely
exercise. However, this would reduce the size of the populations studied for
each model.

dt qx+t = x+t dt + o(dt) for t > 0

x+t is an age-dependent transition intensity between the two states.


The correction term o(dt) allows for the fact the transition intensity will
change from x+t to x+t+dt over the period dt. By definition, the small
correction term o(t) is such that lim+ o(dt)
dt
= 0.
dt0

x+t is constant for positive integers x and 0 t < 1

The third assumption means that the transition intensity is a step function,
with discrete steps at each birthday. This assumption is made to simplify
the model. In reality, the transition intensity is continuous i.e. you are not
suddenly more likely to die aged 41 than you were aged 40 and 364 days. On
making this assumption, the most accurate approximation for x is actually
to use x+0.5 .

6.2 Developing probabilities


In section 5.5.1 we have derived the relationship:
Rt
x+s ds
t px =e 0

Let us show how to derive this relationship directly from our 3 assumptions
above:
By definition:
t px t+dt px t px
t
= lim+ dt
dt0

Since the Markov assumptions means that the probability of survival after
age x + t is independent of the probability of survival up to age x then:

99
t+dt px = t px dt px+t
t px t px .dt px+t t px
So t
= lim+ dt
dt0

Then, by assumption 2:
t px t px .(1x+t dt+o(dt))t px
t
= lim+ dt
dt0
o(dt)
Hence, since lim+ dt
= 0 by definition of the small correction term,
dt0
t px
t
= t px .x+t
Rt
x+s ds
Hence, by integration, we obtain our required result i.e. t px = e 0 .

6.3 Developing the two-state model


Suppose we observe a total of N identical 1 lives, whilst aged x last birthday
2
. The N lives are not all necessarily observed for the entire year or at the
same time.
Let us introduce some terminology so that we can develop a joint probability
distribution, based on the total number of deaths and observation time for
the population concerned. This probability distribution will include reference
to the transition intensity, x , and will enable us to use empirical data from
actual observations to find the maximum likelihood estimate of x . We will
develop our model for the age band between turning x and x + 1, as, under
the third assumption, the transition intensity in question is assumed constant
during this period.
Definition 6.1. Consider life i in an observation of N lives, where 1 i
N . Then let:
x + ai denote the exact age of life i when observation of this life commences
x + bi denote the exact age of life i when observation of this life must cease,
if it survives to this time
1
In theory our lives should be identical. However, this is impossible to achieve, and
even creating near identical groups may reduce the population down to too small a group
for accurate models to be built. Instead we use identical to refer to the fact that the lives
are from an acceptably homogeneous group; where the group has been segregated based
on the characteristics of each life thought to have a significant effect on mortality.
2
Aged x last birthday means the lifes age falls between, and including, their birthday
when they turn x to, and excluding, their birthday when they turn x + 1.

100
Clearly 0 ai < bi 1, since our observation is only for the period whilst
life i is aged x last birthday.
Definition 6.2. Define Di as the random variable which indicates whether
or not life i is observed to die during the observation.
Let Di = 1 if life i is observed to die during the observation, and 0 otherwise.
Definition 6.3. Define the waiting time, denoted by Vi , to be the actual
time of the observation for life i. Hence 0 < Vi bi ai

Hence Vi is equal to the age when the observation of life i actually ceases less
the age when the observation of life i starts.
Vi is random variable, which is a mixture of discrete and continuous distri-
bution: it is continuous for 0 < Vi < bi ai , and discrete at Vi = bi ai
because the observation will definitely cease at this point, if the life has not
already died at an earlier point. We refer to this end point, bi ai , as having
a probability mass i.e. there is a probability that the random variable Vi
will be exactly equal to bi ai .

From the above definitions we can deduce the following:

0 < Vi bi ai 1, since the values of ai and bi set upper and lower


limits for the length of observation of life i

Di = 0 Vi = bi ai i.e. if no death is observed then our observation


continues from age x + ai to time x + bi .

Di = 1 0 < Vi < bi ai i.e. if a death is observed then it must occur


during the occur during the observation period.

Example 6.1. Let us consider an investigation into the mortality of a group


of 60 year old male smokers with lung cancer. Imagine that Bill, aged 60
years and 3 months agrees to take part in this investigation until he reaches
age 61. Then:
x = 60, ai = 0.25 and bi = 1.
Bill then dies age 60 and 8 months. Hence:
8 3 5
Di = 1 and Vi = 60 12 60 12 = 12
.
Definition 6.4. Let fi (di , vi ) be the joint distribution of (Di , Vi ), where we
use (di , vi ) to represent a single observation, a sample from (Di , Vi ).
Then, we can represent fi (di , vi ) in terms of x using the following logic:

101
Consider the case where di = 0, i.e. life i is not observed to die. Then
vi = bi ai and the probability of this is bi ai px+ai .
Consider the case where di = 1, i.e. life i dies, then this life has:

survived from age x + ai to age x + ai + vi , the probability of this being


vi px+ai , then

died age x + ai + vi , which, by the second assumption, is approximately


equal to x+ai +vi dt
(
bi ai px+ai (di = 0)
Hence, fi (di, vi ) =
vi px+ai .x+ai +vi (di = 1)
Rt
x+s ds
Applying the relationship t px = e 0 , which we derived earlier, we ob-
tain the relationship:
Rvi
x+ai +t dt
fi (di, vi ) = e 0 .(x+ai +vi )di
And since we are only considering x to be a positive integer, and the third
assumption above states that x+t is constant for and 0 t < 1, then if we
denote x+t by :
fi (di, vi ) = evi .di

Definition 6.5. Consider an observation of N independent lives. Let:


N
P N
P
D= Di and V = Vi for 1 i N .
i=1 i=1

Let (d, v) denote a sample drawn from the distribution (D, V ) and let f (d, v)
denote the joint probability function for (d, v).
Note: The observed waiting time, denoted by v, can also be referred to as
the central exposed to risk.
Our representation of the joint probability function fi (di , vi ) can be extended
to derive f (d, v):
N N N
evi .di = ev .d , where d =
Q P P
f (d, v) = di and v = vi
i=1 i=1 i=1

Assumption 3 states that is constant between birthdays, hence the prob-


ability distribution function for Di and the probability density function for
Vi can be represented by:

102
Probability distribution function for Di :
P [Di = 0] =bi ai px+ai = e(bi ai )
P [Di = 1] =bi ai qx+ai = 1 bi ai px+ai = 1 e(bi ai )

The probability density function for Vi :


(
vi px+ai .x+ai +vi 0 < vi < bi ai
f (vi ) =
bi ai px+ai vi = bi ai
which simplifies to:
(
evi . 0 < vi < bi ai
f (vi ) = (bi ai )
(40)
e vi = bi ai

Example 6.2. Investigation has shown that 60 = 0.01 for a certain popu-
lation. Consider a member of this population currently aged 60 12 . Find the
probability distribution function for observing this individuals death before
age 61 and the probability density function for the waiting period of this
observation.
Using notation consistent with that above, denote this life by i. Then:
= 0.01, ai = 0.5 and bi = 1
Hence, the probability distribution function for Di is:
P [Di = 0] = e(bi ai ) = e0.010.5 = 0.995
P [Di = 1] = 1 e(bi ai ) = 0.005
and the probability density function for Vi is:
(
0.01.e0.01vi 0 < vi < 0.5
f (vi ) =
e0.005 vi = 0.5

6.4 The maximum likelihood estimator


Following our result in the previous section, we can establish that the likeli-
hood function for , based on an observation (d, v), is:
L(; d, v) = ev d
From this likelihood function we can derive our maximum likelihood estimate
(MLE) for , , to be:

103
d
= v

The maximum likelihood estimator for the random variable , denoted by


is therefore:
D
= V
.
As usual, this MLE is established by taking logs and differentiating with
respect to to find the maximum point.
Our MLE here is what we would expect intuitively, i.e. we estimate the
hazard rate as total number of deaths divided by the total amount of time
for which lives are observed.

6.5 The distribution of


In this section we will show that the distribution for the random variable

is asymptotically normal, with mean and variance E[V ]
i.e. N (, E[V ]
).
In order to derive this result, we need to first consider E[V ] and E[D]. Let
us start, as we did before, by considering just one life, life i, with random
variable (Di , Vi ). Recall that the probability density function (p.d.f.) for the
sample (di , vi ) is:
Rvi
x+ai +t dt
fi (di , vi ) = e 0 .(x+ai +vi )di for 0 < vi bi ai , di = 0, 1
We now need to consider the distribution Di Vi . This distribution has
mean zero and variance E[Di ]. That is:

1. E[Di Vi ] = 0

2. V ar[Di Vi ] = E[Di ]

Let us prove the the first of these relationships. Since the sum of a p.d.f. is
always 1, then the relation (40) gives us the following:
ai
biR
et dt + e(bi ai ) = 1
0

Differentiating this relationship, with respect to :


ai
biR ai
biR
et dt et tdt (bi ai )e(bi ai ) = 0
0 0

and multiplying through by leads to:

104
ai
biR ai
biR
et dt [ et tdt + (bi ai )e(bi ai ) ] = 0 (Equation A)
0 0
ai
biR
Now, E[Di ] = 0 P [Di = 0] + 1 P [Di = 1] = et dt
0
ai
biR
and E[Vi ] = t.et dt + (bi ai )e(bi ai )
0

hence equation A is equivalent to proving E[Di Vi ] = 0.


We will not prove the variance of Di Vi here.
N
1
P
Now extend this to consider the distribution N
(D V ). (Recall, D = Di
i=1
N
P
and V = Vi for 1 i N .)
i=1

We can use the following facts:


1. the central limit theorem, which states that the distribution of sample
means taken from a large population approaches a normal curve,

2. E[Di Vi ] = 0 implies E[ N1 (D V )] = 0,

3. V ar[Di Vi ] = E[Di ] implies:

V ar[ N1 (D V )] = 1
N2
V ar[D V ] = 1
N2
E[D]
to deduce that:
1
N
(D V ) N (0, E[D]
N2
)
D DV N 1
Now consider = V
= V
= V N
(D V ).
V
Since limN N = E[Vi ] and E[V ] = N E[Vi ] (from 2 above) then:
N
E[V ]
N (0, E[D]
N2
) E[D]
= N (0, E[V ]2
)

So N (, E[V ]
)

Example 6.3. As part of a medical experiment you observe a total of 1,000


mice, infected with a virus, for 1 month. Your hypothesis is that the hazard
rate is 0.4. Calculate the following:
1. the expected total waiting time.

2. the expected number of deaths.

You actually observe 32 deaths and a total waiting time of 78 hours.

105
3. Discuss whether or not your initial hypothesis appears valid

1. Expected total waiting time:


1
!
R12 1
0.4. 12
= E[V ] = N E[Vi ] = 1, 000 t.e0.4t 0.4dt + 1
12
e .
0

1
1
Then, using integration by parts, E[V ] = 1, 000. 0.4 (1e0.4. 12 ) = 81.96

2. Expected number of deaths:


1
!
R12 1
E[D] = 1, 000 e0.4t 0.4dt = 1, 000(1 e0.4 12 ) = 32.78
0

3. Based on the observations, our maximum likelihood estimate of is:


32
= 78
= 0.4103
Based on the hypothesis that = 0.4, the asymptotic distribution of
= D
V
for this observation is:
0.4
N (0.4, 81.96 ) = N (0.4, 0.06992 )

The probability of observing a value in excess of 0.4103 is:

0.41030.4
P [ > 0.4103] = P [Z > 0.0699
] = 1(0.1474) = 10.5586 = 0.44.

Hence the hypothesis appears acceptable based on this observation.

106
6.6 Summary
The two-state Markov model treats a life as being in one of two states: alive
or dead. It relies on the validity of three assumptions:

1. The Markov assumption: The probability that a life at any given


age will be found in either state at any subsequent age depends only
on the ages involved and on the state currently occupied.

2. For a short time interval dt:

dt qx+t = x+t dt + o(dt)

for t > 0 where o(dt) is a small correction term.

3. x+t is constant for x an integer and 0 t < 1

The Markov approach is useful in that it can easily be extended to cope with
more than one state and/or transition intensity.
For a life i, we recorded our observation of the two-state model by using the
terms ai , bi , di , vi . We then summarised our observation of a population,
N
P N
P
size N, using the notation d = Di and v = Vi for 1 i N .
i=1 i=1

The distribution function of an observation, (di , vi ), is:


fi (di , vi ) = evi .di
The joint distribution function of an observation, (d, v), is:
fi (d, v) = ev .d
The maximum likelihood estimator (MLE) for is = D
V
, which is asymp-

totically normally distributed, with mean and variance E[V ]
i.e.

N (, E[V ] ).
Models of lifetimes can be extended past modelling human mortality. We will
introduce some of the more varied uses for these models, including medical
research and engineering, in the next chapter.

107
Questions

1. The two-state Markov model involves a one-directional, single, age de-


pendent transition intensity. Explain what this means.

2. State the assumptions underlying the two-state Markov model.

3. Discuss the Markov assumption in the context of heterogeneity. What


model have we met which addresses heterogeneity?

4. Using the terminology introduce in this chapter, discuss the possible


values of Vi if:

(a) Di =0
(b) Di =1

5. A large actuarial firm is investigating its level of attrition3 post qual-


ification. We examine the employment records for the first year after
qualification. Fiona joined the firm 3 months after qualifying. She
stayed with the firm for 8 months, before being head-hunted by one of
the firms competitors. A two-state Markov model will be used for this
investigation, with x:=time since qualification. What are the following
values for Fiona:

(a) ai
(b) bi
(c) Vi

Discuss the potential heterogeneity within this investigation.

6. Using the result:

fi (di, vi ) = evi .di

and assuming that all lives are independent, state the joint probability
PN N
P
function for (D, V ), where D = Di and V = Vi for 1 i N .
i=1 i=1

3
Attrition is the percentage of employees leaving a firm, over a defined period, usually
a year.

108
7. A university is assessing its drop out rates for the first year of a three
year degree course, and believes = 0.15. At the end of the first term,
there are 50 students on the course. Assuming that all three terms are
equal, students decision to drop out of a course is independent, and
courses run back to back, determine the following:

(a) the probability distribution function for an individual on this


course leaving before the end of the first year
(b) the probability distribution function for the remaining time an
individual spends on the course for the remainder of the first year.
(c) state the joint probability function for the 54 students

8. Derive the maximum likelihood estimate, = vd , from the likelihood


function for , L(; d, v) = ev d .

9. Based on a mortality investigation, we estimate 40 = 0.00102 and


41 = 0.00112. Assuming the assumptions underlying the two-state
model apply, calculate the probability a life aged 40 survives for a year.

10. An investigation took place into the mortality of males aged between
60 and 61 years suffering from angina. The table below summarises the
results of this investigation. For each person it gives the ages at which
observation of this life began (Start), the age at which observation
ceased (End) and the reason for it ceasing (Reason) i.e. D= observation
ceased due to death, W=withdrew from observation for reason other
than death.

Life Start End Reason


1 5
1 60 12 60 12 D
4 11
2 60 12 60 12 W
3 60 61 W
6
4 60 12 61 W
2
5 60 60 12 D
6 60 11
12
61 W
4 10
7 60 12 60 12 W
3
8 60 60 12 D
9 60 61 W
1 6
10 60 12 60 12 W

Estimate q60 using a two-state Markov model, stating any assumptions


made.

109
Chapter 7
The multiple-state Markov model

Introduction
In the previous chapter, we introduced the two-state Markov model. In this
chapter we extend this model to consider scenarios where more than two state
exist and/or where movement between states is two-directional. Examples
of where such models can be used include:
Income protection (IP): This type of insurance plan is designed to pro-
vide a replacement income when your salary stops if you are unable to
work due illness. The diagram below illustrates the relevant states for
IP, and the possible transitions between these states.

 x
able x
- ill

@
x@ x
@
R
@
dead

Cancer patients: The states to consider for cancer patients can vary
depending on the research being carried. At their simplest, four states
can be considered: able, ill, remission and dead. On a more complex
level, states can be constructed so as to take into account more complex
detail such as family history and genetic testing.

Extending the two-state Markov model


The two-state Markov model was based on three underlying assumptions.
We can extend these assumptions to enable us to develop a multiple-state
model. This simply involves adapting the second and third assumption to
allow for transition between more than one state and the respective multiple
transition intensities.
Before we specify these assumptions it is necessary to introduce some termi-
nology to cope with more than two states and one transition intensity.
Definition 7.1. Let g and h denote any two distinct states in the process
we are trying to model. Then let:

110
gh
x denote the transition intensity from state g to state h at age x.
gh
t px =P[In state h at age x + t| In state g at age x] i.e. the probability that
a life in state g at age x will be in state h at age x + t.
gg
t px =P[In state g at age x + t| In state g at age x] i.e. the probability that
a life in state g at age x will be in state g at age x + t.
gg
t px =P[In state g from age x to age x + t| In state g at age x] i.e. the
probability that a life in state g at age x will remain in state g until age
x + t.
The difference between t pgg gg
x and t px may not be immediately clear. For t px
gg

we are only concerned the probability that life is in state g at age x + t given
it is in state g at age x. However, we are not concerned about the state the
life has been in between these ages i.e. it could have left state g and returned
to it during the period. For t pgg
x we are considering the probability that a life
age x, and in state g, remains in state g until age x + t. It is possible that
for some models t pgg gg
x and t px are equal, however, this is only the case when
either you cannot re-enter the state once you have left it or you are unable
to leave the state. Let us illustrate this by considering two examples.
Example 7.1. Let us consider the a=able, i=ill, d=dead states. Then t paa x
is not equal to t paa
x since it is possible that the life becomes ill and recovers
between ages x and x + t.
Example 7.2. Let us consider a model for cancer sufferers, where a=able,
i=ill, r=remission, c=recovered and d=dead.

-
able - ill  remission

@ @
I  
@ @

@  @
@
R 
)

dead  recovered

Then t paa aa
x is equal to t px , since you cannot reenter this state after leaving
it.
Let us now state the three assumptions that are the foundation of the
multiple-state Markov model.

1. The Markov assumption: The probability that a life at any given


age will be found in a given state at any subsequent age depends only
on the ages involved and on the state currently occupied.

111
2. For any distinct two states, g and h, over a short time interval dt:
gh
dt px+t = gh
x+t dt + o(dt) for t > 0

where o(dt) is a small correction term, which allows for the probability
that a life makes any two or more transitions in time period dt.

3. For any distinct two states, g and h, gh


x+t is constant for x an integer
and 0 t < 1

7.1 Developing probabilities


In the notations above, the Kolmogorov equations (see section 5.3) takes the
form
t pgh P gj jh hj
t
x
= (t px x+t t pgh
x x+t )
j6=h

t pgg
= t pgg gj
x
P
t x x+t
j6=g

t pgh jh gh hj
(t pgj
P
Example 7.3. Let us prove the equation = x x+t t px x+t )
x
t
j6=g
directly.
Start with the definition of partial differentiation
gh gh
t pgh t+dt px t px
t
x
= lim+ dt
dt0

Now consider t+dt pgh x , the probability that a life in state g at age x will be
in state h at age x + t + dt. Let us split this probability by considering the
state the life will be in at age x + t: it can either be in state h, or in some
state other than h i.e.
gh gh hh
P gj jh
t+dt px =t px .dt px+t + t px .dt px+t
j6=h

By applying the second assumption listed above for the multiple state model:
jh
dt px+t = jh
x+t .dt + o(dt)
hj
hh
jh
P P
and by noting that dt px+t = 1 dt px+t = 1 x+t .dt + o(dt) we can
j6=h j6=h
derive the relationship:
gh gh
P hj P gj jh
t+dt px = t px .(1 x+t .dt + o(dt)) + t px .(x+t .dt + o(dt))
j6=h j6=h

112
gh P hj P gj jh gh
t px .(1 x+t .dt+o(dt))+ t px .(x+t .dt+o(dt))t px
t pgh j6=h j6=h
Hence t
x
= lim+ dt
dt0

which simplifies to give our required relationship.

Our second relationship,


t pgg gg
P gj
t
x
= t p x x+t
j6=g

can be used to derive t pgg gg


x by dividing through by t px and integrating with
respect to t. This yields the following result:
!
t P
gj
gg
R
t px = exp x+s ds
0 j6=g

Example 7.4. The diagram below outlines the three state model, a=able,
ai
i=ill and d=dead. Write down equations for t piix and ttpx .
 t 
ii
R ia id
t px = exp (x+s + x+s )ds
0
t pai
t
x
= t paa ai ai ia ad di ai id
x x+t t px x+t + t px x+t t px x+t

Since, di
x+t = 0 this simplifies to:
t pai
t
x
= t paa ai ai ia id
x x+t t px (x+t + x+t )

7.2 Solving the Kolmogorov equations for a simple


multiple state model
Often the Kolmogorov equations are too complex to solve algebraically and
numerical techniques must be used. However, there are some simple cases
where the model is sufficiently tractable to derive the probabilities directly.
Let us consider a simple Markov model with three states, relating to a life as-
surance product: 0=active member paying premiums for life cover, 1=lapsed
member, no longer paying premiums for life cover and 2=dead. Assume the
transition intensities illustrated in the diagrams are constant (and hence not
time-dependent) and that the records terminate when a member moves to
state 1 or 2, i.e. we do not record transition between states 1 and 2.

113
0

01 02
@
@
R
@
1 2

Note that this case is simplified by the fact the transitions are one directional
only and you cannot revisit states.
Then the Kolmogorov equations for the model can be solved by noting the
following:

t p00 01 02
x = 1 (t px +t px ), since if you are not in state 0 you must be in
either state 1 or state 2.

t p00 00
x =t px

Rt 01
t p00
x = exp( ( + 02 )ds), from the previous section.
0

( 01 +02 )t
Hence, t p00
x = e

Since the transition intensities


01
are
02
constant and transition from states
t px t px
1 and 2 is not possible, 01 = 02

Hence the following equations hold:


h i
01 01 (01 +02 )t
t px = 01 +02 1 e

h i
02 01 02
02
t px = 01 +02
1 e( + )t

7.3 The maximum likelihood estimators


Based on our third assumption for multiple state Markov models, i.e. gh x+t
is constant for 0 t < 1, we can study a group between ages x and x + 1
and use our observations to estimate the transition intensities. Let us return
to the illness-death model considered earlier.

114
 x
able x
- ill

@
x@ x
@
R
@
dead

Definition 7.2. Let us introduce some terminology describing the param-


eters of the illness-death model. Consider observing a life, call it life i, from
a given population. Then the following statistics are of interest for this life:
Vi =Waiting time of life i in the able state
Wi =Waiting time of life i in the ill state
Si =Number of transitions able to ill by life i
Ri = Number of transitions ill to able by life i
Di =Number of transitions able to dead by life i
Ui = Number of transitions ill to dead by life i
N
P
Define the total waiting time in the able state as V = Vi , where N is
i=1
the number of observed lives, and define the totals for the other parameters
above in a consistent manner.
Consistent with the two-state Markov model, use lower case symbols to de-
note the observed sample values for the parameters.

Let us make some observations about the terms we have defined above and
then use an example to reinforce them.
Life i may enter and leave both the able and ill state several times during
the period. Vi and Wi represent the total waiting time spend in the able
and ill state respectively. Since we are assuming the transition intensities are
constant, this is an acceptable approach.
At the end of the observation the life has either died or not died, hence
Di + Ui is either 0 or 1. Both Di and Ui can be either 0 or 1 but both Di
and Ui cannot be 1 as you either die from the able state or ill state, but not
both.
Definition 7.3. Let us use the following notation for transition intensities
of the illness-death model:
Let denote the transition intensity ad .

115
Let denote the transition intensity ai .
Let denote the transition intensity ia .
Let denote the transition intensity id .
Note: is the greek letter nu, and is different from the letter v, used to
denote the total sample observed waiting time in the able state.
Example 7.5. Consider two lives, life A and life B, who take part in an
investigation to establish an illness-death model for the age interval 40 to 41.
Summaries of the observation of these lives are provided below.
Life A: Joins the investigation age 40 years 2 months in an ill state. Recovers
age 40 years 5 months. Becomes ill again age 40 years 7 months and dies age
40 years 9 months.
Life B: Joins the investigation age 40 years in an well state. Falls ill age 40
years 5 months. Recovers age 40 years 9 months and remains well until age
41.
The parameters for these sample lives are summarised in the table below.
Life vi wi si ri di ui
2 5
A 12 12
1 1 0 1
8 4
B 12 12
1 1 0 0
The likelihood function for the four parameters is proportional to:
L(, , , ) = e(+)v e(+)w d u s r
This relationship is derived using a similar approach to that seen in chapter
5. Let us consider life i. Then using the relationship:
!
t P
gj
gg
R
t px = exp x+s ds
0 j6=g

we can derive the probability of remaining in the able state for vi to be


e(+)vi . Similarly, the probability of remaining in the ill state for wi is
e(+)wi .
The second assumption for multiple state Markov model allows us to deduce
that the probability of making the observed number of transitions between
states is proportional to d u s r .
Hence the joint distribution for life i is proportional to:
e(+)vi e(+)wi di iu is ri
and by considering this distribution over all N lives, we obtain the likelihood
function:

116
L(, , , ) = e(+)v e(+)w d u s r ,
which enables us to derive the maximum likelihood estimators:
D U S R
= V
, = W
, = V
, = W

Example 7.6. Imagine we have access to the following data from a six-
month study into the sickness of men aged between 95 and 96 for a year.
The investigation started with 10,000 healthy lives age 95. During the year
we observe the following movements:

1,700 deaths are observed: 30% of which are from a state of ill-health

2,600 illnesses are diagnosed during the period, and 780 recoveries from
these illnesses observed.

The average period of illness was for 4 months.

4,300 lives were censored during the investigation, for various reasons,
including turning 96.

There were 4,000 lives under observation when the investigation termi-
nated.

Based on the following assumptions:

lives were censored uniformly through the six-month period

deaths occurred uniformly through the six-month period

illness occurred uniformly through the six-month period

Then we can estimate the following statistics in the Markov sickness model.
6 3 3
v + w = 10, 000 12
4, 300 12
1, 700 12
= 3, 500
4
w = 2, 600 12
( 26 + 16 ( 78 + 85 + 38 + 18 )) = 577.78
v = 3, 500 577.78 = 2922.22
s = 2, 600, r = 780, d = 1, 190, u = 510
Therefore, the maximum likelihood estimates for these parameters, based on
this observation, are:

117
1,190 510
= 2,922.22
= 0.4072, = 577.78
= 0.8827,
2,600 780
= 2,922.22
= 0.8897, = 577.78
= 1.3500
As the example above illustrates, it is sometimes necessary to estimate v and
w. There are two reasons for this:

1. The precise data is not always available. For example, observations


may only take place at regular intervals, like when using census data.

2. The precise data may be available but it is time-consuming to calculate


the waiting times exactly.

When we approximate v and w we make a number of assumptions about


the distribution of the observed statistics. The validity of our approximation
relies on these assumptions being suitable.

7.4 Properties of the maximum likelihood estimators


The maximum likelihood estimators are clearly not independent. For exam-
ple, if Di = 1 then Ui = 0. However, they are asymptotically independent,
with each estimator asymptotically distributed in a similar way to the two-
state Markov model i.e.

N (, E[V ]
)

N (, E[W ]
)

N (, E[V ]
)

N (, E[W ]
)
Each estimators asymptotic distribution is uncorrelated with the others.
The above asymptotic distributions can be derived using a similar approach
to that in chapter 5. We will not derive these as part of this course.

Practical applications
The Markov model allows us to reduce the problem of analysing the morbidity
(the rate of incidence of illness) and mortality experience to the problem of
modelling the transitions between states. They offer a practical, intuitive
and essentially tractable solution to many problems.

118
Where the transition intensities do not depend on the length of time in the
current state, then we apply a standard Markov model. If the intensities do
depend on the length of time in the current state we require a semi-Markov
model. The transition intensities can be estimated from transition data, and
with them we can calculate any probabilities by solving some differential
equations.
Once we have calculated the probabilities of transition we can value all sorts
of insurance benefits. One important and quite controversial example, is
the use of a Markov multiple state model to explore the effect of genetic
information on insurance and the risk of adverse selection.
Definition 7.4. In an actuarial context, selection is the process by which
lives in a population are divided into separate homogeneous groups. Where
an individual has the ability to influence the group they fall into, adverse
selection can occur.
For example, if an individuals parents have both died at an early age from
heart disease, they may choose to take out additional life insurance. If in-
surance company A do not take into account family history when setting
premiums, but other insurance companies do, the individual may approach
company A to try to avoid higher premiums. Company A are then subject
to adverse selection.

Insurers were extremely worried about the effect of adverse selection on their
solvency. Adverse selection usually arises from an information asymmetry,
where the policyholder exploits the additional information he holds about
his own claim risk. Models were used to explore the idea that the risk of
adverse selection could, to a large extent, be managed by restrictions on the
level of death benefits under insurance policies (which are very sensitive to
mortality experience). If large sums insured were only available to applicants
prepared to undergo genetic testing, then the adverse selection issues were
not as daunting as the insurance industry had assumed.
Since 1999 the Genetics and Insurance Research Centre, based at Heriot-
Watt University in Edinburgh, has continued to apply Markov and semi-
Markov models to specific disorders having a genetic component, including
breast and ovarian cancer, Alzheimers disease and Huntingtons disease. Such
research is used when pricing critical illness insurance, which pays a benefit
in the event that the policyholder suffers from one of a list of diseases within
a specified term. The implications of adverse selection for critical illness
insurance have been explored in many of their papers, and the work assesses
what information is required for a potential policyholder to be eligible for
insurance, and at what premium.

119
Multiple state models are also highly appropriate for modelling policyholder
or investor behaviour, for example life insurance lapse experience.
They are also used for econometric models, for example, analysing stock
data or the state of the economy. The mathematics is somewhat more com-
plex, but it opens opportunities also for more advanced financial engineering
applications.

120
7.5 Summary
The multiple state models relies on the following three assumptions:

1. The Markov assumption: The probability that a life at any given


age will be found in a given state at any subsequent age depends only
on the ages involved and on the state currently occupied.
2. For any distinct two states, g and h, over a short time interval dt:
gh
dt px+t= gh
x+t dt + o(dt) for t > 0 where o(dt) is a small correction term,
which allows for the probability that a life makes any two or more
transitions in time period dt.
3. For any distinct two states, g and h:

gh
x+t is constant for x an integer and 0 t < 1.

These assumptions allow us to derive relationships such as:


t pgh P gj jh hj
t
x
= (t px x+t t pgh
x x+t )
j6=h

t pgg
= t pgg gj
x
P
t x x+t
j6=g

We can find the probability of remaining in the same state over a period t:
!
t P
gg
R gj
t px = exp x+s ds
0 j6=g

We then introduced terminology to summarise data for the illness-death


model, i.e. V , W , R, S, D and U and established that the likelihood
function for the four parameters , , and is proportional to:
L(, , , ) = e(+)v e(+)w d u s r
the maximum likelihood estimators are:
D U S R
= V
, = W
, = V
, = W
and asymptotic distributions for these estimators are:

N (, E[V ]
)

N (, E[W ]
)

N (, E[V ]
)

N (, E[W ]
)

121
Questions

1. The model below represents a multiple state model for an Income Pro-
tection Plan, where there are 3 states: a =able, i =ill and d =dead.
Outline the following terms in words:

(a) t pad
x

(b) t piix
(c) t piix
(d) t pdd
x

 x
able x
- ill

@
x@ x
@
R
@
dead

2. Consider a simple model investigating the effects of smoking on mor-


tality, where n=non-smoker, s=smoker, f =former smoker, d=dead.
Which of the following statements is true:

(a) t pnn nn
x =t px

(b) t pss ss
x =t px

(c) t pfxf =t pfxf


(d) t pdd dd
x =t px

t pgg
= t pgg gj
P
3. Prove the relationship t
x
x x+t
j6=g

4. Consider three lives, lives A, B and C, who take part in an investigation


to establish an illness death model for the age interval 70 to 71. The
results of the observation of these lives are outlined below. Summarise
the results from these observed samples using the appropriate notation
(e.g. vi , wi etc.).

(a) Life A: Joins observation age 70 when ill. Recovers age 70 years
1 months. Falls ill again age 70 years 2 months. Recovers age 70
years 4 months. Leaves study age 70 years 11 months.

122
(b) Life B: Joins observation age 70 years 3 months when well. Dies
age 70 years 10 months.
(c) Life C: Joins observation age 70 and well. Falls ill age 70 years
3 months. Recovers age 70 years 9 months. Falls ill again age 70
years 10 months. Dies age 70 years 11 months.

5. Define W , D and U .

6. Following an investigation into the sickness of a defined age group


within a population we observe the following results:

211 subjects fall ill during the observation


135 subjects recover during the observation
72 subjects die during the observation: 62 of whom were ill at the
time of death.
Subjects were observed for 1,304 years (treating each life as a
separate contribution to the total): for 503 of these years the
subjects were in an ill state.

Assuming the transition intensities are constant for this age band, draw
a diagram to represent the Markov model for sickness and use the
observation above to estimate the relevant transition intensities.

7. Can a transition intensity be greater than one? Justify your answer.

123
A Chapter 1 solutions
1. What input parameters might you have for a model assessing the cost
of providing life assurance?

date of birth
sex
smoker/non-smoker indicator
height
weight
married
profession
Units of alcohol consumed per week
Family history of illness
Criminal record indicator

The factors above will help a life company to accurately assess the
likelihood of an individual dying during the period the life cover is in
force. In addition to this the sum assured (i.e. the amount which will
be paid out on death) and the period of cover may also be an input
parameter, chosen by the customer rather than specified by the insurer.

2. Explain the keys steps involved in constructing a model.

(a) Objectives: Develop a clear set of objectives to model the rele-


vant system or process.
(b) Plan and validate: Plan the model around the chosen objec-
tives, ensuring that the models output can be validated i.e. checked
to ensure it accurately reflects the anticipated output from the rel-
evant system or process.
(c) Data: Collect and analyse the necessary data for the model. This
will include assigning appropriate values to the input parameters
and justifying any assumptions made as part of the modelling
process.

124
(d) Capture real world system: The initial model should be de-
scribed so as to capture the main feature of the real world system.
The level of detail in the model can always be reduced at a later
stage.
(e) Expertise: Involve the experts on the relevant system or process.
They will be able to feedback on the validity of the model before
it is developed further.
(f) Choose computer program: Decide on whether the model
should be built using a simulation package or a general purpose
language.
(g) Write model: Write the computer program for the model.
(h) Debug the program.
(i) Test model output: Test the reasonableness of the output from
the model. The experts on the relevant system or process should
be involved at this stage.
(j) Review and amend: Review and carefully consider the appro-
priateness of the model in the light of small changes in input
parameters.
(k) Analyse output: Analyse the output from the model.
(l) Document and communicate: Document and communicate
the results of the model, including an appropriate amount of de-
tails around the methodology used.

3. At what stage in the modelling process should you debug your program?

You should debug your programm:

before running it: to ensure there are no obvious errors. Mod-


els can take a while to run and so it is worth double checking
programming before running it.
after the program has run: when errors are picked up prevent the
program running smoothly and following the investigation of the
validity of output
after the program has been modified: if we modify the program it
is necessary to revisit the first two steps above i.e. debug before
the program is rerun and after it has been run.

125
4. Briefly discuss the issues to consider when assessing the suitability of
a model.

objectives: does the model achieve our objectives? i.e. does it


accurately represent the system we are trying to represent, to our
required level?
data: are the data and assumptions used in the model appropri-
ate? have we addressed correlations between parameters used in
the model to an acceptable level? do we understand the impact
of simplifications and assumptions made in the model?
past models: have we used what information we can from previous
relevant models?
accuracy: is the model sufficiently accurate to meet its objectives?
We need to balance accuracy against the cost and complexity in-
volved with building the model. Also, care must be taken to ensure
the output and results are not communicated at a spurious level
of accuracy.
credibility of output: is the output from the model credible?
communicable: are the results easy to communicate to the relevant
audience?
extendability: examine whether the model can be adapted for
further investigation here or used in future projects.

5. Explain the difference between a stochastic and a deterministic model.

A stochastic model is a simulation model that allows for uncertain


events with probability distributions for some or all of the model pa-
rameters i.e. the parameters are treated a random. A deterministic
model is a model which does not include any element of randomness.

When we run a stochastic model we obtain one possible outcome from


the model, i.e. the output is random. Each specific output is only one
estimate of the real world system. Therefore, several independent runs
are required in order to obtain an idea of the distribution of the output.

In contrast, a deterministic model uses fixed parameters and has a


single set of outputs. You could view a deterministic model as a single
run of a stochastic model with fixed input parameters. That is, you
only need run a deterministic model once to determine the results.

126
6. What does sensitivity analysis mean?

Sensitivity analysis involves making small changes to the model input


parameters and analysing the effect this has on the models output.
Usually one parameter only is changed for each comparative run, to
gauge the extent of its significance in the model. It then allows us to
examine the validity of making this assumption.

7. What is a Turing test?

A Turing test involves experts in the real world system analysing the
output from a model against real life data from the system the model
is trying to recreate. The experts are asked to compare several sets
of data: somce from the model and some from the real world system.
They are not told which data is which. The experts are then asked
whether they can differentiate between the two sets of data. If they are
able to differentiate, their method of differentiation can then be used
to adapt and improve the model.

8. Mr Jones plans to retire this month. He asks you to calculate what


amount of money he would require now in order to draw a 10, 000
p.a. pension at the end of each year from his funds. He says that he
expects to live for 20 years and to be able to achieve a future investment
return of 7% p.a.

(a) calculate the funds required by Mr Jones based on his expected


future lifetime and investment return.
1v 20
10, 000a 20| where i = 7% p.a. = 10, 000 i
= 105, 940
You wish to demonstrate to Mr Jones the effect of any uncertainty
around his future lifetime and future investment returns. There-
fore you carry out some sensitivity testing on his required funds.
What funds will be required by Mr Jones if:

(b) he only survives for 15 years?


10, 000a 15| where i = 7% p.a. = 91, 079
(c) he survives for 25 years?
10, 000a 25| where i = 7% p.a. = 116, 536

127
(d) his future investment returns are 6% p.a.
10, 000a 20| where i = 6% p.a. = 114, 699
(e) his future investment returns are 8% p.a.
10, 000a 20| where i = 8% p.a. = 98, 181

128
B Chapter 2 solutions
1. Use Excels RAND() function to generate 30 U (0, 1)-random variates. Use
them to estimate the integral of f (x) = xe2x + 1 over the domain x [0, 1].

Answer: In view of the strong law of large numbers, we expect that


Z 1 M
1 X
I := f (x) dx = E(f (U )) IM :=
b f (Um ),
0 M m=1

where M N is large enough and U, U1 , . . . , UM are independent U (0, 1)


distributed random variables. Suppose now that M = 30 and that the reali-
sations of the random variables Um for 1 m M are given by

(u1 , . . . , u30 ) = (0.587, 0.030, 0.048, 0.593, 0.165, 0.788, 0.714, 0.265,
0.712, 0.630, 0.569, 0.766, 0.638, 0.984, 0.721, 0.028,
0.726, 0.218, 0.792, 0.656, 0.155, 0.447, 0.224, 0.478,
0.113, 0.493, 0.980, 0.320, 0.524, 0.933)
PM
Using these numbers, the estimator of the integral is equal to M1 m=1 f (um ) =
1.147872. The integral I can be evaluated as
Z 1  1
2x 2x 1 2x  x=1 5 3
I= (xe + 1) dx = xe e + x = 2 = 1.148499.

0 2 4 x=0 4 4e

2. Obtain the 95%-confidence interval for the integral in Question 1, based


on the Monte Carlo estimate with M random simulations.
Answer: If M is bigger than 30, say, we may assume that Z := eIM/IM is
b
f
approximately N (0, 1) distributed, where we use the sample variance
M
1 X
ef2
= (f (Um ) IbM )2 ,
M 1 m=1

which is an unbiased estimator of the variance Var(f (U )). Let z be such that

0.95 = P (z Z z) = (z) (z) = 2(z) 1,

i.e. (z) = 0.975. From the table of the standard normal distribution func-
tion , we then get the value z = 1.96. This implies that

ef
ef 
0.95 = P IbM 1.96 I IbM + 1.96 ,
M M

129

ef b
ef
i.e. [IbM 1.96 M , IM + 1.96 M ] is a 95%-confidence interval for I. In the
situation of Question 1, we obtain the interval [1.1316, 1.1642], which contains
both the true value I = 1.148499 and the point estimate 1.147872. Here the
standard deviation is ef = 0.045532.
3. Write an algorithm to generate M N random variates from the double
exponential distribution with density function f (x) = 21 e|x| for x R
where > 0, using the inverse-transform method.
Answer: The corresponding distribution function is given by
Z x  1 x
2
e , if x < 0,
F (x) = f (y) dy = 1 x
1 2
e , if x 0.

We note that F (x) < 1/2 if and only if x < 0. The inverse F 1 is easily
derived as  1
1
log(2y), if 0 < y < 21 ,
F (y) =
1 log(2(1 y)), if 12 y < 1.
We can use the following algorithm:
(a) Generate a sequence of independent random variates u1 , . . . , uM from
the uniform distribution U (0, 1).
(b) If um < 1/2 then let xm = 1 log(2um ), otherwise let
xm = 1 log(2(1 um )), (m = 1, . . . , M ).
(c) Return the sequence x1 , . . . , xM .
4. Demonstrate that the acceptance-rejection method algorithm does gener-
ate random variates from the distribution function with density function f (x).

Answer: The answer requires to formulate the algorithm with the help of
random variables rather than their realisations. Let U and Z be indepen-
dent random variables, where we assume that U U (0, 1) and that Z has a
suitable density h(x). We consider a probability density f (x) and set

f (x) f (x)
C = sup , g(x) = .
x h(x) C h(x)

Here we use that 0/0 := 0 and 1/0 = . We assume that C is finite. The
algorithm for the simulation of a random variable X from density f (x) states
that we return X = Z if U g(Z) and reject otherwise. So we have to prove
that the distribution of Z given U g(Z) has the density f (x). For this, it
suffices to show that the distribution
Rz function P (Z z | U g(Z)) coincides
with the distribution function f (x) dx for arbitrary (fixed) z R. Using

130
the properties of conditional expectations, we get

P (U g(Z)) = E[I{U g(Z)} ] = E(E[I{U g(Z)} | Z]) = E(g(Z))


Z Z Z
f (x) 1 1
= g(x)h(x) dx = h(x) dx = f (x) dx = .
R R C h(x) C R C

Note that I denotes an indicator function, where, for any event A, we have
IA () = 1 if A and IA () = 0 otherwise. Similarly to the above

P (Z z, U g(Z))
= E[I{Zz, U g(Z)} ] = E[I{Zz} I{U g(Z)} ] = E(E[I{Zz} I{U g(Z)} | Z])
Z z
= E(I{Zz} E[I{U g(Z)} | Z]) = E(I{Zz} g(Z)) = g(x)h(x) dx

Z z
1 z
Z
f (x)
= h(x) dx = f (x) dx.
C h(x) C

Combining the above formulae, we obtain


z
P (Z z, U g(Z))
Z
P (Z z | U g(Z)) = = f (x) dx
P (U g(Z))

as required.
5. Estimate the proportion of computed numbers that would be rejected from
3
the acceptance-rejection method if f (x) = 32 (1 x)(x 5) for 1 x 5
and h(x) is the density function for the U (1, 5) distribution.
Answer: Let us use the notation from the answer of Question 5. The
proportion of computed numbers that would be rejected is estimated by the
probability P (U > g(Z)) = 1P (U g(Z)) = 1 C1 . So we have to evaluate
the constant C:
3
f (x) 32
(x 1)(5 x) 3  
C = sup = sup 1 = sup (x 1)(5 x) .
x h(x) x(1,5) 4
8 x(1,5)

The function (x 1)(5 x), (x (1, 5)), assumes its maximum at x = 3.


Hence C = 83 4 = 32 and, in turn, P (U > g(Z)) = 13 = 0.3333.
6. Use Excels RAND() function to generate 6 U (0, 1)-random variates. Use
them to generate 5 random variates from the N (10, 16) distribution.
Answer: Assume that the generated numbers are 0.686, 0.033, 0.248, 0.046,
0.229, and 0.617. We use the Box-Muller algorithm with pairs of uniform

131
(1) (1) (2) (2) (3) (3)
random variates (u1 , u2 ), (u1 , u2 ), and (u1 , u2 ) given by (0.686, 0.033),
(0.248, 0.046), and (0.229, 0.617). We simply return
q q
(j) (j) (j) (j) (j) (j)
z1 = 10 + 4 2 ln u1 cos(2u2 ), z2 = 10 + 4 2 ln u1 sin(2u2 ),

for all j = 1, 2, 3, i.e. 13.398, 10.715, 16.403, 11.904, 4.906, and 5.394. Since
we have generated 6 random variates, one of them (say the last one) can be
dropped.

132
C Chapter 3 solutions
1. Let X be a random variable from the continuous uniform distribution,
X U (0.5, 1.0). Starting with the probability density function, derive ex-
pressions for the cumulative distribution function, expectation and variance
of X.

Answer: The probability density function X (x) of X is a positive constant


on the interval (0.5, 1.0) and zero otherwise. Further it has integral 1, so the
right choice is given by

2, if x (1/2, 1),
X (x) = 2 I(1/2,1) (x) =
0, otherwise.

Here IA (x) denotes the indicator function of a set A which is equal to 1 if


x A and equal to 0 otherwise. The cumulative distribution function is
given by

0, if x 1/2
Z x R x 2 dz, if x (1/2, 1)

1/2
FX (x) = X (z) dz =


1, if if x 1

 1 
= 2 x I(1/2,1) (x) + I[1,) (x), (x R).
2
Hence the expectation of X is
Z Z 1 x=1 1 3
E[X] = xX (x) dx = 2x dx = x2 =1 = = 0.75.

1/2 x=1/2 4 4

The variance of X can be calculated like this:


Z  Z 1 
3 2 3 2 2 3 3 x=1
Var(X) = x X (x) dx = 2 x dx = x
4 4 3 4 x=1/2

1/2
2h 1 1i 1 1
= 3
+ 3
= = = 0.020833.
3 4 4 3 16 48
2. A sample space consists of five elements = {a1 , a2 , a3 , a4 , a5 }. For which
of the following sets of probabilities does the corresponding triple (, A, P )
become a probability space? Why?
(a) p(a1 ) = 0.3; p(a2 ) = 0.2; p(a3 ) = 0.1; p(a4 ) = 0.1; p(a5 ) = 0.1;

(b) p(a1 ) = 0.4; p(a2 ) = 0.3; p(a3 ) = 0.1; p(a4 ) = 0.1; p(a5 ) = 0.1;

133
(c) p(a1 ) = 0.4; p(a2 ) = 0.3; p(a3 ) = 0.2; p(a4 ) = 0.1; p(a5 ) = 0.1;

(d) p(a1 ) = 0.4; p(a2 ) = 0.3; p(a3 ) = 0.2; p(a4 ) = 0.1; p(a5 ) = 0.1.

Answer: Since is finite, we may assume that A is the -algebra of all


subsets of . So in order to answer the question, we only have to look at
the point probabilities p(ai ) = P ({ai }) for i = 1, . . . , 5. From the definition
of a discrete probability distribution, we know that the sum of all point
probabilities must be equal to 1, i.e. here we must have

p(a1 ) + p(a2 ) + + p(a5 ) = 1.

In parts (a) and (d), the values of this sum is equal to 0.8 and 1.1, respectively,
which means that P is not a probability distribution and therefore (, A, P )
is not a probability space.
We further know that probabilities can never be negative. In part (c), we
have p(a4 ) = 0.1, which means that (, A, P ) is not a probability space.
In part (b), (, A, P ) is indeed a probability space, since here all require-
ments are met.

3. Assets A and B have the following distribution of returns in various states:


State Asset A Asset B Probability
1 10% 2% 0.2
2 8% 15% 0.2
3 25% 0% 0.3
4 14% 6% 0.3
Show that the correlation between the returns on asset A and asset B is equal
to 0.3830.

Answer: Let RA and RB be the returns on assets A and B, respectively.


Then the correlation between RA and RB is given by

Cov(RA , RB )
Corr(RA , RB ) = p ,
Var(RA )Var(RB )

134
where Cov(RA , RA ) = E(RA RB ) E(RA )E(RB ) is the covariance between
RA and RB . We have

E(RA ) = (10 0.2 + 8 0.2 + 25 0.3 + (14) 0.3)% = 6.9%,


2
Var(RA ) = E(RA ) (E(RA ))2 ,
 
= 102 0.2 + 82 0.2 + 252 0.3 + (14)2 0.3 6.92 %%
= (15.2148)2 %%,
p
Var(RA ) = 15.2148%,
E(RB ) = (2 0.2 + 15 0.2 + 0 0.3 + 6 0.3)% = 4.4%,
2
Var(RB ) = E(RB ) (E(RB ))2
 
= 22 0.2 + 152 0.2 + 02 0.3 + 62 0.3 4.42 %%
= (6.1025)2 %%,
p
Var(RB ) = 6.1025%,

E(RA RB ) = 10 (2) 0.2 + 8 15 0.2 + 25 0 0.3

+ (14) 6 0.3 %%
= 5.2%%.

Note that % and %% stand for 1/100 and 1/1002 , respectively. Using the
values above, we obtain

5.2/1002 6.9 4.4/1002


Corr(RA , RB ) = = 0.3830,
0.152148 0.061025
as required.

4. Formalise Example 3.6 as = {1 , 2 , 3 , 4 }, P ({1 }) = P ({2 }) =


P ({3 }) = P ({4 }) = 1/4 and

A := {1 , 4 }, B := {2 , 4 }, C := {3 , 4 }.

Prove that the pairs (A, B), (A, C) and (B, C) are independent, but the triple
(A, B, C) is not mutually independent according to Definition 3.4.

135
Answer: We have
1 1 1
P (A) = P (B) = P (C) = + = ,
4 4 2
1
P (A B) = P ({4 }) = = P (A)P (B),
4
1
P (A C) = P ({4 }) = = P (A)P (C),
4
1
P (B C) = P ({4 }) = = P (B)P (C),
4
which shows that the pairs (A, B), (A, C) and (B, C) are independent. How-
ever
1 1
P (A B C) = P ({4 }) = 6= = P (A)P (B)P (C).
4 8
So the triple (A, B, C) is not mutually independent.

5. You intend to model the maximum daily temperature in your office as a


stochastic process. What time set and state space would you use?

Answer: It is reasonable to use a suitable discrete time set such as T =


{0, 1, 2, . . . } and a continuous state space such as S = R.

136
D Chapter 4 solutions
1. Consider a Markov chain with state space S = {0, 1, 2} and transition
matrix
p q 0
P = 1/4 0 3/4 .
p 1/2 7/10 1/5

(a) Calculate values for p and q.

(b) Draw the transition graph for the process.


(3)
(c) Calculate the transition probabilities pi,j .

(d) Find any stationary distributions for the process.

Answer: (a) The sum of all entries in the last row must be equal to 1, as a
7
consequence of which p = 1 51 10 + 12 = 35 . In view of the first row, we see
that q = 25 .
(b)

1
2/5 7/10
3/4
1/4
0 2
3/5 1/10
1/5

(c) With the help of (a) we get



3/5 2/5 0 23/50 6/25 3/10
P = 1/4 0 3/4 , P2 = 9/40 5/8 3/20 ,
1/10 7/10 1/5 51/200 9/50 113/200

183/500 197/500 6/25 0.366 0.394 0.24
P3 = 49/160 39/200 399/800 = 0.30625 0.195 0.49875 .
509/2000 199/400 31/125 0.2545 0.4975 0.248
(3) (3)
The values pi,j are the entries of P3 , i.e. P3 = (pi,j )i,jS . For example, we
(3)
have p1,2 = 0.49875. It should be mentioned that higher powers of P can be

137
evaluated using the property Pk+` = Pk P` , (k, ` N). E.g. the calculation
of P4 = (P2 )2 does not require the calculation of P3 .
(d) It can be shown that the only stationary distribution is given by
 55 64 60 
= (1 , 2 , 3 ) = , , (0.30726, 0.35754, 0.33520).
179 179 179
Indeed this follows, if we solve the linear equations P = for 1 , 2 , 3
[0, 1] with 1 + 2 + 3 = 1. More precisely, we have
3 1 1
1 + 2 + 3 = 1 , (41)
5 4 10
2 7
1 +
02 + 3 = 2 , (42)
5 10
3 1
01 + 2 + 3 = 3 , (43)
4 5
1 + 2 + 3 = 1. (44)
15
From (43) it follows that 3 = 16 2 . Using this in (42), we see that 2 = 64
55 1
12 64 12
and, in turn, 3 = 11 1 . In view of (44), we then get 1 (1 + 55 + 11 ) = 1, i.e.
55
1 = 179 . From the above, we then obtain the remaining values 2 and 3
as indicated. We did not use (41). This equation must be valid, since P is a
stochastic matrix. Therefore, (41) can be used to check our solution.

2. Prove equation (28) relating the probability of a particular path occuring


in a Markov chain.

Answer: We have to show that, if {Xk }kZ+ (Z+ = {0, 1, 2, . . . }) is a Markov


chain, then
N
Y 1
P (X0 = j0 , X1 = j1 , . . . , XN = jN ) = P (X0 = j0 ) pjn ,jn+1 (n, n + 1)
n=0

for N N and states j0 , . . . , jN S. Note that we do not assume a time-


homogeneous chain, as a consequence of which the one-step transition prob-
abilities pi,j (n, n + 1) also depend on time n. The above equation can be
proved by induction over N . Indeed, if N = 1, then the equation can be
shown this way:

P (X0 = j0 , X1 = j1 ) = P (X0 = j0 )P (X1 = j1 | X0 = j0 )


= P (X0 = j0 )pj0 ,j1 (0, 1).

138
Suppose the equation is true for a N N, then, using the Markov property
of {Xk }, we get

P (X0 = j0 , X1 = j1 , . . . , XN +1 = jN +1 )
= P (XN +1 = jN +1 | X0 = j0 , X1 = j1 , . . . , XN = jN )
P (X0 = j0 , X1 = j1 , . . . , XN = jN )
N
Y 1
= P (XN +1 = jN +1 | XN = jN ) P (X0 = j0 ) pjn ,jn+1 (n, n + 1)
n=0
N
Y 1
= pjN ,jN +1 (N, N + 1) P (X0 = j0 ) pjn ,jn+1 (n, n + 1)
n=0
N
Y
= P (X0 = j0 ) pjn ,jn+1 (n, n + 1),
n=0

which completes our induction proof.

3. A No-Claims Discount system operated by a motor insurer has the fol-


lowing four levels:
Level 1: 0% discount;
Level 2: 25% discount;
Level 3: 40% discount;
Level 4: 60% discount.
The rules for moving between these levels are as follows:
Following a year with no claims, move to the next higher level, or
remain at level 4.
Following a year with one claim, move to the next lower level, or remain
at level 1.
Following a year with two or more claims, move down two levels, or
move to level 1 (from level 2) or remain at level 1.
For a given policyholder in a given year the probability of no claims is 0.85
and the probability of making one claim is 0.12. Xt denotes the level of the
policyholder in year t.
(i) Explain why Xt is a Markov chain. Write down the transition matrix
of this chain.

139
(ii) Calculate the probability that a policyholder who is currently at level
2 will be at level 2 after:

(a) one year.


(b) two years.
(c) three years.

(iii) Explain whether the chain is irreducible and/or aperiodic.

(iv) Does this Markov chain converge to a stationary distribution?

(v) Calculate the long-run probability that a policyholder is in discount


level 2.

Answer:
(i) It is clear that X(t) is a Markov chain; knowing the present state, any
additional information about the past is irrelevant for predicting the
next transition.
Then the transition matrix is given by

.15 .85 0 0
.15 0 .85 0
P =.03 .12 0 .85 .

0 .03 .12 .85

(ii) (a) For the one year transition p22 = 0, since with probability 1, the
chain will leave the state 2.
(b) The second order transition matrix is given by

0.15 0.1275 0.7225 0
0.048 0.2295 0 0.7225
P (2) = P P =
0.0225 0.051 0.204 0.7225 ,

0.0081 0.0399 0.1275 0.8245


(2)
and thus p22 = .2295.
(c) The relevant entry from the third order transition matrix is .062475.

(iii) The chain is irreducible as any state is reachable by any other state.
It is also aperiodic. For states 1 and 4 the chain can simply remain
there. This is not the case for states 2 and 3. However these are

140
also aperiodic, since starting from 2 the chain can return in 2 and 3
transitions from the previous part of the question. Similarly the chain
started at 3 can return at 3 in two steps (look at P 2 ), and at three
steps.

(iv) The chain is irreducible and has a finite state space and thus has a
unique stationary distribution.

(v) To find the long run probability that the chain is at level 2 we need to
calculate the unique stationary distribution . This amounts to solving
the matrix equation P = . This is a system of 4 equations in 4
unknowns given by

1 = .151 + .152 + .033 (45)


2 = .851 + .123 + .034 (46)
3 = .852 + .124 (47)
4 = .853 + .854 . (48)

We discard the first equation and replace it by

1 + 2 + 4 + 4 = 1.

Using substitutions or any other method we solve the system to obtain

(1 , 2 , 3 , 4 ) = (.01424, .05269, .13996, .79311).

(n)
Let pij be the n-step transition probability of an irreducible aperiodic
(n)
Markov chain on a finite state space. Then, lim pij = j for each i and
j. Thus the long run probability that the chain is in state 2 is given by
2 = .05269.

141
E Chapter 5 solutions
1. Claims are known to follow a Poisson process with a uniform rate of 3 per
day.

(a) Calculate the probability that there will be fewer than 1 claim on a
given day.

(b) Estimate the probability that another claim will be reported during the
next hour. State all assumptions made.

(c) If there have not been any claims for over a week, calculate the expected
time before a new claim occurs.

Answer: Let {Nt }t[0,) denote our Poisson process with rate = 3, where
the time is measured in days.
(a) We have to evaluate P (Nt+1 Nt < 1) for a fixed t 0. But this is
equal to
P (N1 = 0) = e = e3 = 0.04979.
1
(b) We look for the probability that, during the time interval (t, t + 24
]
for a fixed t, at least one claim will be reported, i.e.

P (Nt+1/24 Nt 1) = P (N1/24 1) = 1 P (N1/24 = 0) = 1 e/24


= 1 e1/8 = 0.11750.

(c) Conditional on N7 = 0, we can assume that {Nt+7 }t[0,) behaves like


a Poisson process (Net )t[0,) with parameter = 3. But here the first jump
(claim) occurs at a random time e1 , which has an exponential distribution
with parameter . It is well-known that the expectation is E(e 1 ) = 1 = 13 .

2. Prove equation (38), which gives the Chapman-Kolmogorov equations for


a Markov jump process.

Answer: Let {Xt }t[0,) be a (not necessarily time-homogeneous) Markov


process with discrete state space S and transition probabilities pi,j (s, t) =
P (Xt = j | Xs = i) where i, j S, 0 s < t < and we assume that
P (Xs = i) > 0. We have to show that
X
pi,j (t1 , t3 ) = pi,k (t1 , t2 ) pk,j (t2 , t3 ),
kS

142
where i, j S and 0 t1 < t2 < t3 < . We have

pi,j (t1 , t3 ) = P (Xt3 = j | Xt1 = i)


X
= P (Xt3 = j, Xt2 = k | Xt1 = i)
kS
X
= P (Xt3 = j | Xt1 = i, Xt2 = k) P (Xt2 = k | Xt1 = i)
kS
X
= P (Xt3 = j | Xt2 = k) P (Xt2 = k | Xt1 = i)
kS
X
= pi,k (t1 , t2 ) pk,j (t2 , t3 ).
kS

3. Consider the sickness-death model given in Figure 9, write down an


integral expression for pHD (s, t).
Answer: We have
Z ts  
pHD (s, t) = pSD (s + w, t)(s + w) + pDD (s + w, t)(s + w)
0
 Z s+w 
exp ((u) + (u)) du dw
s
Z ts  
= pSD (s + w, t)(s + w) + (s + w)
0
 Z s+w 
exp ((u) + (u)) du dw.
s

The individual remains in the healthy state from time s to time s+w and then
jumps to the state dead (where he remains) or to the state sick (where he
jumps to state dead by time t). Note that here pDD (s+w, t) = 1. Further note
that the formula for pHS (s, t) did not contain the term pDS (s + w, t)(s + w),
since the probability to jump from dead to sick is equal to zero.

4. Let {Xt , t 0} be a time-homogeneous Markov process with state space


S = {0, 1} and transition rates q01 = , q10 = .

(a) Write down the generator matrix for this process.

(b) Solve the Kolmogorovs forward equations for this Markov jump process
to find all transition probabilies.

(c) Check that the ChapmanKolmogorov equations hold.

143
(d) What is the probability that the process will be in state 0 in the long
term? Does it depend on the initial state?

Answer: (a) The generator matrix Q is given by


 

Q= .

dP(t)
(b) The Kolmogorov forward equations dt
= P(t)Q therefore become

dp00 (t)
= p00 (t) + p01 (t),
dt
dp01 (t)
= p00 (t) p01 (t),
dt
dp10 (t)
= p10 (t) + p11 (t),
dt
dp11 (t)
= p10 (t) p11 (t),
dt
Substituting p01 (t) = 1 p00 (t) in the first equation, we get equation

dp00 (t)
= p00 (t) + (1 p00 (t)) = ( + )p00 (t) + .
dt
which has a general solution

p00 (t) = + Ce(+)t .
+

Initial condition p00 (0) = 1 lealds to C = +
, so finally we get

(+)t
p00 (t) = + e .
+ +

Transition probabilities p01 (t), p10 (t), and p11 (t) can be found similarly. They
are:
(+)t
p01 (t) = e ;
+ +

p10 (t) = e(+)t ;
+ +

p11 (t) = + e(+)t .
+ +

144
(c) For a time-homogeneous Markov process ChapmanKolmogorov equa-
tions take the form X
pij (t + s) = pik (t) pkj (s).
kS

In our case, S = {0, 1}, thus there are 4 equations. For example, for i = j = 0
we get
p00 (t + s) = p00 (t) p00 (s) + p01 (t) p10 (s).
So we should check that
(+)(t+s)
+ e =
+ +
  
(+)t (+)s
+ e + e +
+ + + +
  
(+)t (+)s
e e ,
+ + + +

which is a straighforward exersize. Three other ChapmanKolmogorov equa-


tions can be checked similarly.
(d)

lim p00 (t) = lim p10 (t) = ,
t t +
so the the probability that the process will be in state 0 in the long term is

+
, and it does not depend on the initial state.

145
F Chapter 6 solutions
1. The two-state Markov model involves a one-directional, single, age de-
pendent transition intensity. Explain what this means.

It refers to the fact that we only have two possible states: alive or dead.
Transition between these states can only take place in one direction,
from alive to dead. Hence the terms singe and one-directional.

The transition intensity between these two states varies with age e.g.
it is much higher for a 90 year old than a 30 year old. Hence the term
age dependent.

2. State the assumptions underlying the two-state Markov model.

(a) The Markov assumption: The probability that a life at any


given age will be found in either state at any subsequent age de-
pends only on the ages involved and on the state currently occu-
pied.
(b) For a short time interval dt:

dt qx+t = x+t dt + o(dt)

for t > 0 where o(dt) is a small correction term.


(c) x+t is constant for 0 t < 1

3. Discuss the Markov assumption in the context of heterogeneity. What


model have we met which addresses heterogeneity?

The Markov assumption means that the only information that affects
the future lifetime of the life is its current age and state. Therefore we
ignore any previous or current medical conditions or lifestyle factors,
such as smoking or exercise, and treat every life, aged x, as an identical
life. The Markov assumptions means that the probability of survival
after age x + t is independent of the probability of survival up to age
x.

When studying mortality for a heterogenous group, our results will


reveal only the average result for the group. Therefore the Markov
assumption is clearly a simplification.

146
In reality, the population could be split into homogenous groups taking
into account characteristics thought to have a significant effect on mor-
tality. A separate model could then be developed for each group. Al-
ternatively the Cox model could be use instead of the two-state model.
This addresses heterogeneity directly by allowing for different covari-
ates.

4. Using the terminology introduce in this chapter, discuss the possible


values of Vi if:

(a) Di =0
(b) Di =1

(a) Di =0: Then a death has not been observed and observation has
taken place between age x + ai and x + bi . Hence, Vi = bi ai .
(b) Di =1: Then a death has been observed during the observation
period and hence 0 Vi < bi ai .

5. A large actuarial firm is investigating its level of attrition4 post qual-


ification. We examine the employment records for the first year after
qualification. Fiona joined the firm 3 months after qualifying. She
stayed with the firm for 8 months, before being head-hunted by one of
the firms competitors. A two-state Markov model will be used for this
investigation, with x:=time since qualification. What are the following
values for Fiona:

(a) ai
(b) bi
(c) vi

Discuss the potential heterogeneity within this investigation.

3
(a) ai = 12
(b) bi = 1
8
(c) vi = 12
4
Attrition is the percentage of employees leaving a firm, over a defined period, usually
a year.

147
Possible heterogeneity links to how long the employee has been with
the firm before qualifying (or joining the organisation if later), time to
qualification and age at qualification.

6. Using the result:

fi (di, vi ) = evi .di

and assuming that all lives are independent, state the joint probability
PN N
P
function for (D, V ), where D = Di and V = Vi for 1 i N .
i=1 i=1

N N N
evi .di = ev .d , where d =
Q P P
f (d, v) = di and v = vi
i=1 i=1 i=1

7. A university is assessing its drop out rates for the first year of a three
year degree course, and believes = 0.15. At the end of the first term,
there are 50 students on the course. Assuming that all three terms
are equal, students decision to drop out of a course is independent and
courses run back to back, determine the following:
(a) the probability distribution function for an individual on this
course leaving before the end of the first year
(b) the probability distribution function for the remaining time an
individual spends on the course for the remainder of the first year.
(c) state the joint probability function for the 50 students

(a) Let Di be the random variable which indicates whether student i


remains on the course until the end of the year. Where Di = 0 if
they remain to the end of the year and Di = 1 if the leave. Then:
2
P [Di = 0] = e0.15 3 = 0.9048
P [Di = 1] = 1 P [Di = 0] = 0.0952
(b) Let Vi denote the waiting time of student i, i.e. the time that they
remain on the course from the start of the second term until the
end of the year. Then:
2
e0.15vi .0.15 0 < vi < 3
(
f (vi ) = 2
e0.15 3 2
vi =
3
148
50
P 50
P
(c) Let where d = di and v = vi . Then:
i=1 i=1

f (d, v) = e0.15v .0.15d ,

8. Derive the maximum likelihood estimate, = vd , from the likelihood


function for , L(; d, v) = ev d .

Taking logs, log L = v + d log


d log L d
Then differentiating, with respect to , d
= v +

d log L
Setting d
equal to zero to find a turning point, leads to = vd .
d log L
To prove this is a maximum, differentiate d
with respect to :

d2 log L
d2
= d2 < 0. Hence d
v
is a maximum and our maximum likelihood
estimate is = vd .

9. Based on a mortality investigation, we estimate 40 = 0.00102 and


41 = 0.00112. Assuming the assumptions underlying the two-state
model apply, calculate the probability a life aged 40 survives for a year.

Over the year of being 40 we can estimate average using 40.5 =


40 +41
2
= 0.00107.

Under the Markov model, we assume is constant over the year and
derive t px = e.t . Hence:

p40 = e0.001071 = 0.9989.

10. An investigation took place into the mortality of males aged between
60 and 61 years suffering from angina. The table below summarises the
results of this investigation. For each person it gives the ages at which
observation of this life began (Start), the age at which observation
ceased (End) and the reason for it ceasing (Reason) i.e. D= observation
ceased due to death, W=withdrew from observation for reason other
than death.

149
Life Start End Reason
1 5
1 60 12 60 12 D
4 11
2 60 12 60 12 W
3 60 61 W
6
4 60 12 61 W
2
5 60 60 12 D
6 60 11
12
61 W
4 10
7 60 12 60 12 W
3
8 60 60 12 D
9 60 61 W
1 6
10 60 12 60 12 W

Estimate q60 using a two-state Markov model, stating any assumptions


made.
Life di vi
4
1 1 12
7
2 0 12
3 0 1
6
4 0 12
2
5 1 12
1
6 0 12
6
7 0 12
3
8 1 12
9 0 1
5
10 0 12

Hence d = 3 and v = 4 10
12
, so 60 = 3
4 10
= 0.6207.
12

Assuming the force of mortality is constant over (60,61):

q60 = 1 e0.6207 = 0.4624.

150
G Chapter 7 solutions
1. The model below represents a multiple state model for an Income Pro-
tection Plan, where a=able, i=ill and d=dead. Outline the following
terms in words:

(a) t pad
x

(b) t piix
(c) t piix
(d) t pdd
x

 x
able x
- ill

@
x@ x
@
R
@
dead

(a) t pad
x is the probability of an individual who is able at age x dying
in the next t years, i.e. by age x + t.
(b) t piix is the probability of an individual who is ill at age x being ill
at age x + t.
(c) t piix is the probability of an individual who is ill at age x remaining
ill, without lapse, until age x + t.
(d) t pdd
x is the probability of an individual who is dead at age x being
dead at age x + t. Clearly this probability is equal to 1.

2. Consider a simple model investigating the effects of smoking on mor-


tality, where n=non-smoker, s=smoker, f =former smoker, d=dead.
Which of the following statements is true:

(a) t pnn nn
x =t px

(b) t pss ss
x =t px

(c) t pfxf =t pfxf


(d) t pdd dd
x =t px

151
(a) True, since you cannot return to non-smoker status once you have
left it.
(b) False, you can leave smoker status and return to it during the
period between age x and x + t.
(c) False, you can leave smoker status and return to it during the
period between age x and x + t.
(d) True, since you remain dead once you have died. In fact, these
probabilities are equal to 1.

t pgg
= t pgg gj
P
3. Prove the relationship t
x
x x+t
j6=g

Start with the definition of partial differentiation:


gg gg
t pgg t+dt px t px
t
x
= lim+ dt
dt0

Now consider t+dt pgg


x , the probability that a life in state g at age x will
remain in state g until age x + t + dt. Let us split this probability by
considering the state the life will be in at age x + t: it has to be in state
g.
gg gg
t+dt px =t pgg
x .dt px+t

By applying the second assumption listed above for the multiple state
model, and recognising the dt is sufficiently small to assume only one
transition can take place:
gg P gj P gj
dt px+t = 1 dt px+t = 1 x+t .dt + o(dt).
j6=g j6=g

We can use the above equations to derive the relationship:


gg gg
P gj
t+dt px = t px .(1 x+t .dt + o(dt))
j6=g

gg P gj
t px .(1 x+t .dt+o(dt))t pgg
x
t pgg j6=g
Hence t
x
= lim+ dt
dt0

This simplifies to give:


t pgg P gj
t
x
= t pgg
x x+t
j6=g

152
4. Consider three lives, lives A, B and C, who take part in an investigation
to establish an illness death model for the age interval 70 to 71. The
results of the observation of these lives are outlined below. Summarise
the results from these observed samples using the appropriate notation
(e.g. vi , wi etc.).

(a) Life A: Joins observation age 70 when ill. Recovers age 70 years
1 months. Falls ill again age 70 years 2 months. Recovers age 70
years 4 months. Leaves study age 70 years 11 months.
(b) Life B: Joins observation age 70 years 3 months when well. Dies
age 70 years 10 months.
(c) Life C: Joins observation age 70 and well. Falls ill age 70 years
3 months. Recovers age 70 years 9 months. Falls ill again age 70
years 10 months. Dies age 70 years 11 months.

Life vi wi si ri di ui
8 3
A 12 12
1 2 0 0
7
B 12
0 0 0 1 0
4 7
C 12 12
2 1 0 1

5. Define W , D and U .
N
P
Wi =Waiting time of life i in the ill state. W = Wi , i.e. the total
i=1
time spent in the ill state over all lives under observation.
N
P
Di =Number of transitions able to dead by life i. Di = Di , i.e.
i=1
the total number of deaths from an able state during the observation
period.
N
P
Ui = Number of transitions ill to dead by life i. Ui = Ui , i.e. the
i=1
total number of deaths from an ill state during the observation period.

6. Following an investigation into the sickness of a defined age group


within a population we observe the following results:

211 subjects fall ill during the observation


135 subjects recover during the observation

153
72 subjects die during the observation: 62 of whom were ill at the
time of death.
Subjects were observed for 1,304 years (treating each life as a
separate contribution to the total): for 503 of these years the
subjects were in an ill state.

Assuming the transition intensities are constant for this age band, draw
a diagram to represent the Markov model for sickness and use the
observation above to estimate the relevant transition intensities.

S = 211, r = 135, u = 62, d = 10, v = 801, w = 503


10
= 801
= 0.0125
62
= 503
= 0.1233
211
= 801
= 0.2634
135
= 503
= 0.2684

7. Can a transition intensity be greater than one? Justify your answer.

Yes. For example, a stroke patient may have a transition intensity of


6 say in the first month after his operation. Hence, his probability of
1
dying during this month would be approximately 6 12 = 0.5. The
transition intensities are not the same as the probability: they apply
to a very short time period.

154