Non-Life Insurance: Mathematics & Statistics: - Lecture Notes

w)
Non-Life Insurance:
Mathematics & Statistics
– Lecture Notes –
(m
tes
Mario V. Wüthrich
RiskLab Switzerland
Department of Mathematics
ETH Zurich
no
NL
Version September 2, 2013

2
w)
(m
tes
no
NL
Version September 2, 2013, M.V. Wüthrich, ETH Zurich

Preface and Terms of Use
Lecture notes. The present lecture notes cover the lecture Non-Life Insurance:
w)
Mathematics & Statistics which is held at the Department of Mathematics at ETH
Zurich. This lecture is a merger of the two lectures Nicht-Leben Versicherungs-
mathematik and Risk Theory for Insurance. It is held for its first time in Spring
2014. The lecture aims at providing a basis in non-life insurance mathematics
(m
which forms a core subject of actuarial sciences. After this course, the students
may follow lectures that give a deeper specialization in non-life insurance math-
ematics, such as Credibility Theory, Non-Life Insurance Pricing with Generalized
Linear Models, Stochastic Claims Reserving Methods, Market-Consistent Actuarial
Valuation, Quantitative Risk Management, etc.
tes
Prerequisites. The prerequisites for this lecture are a solid education in mathe-
matics, in particular, in probability theory and statistics.
Terms of Use. These lecture notes are an ongoing project which will be continu-
no
ously revised and updated. Of course, there may be errors in the notes and there
is always room for improvement. Therefore, I appreciate any comment and/or cor-
rections that readers may have. However, I would like you the respect the following
rules:
• These notes are provided solely for educational, personal and non-commercial
NL
use. Any commercial use or reproduction is forbidden.
• All rights remain with the author. He may update the manuscript or with-
draw the manuscript at any time. There is no right of the availability of any
(old) version of these notes. The author may also change these terms of use
at any time.
• The author disclaims all warranties, including but not limited to the use or
the contents of these notes. On using these notes, you fully agree to this.
• Citation: please use the SSRN URL.
3
4
w)
(m
tes
no
NL

Acknowledgment
Writing these notes, I profited greatly from various inspiring as well as ongoing
w)
discussions, concrete contributions and critical comments with and by several peo-
ple: first of all our students that have been following our lectures at ETH Zurich
since 2006; furthermore Hans Bühlmann, Philippe Deprez, Paul Embrechts, Lau-
rent Huber, Michael Merz, Gareth Peters, Simon Rentzmann. I especially thank
(m
Alois Gisler for providing his lecture notes [48] and the corresponding exercises.
Zurich, September 2, 2013 Mario V. Wüthrich

tes
no
NL
5
6
w)
(m
tes
no
NL

Contents
w)
1 Introduction 11
1.1 Nature of non-life insurance . . . . . . . . . . . . . . . . . . . . . . 11
1.1.1 Non-life insurance and the law of large numbers . . . . . . . 11
1.1.2 Risk components and premium elements . . . . . . . . . . . 13
(m
1.2 Probability theory and statistics . . . . . . . . . . . . . . . . . . . . 14
1.2.1 Random variables and distribution functions . . . . . . . . . 14
1.2.2 Terminology in statistics . . . . . . . . . . . . . . . . . . . . 20
2 Collective Risk Model 23

2.1 Compound distributions . . . . . . . . . . . . . . . . . . . . . . . . 23
tes
2.2 Explicit claims count distributions . . . . . . . . . . . . . . . . . . . 25
2.2.1 Binomial distribution . . . . . . . . . . . . . . . . . . . . . . 26
2.2.2 Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.3 Mixed Poisson distribution . . . . . . . . . . . . . . . . . . . 36
2.2.4 Negative-binomial distribution . . . . . . . . . . . . . . . . . 36
no
2.3 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.3.1 Method of moments . . . . . . . . . . . . . . . . . . . . . . 40
2.3.2 Maximum likelihood estimators . . . . . . . . . . . . . . . . 45
2.3.3 Example and χ2 -goodness-of-fit analysis . . . . . . . . . . . 47
NL
3 Individual Claim Size Modeling 53

3.1 Data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.2 Selected parametric claims size distributions . . . . . . . . . . . . . 58
3.2.1 Gamma distribution . . . . . . . . . . . . . . . . . . . . . . 59
3.2.2 Weibull distribution . . . . . . . . . . . . . . . . . . . . . . 63
3.2.3 Log-normal distribution . . . . . . . . . . . . . . . . . . . . 65
3.2.4 Log-gamma distribution . . . . . . . . . . . . . . . . . . . . 69
3.2.5 Pareto distribution . . . . . . . . . . . . . . . . . . . . . . . 72
3.3 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.3.1 Kolmogorov-Smirnov (KS) test . . . . . . . . . . . . . . . . 77
3.3.2 Anderson-Darling (AD) test . . . . . . . . . . . . . . . . . . 80
3.3.3 Goodness-of-fit and information criteria . . . . . . . . . . . . 80
3.4 Calculating within layers for claim sizes . . . . . . . . . . . . . . . . 83
7
8 Contents
3.4.1 Claim size modeling using layers . . . . . . . . . . . . . . . . 83

3.4.2 Re-insurance layers and deductibles . . . . . . . . . . . . . . 85
4 Approximations for Compound Distributions 89

4.1 Approximations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.1.1 Normal approximation . . . . . . . . . . . . . . . . . . . . . 90
4.1.2 Translated gamma and log-normal approximations . . . . . . 93
4.1.3 Edgeworth approximation . . . . . . . . . . . . . . . . . . . 96
4.2 Algorithms for compound distributions . . . . . . . . . . . . . . . . 101
4.2.1 Panjer algorithm . . . . . . . . . . . . . . . . . . . . . . . . 101
w)
4.2.2 Fast Fourier transform . . . . . . . . . . . . . . . . . . . . . 112
5 Ruin Theory in Discrete Time 115

5.1 Net profit condition . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
(m
5.2 Lundberg bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.3 Pollaczek-Khinchin formula . . . . . . . . . . . . . . . . . . . . . . 123
5.4 Subexponential claims . . . . . . . . . . . . . . . . . . . . . . . . . 127
6 Premium Calculation Principles 133

6.1 Simple risk-based principles . . . . . . . . . . . . . . . . . . . . . . 134
tes
6.2 Advanced premium calculation principles . . . . . . . . . . . . . . . 135
6.2.1 Utility theory pricing principles . . . . . . . . . . . . . . . . 135
6.2.2 Esscher premium . . . . . . . . . . . . . . . . . . . . . . . . 145
6.2.3 Probability distortion pricing principles . . . . . . . . . . . . 146
6.2.4 Cost-of-capital principles . . . . . . . . . . . . . . . . . . . . 150
no
6.2.5 Deflator based pricing principles . . . . . . . . . . . . . . . . 155
7 Tariffication and Generalized Linear Models 159

7.1 Simple tariffication methods . . . . . . . . . . . . . . . . . . . . . . 162
7.2 Gaussian approximation . . . . . . . . . . . . . . . . . . . . . . . . 165
NL
7.2.1 Maximum likelihood estimation . . . . . . . . . . . . . . . . 165

7.2.2 Goodness-of-fit analysis . . . . . . . . . . . . . . . . . . . . 169
7.3 Generalized linear models . . . . . . . . . . . . . . . . . . . . . . . 173
7.3.1 GLM for Poisson claims counts . . . . . . . . . . . . . . . . 176
7.3.2 GLM for gamma claim sizes . . . . . . . . . . . . . . . . . . 177
7.3.3 Variable reduction analysis . . . . . . . . . . . . . . . . . . . 180
8 Bayesian Models and Credibility Theory 183

8.1 Exact Bayesian models . . . . . . . . . . . . . . . . . . . . . . . . . 184
8.1.1 Poisson-gamma model . . . . . . . . . . . . . . . . . . . . . 185
8.1.2 Exponential dispersion family with conjugate priors . . . . . 188
8.2 Linear credibility estimation . . . . . . . . . . . . . . . . . . . . . . 193
8.2.1 Bühlmann-Straub (BS) model . . . . . . . . . . . . . . . . . 194

Contents 9
8.2.2 Bühlmann-Straub credibility formula . . . . . . . . . . . . . 195

8.2.3 Estimation of structural parameters . . . . . . . . . . . . . . 200
8.2.4 Prediction error in the Bühlmann-Straub model . . . . . . . 202
9 Claims Reserving 205

9.1 Outstanding loss liabilities . . . . . . . . . . . . . . . . . . . . . . . 206
9.2 Claims reserving algorithms . . . . . . . . . . . . . . . . . . . . . . 211
9.2.1 Chain-ladder algorithm . . . . . . . . . . . . . . . . . . . . . 212
9.2.2 Bornhuetter-Ferguson algorithm . . . . . . . . . . . . . . . . 216
9.3 Stochastic claims reserving methods . . . . . . . . . . . . . . . . . . 217
w)
9.3.1 Gamma-gamma Bayesian CL model . . . . . . . . . . . . . . 219
9.3.2 Over-dispersed Poisson model . . . . . . . . . . . . . . . . . 224
9.4 Claims development result . . . . . . . . . . . . . . . . . . . . . . . 227
(m
10 Solvency Considerations 235
10.1 Balance sheet and solvency . . . . . . . . . . . . . . . . . . . . . . . 235
10.2 Risk modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
10.3 Insurance liability variables . . . . . . . . . . . . . . . . . . . . . . 242
10.3.1 Market-consistent value . . . . . . . . . . . . . . . . . . . . 242
10.3.2 Insurance risk . . . . . . . . . . . . . . . . . . . . . . . . . . 243
tes
no
NL

10 Contents
w)
(m
tes
no
NL

Chapter 1
Introduction
w)
1.1 Nature of non-life insurance
1.1.1
(m
Non-life insurance and the law of large numbers
Insurance originates from a general demand of our society that asks for protection
against unforeseeable random events which might cause serious (financial) damage
to individuals and society. Insurance then organizes the financial protection against
tes
such unforeseeable random events, meaning that it takes care of the financial re-
placements of the damage. The general idea is to build a community to which
everybody contributes a certain amount (fixed deterministic premium) and then
the (random) financial damages are financed by the savings of this community.
no
insurance policy covering risks (random events)

NL
insured insurer
policyholder insurance company
fixed premium (deterministic)
11
12 Chapter 1. Introduction
The basic features of such communities are that every member faces similar risks
and by building such communities the individual members receive diversification
benefits in the form of a law of large numbers that applies to the community.
Insurance companies organize this equal balance within the community.
Modern insurance as known today is traced back
to the Great Fire of London in 1666 which has
destroyed a big part of London. This event has
initiated fire insurance protection against such
disastrous events. Today, fire insurance belongs
w)
to the branch of non-life insurance which is also
known as property and casualty insurance in
Great Fire of London 1666
the US and general insurance in the UK. Non-
life insurance comprises car insurance, liability insurance, property insurance, ac-
(m
cident and health insurance, marine insurance, credit insurance, legal protection
insurance, traveling insurance and other similar products. Insurance contracts for
these types of products have in common that they specify an insurance period
(typically one year) and then all (random) events that occur within this insurance
period and which are causing financial damage to which the insurance contract
applies are replaced. Such random events to which insurance contracts apply are
tes
called insurance claims.
Typically, the insurance premium for these con-
tracts is paid at the beginning of the insurance
period (in advance). To determine this insur-
ance premium, the insurance company pools
no
similar risks whose individual insurance claims

can be described by a sequence Y1 , . . . , Yn , n ∈
N, of random variables. These insurance claims
Yi are random at the beginning of the insurance
period and therefore need to be described with J. Bernoulli
NL
probability theory. Assume we have a proba-

bility space (Ω, F, P) and Y1 , . . . , Yn are uncorrelated and identically distributed
random variables on that probability space with finite mean µ = E[Y1 ] < ∞. In
that case we can apply the weak law of large numbers (LLN) which says that for
all ε > 0
n
" #
1 X
lim P Yi − µ ≥ ε = 0. (1.1)

n→∞ n
i=1
Basically, this means that the total claim amount becomes more “predictable” with
increasing portfolio size n, and therefore we can calculate the insurance premium
quite accurately for large portfolio sizes n because this provides the required equal
balance. The weak law of large numbers is therefore considered to be a theoret-
ical cornerstone of insurance. It goes back to the Swiss mathematician Jakob
Bernoulli (1655-1705) of the famous Bernoulli family and was first published in

Chapter 1. Introduction 13
his path-breaking work Ars Conjectandi which has appeared in 1713, eight years
after his death, see Bolthausen-Wüthrich [15].
For independent and identically distributed random variables
Y1 , Y2 , . . . with finite variances σ 2 < ∞ the weak law of large
numbers can further be refined by Chebychev’s inequality which
provides rates of convergence and by the central limit theorem
(CLT) which provides the asymptotic limit distribution. The
CLT states under the above assumptions that we have the fol-
lowing convergence in distribution
w)
Pn
Y − nµ A. De Moivre
i=1
√i ⇒ N (0, 1) as n → ∞, (1.2)
nσ
i.e. in the limit we obtain a standard Gaussian distribution. The
(m
crucial feature is that the denominator only increases of order
√
n, i.e. it increases at a slower rate than n. This exactly implies
that the total claim amount of the portfolio becomes predictable
in the limit because the relative confidence bounds get narrower
the bigger the portfolio is. These are the basics why insurance
works. The CLT goes back to Abraham De Moivre (1667-
tes
1754) who published a first article on the CLT in 1733 based on
P.-S. Laplace
coin tossing, this was way ahead of time, and to Pierre-Simon
Laplace (1749-1827) who provided an extension in 1812.
no
1.1.2 Risk components and premium elements

Insurance contracts involve many different risk components. We briefly present
them from the insurance company point of view.
1. Pure randomness: The outcomes of the claims Yi are uncertain/random.
NL
This risk is taken care of by the volume n of the insurance portfolio (as
described above). This implies that it can be controlled in a sufficient way if
the insurance portfolio is built in an appropriately large fashion.
2. Model risk: The description of the randomness of Yi , described in the previous

item, is always based on a stochastic model, i.e. we aim to describe their
random outcomes in a model world. This modeling should have the minimal
requirement that it characterizes the nature of Yi in a sufficiently accurate
way. Model risk now derives from the fact that we try to explain real world
behavior with models. There are different things that can go wrong in this
modeling task:
(a) the model world does not provide an appropriate description of real world
behavior;

(b) the parameters in the chosen model are misspecified;

(c) risk factors change over time so that past observations do not appropri-
ately describe what may happen in the future.
All these uncertainties ask for a risk loading (risk margin) beyond the pure risk
premium defined by µ = E[Yi ]. This will be described in detail below.
We close this section by describing the elements that are considered for the insur-
ance premium calculation. The premium items are:
• pure risk premium µ = E[Yi ]
w)
• risk margin to protect against the risks mentioned above
• profit margin
(m
• − financial gains on investments
• sales commissions to agents
• other administrative expenses
• state taxes
tes
The sum of all these items specifies the insurance premium. Non-life insurance
mathematics and statistics typically studies the first two items. This will be done
in the subsequent chapters.
no
1.2 Probability theory and statistics

1.2.1 Random variables and distribution functions
In this section we briefly recall the crucial notation and the key results of probability
theory used in these notes. We denote the underlying probability space by (Ω, F, P)
NL
and assume throughout that this probability space is sufficiently rich so that it
carries all the objects that we are going to consider.
Random variables on this probability space (Ω, F, P) are denoted by capital let-
ters X, Y, S, N, . . . and the corresponding observations are denoted by x, y, s, n, . . ..
That is, x constitutes a realization of X. Random vectors are denoted by boldface,
e.g., X = (X1 , . . . , Xd )0 and the corresponding observation by x = (x1 , . . . , xd )0 for
a given dimension d ∈ N. Since there is broad similarity between random variables
and random vectors, we restrict to random variables for introducing the crucial
terminology from probability theory.
Random variables X are characterized by (probability) distribution functions F =
FX : R → [0, 1] meaning that for all x ∈ R
F (x) = FX (x) = P [X ≤ x] ∈ [0, 1]

denotes the probability that X has an outcome less or equal to x. In general, we

drop the subscript in the distribution function F = FX and we simply write X ∼ F
for X having distribution function F ; a distribution function is a right-continuous,
non-decreasing function with limx→−∞ F (x) = 0 and limx→∞ F (x) = 1.
We distinguish two important types of random variables:

(i) a random variable X ∼ F is called discrete if F is a step function with countably
many steps in discrete points k ∈ A ⊂ R. In this case we write
pk = P [X = k] > 0 for k ∈ A,
w)
with k∈A pk = 1. We call pk probability weight of X in k ∈ A;
P
(ii) a random variable X ∼ F is called absolutely continuous if there exists a

measurable function f ≥ 0 with f = F 0 , i.e.
(m
Z x
F (x) = f (y) dy for all x ∈ R.
−∞
This function f is called density of X and in that case we also use the terminology
X ∼ f.
tes
Assume X ∼ F and h : R → R is a sufficiently nice measurable function. We
define the expected value of h(X) by
 P
Z 

 k∈A h(k) pk if X is discrete,
E [h(X)] = h(x) dF (x) = 
no
R  R h(x)f (x) dx if X is absolutely continuous.


R
The middle term uses the general framework of the Riemann-Stieltjes integral
R
R h dF . The “sufficiently nice” refers to the fact that E [h(X)] is only defined
upon existence. The most important functions h in our analysis define the follow-
NL
ing moments (based upon existence):
• mean, expectation or first moment of X ∼ F

Z
µX = E [X] = x dF (x),
R
• k-th moment of X ∼ F
h i Z
E Xk = xk dF (x),
R
• variance of X ∼ F
h i h i
2
σX = Var (X) = E (X − E[X])2 = E X 2 − E [X]2 ,

• standard deviation and coefficient of variation of X ∼ F

σX
σX = Var (X)1/2 and Vco(X) = for E[X] > 0,
E[X]
• skewness of X ∼ F h i
E (X − E[X])3
ςX = 3
,
σX
• moment generating function of X ∼ F at r ∈ R
w)
MX (r) = E [exp {rX}] .
The moment generating function will be crucial to identify the properties of random
variables X. If we replace r by −r, i.e. if we consider MX (−r), we obtain the
(m
Laplace-Stieltjes transform of the distribution function F at position r which will
be denoted
cF (r) = MX (−r).
m
Lemma 1.1. Choose X ∼ F and assume that there exists r0 > 0 such that
MX (r) < ∞ for all r ∈ (−r0 , r0 ). Then MX (r) has a power series expansion
for r ∈ (−r0 , r0 ) with
tes
X rk h i
MX (r) = E Xk .
k≥0 k!
Proof. Note that it suffices to choose r ∈ (−r0 , r0 ) with r 6= 0. Since e|rx| ≤ e−rx + erx
the assumptions imply integrability E [exp {|rX|}] < ∞. This implies that E[|X|k ] < ∞ for all
no
k ∈ N because |x|k is dominated by e|rx| for sufficiently large |x|. It also implies that the partial
Pm
sums |fm (x)| = | k=0 (rx)k /k!| are uniformly bounded by the integrable (w.r.t. dF ) function
k |rx|
P
k≥0 |rx| /k! = e . This allows to apply the dominated convergence theorem which provides
m
X rk h i
E X k = lim E [fm (X)] = E lim fm (X) = MX (r).

lim
m→∞ k! m→∞ m→∞
k=0
NL
This proves the lemma. 2
Lemma 1.1 implies that the power series converges for all r ∈ (−r0 , r0 ) for given
r0 > 0 and, thus, we have a strictly positive radius of convergence ρ0 > 0. A
standard result from analysis implies that in the interior of the interval [−ρ0 , ρ0 ]
we can differentiate MX (·) arbitrarily often (term by term of the power series) and
the derivatives at the origin are given by
dk h
k
i
MX (r)|r=0 = E X <∞ for k ∈ N0 . (1.3)
drk
Lemma 1.2. Choose a random variable X ∼ F and assume that there exists r0 > 0
such that MX (r) < ∞ for all r ∈ (−r0 , r0 ). Then the distribution function F of X
is completely determined by its moment generating function MX .

Proof. The existence of a strictly positive radius of convergence ρ0 implies that all moments
of X exist and that they are directly determined by the moment generating function via (1.3).
Theorem 30.1 of Billingsley [13] then implies that there is at most one distribution function F
which has the same moments (1.3) for all k ∈ N. 2
For one-sided random variables the statement even holds true in general:
Lemma 1.3. Assume X ≥ 0, P-a.s. The distribution function F of X is completely

determined by its moment generating function MX .
w)
Proof. See Section 22 of Billingsley [13], in particular Theorem 22.2. 2
In particular, Lemma 1.3 gives for two random variables X ∼

F and Y ∼ G with X ≥ 0 and Y ≥ 0, P-a.s., the following
(m
implication
(d)
MX ≡ MY ⇒ X = Y.
This property is often used to identify distribution functions.
Lemma 1.4. Assume that the random variables Xn , n ∈ N, and P.L. Chebychev
tes
X have finite moment generating functions MXn , n ∈ N, and
MX on a common interval (−r0 , r0 ) with r0 > 0. Suppose limn→∞ MXn (r) = MX (r)
for all r ∈ (−r0 , r0 ). Then (Xn )n converges in distribution to X, write Xn ⇒ X
for n → ∞.
no
Proof. See Section 30 of Billingsley [13]. Basically, Chebychev’s inequality implies tightness of
the underlying probability measures from which the convergence in distribution is derived. 2
Example 1.5 (Gaussian distribution). Assume X ∼ N (µ, σ 2 ) has a Gaussian

NL
distribution with parameters µ ∈ R and σ 2 > 0. X is an absolutely continuous

random variable with density f (x) for x ∈ R given by
1 (x − µ)2
( )
1
f (x) = √ exp − .
2πσ 2 σ2
The moment generating function of X is given by

n o
MX (r) = exp rµ + r2 σ 2 /2 < ∞ for r ∈ R. (1.4)
This moment generating function is obtained by direct calculation completing the

square. Observe that MX (·) is finite on R and, thus, all moments exist and
d 1

µX = E [X] = MX (r)|r=0 = exp rµ + r2 σ 2 µ + rσ 2 = µ,
dr 2 r=0

and for the second moment we obtain

d2 1 2 2
h i
E X2 = MX (r)|r=0 = exp rµ + r σ (µ + rσ 2 2
) + σ 2
= µ2 + σ 2 .
dr2

2 r=0
This implies for the variance of Gaussian distributions

h i
2
σX = Var(X) = E X 2 − E [X]2 = σ 2 .
Moreover, any random variable Y that has moment generating function of the form
(1.4) is Gaussian with mean µY = µ and variance σY2 = σ 2 , see Lemma 1.2.
)
Exercise 1 (Gaussian distribution).
w
(a) Assume X ∼ N (0, 1). Prove that a + bX ∼ N (a, b2 ) for a, b ∈ R.
(b) Assume that Xi are independent and Xi ∼ N (µi , σi2 ). Prove that Xi ∼
P
i
(m
N ( i µi , i σi2 ).
P P
(c) Assume X ∼ N (0, 1). Prove that E[X 2k+1 ] = 0 for all k ∈ N0 .

The Gaussian distribution is named after

Carl Friedrich Gauss (1777-1855).
tes
He was one of the greatest mathemati-
cians and has contributed to many dif-
ferent fields in mathematics and physics.
We recommend the novel of Kehlmann
no
[58] that fictitiously describes the lives of

Carl Friedrich Gauss and of the nat- C.F. Gauss
ural scientist Alexander von Hum-
boldt (1769-1859).
NL
Often we do not directly consider the moment generating function MX of a random

variable X but rather its logarithm. The cumulant generating function of X is given
by
log MX (r) = log E [exp {rX}] .
Assume that MX is finite on (−r0 , r0 ) with r0 > 0. We have
MX0 (r)

d
log MX (r)|r=0 = = E [X] = µX ,
dr MX (r) r=0
MX00 (r)MX (r) − (MX0 (r))2

d2 2
log MX (r)|r=0 = = Var (X) = σX , (1.5)
dr2 (MX (r))2

r=0
d3 h
3
i
3
3
log MX (r)|r=0 = E (X − E[X]) = ςX σX .
dr

Lemma 1.6. Assume that MX is finite on (−r0 , r0 ) with r0 > 0. Then log MX (·)
is a convex function on (−r0 , r0 ).
Proof. In order to prove convexity we calculate the second derivative at r ∈ (−r0 , r0 )
00 0 00 0
2
d2 (r))2

MX (r)MX (r) − (MX MX (r) MX (r)
log MX (r) = = −
dr2 (MX (r))2 MX (r) MX (r)
!2
E X 2 erX E XerX
= − .
E [erX ] E [erX ]
Define the new function Fr by
w)
Z x
1
Fr (x) = ery dF (y). (1.6)
MX (r) −∞
Observe that Fr is a distribution function. Thus, we can choose a random variable Xr ∼ Fr

whose variance is given by
(m
!2
E X 2 erX E XerX d2
0 ≤ Var(Xr ) = E[Xr2 ] 2
− E[Xr ] = − = log MX (r).
E [erX ] E [erX ] dr2
This proves the claim. 2
Remark. The distribution function Fr defined in (1.6) gives the Esscher measure
tes
of F . The Esscher measure has been introduced by Bühlmann [19] for a new
premium calculation principle. Bühlmann has called it Esscher pricing principle
because of its formal connection to the Esscher transform. We come back to this
in Section 6.2.2.
no
The next formula will often be used: Assume that X ∼ F is non-negative, P-a.s.,
and has finite first moment. Then we have the identity
Z ∞ Z ∞ Z ∞
E[X] = x dF (x) = [1 − F (x)] dx = P [X > x] dx.
NL
0 0 0
The proof goes by integration by parts and the result says that we can calculate
expected values from survival function F̄ (x) = 1 − F (x) = P[X > x].
Often we deal with sequences X1 , X2 , . . . of random variables which are independent

and identically distributed (i.i.d.) with distribution function F . In this case we use
i.i.d.
the notation X1 , X2 , . . . ∼ F .
Another property that is going to be used quite frequently is the so-called tower
property, see Williams [85]. It states that for any sub-σ-algebra G ⊂ F on our
probability space (Ω, F, P) we have for any integrable random variable X ∼ F
E [X] = E [E [X| G]] . (1.7)

In particular, if X and Y are two random variables on (Ω, F, P) we have
E [X] = E [E [X| Y ]] ,
where E[X|Y ] is an abbreviation for E[X|σ(Y )] with σ(Y ) ⊂ F denoting the σ-

algebra generated by the random variable Y .
Assume that X is square integrable then the tower property (1.7) implies
Var(X) = E [Var (X| G)] + Var (E [ X| G]) .
We have mentioned above that distribution functions F are right-continuous and
w)
non-decreasing. This allows to define the left-continuous generalized inverse of F
by
F ← (p) = inf {x; F (x) ≥ p} ,
where we use the convention that inf ∅ = ∞. For p ∈ (0, 1), F ← (p) is often called
(m
the p-quantile of X ∼ F . The generalized inverse F ← is only tricky at the places
where F has an discontinuity or where F is not strictly increasing. It satisfies the
following properties, see Proposition A3 in McNeil et al. [68],
1. F ← is non-decreasing and left-continuous.

tes
2. F is continuous iff F ← is strictly increasing.
3. F is strictly increasing iff F ← is continuous.
4. (If F is right-continuous, then) F (x) ≥ z iff F ← (z) ≤ x.

no
5. F ← (F (x)) ≤ x.
6. F (F ← (z)) ≥ z.
7. If F is strictly increasing, then F ← (F (x)) = x.

NL
8. If F is continuous, then F (F ← (z)) = z.
Items 4. to 8. need that F ← (z) < ∞. Note that the first part of item 4. is put in
brackets because distribution functions are right-continuous. However, generalized
inverses can also be defined for functions that are not right-continuous (as long as
they are non-decreasing) and then the condition in the bracket of item 4. is needed.
1.2.2 Terminology in statistics

In many situations we face the problem that we need to predict the outcome of
a random variable X ∼ F . This problem is solved by specifying an appropriate
c For instance, we can choose as predictor X
predictor X. c = µ = E[X].
X
On the other hand a distribution function F often involves unknown parameters.
These unknown parameters need to be estimated, for instance, using past experience

and expert opinion. For example, we can estimate the (unknown) mean µX of X
by an estimator µb X .
If we now choose predictor X c = µ b X for predicting X, then µ b X serves at the
same time as estimator for µX and as predictor for X. In this sense we obtain
an estimation error which is specified by the difference µX − µb X and we obtain a
prediction error which is characterized by the following difference
X −X
c = X −µ
b X = (X − µX ) + (µX − µ
bX ) . (1.8)
The second term on the right-hand side of (1.8) again specifies the estimation error
w)
and the first term on the left-hand side of (1.8) is often called pure process error
which is due to the stochastic nature of X, see also Section 9.3.
Statistical tests deal with the problem of making decisions. Assume we have an
(m
observation x of a random vector X ∼ Fθ with given but unknown parameter θ
which lies in a given set Θ of possible parameters. The aim is to test whether
the (true, unknown) parameter θ that has generated x may belong to some subset
Θ0 ⊂ Θ. In the simplest case we have a singleton Θ0 = {θ0 }. Assume that we
would like to check whether x may have been generated by a given parameter θ0 .
tes
• Null hypothesis H0 : θ = θ0 .
• (Two-sided) alternative hypothesis H1 : θ 6= θ0 .
We then build a test statistics T (X) whose distribution function is known under
no
the null hypothesis H0 and we consider the question whether T (x) takes an unlikely
value under the null hypothesis. This means that one chooses a significance level
q ∈ (0, 1) (typically 5% or 1%) which provides a critical region Cq with P[T (X) ∈
Cq ] ≤ q. The null hypothesis is then rejected if T (x) falls into this critical region.
In practice, one often calculates the so-called p-value. This denotes the critical
probability at which the null hypothesis is just rejected. For instance, if we choose
NL
a significance level of 5% and the resulting p-value of T (x) is less or equal to 5%

then the test rejects the null hypothesis on the 5% significance level.
Exercise 2 (χ2 -distribution). Assume that Xk has a χ2k -distribution with k ∈ N

degrees of freedom, i.e. Xk is absolutely continuous with density
1
f (x) = xk/2−1 exp {−x/2} , for x ≥ 0.
2k/2 Γ(k/2)
(a) Prove that f is a density. Hint: see Section 3.2.1 and proof of Proposition
2.20.
(b) Prove
MXk (r) = (1 − 2r)−k/2 for r < 1/2.

(d)
(c) Choose Z ∼ N (0, 1) and prove Z 2 = X1 .
i.i.d. Pk (d)
(d) Choose Z1 , . . . , Zk ∼ N (0, 1). Prove i=1 Zi2 = Xk and calculate the first
two moments of the latter.
w)
(m
tes
no
NL

Chapter 2
Collective Risk Model
w)
The aim of this chapter is to describe the probability distribution of the total claim
amount S that an insurance company faces within a fixed time period. For the
(m
time period we take one (accounting) year. Assume that N counts all claims that
occur within this fixed accounting year. The total claim amount is then given by
N
X
S = Y1 + Y2 + . . . + YN = Yi ,
i=1
tes
where Y1 , . . . , YN models the individual claim sizes. If we are at the beginning of
this accounting year then neither the number of claims N nor the individual claim
sizes Y1 , . . . , YN are known. Therefore, we model all these unknowns with random
variables that describe the possible outcomes of the total claim amount S (which,
of course, then is also a random variable). We call such models for S collective
no
risk models because we consider the whole portfolio as a collective. The hope is to
discover a law of large numbers for the total insurance portfolio claim so that the
insurance company can benefit from diversification benefits (between individual
risks) that allow to predict possible outcomes of S (more) accurately.
NL
2.1 Compound distributions

The starting point of the modeling of S is a compound distribution. This compound
distribution is based on rather strong model assumptions on the one hand, but on
the other hand it already leads to a good description and understanding of the
possible outcomes of the total claim amount S.
Model Assumptions 2.1 (compound distribution). The total claim amount S is

given by the following compound distribution
N
X
S = Y1 + Y2 + . . . + YN = Yi ,
i=1
with the three standard assumptions
23
24 Chapter 2. Collective Risk Model
1. N is a discrete random variable which only takes values in A ⊂ N0 ;

i.i.d.
2. Y1 , Y2 , . . . ∼ G with G(0) = 0;
3. N and (Y1 , Y2 , . . .) are independent.
Remarks.
• If S satisfies these three standard assumptions from Model Assumptions 2.1

we say that S has a compound distribution.
• The first assumption of the compound distribution says that the number of
)
claims N takes only non-negative integer values. The event {N = 0} means
w
that no claim occurs which provides a total claim amount of S = 0.
• The second assumption means that the individual claims Yi do not affect each
(m
other, i.e. if we face a large first claim Y1 this does not give any information
for the remaining claims Yi , i ≥ 2. Moreover, we have homogeneity meaning
that all claims have the same marginal distribution function G with
0 = G(0) = P [Y1 ≤ 0] ,
tes
i.e. the individual claim sizes Yi are strictly positive, P-a.s. We use synony-
mous the terminology (individual) claim size, (individual) claim and claims
severity for Yi .
• Finally, the last assumption says that the individual claim sizes are not af-
no
fected by the number of claims and vice versa, for instance, if we observe
many claims this does not contain any information whether these claims are
of smaller or larger size.
This compound distribution is the base model for collective risk modeling and we
are going to describe different choices for the claims count distribution of N and
NL
for the individual claim size distributions of Yi . We start with the basic recognition
features of compound distributions.
Proposition 2.2. Assume S has a compound distribution. We have
E[S] = E[N ] E[Y1 ],

Var(S) = Var(N ) E[Y1 ]2 + E[N ] Var(Y1 ),
s
1
Vco(S) = Vco(N )2 + Vco(Y1 )2 ,
E[N ]
MS (r) = MN (log(MY1 (r))) for r ∈ R,
whenever they exist.

Chapter 2. Collective Risk Model 25
Proof. Using the tower property (1.7) we obtain for the mean of S
"N # " " N ## "N # "N #
X X X X
E[S] = E Yi = E E Yi N = E E [ Yi | N ] = E E [Yi ]

i=1 i=1 i=1 i=1
= E [N E [Y1 ]] = E [N ] E [Y1 ] .
For the second statement we have

N
! " N #! " N
!#
X X X
Var(S) = Var Yi = Var E Yi N + E Var Yi N

i=1 i=1 i=1
N
! "N #
X X
w)
= Var E [ Yi | N ] + E Var ( Yi | N )
i=1 i=1
N
! " N
#
X X 2
= Var E [Yi ] +E Var (Yi ) = Var (N ) E [Y1 ] + E [N ] Var (Y1 ) .
i=1 i=1
(m
Finally, for the moment generating function we have
" ( N )# " " N ## "N #
X Y Y
MS (r) = E exp r Yi = E E exp {rYi } N = E E [ exp {rYi }| N ]

i=1 i=1 i=1
= E MY1 (r)N = E [exp {N log(MY1 (r))}] = MN (log(MY1 (r))).

This proves the proposition. 2

tes
Under Model Assumptions 2.1 the distribution function of S can be written as
" N #
X X
P [S ≤ x] = Yi ≤ x N = k P [N = k] (2.1)

P

k∈A i=1
no
" k #
G∗k (x) P [N = k] ,
X X X
= P Yi ≤ x P [N = k] =
k∈A i=1 k∈A
where G∗k denotes the k-th convolution of the distribution function G. In partic-
i.i.d.
ular, we have for Y1 , Y2 ∼ G
NL
Z
P [Y1 + Y2 ≤ x] = G(x − y) dG(y) = G∗2 (x).
With formula (2.1) we obtain a closed form solution for the distribution function
of S. However, in general, this formula is not useful due to the computational
complexity of calculating G∗k for too many k ∈ A. We present other solutions
for the calculation of the distribution function of S. These involve simulations,
approximations and smart analytic techniques under additional model assumptions.
2.2 Explicit claims count distributions

In this section we give explicit distribution functions for the number of claims N
modeling. The three most commonly used distribution functions are the binomial

distribution, the Poisson distribution and the negative-binomial distribution. Our

aim is to present these three distribution functions, describe the properties of the
resulting compound distributions, and discuss parameter estimation.
In a non-life insurance context the claims count random variable N should always
be understood in relation to an underlying (deterministic) volume v > 0. There-
fore, we consistently use a volume measure to describe N . Often this is not done
in the related literature. The volume measure will become especially important for
diversification benefit studies, parameter estimation and the evaluation of parame-
ter uncertainty. The volume measures can be of different nature depending on the
w)
insurance policy and one should always choose the most appropriate one. Typical
volume measures are: number of insured, number of policies, number of risks. But
in health and accident insurance it could also be the aggregated wages insured or
in fire insurance the total value of the buildings insured. To make language simple
we assume that v > 0 denotes the number of risks insured and N counts the num-
(m
bers of risks that have a claim. The ratio N/v is called claims frequency and the
expected number of claims is given by
E[N ] = λv,
where λ > 0 denotes the expected claims frequency. Under these assumptions we
tes
would like to describe the probability weights
pk = P [N = k] for k ∈ A ⊂ N0 .
no
2.2.1 Binomial distribution

For the binomial distribution we choose a fixed volume v ∈ N and a fixed default
probability p ∈ (0, 1) (expected claims frequency).
We say N has a binomial distribution, write N ∼ Binom(v, p), if

NL
!
v
pk = P [N = k] = pk (1 − p)v−k for all k ∈ {0, . . . , v} = A.
k
P
The binomial formula provides k∈A pk = 1, see e.g. Section 5.3 in Merz-Wüthrich
[70], and, hence, we have a discrete distribution function on the set A = {0, . . . , v}.
The special case v = 1 is called Bernoulli distribution or Bernoulli experiment, write
N ∼ Bernoulli(p), and reflects the coin tossing experiment
(
1−p for k = 0,
P [N = k] =
p for k = 1.
This describes whether a single risk defaults or not.

Proposition 2.3. Assume N ∼ Binom(v, p) for fixed v ∈ N and p ∈ (0, 1). Then
s
1−p
E[N ] = vp, Var(N ) = vp(1 − p), Vco(N ) = ,
vp
MN (r) = (per + (1 − p))v for all r ∈ R.
Proof. We calculate the moment generating function and then the first two moments follow from
formula (1.5). For the moment generating function we have
X v
rk v
X k
k v−k
MN (r) = e p (1 − p) = (per ) (1 − p)v−k
k k
w)
k∈A k∈A
k v−k
X v per 1−p
= (per + (1 − p))v .
k per + (1 − p) per + (1 − p)
k∈A
The last sum is again a summation over probability weights p∗k , k ∈ A, of a binomial distribution
(m
with default probability p∗ = (per )/(per + (1 − p)) ∈ (0, 1). Therefore it adds up to 1 which
completes the proof. 2
Next we give a second characterization of the binomial distribution which leads to

the interpretation of the binomial distribution.
tes
Corollary 2.4. Assume that N ∼ Binom(v, p) with given v ∈ N and p ∈ (0, 1).
i.i.d.
Choose X1 , . . . , Xv ∼ Bernoulli(p). Then we have
v
(d) X
N = Xi .
i=1
no
Pv
Proof. In view of Lemma 1.3 it suffices to prove that N and X = i=1 Xi have the same
moment generating function. The moment generating function of the latter is given by
" v # v v
Y Y Y
rXi
E erXi = (per + (1 − p)) = MN (r).

MX (r) = E e =
i=1 i=1 i=1
NL
This completes the proof. 2
Remarks. The corollary states that N describes the number of defaults within
a portfolio of fixed size v ∈ N. Every risk in this portfolio has the same default
probability p and defaults between different risks do not influence each other (are
independent). Thus, if N has a binomial distribution then every risk in such a
portfolio can at most default once. This is the case, for instance, for life insurance
policies where an insured can die at most once. In non-life insurance this distri-
bution is less commonly used because for typical non-life insurance policies we can
have more than one claim within a fixed time interval, e.g., a car insurance policy
can suffer two or more accidents within the same accounting year. Therefore, the
binomial distribution is only of marginal interest in non-life insurance modeling.

Definition 2.5 (compound binomial model). The total claim amount S has a
compound binomial distribution, write
S ∼ CompBinom(v, p, G),
if S has a compound distribution with N ∼ Binom(v, p) for given v ∈ N and

p ∈ (0, 1) and individual claim size distribution G.
Proposition 2.6. Assume S ∼ CompBinom(v, p, G). We have
E[S] = vp E[Y1 ],
w)

Var(S) = vp E[Y12 ] − pE[Y1 ]2 ,
s
1 q
Vco(S) = 1 − p + Vco(Y1 )2 ,
vp
(m
MS (r) = (pMY1 (r) + (1 − p))v for r ∈ R,

Proof. The proof is an immediate consequence of Propositions 2.2 and 2.3. 2
Remark. The coefficient of variation Vco(S) is a measure for the degree of di-
tes
versification within the portfolio. If S has a compound binomial distribution with
fixed default probability p and fixed claim size distribution G having finite second
moment, then the coefficient of variation converges to zero of order v −1/2 as the
portfolio size v increases.
no
Corollary 2.7 (aggregation property). Assume that S1 , . . . , Sn are independent

with Sj ∼ CompBinom(vj , p, G) for all j = 1, . . . , n. The aggregated claim has a
compound binomial distribution with
 
n
X n
X
S= Sj ∼ CompBinom  vj , p, G .
NL
j=1 j=1
Proof. Exercise. 2
2.2.2 Poisson distribution

For the Poisson distribution we choose a fixed volume v > 0 and a fixed expected
claims frequency λ > 0.
We say N has a Poisson distribution, write N ∼ Poi(λv), if
(λv)k
pk = P [N = k] = e−λv for all k ∈ A = N0 .
k!

The power series expansion of the exponential function eλv

P
provides k≥0 pk = 1 and thus we have a discrete distribu-
tion function on the set A = N0 .
The Poisson distribution goes back to Siméon Denis Pois-

son (1781-1840) who has published his work on probability
theory in 1837.
Note that the parameter λv only appears as a product in S.D. Poisson

the Poisson distribution. Therefore, we could also define
w)
c = λv > 0 and then we could work solely with c. This is the way how the
Poisson distribution is typically treated in the literature. However, we would like
to keep the separation of c into λ and v because we would like to have the frequency
interpretation for λ which allows for the study of diversification benefits. This is
(m
exactly the statement of the next proposition.
Proposition 2.8. Assume N ∼ Poi(λv) for fixed λ, v > 0. Then

s
1
E[N ] = λv = Var(N ), Vco(N ) = ,
λv
tes
MN (r) = exp {λv(er − 1)} for all r ∈ R.
Proof. We calculate the moment generating function and then the first two moments follow from
formula (1.5). For the moment generating function we have using the power series expansion of
the exponential function
no
X (λv)k X (λver )k
MN (r) = erk e−λv = e−λv = exp {−λv + λver } .
k! k!
k≥0 k≥0
Proposition 2.8 provides the interpretation of the parameter λ. For given volume
NL
v > 0 the expected claims frequency is
N

E = λ.
v
Moreover, for the coefficient of variation of the claims frequency N/v we obtain
N

Vco = (λv)−1/2 → 0 for v → ∞. (2.2)
v
Next we give a constructive characterization of the Poisson distribution.
Lemma 2.9. Assume that Nv ∼ Binom(v, p) with v ∈ N and p = p(v) ∈ (0, 1)

such that limv→∞ vp = c > 0. Then Nv converges in distribution to N ∼ Poi(c) as
v → ∞.

Proof. In view of Lemma 1.4 we need to prove the the moment generating functions of Nv have
the appropriate convergence property.
h ivp(v)
1/p(v)
MNv (r) = (per + (1 − p))v = (1 + p(v) (er − 1)) .
Note that p(v) → 0 as v → ∞. If we apply this limit to the inner bracket (1 + p(v)(er − 1))1/p(v)
we exactly obtain the limit definition of the exponential function exp{er − 1}, see Definition 14.30
in Merz-Wüthrich [70]. This with the fact that vp(v) → c > 0 as v → ∞ provides the proof. 2
Interpretation. Binomially distributed claims counts Nv can be approximated
w)
by a Poisson distribution if the default probability p is very small compared to the
portfolio size v.
Definition 2.10 (compound Poisson model). The total claim amount S has a
compound Poisson distribution, write
(m
S ∼ CompPoi(λv, G),
if S has a compound distribution with N ∼ Poi(λv) for given λ, v > 0 and individual
claim size distribution G.
tes
Proposition 2.11. Assume S ∼ CompPoi(λv, G). We have
E[S] = λv E[Y1 ],
Var(S) = λv E[Y12 ],
no
s
1 q
Vco(S) = 1 + Vco(Y1 )2 ,
λv
MS (r) = exp {λv(MY1 (r) − 1)} for r ∈ R,

NL
Proof. The proof is an immediate consequence of Propositions 2.2 and 2.8. 2
Remark. If S has a compound Poisson distribution with fixed expected claims

frequency λ > 0 and fixed claim size distribution G having finite second moment,
then the coefficient of variation converges to zero of order v −1/2 as the portfolio
size v increases.
The compound Poisson distribution has the so-called aggregation property and the
disjoint decomposition property. These are two extremely beautiful and useful
properties which explain part of the popularity of the compound Poisson model.
We first state and prove these two properties and then we give interpretations in
the light of non-life insurance portfolio modeling.

Theorem 2.12 (aggregation ↑ of compound Poisson distributions). Assume

S1 , . . . , Sn are independent with Sj ∼ CompPoi(λj vj , Gj ) for all j = 1, . . . , n. The
aggregated claim has a compound Poisson distribution
n
X
S= Sj ∼ CompPoi(λv, G),
j=1
with n n n
X X vj X λj vj
v= vj , λ= λj and G= Gj .
v j=1 λv
w)
j=1 j=1
Proof. We have assumed that Gj (0) = 0 for all j = 1, . . . , n which implies that S ≥ 0, P-a.s.
From Lemma 1.3 it follows that we only need to identify the moment generating function of S in
order to prove that it is compound Poisson distributed. Observe that MS (r) exists at least for
(m
r ≤ 0. Thus, we calculate (using the independence of the Sj ’s)
    
 X n  n
Y Yn
MS (r) = E exp r Sj  = E  exp {rSj } = E [exp {rSj }]
 
j=1 j=1 j=1
  
n n
Y n o  X λ j vj 
= exp λj vj MY (j) (r) − 1 = exp λv  MY (j) (r) − 1 ,
λv
tes
1  1 
j=1 j=1
(j)
where we have assumed Y1 ∼ Gj . This is a compound Poisson distribution with expected num-
ber of claims λv and the claim size distribution G is obtained from the moment generating func-
Pn λj vj Pn λj vj
tion j=1 λv MY (j) (r): note that G = j=1 λv Gj is a distribution function (non-decreasing,
1
right-continuous, limx→−∞ G(x) = 0 and limx→∞ G(x) = 1). We choose Y ∼ G and obtain
no
 
Z ∞ Z ∞ n
X λ v
j j
MY (r) = ery dG(y) = ery d  Gj (y)
0 0 j=1
λv
n Z ∞ n
X λj vj X λ j vj
= ery dGj (y) = MY (j) (r).
j=1
λv 0 j=1
λv 1
NL
Using Lemma 1.3 once more for these claim sizes proves the theorem. 2
Next we analyze the disjoint decomposition property. Therefore we slightly extend

the compound Poisson model. Let (p+ j )j=1,...,m be a discrete probability distribution
on the finite set {1, . . . , m}. Assume p+ j > 0 for all j. We can interpret the set
{1, . . . , m} as different sub-portfolios or different lines of business. For instance,
if we have a car insurance portfolio, a property insurance portfolio and a liability
insurance portfolio we set m = 3, with j ∈ {1, 2, 3} labeling the portfolios of three
lines of business. Assume Gj are the corresponding claim size distributions of the
sub-portfolios with Gj (0) = 0. Then, we can define the mixture distribution by
m
p+
X
G(y) = j Gj (y) for y ∈ R.
j=1

Theorem 2.12 exactly provides such a mixture distribution with p+

j = λj vj /(λv) if
we aggregate the sub-portfolios.
In the next theorem we go the opposite way, i.e. we are aiming at decomposing G.
We define a discrete random variable I which indicates to which sub-portfolio a
particular claim Y belongs: define
P [I = j] = p+
j for all j ∈ {1, . . . , m}. (2.3)
w)
This allows to extend the compound Poisson model from Definition 2.10.
Definition 2.13 (extended compound Poisson model). The total claim amount
S= N
P
i=1 Yi has a compound Poisson distribution as defined in Definition 2.10. In
(m
addition, we assume that (Yi , Ii )i≥1 are i.i.d. and independent of N with Yi having
marginal distribution function G with G(0) = 0 and Ii having marginal distribution
function given by (2.3).
Remark. Note that Definition 2.13 gives a well-defined extension, i.e. it fully
tes
respects the assumptions made in Definition 2.10 because (Yi , Ii )i≥1 are i.i.d. and
independent of N with Yi having the appropriate marginal distribution function
G. Observe that we do not specify the dependence structure between Yi and Ii . If
we choose m = 1 in (2.3) we are back in the classical compound Poisson model.
Therefore, the next theorem especially applies to the compound Poisson model.
no
Before stating the theorem we introduce an admissible, measurable disjoint de-

composition of the total space: the random vector (Y1 , I1 ) takes values in R+ ×
{1, . . . , m}. On this latter we choose a finite sequence A1 , . . . , An of (measurable)
NL
sets such that Ak ∩ Al = ∅ for all k 6= l and
n
[
Ak = R+ × {1, . . . , m}. (2.4)
k=1
Such a sequence A1 , . . . , An is called a measurable disjoint decomposition of R+ ×

{1, . . . , m}. This measurable disjoint decomposition is called admissible for (Y1 , I1 )
if for all k = 1, . . . , n
p(k) = P [(Y1 , I1 ) ∈ Ak ] > 0.
Pn
Note that k=1 p(k) = 1, due to (2.4) and the mutual disjointness.

Theorem 2.14 (disjoint decomposition ↓ of compound Poisson distributions). As-

sume that S fulfills the extended compound Poisson model assumptions of Definition
2.13. We choose an admissible, measurable disjoint decomposition A1 , . . . , An for
(Y1 , I1 ). Define for k = 1, . . . , n the random variables
N
X
Sk = Yi 1{(Yi ,Ii )∈Ak } .
i=1
Sk are independent and CompPoi(λk vk , Gk ) distributed for k = 1, . . . , n with
w)
λk vk = λvp(k) > 0 and Gk (y) = P [ Y1 ≤ y| (Y1 , I1 ) ∈ Ak ] .
Proof of Theorem 2.14. We prove the theorem using the multivariate version of the mo-
ment generating function. Choose r = (r1 , . . . , rn )0 ∈ Rn . The multivariate moment generating
function of S = (S1 , . . . , Sn )0 is given by
(m
" ( n )# " ( n N
)#
X X X
0
MS (r) = E [exp {r S}] = E exp rk Sk = E exp rk Yi 1{(Yi ,Ii )∈Ak }
k=1 k=1 i=1
"N " ( n
) ##
Y X
= E E exp rk Yi 1{(Yi ,Ii )∈Ak } N

i=1 k=1
"N " ( n )##
tes
Y X
= E E exp rk Yi 1{(Yi ,Ii )∈Ak } .
i=1 k=1
We calculate the inner expected values.

" ( n )# n
" ( n ) #
X X X
E exp rk Yi 1{(Yi ,Ii )∈Ak } = E exp rk Yi 1{(Yi ,Ii )∈Ak } 1{(Yi ,Ii )∈Al }
no
k=1 l=1 k=1

n
" ( n
) #
X X
= E exp rk Yi 1{(Yi ,Ii )∈Ak } (Yi , Ii ) ∈ Al P [(Yi , Ii ) ∈ Al ]

l=1 k=1
Xn n
X
= E [ exp {rl Yi }| (Yi , Ii ) ∈ Al ] p(l) = p(l) MY (l) (rl ),
1
l=1 l=1
NL
(l)
where we assume Y1 ∼ Gl . Collecting all terms we obtain
 !N  " ( !)#
Xn n
X
MS (r) = E  p(l) MY (l) (rl )  = E exp N log p(l) MY (l) (rl )
1 1
l=1 l=1
( n
!) ( n
)
X X
(l) (l)
= exp λv p MY (l) (rl ) − 1 = exp λv p MY (l) (rl ) − 1
1 1
l=1 l=1
n
Y n o n
Y
= exp λvp(l) MY (l) (rl ) − 1 = MSl (rl ).
1
l=1 l=1
This proves the theorem because we have obtained a product (i.e. independence) of moment
generating functions of compound Poisson distributed random variables Sl , l = 1, . . . , n. 2

Remarks 2.15 (Aggregation ↑ and disjoint decomposition ↓ properties).
• The aggregation property implies that we can follow a bottom-up ↑ modeling

approach for the entire insurance business. Thus, we model each sub-portfolio
Sj independently with a compound Poisson distribution. The total portfolio
is then easily obtained by the aggregation theorem and we stay in the same
family of distributions. This theorem is of special importance when we esti-
mate the frequency parameters λj and the individual claim size distributions
Gj on the bottom level.
w)
• The disjoint decomposition property implies that we
can also follow a top-down ↓ modeling approach. Thus,
we model the overall portfolio by a compound Poisson
distribution and by the disjoint decomposition prop-
(m
erty we can easily allocate the total claim to sub-
portfolios. The crucial result here is, at the first sight
surprising, that this allocation results in independent
compound Poisson distributions for Sj . This property
does not hold true for other compound distributions
because it essentially uses the independent space and
tes
time decoupling property of Poisson point processes, see also Section 3.3.2 in
Mikosch [71].
• For I we have chosen a finite (discrete) indicator. Of course, this model can
easily be extended to other indicators. The crucial property is the i.i.d. as-
no
sumption on the random vectors (Yi , Ii ). We have chosen a finite indicator I

because this has the natural interpretation of sub-portfolios. If I = 1, P-a.s.,
then we can completely drop this indicator.
The choice of the appropriate volume on the sub-portfolios depends on the choice
of the indicator I. If m = 1, i.e. if we only consider one portfolio, but we apply a
NL
disjoint decomposition of this portfolio as follows
Yi = Yi 1{Yi ∈A1 } + . . . + Yi 1{Yi ∈An } ,
then it is natural to set vk = v and λk = λp(k) for k = 1, . . . , n. That is, the volume
v > 0 remains constant but the expected claims frequencies λk change accordingly
to Ak . This is also called thinning of the Poisson point process.
The second extreme case is m = n > 1 and the disjoint decomposition is given by
{(Yi , Ii ) ∈ Ak } = {Ii = k},
i.e. we only consider a decomposition according to different sub-portfolios k =

1, . . . , m. In this case we would rather define vk > 0 by the volume of portfolio k
and λk = λp(k) v/vk .

Example 2.16 (large claims separation). An important application of the disjoint

decomposition property of compound Poisson distributions is the separation of
large claims from small claims. Often, there is not one parametric distribution
function G that applies to the entire range of possible outcomes of the individual
claim sizes Yi . Therefore, these individual claim sizes are separated into different
layers. Let us assume that we would like to model two layers. We choose a large
claims threshold M > 0 such that G(M ) ∈ (0, 1), i.e. G(M ) is bounded away form
zero and one. We then define the disjoint decomposition A1 , A2 of R+ by
A = A1 = {Y1 ≤ M } and Ac = A2 = {Y1 > M } .
w)
Assume that S ∼ CompPoi(λv, G). We define the total claim Ssc in the small
claims layer and the total claim Slc in the large claims layer by
(m
N
X N
X
Ssc = Yi 1{Yi ≤M } and Slc = Yi 1{Yi >M } .
i=1 i=1
Theorem 2.14 implies that Ssc and Slc are independent and compound Poisson
distributed with
tes
Ssc ∼ CompPoi (λsc v = λG(M )v , Gsc (y) = P [Y1 ≤ y|Y1 ≤ M ]) ,
and
Slc ∼ CompPoi (λlc v = λ(1 − G(M ))v , Glc (y) = P [Y1 ≤ y|Y1 > M ]) .
no
In particular, this means that we can model the small and the large claims layers
completely separately and then obtain the total claim distribution by a simple
convolution of the two resulting distribution functions (due to independence), see
Example 4.11, below.
NL
For the large claims layer we need to determine the expected large claims frequency
λlc . The individual claim sizes Y1 |{Y1 >M } are often model with a Pareto distribution
with threshold M , see Sections 3.2.5 and 3.4.1.
The small claims layer is often approximated by a parametric distribution function:
we have seen in (2.1) that compound distributions may lead to rather time consum-
ing complexity when the expected number of claims λsc v is large. Therefore, one
typically assumes that the expected number of claims is sufficiently large so that
we are already in the asymptotic regime of the central limit theorem and then we
approximate this compound distribution by the Gaussian distribution (or maybe
by a distribution function that is slightly skewed, see Sections 4.1.2 and 4.1.3).
Note that the small claims layer cannot be distorted by large claims because they
are already sorted out by the threshold M . We will describe this in more detail in
Section 3.4.1, below.

2.2.3 Mixed Poisson distribution

Above we have introduced the binomial and the Poisson distribution. These two
distributions have the following property
binomial distribution E [N ] > Var(N ),

Poisson distribution E [N ] = Var(N ).
However, insurance data often suggests
w)
E [N ] < Var(N ).
Therefore, we present more claims count distributions for N . In particular, the

mixed Poisson distribution enjoys the latter property of a variance dominating the
(m
mean. We remark that similar constructions could also be done for the binomial
distribution. We refrain from doing so because the Poisson case is more appropriate
for non-life insurance modeling.
The mixed Poisson distribution gives the general principle and a specific example
will be given in the next section. The idea is to attach volatility (or uncertainty)
to the claims frequency parameter λ, thus, the claims frequency will be modeled
tes
as a latent variable, and based on this latent variable we choose the claims count
distribution.
Definition 2.17 (mixed Poisson distribution).
• Assume Λ ∼ H with H(0) = 0, E [Λ] = λ and Var(Λ) > 0.

no
• Conditionally, given Λ, N ∼ Poi(Λv) for a fixed volume v > 0.
Lemma 2.18. Assume N satisfies Definition 2.17. We have E [N ] < Var(N ).

Proof. The tower property implies E[N ] = E[E[N |Λ]] = E[Λv] = λv and
NL
Var(N ) = E[Var(N |Λ)] + Var(E[N |Λ]) = vE[Λ] + v 2 Var(Λ) > λv.
In the next section we make an explicit choice for the distribution function H.
2.2.4 Negative-binomial distribution

In this section we assume that N has a mixed Poisson distribution and we assume
that the latent variable Λ is drawn from a gamma distribution. Therefore, we
briefly introduce the gamma distribution, which is described in more detail in
Section 3.2.1, below.

We say X has a gamma distribution, write X ∼ Γ(γ, c), with shape parameter
γ > 0 and scale parameter c > 0 if X is a non-negative, absolutely continuous
random variable with density
cγ
f (x) = xγ−1 exp {−cx} for x ≥ 0.
Γ(γ)
The moments of X are given by

γ
γ γ c

E[X] = , Var(X) = 2 and MX (r) = for r < c.
c c c−r
w )
The gamma distribution has many nice properties and it is used rather frequently
for the modeling of latent variables and for the modeling of individual claim sizes,
see Section 3.2.1.
(m
Definition 2.19 (negative-binomial distribution). We say N has a negative-
binomial distribution, write N ∼ NegBin(λv, γ), with volume v > 0, expected claims
frequency λ > 0 and dispersion parameter γ > 0, if
• Θ ∼ Γ(γ, γ), and

tes
• conditionally, given Θ, N ∼ Poi(Θλv).
Note that for Λ = Θλ we are exactly in the context of Definition 2.17 with the first
no
two moments given by
E[Λ] = λ and Var(Λ) = λ2 /γ > 0.
Proposition 2.20 (negative-binomial distribution, 2nd definition). The negative-

NL
binomial distribution as defined in Definition 2.19 satisfies

!
k+γ−1
pk = P[N = k] = (1 − p)γ pk for k ∈ A = N0 ,
k
where we choose p = (λv)/(γ + λv) ∈ (0, 1).
This latter representation is the definition often used for the negative-binomial
distribution. However, in our context, it is simpler to work with the first definition.
Especially, parameter estimation will give an explicit meaning to the latent variable
Θ.

Proof of Proposition 2.20. We apply the tower property which implies
(Θλv)k

P[N = k] = E [P[N = k|Θ]] = E exp{−Θλv}
k!
Z ∞ k γ
(xλv) γ
= exp{−xλv} xγ−1 exp {−γx} dx
0 k! Γ(γ)
Z ∞
(λv)k γ γ Γ(γ + k) (γ + λv)γ+k γ+k−1
= x exp {−(γ + λv)x} dx
Γ(γ) k! (γ + λv)γ+k 0 Γ(γ + k)
γ k
Γ(γ + k) γ λv k+γ−1 γ
= = (1 − p) pk ,
Γ(γ) k! γ + λv γ + λv k
w)
notice that the second last inequality follows because we have a gamma density with shape
parameter γ + k and scale parameter γ + λv under the integral. This trick of completion should
be remembered because it is applied rather frequently. 2
Proposition 2.21. Assume N ∼ NegBin(λv, γ) for fixed λ, v, γ > 0. Then
(m
E[N ] = λv
Var(N ) = λv(1 + λv/γ) > λv,
s
1 q
Vco(N ) = 1 + λv/γ > γ −1/2 > 0,
λv
!γ
1−p
tes
MN (r) = for all r < − log p,
1 − per
and p = (λv)/(γ + λv) ∈ (0, 1).

Proof. The first three statements are a direct consequence of the proof of Lemma 2.18 and the
no
properties of the gamma distribution. Therefore, it remains to calculate the moment generating
function. The tower property implies
MN (r) = E E erN Θ = E [exp {Θλv (er − 1)}] = MΘ (λv(er − 1)),

from which the claim follows for Θ ∼ Γ(γ, γ) and 1 − p = γ/(γ + λv). 2
NL
Proposition 2.21 provides a nice interpretation. For given volume v > 0 the ex-
pected claims frequency is
N

E = λ.
v
Moreover for the coefficient of variation of the claims frequency N/v we obtain
N
q
Vco = (λv)−1 + γ −1 → γ −1/2 > 0 for v → ∞.
v
Thus, the random variable Θ reflects the uncertainty in the “true” underlying
frequency parameter of the Poisson distribution. This uncertainty also remains
in the portfolio for infinitely large volume v, i.e. it is not diversifiable, and the
positive lower bound is determined by the dispersion parameter γ ∈ (0, ∞). The
interpretation of this is as follows: consider the time series N1 , N2 , . . . of claims

counts in different accounting years 1, 2, . . .. Each of these accounting years has

its own (risk) characteristics Θ1 , Θ2 , . . ., like weather conditions, inflation index,
portfolio fluctuations, etc. Since we do not know these characteristics a priori,
i.e. prior to future accounting years, we model these characteristics with a latent
factor (Θt )t≥1 which then provides the “true” frequency parameter for accounting
year t given by Λt = Θt λ. This differs from the Poisson case, see (2.2).
Example 2.22 (claims count distributions). We compare the binomial, Poisson
and the negative-binomial distributions. We assume that they have identical means
E[N ] = 500 with v = 1000, p = λ = 0.5 and γ = 100.
w)
0.025
binomial
Poisson
negative−binomial
0.020
(m
probability weights p_k
0.015
0.010
tes
0.005
no
0.000
200 300 400 500 600 700 800
Figure 2.1: Probability weights pk of binomial, Poisson and negative binomial

NL
distributions with identical means.
In Figure 2.1 we plot the corresponding probability weights pk . We observe that

the coefficient of variation is increasing from the binomial over the Poisson to the
negative-binomial distribution, which gives successively more uncertainty to claims
counts.
Definition 2.23 (compound negative-binomial model). The total claim amount S

has a compound negative-binomial distribution, write
S ∼ CompNB(λv, γ, G),
if S has a compound distribution with N ∼ NegBin(λv, γ) for given λ, v, γ > 0 and
individual claim size distribution G.

Proposition 2.24. Assume S ∼ CompNB(λv, γ, G). We have, whenever they

exist,
E[S] = λv E[Y1 ],
Var(S) = λv E[Y12 ] + (λv)2 E[Y1 ]2 /γ,
s
1 q
Vco(S) = 1 + Vco(Y1 )2 + λv/γ > γ −1/2 ,
λv
!γ
1−p
MS (r) = for r ∈ R such that MY1 (r) < 1/p,
w)
1 − pMY1 (r)
with p = (λv)/(γ + λv) ∈ (0, 1).
(m
Proof. The proof is an immediate consequence of Propositions 2.2 and 2.21.
2.3 Parameter estimation

Once we have specified the distribution functions for N and Yi we still need to
determine their parameters. In the case of the claims count distribution for N
tes
these are (i) default probability p for the binomial distribution; (ii) expected claims
frequency λ for the Poisson distribution; or (iii) expected claims frequency λ and
dispersion parameter γ for the negative-binomial distribution. Essentially, there
are three different ways to estimate these parameters:
no
1. method of moments (MM),
2. maximum likelihood estimation (MLE) method,
3. Bayesian inference method (inverse probability method).

NL
In this section we describe the first two methods. For the Bayesian inference method
we refer to Chapter 8.
2.3.1 Method of moments

We start with a simple example to explain the method of moments. Assume that
i.i.d.
we have X1 , . . . , XT ∼ F , where F is a parametric distribution function that
depends (for simplicity) on a two dimensional parameter (ϑ1 , ϑ2 ). Assume that the
first two moments of X1 are finite, and thus, for all t = 1, . . . , T we have
µ = µ(ϑ1 , ϑ2 ) = E[Xt ] < ∞ and σ 2 = σ 2 (ϑ1 , ϑ2 ) = Var(Xt ) < ∞.
Remark. For general d-dimensional parameters (ϑ1 , . . . , ϑd ) we extend the argu-

ment to the first d moments of Xt .

We define the sample mean and sample variance by, T ≥ 2 for the latter,
T T
1 X 1
σbT2 = (Xt − µb T )2 .
X
µb T = Xt and (2.5)
T t=1 T − 1 t=1
A straightforward calculation shows that these are unbiased estimators for µ and
σ 2 , that is,
E[µb T ] = µ = µ(ϑ1 , ϑ2 ) and E[σbT2 ] = σ 2 = σ 2 (ϑ1 , ϑ2 ). (2.6)
)
This motivates the moment estimator (ϑb1 , ϑb2 ) for (ϑ1 , ϑ2 ) by solving the system of
w
equations
µb T = µ(ϑb1 , ϑb2 ) and σbT2 = σ 2 (ϑb1 , ϑb2 ).
(m
In our situation the problem is more involved. Assume we have a vector of obser-
vations N = (N1 , . . . , NT )0 , where Nt denotes the number of claims in accounting
year t. The difficulty is that Nt , t = 1, . . . , T , are not i.i.d. because they depend
on different volumes vt . That is, in general, the portfolio changes over accounting
years. Therefore, we need to slightly modify the framework described above.
tes
Assumption 2.25. Assume that there exist strictly positive volumes v1 , . . . , vT
such that the components of F = (N1 /v1 , . . . , NT /vT )0 are independent with
λ = E[Nt /vt ] and τt2 = Var(Nt /vt ) ∈ (0, ∞),

no
for all t = 1, . . . , T .
Lemma 2.26. We make Assumption 2.25. The unbiased linear (in F) estimator
for λ with minimal variance is given by
T
!−1 T
1 Nt /vt
NL
b MV
X X
λ T = 2 2
,
t=1 τt t=1 τt
the variance of this estimator is given by
T
!−1

b MV
X 1
Var λT = 2
.
t=1 τt
The upper index MV stands for “minimum variance” estimator.
Proof. We apply the method of Lagrange, see Section 24.3 in Merz-Wüthrich [70]. We define
the mean vector λ = λe = λ(1, . . . , 1)0 ∈ RT and the diagonal positive definite covariance matrix
T = diag(τ12 , . . . , τT2 ) of F. Then we would like to solve the following minimization problem
1 0
x+ = arg min{x∈RT ;x0 λ=λ} x T x,
2

thus, we minimize the variance Var(x0 F) = x0 T x subject to all unbiased linear combinations of
F which gives the constraint λ = E[x0 F] = x0 λ. The Lagrangian for this problem is given by
1 0
L(x, c) = − x T x − c(x0 λ − λ),
2
with Lagrange multiplier c. The optimal value x+ is found by the solution of
∂ ∂
L(x, c) = −T x − cλ = 0 and L(x, c) = −x0 λ + λ = 0.
∂x ∂c
The first requirement implies x = −cT −1 λ = −cλT −1 e. Plugging this into the second require-
ment implies λ = x0 λ = −cλ2 e0 T −1 e. If we solve this for the Lagrange multiplier we obtain
−c = λ−1 (e0 T −1 e)−1 . This provides
w)
T
!−1
1 X 1 0
x+ = 0 −1 T −1 e = 2 τ1−2 , . . . , τT−2 .
eT e τ
t=1 t
bMV = (x+ )0 F and the variance is given by

This implies that λT
(m
T
!−1

bMV = (x+ )0 T x+ = (e0 T −1 e)−1 =
X 1
Var λ T 2 .
τ
t=1 t
We apply this lemma to the case of the binomial and the Poisson distributions.
Assume that Nt , t = 1, . . . , T , are independent with Nt ∼ Binom(vt , p) or Nt ∼
tes
Poi(λvt ), respectively. Then we have in the binomial case
E[Nt /vt ] = p and Var(Nt /vt ) = p(1 − p)/vt = τt2 ,
and in the Poisson case

no
E[Nt /vt ] = λ and Var(Nt /vt ) = λ/vt = τt2 .

Note that in both cases the unknown parameter p and λ, respectively, appears in
the variance. However, the appearance

is of multiplicative
−1
type which implies that
−2 PT −2
it cancels in the weights wt = τt s=1 τs . Therefore we get the following
NL
moment estimators in the binomial and the Poisson cases.
Estimator 2.27 (moment estimators in the binomial and Poisson cases).

We have the following unbiased linear minimum variance estimators:
• binomial case for p

T T
1 vt Nt
pbMV
X X
T = PT Nt = PT ;
s=1 vs t=1 t=1 s=1 vs vt
• Poisson case for λ

T T
b MV = P 1 X X vt Nt
λ T T Nt = PT .

The variances of these estimators are given by

p(1 − p) b MV = P λ

Var pbMV
T = PT and Var λ T T .
s=1 vs s=1 vs
These variances (and uncertainties) converge to zero for Ts=1 vs → ∞, and they can
P
be estimated by replacing the unknown parameters p and λ, respectively, by their

estimators. Note that we can explicitly give these distributions of the estimators
because in the former case Tt=1 Nt ∼ Binom( Tt=1 vt , p) and in the latter case
P P
w)
PT PT
t=1 Nt ∼ Poi(λ t=1 vt ).
The negative-binomial case is more complex. Assume that Nt , t = 1, . . . , T , are

independent with Nt ∼ NegBin(λvt , γ). For the first two moments we have
(m
E[Nt /vt ] = λ and Var(Nt /vt ) = λ/vt + λ2 /γ = τt2 .
The variance term has two unknown parameters λ and γ and we lose the nice mul-
tiplicative structure from the binomial and the Poisson case which has allowed to
apply Lemma 2.26 in a straightforward manner. If we drop the condition “minimum
variance” we obtain the following unbiased linear estimator.
tes
Estimator 2.28 (moment estimator in the negative-binomial case (1/2)).
We have the following unbiased linear estimator for λ
T T
1 vt Nt
no
b NB = P
X X
λ T T Nt = PT .
The unbiasedness of λb NB immediately follows from the assumptions of the negative-

T
binomial distribution. The variance of this estimator is given by
NL
!2 T PT

b NB
1 X t=1 λvt + (λvt )2 /γ
Var λ T = PT Var(Nt ) = 2 .
vs
P
T
s=1 t=1 s=1 vs
There remains the estimate of γ. Therefore, we define

T 2
1 X Nt b N B

Vb 2 T = vt − λT . (2.7)
T − 1 t=1 vt
Lemma 2.29. In the negative-binomial model VbT2 satisfies

T T
!
λ2 vt2
P
h i 1
Vb 2 vt − Pt=1
X
E T =λ+ T .
T − 1 t=1 t=1 vt γ
.

This motivates the following estimator.
Estimator 2.30 (moment estimator in the negative-binomial case (2/2)).

The method of moments suggests the following estimator for γ
T T
!
(λb NB )2 1 vt2
P
γbTNB = b 2 T b NB vt − Pt=1
X
T ,
VT − λT T − 1 t=1 t=1 vt
for VbT2 > λ

b NB , otherwise use the Poisson or the binomial model (no over-dispersion
T
in data N1 , . . . , NT ).
w)
bN B for λ we have
Proof of Lemma 2.29. Using the unbiasedness of λT
T
" #
2
h i
2 1 X Nt bN B
E VT
b = vt E − λT
T − 1 t=1 vt
(m
T
1 X Nt b N B
= vt Var − λT
T − 1 t=1 vt
T
1 X Nt Nt b N B
bN B
= vt Var − 2Cov , λT + Var λ T
T − 1 t=1 vt vt
" T #
1 X λvt + (λvt )2 /γ PT λvt + (λvt )2 /γ
tes
t=1
= − PT .
T − 1 t=1 vt s=1 vs
We justify these estimators in the case of vt = v for all t = 1, . . . , T . This uniform

no
volume case provides

T
b NB v = 1
X
λ T Nt = µb T ,
T t=1
which is the sample mean of i.i.d. random variables Nt . For the γ estimate we
obtain in the uniform volume case
NL
(λb NB v)2
T
γbTNB = with
VbT2 v − λb NB v
T
T
1 X bN B v 2 = σ

VbT2 v = Nt − λ T b T2 ,
T − 1 t=1
where the latter term is the sample variance of i.i.d. random variables Nt . Or in
other words, the proposed estimators in the uniform volume vt = v case are found
by looking at the system of equations (2.6). In the negative-binomial model this
system is given by
E[µb T ] = µ = λv and E[σbT2 ] = σ 2 = λv + (λv)2 /γ.
Replacing µ and σ 2 by their sample estimators and solving the system of equations
b NB and γ
provides λ bTNB in the uniform volume case.
T

2.3.2 Maximum likelihood estimators

The MLE method has been popularized by Sir Ronald Aylmer
Fisher (1890-1962) but it has been used already before by Gauss,
Laplace and others. The philosophy behind MLE is different
compared to the method of moments. For MLE the first ob-
jective is not unbiasedness but maximizing the probability of a
given observation. MLE can be done for densities or for probabil-
ity weights, we formulate it for the latter because at the moment
Sir R.A. Fisher
we are looking at discrete random variables N .
w)
Assume that the components of N = (N1 , . . . , NT )0 are independent with probabil-
(t)
ity weights pk (ϑ) = Pϑ [Nt = k] = P[Nt = k] which depend on a common unknown
parameter ϑ. The independence property of N1 , . . . , NT implies that their
(m
joint likelihood function is given by
T
Y (t)
LN (ϑ) = pNt (ϑ),
t=1
and their joint log-likelihood function is given by

tes
T
X (t)
`N (ϑ) = log pNt (ϑ).
t=1
no
The MLE for ϑ is based on the rationale that ϑ should be chosen such that the
probability of observing {N1 , . . . , NT } is maximized. The MLE ϑbMLE
T for ϑ based
on the observation N is given by
ϑbMLE = arg max LN (ϑ) = arg max `N (ϑ).

NL
T
ϑ ϑ
Under suitable regularity properties the MLE ϑbMLE

T is found as solution of
T
∂ ∂ X (t)
`N (ϑ) = log pNt (ϑ) = 0.
∂ϑ ∂ϑ t=1
(t)
If the probability weights pk (ϑ) are sufficiently regular as a function of ϑ in a reg-
ular domain which contains the true parameter ϑ, then the MLE ϑbMLE T is asymp-
totically unbiased for T → ∞ and under appropriate scaling it has an asymptotic
Gaussian distribution with inverse Fisher’s information as covariance matrix, for
details see Theorem 4.1 in Lehmann [61].

Estimator 2.31 (MLE in the binomial case).

Assume that N1 , . . . , NT are independent and Binom(vt , p). The MLE is given by
T T
1 vt Nt
pbMLE = pbMV
X X
T = PT Nt = PT T .
s=1 vs t=1 t=1 v
s=1 s v t
Proof. The log-likelihood function is given by

T
X vt
`N (p) = log + Nt log p + (vt − Nt ) log(1 − p).
t=1
Nt
w)
Calculating the derivative w.r.t. p provides the requirement
T
∂ X Nt v t − Nt
`N (p) = − = 0.
∂p t=1
p 1−p
(m
Solving this for p proves the claim. 2
Estimator 2.32 (MLE in the Poisson case).

Assume that N1 , . . . , NT are independent and Poi(λvt ). The MLE is given by
T T
b MLE = P 1 X X vt Nt b MV .
λ T T Nt = PT = λ T
v
tes
s=1 vs t=1 t=1 s=1 vs t
Proof. The log-likelihood function is given by

T
X
`N (λ) = −(λvt ) + Nt log(λvt ) − log(Nt !).
no
t=1
Calculating the derivative w.r.t. λ provides the requirement

T
∂ X Nt
`N (λ) = −vt + = 0.
∂λ t=1
λ
Solving this for λ provides the claim. 2

NL
Estimator 2.33 (MLE in the negative-binomial case).

Assume that N1 , . . . , NT are independent and NegBin(λvt , γ). The MLE
b MLE , γ
(λ b MLE ) is the solution of
T
!
∂ X Nt + γ − 1
log + γ log(1 − pt ) + Nt log pt = 0,
∂(λ, γ) t=1 Nt
with pt = (λvt )/(γ + λvt ) ∈ (0, 1).
Unfortunately, this system of equations does not have a closed form solution, and
a root search algorithm is needed to find the MLE solution for (λ, γ), see also page
61 below.

2.3.3 Example and χ2 -goodness-of-fit analysis

We apply the claims count models (Poisson and
negative-binomial) to a real data set. We take
the data set provided in Gisler [48]. This data
set describes the number of claims in an insur-
ance portfolio that protects private households
against water claims. The data is displayed in
Table 2.1 and Figure 2.2.
w)
year volume number of frequency
t vt claims Nt Nt /vt
1982 240’755 13’153 5.46%
(m
1983 255’571 14’186 5.55%
1984 269’739 14’207 5.27%
1985 281’708 13’461 4.78%
1986 306’888 21’261 6.93%
1987 320’265 19’934 6.22%
1988 323’481 15’796 4.88%
1989 334’753 15’157 4.53%
tes
1990 340’265 17’483 5.14%
1991 344’757 19’185 5.56%
total 3’018’182 163’823 5.43%
Table 2.1: Private households water insurance: number of policies and claims
no
counts, source Gisler [48].
We observe a strong growth of volume of more than 40% in this insurance portfo-
NL
lio from v1982 = 2400 755 policies to v1991 = 3440 757 policies. Such a strong growth
might question the stationarity assumption in the expected claims frequency λt ≡ λ
because it might also reflect a substantial change in the portfolio (and product
maybe). Nevertheless we assume its validity (because the observed claims frequen-
cies Nt /vt do not show any structure such as a linear trend, see Figure 2.2) and
we fit the Poisson and, if necessary, the negative-binomial distribution to this data
set.
Poisson model. We assume that Nt are independent with Nt ∼ Poi(λvt ). The

minimal variance estimator and the MLE for λ are given by, see Estimator 2.32,
1991
b MLE = P 1
b MV = λ
X
λ T T 1991 Nt = 5.43%.
s=1982 vs t=1982

0.070
●
0.065
●
observed frequencies
0.060
0.055
● ●
●
w)
●
0.050
●
0.045
(m
1982 1984 1986 1988 1990
Figure 2.2: Observed yearly claims frequencies Nt /vt from t = 1982 to 1991 com-
pared to the overall average of 5.43%, see Table 2.1.
tes
The coefficient of variation in the Poisson model is given by, see (2.2),
Vco(Nt /vt ) = (λvt )−1/2 .

b MV by
This coefficient of variation is estimated using λ T
no
Vco(N b MV −1/2 ≈ 0.8%.

t /vt ) = (λT vt )
d
If we choose 1 standard deviation as confidence bounds, i.e. if we consider the

confidence interval CIt = (λ ± λ(λvt )−1/2 ), we obtain estimated confidence intervals
(for any t) of roughly
NL
CI
c = (5.39%, 5.47%).
t
These resulting confidence bounds are very narrow and we observe that most of
the observed claims frequencies Nt /vt in Table 2.1 lie outside of these confidence
bounds, see Figure 2.3 (lhs). This clearly rejects the assumption of having Pois-
son distributions for the number of claims and suggests that we should study the
negative-binomial model for Nt .
Negative-binomial model. As described above, the negative-binomial model is

able to model the heterogeneity over different accounting years t. It assumes that
every accounting year t is characterized by a (latent) risk factor Θt which describes
the nature of that particular accounting year t. A priori all years are similar which
is expressed by the i.i.d. property of Θt with Θt ∼ Γ(γ, γ) for identical dispersion
parameters γ > 0. We estimate this dispersion parameter with Estimator 2.30.

0.070
0.070
● ●
0.065
0.065
● ●
0.060
0.060
0.055
0.055
● ● ● ●
● ●
● ●
● ●
0.050
0.050
● ●
● ●
w)
0.045
0.045
● ●
1982 1984 1986 1988 1990 1982 1984 1986 1988 1990
(m
Figure 2.3: Observed yearly claims frequencies Nt /vt from t = 1982 to 1991 com-
pared to the to the estimated overall frequency of 5.43%. (lhs): 1 standard devia-
tion confidence bounds Poisson case; (rhs): 1 standard deviation confidence bounds
negative-binomial case.
We expect that it substantially differs from ∞, i.e. VbT2 > λ

b NB = λ
b MV . We obtain
tes
T T
2
VT = 15.84 > 5.43%. Thus, we have a clear over-dispersion which results in the
b
estimate
q
γbTNB = 56.23 and Vco(N
d
t /vt ) =
b NB v )−1 + (γ
(λ T t bTNB )−1 ≈ 13%.
no
If we calculate the estimated 1 standard deviation confidence bounds we obtain for

all t roughly
CI
c = (4.70%, 6.15%).
t
This makes much more sense in view of the observed frequencies Nt /vt in Table
NL
2.1. We see that 7/10 of the observations are within these confidence bounds, see
Figure 2.3 (rhs).
We close this example with a statistical test: In the previous example it was obvious
that the Poisson model does not fit to the data. In situations where this is less
obvious we can use the following χ2 -goodness-of-fit test.
Null hypothesis H0 : Nt are independent and Poi(λvt ) distributed for t = 1, . . . , T .
We are going to build a test statistics for the evaluation of this null hypothesis H0 .
We define
T
(Nt /vt − λ)2
χ∗ = χ∗ (N) =
X
.
t=1 λ/vt

It is not straightforward to determine the explicit distribution function of χ∗ .

Therefore, we give an approximate answer to this request of hypothesis testing.
The aggregation and disjoint decomposition theorems (Theorems 2.12 and 2.14)
imply that Nt ∼ Poi(λvt ) can be understood as a sum of vt i.i.d. random variables
Xi ∼ Poi(λ). That is,
vt
(d) X
Nt = Xi ,
i=1
with E[X1 ] = λ and Var(X1 ) = λ. But then the CLT (1.2) applies with
Pvt
Nt /vt − λ Nt − λvt (d) X − λvt
w)
Ze t = q = √ = √i
i=1
⇒ N (0, 1) as vt → ∞.
λ/vt λvt λvt
This explains that Zet can be approximated in distribution by a standard Gaussian

random variable Zt ∼ N (0, 1) for vt sufficiently large.
(m
i.i.d.
Next, if we assume that Z1 , . . . , ZT ∼ N (0, 1) then a standard result in statistics
says that Tt=1 Zt2 has a χ2 -distribution with T degrees of freedom, denoted by χ2T ,
P
see also Exercise 2 on page 21. Therefore, we obtain the asymptotic approximation
in distribution
T
(Nt /vt − λ)2 X T T
(d) X 2
χ∗ = χ∗ (N) = Zet2 ≈ Zt ∼ χ2T .
X
tes
=
t=1 λ/vt t=1 t=1
In the last step we need to replace the unknown parameter λ by its estimate λ b MLE .
T
By doing so, we lose one degree of freedom, thus, we get the test statistics χb∗ and
the corresponding distributional approximation
no
2
T b MLE
Nt /vt − λ T (d)
χb∗ = ≈ χ2T −1 .
X
vt b MLE
t=1 λ T
NL
For the data in Table 2.1 we obtain χb∗ = 20 627. The 99%-quantile of the χ2 -
distribution with T − 1 = 9 degrees of freedom is given by 21.67. Since this is by
far smaller than χb∗ we reject the null hypothesis H0 on the significance level of
q = 1%. This, of course, is not surprising in view of Figure 2.3 (lhs).
Exercise 3. Consider the data given in Table 2.2. Estimate the parameters for
t 1 2 3 4 5 6 7 8 9 10
Nt 1’000 997 985 989 1’056 1’070 994 986 1’093 1’054
vt 10’000 10’000 10’000 10’000 10’000 10’000 10’000 10’000 10’000 10’000
Table 2.2: Observed claims counts Nt and corresponding volumes vt .

the Poisson and the negative-binomial models. Which model is preferred? Does
a χ2 -goodness-of-fit test reject the null hypothesis on the 5% significance level of
having Poisson distributions?
w)
(m
tes
no
NL

w)
(m
tes
no
NL

Chapter 3
Individual Claim Size Modeling
w)
In Model Assumptions 2.1 we have introduced the compound distribution
(m
N
X
S = Y1 + Y2 + . . . + YN = Yi ,
i=1
with the three standard assumptions
1. N is a discrete random variable which takes values in A ⊂ N0 ;

tes
i.i.d.
2. Y1 , Y2 , . . . ∼ G with G(0) = 0;
3. N and (Y1 , Y2 , . . .) are independent.
In Chapter 2 we have discussed the modeling of the claims count distribution of

no
N . In this chapter we concentrate on the modeling of the individual claim sizes Yi .

To get an understanding for the modeling of G we
present an analysis based on two explicit data sets.
The first data set is a private property (PP) insurance
data set that consists of 72’769 claims records. The
NL
second data set is a commercial property (CP) insur-

ance data set that consists of 18’285 claims records.
Before presenting sophisticated mathematical modeling methods we analyze these
two data sets using tools from descriptive statistics.
3.1 Data analysis

The first observation is that the two data sets contain many claims records with
zero claims payments. That is, many of the recorded claims were settled without
any payments. In the case of PP insurance these were about 16% of the reported
claims and in the case of CP insurance we observe about 21% of zero claims. Zero
claims are due to reasons such as: the final claim does not exceed the deductible,
the insurance company is not liable for the claim, another insurance policy covers
53
54 Chapter 3. Individual Claim Size Modeling
the claim, reporting a (small) claim reduces the no-claims-benefit too much so that
the insured decides to withdraw the claim, etc.
We can deal in two different ways with such zero claims: (i) estimate the probability
of a zero claim separately and add a probability weight to G at 0; (ii) we simply
reduce the expected claims frequency λ by these zero claims. The first way (i) is
mathematically consistent, but contradicts our model assumption G(0) = 0; the
second way (ii) perfectly fits into the compound Poisson modeling framework due
to the disjoint decomposition Theorem 2.14. In general, the second version is the
simpler one to deal with (however, one may lose some information). Therefore, we
w)
assume that G(0) = 0 and E[N ] = λv, where v > 0 is the portfolio size and the
expected claims frequency λ > 0 only assesses strictly positive claims. Hence, after
subtracting these zero claims we have n = 610 053 claims records in PP insurance
and n = 140 532 in CP insurance.
We start with the scatter plots of the data, see Figures 3.1 and 3.2. We plot the
(m
individual claim sizes (ordered by arrival date) both on the original scale (lhs) and
on the log scale (rhs). These scatter plots do not offer much information because
tes
no
NL
Figure 3.1: Scatter plot of the n = 610 053 strictly positive claims records of PP
insurance ordered by arrival date: original scale (lhs) and log scale (rhs).
they are overloaded, at least they do not show any obvious trends. We calculate
the sample means and the sample variances of the observations, see also (2.5),
n n
1 X 1 X
µb n = Yi and σbn2 = (Yi − µb n )2 ,
n i=1 n − 1 i=1
For our data sets we obtain

0 PP
PP : µb PP b nPP = 70 534, Vco
n = 3 116, σ
d
n = 2.42; (3.1)
CP : µb CP 0
b nCP = 280 505, d CP
n = 6 850, σ Vco n = 4.16. (3.2)

Chapter 3. Individual Claim Size Modeling 55
w)
(m
Figure 3.2: Scatter plot of the n = 140 532 strictly positive claims records of CP
insurance ordered by arrival date: original scale (lhs) and log scale (rhs).
histogram claim sizes PP insurance histogram logged claim sizes PP insurance

60000
12000
tes
50000
10000
40000
8000
30000
count
count
6000
no
20000
4000
10000
2000
0
0 50000 100000 150000 200000 250000 4 6 8 10 12
claim sizes logged claim sizes

NL
Figure 3.3: Histogram of the n = 610 053 strictly positive claims records of PP
insurance: original scale (lhs) and log scale (rhs).
Next we give the histogram for PP insurance, see Figure 3.3 (lhs). We see that a
few large claims distort the whole picture so that the histogram is not helpful. We
could plot a second one only considering small claims. In Figure 3.3 (rhs) we plot
the histogram for logged claim sizes. In Figure 3.4 we give the corresponding box
plots, they show positive skewness. The ultimate goal is to have the full distribution
functions G(y) = P[Y ≤ y] of the two claims classes PP and CP. Since we have
so many observations we could directly work (at least for small claims, see Section

w)
(m
Figure 3.4: Box plots of claims records of PP and CP insurance: original scale (lhs)
and log scale (rhs).
3.4.1) with the empirical distribution function which is given by

n
b (y) = 1 X
G 1{Y ≤y} .
tes
n
n i=1 i
The empirical distribution function with logged claim sizes is given in Figure 3.5
(lhs). For a sequence of observations Y1 , . . . , Yn we denote the ordered sample by
no
NL
Figure 3.5: Empirical distribution functions G b of PP and CP insurance on log

n
scale (lhs) and corresponding empirical loss size index functions Ibn (rhs).
Y(1) ≤ Y(2) ≤ . . . ≤ Y(n) . For the next definitions we assume that Y ∼ G has finite
mean. We define loss size index function and the empirical loss size index function

by
Ry Pbnαc
0 z dG(z) Y(i)
I(G(y)) = R∞ and Ib
n (α) = Pi=1
n ,
0 z dG(z) i=1 Yi
for α ∈ [0, 1]. The loss size index function I(G(y)) chooses a claim size threshold y
and then it evaluates the relative expected claim among that can be explained by
claim sizes up to this threshold y. The resulting empirical graphs are presented in
Figure 3.5 (rhs). Rather typically in non-life insurance we see that the 20% largest
claims roughly cause 75% of the total claim size! Basically, this explains that we
need to well understand large claims because they heavily influence the total claim
w)
amount.
We have already seen in the previous figures that large claims may lead to several
modeling difficulties. Two plots that especially focus on large claims are the mean
excess plot and the log-log plot. We define the mean excess function and empirical
(m
mean excess function by
Pn
i=1 (Yi − u)1{Yi >u}
e(u) = E [Yi − u|Yi > u] and ebn (u) = Pn .
i=1 1{Yi >u}
The (empirical) mean excess plot is obtained by

tes
u 7→ e(u) and u 7→ ebn (u),
and the (empirical) log-log plot by

no

y 7→ (log y, log(1 − G(y))) and y 7→ log y, log(1 − G
b (y)) .
n
NL
Figure 3.6: Empirical log-log plot (lhs) and empirical mean excess plot (rhs) of PP
and CP insurance data.

In Figure 3.6 we present the empirical log-log and mean excess plots of the two
data sets. Linear decrease in the log-log plot and linear increase in the mean excess
plot will have the interpretation of heavy tailed distributions in the sense that the
survival function Ḡ = 1 − G is regularly varying at infinity.
3.2 Selected parametric claims size distributions

In this section we introduce the most popular parametric claim size distributions.
We only consider distribution functions G with unbounded support in R+ . We use
w)
the following notation for a random variable Y ∼ G:
g density of Y for G absolutely continuous,
(m
MY (r) moment generating function of Y in r ∈ R, where it exists,
µY expected value of Y , if it exists,
σY2 variance of Y , if it exists,
Vco(Y ) coefficient of variation of Y , if it exists,
ςY skewness of Y , if it exists,
tes
Ḡ = 1 − G survival function of Y , i.e. Ḡ = P[Y > y].
For the analysis of G also the following quantities are of interest:

no
E[Y 1{u1 <Y ≤u2 } ] expected value of Y within layer (u1 , u2 ],

I(G(y)) = E[Y 1{Y ≤y} ]/µY loss size index function for level y,
NL
e(u) = E[Y − u|Y > u] mean excess function of Y above u.
If G depends on the parameter ϑ and we have i.i.d. observations Yi ∼ G then we

can estimate this parameter from the data. The method of moments estimator is
denoted by ϑbMM and the maximum likelihood estimator by ϑbMLE , see also Section
2.3. Note that if one estimates this parameter one should also try to assess the
precision of the estimate.

For the analysis of the tail of the distribution we consider

the property of regular variation at infinity. Therefore, we
assume that G has an infinite support and we say that the
survival function Ḡ = 1 − G is regularly varying at infinity
with (tail) index α > 0, we write Ḡ ∈ R−α , if for all t > 0
1 − G(xt)
lim = t−α . (3.3)
x→∞ 1 − G(x)
If the above holds true for α = 0 then we say Ḡ is slowly
varying at infinity and we write Ḡ ∈ R0 . From an insurance
w)
point of view distribution functions G with Ḡ ∈ R−α for
some α > 0 are dangerous because they have a large potential for big claims, see
Chapter 3 in Embrechts et al. [36]. Therefore, it is crucial to know this index of
regular variation at infinity, see also Remarks 5.17.
(m
3.2.1 Gamma distribution
The gamma distribution has two parameters, a shape parameter γ > 0 and a scale
parameter c > 0. We write Y ∼ Γ(γ, c). The distribution function of Y has positive
support R+ with density for y ≥ 0 given by
tes
cγ γ−1
g(y) = y exp {−cy} .
Γ(γ)
There is no closed form solution for the distribution function G. For y ≥ 0 it can
no
only be expressed as
Z y
cγ γ−1 −cx 1 Z cy γ−1 −z
G(y) = x e dx = z e dz = G(γ, cy),
0 Γ(γ) Γ(γ) 0
NL
where G(·, ·) is the incomplete gamma function. From this we see that the family
of gamma distributions is closed towards multiplication with a positive constant,
that is, for ρ > 0 we have
ρY ∼ Γ(γ, c/ρ). (3.4)
This property is important when we deal with claims inflation. For the moment
generating function and the moments we have
γ
c

MY (r) = for r < c,
c−r
γ γ
µY = , σY2 = 2 ,
c c
−1/2
Vco(Y ) = γ , ςY = 2γ −1/2 > 0.

For 0 ≤ u1 < u2 and u, y > 0 we obtain

γ
E[Y 1{u1 <Y ≤u2 } ] = [G(γ + 1, cu2 ) − G(γ + 1, cu1 )] ,
c
I(G(y)) = G(γ + 1, cy),
!
γ 1 − G(γ + 1, cu)
e(u) = − u.
c 1 − G(γ, cu)
Exercise 4. Assume Y ∼ Γ(γ, c).
w)
• Prove the statements of the moment generating function MY and the loss
index function I(G(y)). Hint: use the trick of the proof of Proposition 2.20.
• Prove the statements
(m
1 − I(G(u))
e(u) = µY − u, E[Y 1{u1 <Y ≤u2 } ] = µY (I(G(u2 )) − I(G(u1 ))) .
1 − G(u)
The gamma distribution does not have a regularly varying tail at infinity, see
tes
Table 3.4.4 in Embrechts et al. [36]. In fact, Ḡ(y) = 1 − G(y) decays roughly as
exp{−cy} to 0 as y → ∞ because exp{−cy} gives an asymptotic lower bound and
exp{−(c − ε)y} an asymptotic upper bound for any ε > 0 on Ḡ(y).
For generating gamma random numbers in R the following code is used (n stands
for the number of random numbers to be generated)
no
> rgamma(n, shape=γ, rate=c).
The method of moments estimators are given by

µb n µb 2n
NL
cbMM = and γb MM = .
σbn2 σbn2
For the MLE we have log-likelihood function, set Y = (Y1 , . . . , Yn )0 ,
n
X
`Y (γ, c) = γ log c − log Γ(γ) + (γ − 1) log Yi − cYi .
i=1
The MLE γb MLE of γ is the solution of

n
Γ0 (γ) 1 X
log γ − log µb n − + log Yi = 0. (3.5)
Γ(γ) n i=1
This is solved numerically, and the MLE for c is then given by

γb MLE
cbMLE = .
µb n

For the numerical solution in R one can use the command
> fitdistr(data, “gamma”).
The numerical fitting does not always work when the range of observations Y is
too large. In such cases it is recommended that in the first step the data is scaled
by a constant factor ρ > 0, this can be done due to (3.4), and parameters are
estimated for scaled data; and in the second step then the constant is scaled back
by the same factor. An alternative way is to explicitly program the function given
in (3.5) and then apply the root search command uniroot(). The term Γ0 (γ)/Γ(γ)
w)
is calculated with digamma(), see also Section 3.9.5 in Kaas et al. [57].
Remark 3.1 (exponential and χ2 -distributions). The special case γ = 1 is the
(m
exponential distribution with parameter c > 0 denoted by expo(c). The special
case γ = k/2 and c = 1/2 is the χ2 -distribution with k ∈ N degrees of freedom, see
Exercise 2 on page 21.
tes
Example 3.2 (gamma distribution for PP data). We fit the PP insurance data
displayed in Figure 3.1 to the gamma distribution.
no
NL
Figure 3.7: Gamma distribution with MM and MLE fits applied to the PP insur-
ance data. lhs: QQ plot; rhs: loss size index function.
From Figures 3.7 and 3.8 we immediately conclude that the gamma model does
not fit to the PP data. The reason is that the data is too heavy tailed, which
can be seen in the QQ plot in Figure 3.7 (lhs): the data at the right end of the
distribution lies substantially above the line. The MM estimators manage to model

w)
(m
Figure 3.8: Gamma distribution with MM and MLE fits applied to the PP insur-
ance data. lhs: log-log plot; rhs: mean excess plot.
the data up to some layer, the MLE estimators, however, are heavily distorted by
the small claims which can be seen in the mean excess plot in Figure 3.8 (rhs).
In fact, we have too many small claims (observations below 1’500). The MLE is
tes
heavily based on these small observations, in Figure 3.7 (rhs) and Figure 3.8 (lhs)
we see that MLE fits well for small claims, whereas MM provides more appropriate
results in the upper range of the data. Summarizing, we should choose more heavy
tailed distribution functions to model this data and the resulting figures are already
no
sufficient for rejecting the model.
Remark 3.3 (inverse Gaussian distribution). A distribution function which is also

found quite often in the actuarial literature is the inverse Gaussian distribution,
see for instance Section 3.9.6 in Kaas et al. [57]. Its density is given by
NL
( ) ( )
α −3/2 (α − cy)2 α −3/2 α2 1
g(y) = √ y exp − = √ y exp − + α − cy ,
2πc 2cy 2πc 2cy 2
where α > 0 is a shape parameter and c > 0 a scale parameter. Observe that this
density behaves similar as the gamma density for y → ∞. For the distribution
function we have a closed form solution
! !
α √ α √
G(y) = Φ − √ + cy + e2α Φ − √ − cy ,
cy cy
where Φ(·) is the standard Gaussian distribution. This can easily be checked by
calculating the derivative of the latter. For the moment generating function and

the moments we have

n h io
MY (r) = exp α 1 − (1 − 2r/c)1/2 for r ≤ c/2,
α α
µY = , σY2 = 2 ,
c c
−1/2
Vco(Y ) = α , ςY = 3α−1/2 > 0.
w)
(m
tes
Figure 3.9: Inverse Gaussian distribution with MM and MLE fits applied to the
PP insurance data: QQ plot.
no
b MM and cbMM , and the MLE esti-

From this we can calculate the MM estimators α
mators are given by
" n
! n
! #−1
1X 1X b MLE
α
α MLE
= Y −1 Yi − 1 and cbMLE = .
n i=1 i
b
n i=1 µb n
NL
In Figure 3.9 we see that this leads to an improvement of the fit compared to the
gamma distribution. Overall it is still not convincing, especially in the tails, and
because the inverse Gaussian distribution is less handy than the ones that will just
be presented below we refrain from further discussing this distribution function.
3.2.2 Weibull distribution

The Weibull distribution has his name from Ernst Hjalmar

Waloddi Weibull (1887-1979), however it was first identified
by Maurice Fréchet (1878-1973) in 1927, but Weibull was
probably the first one who has in 1951 described the distribution
in detail.
The Weibull distribution has two parameters, a shape parameter

τ > 0 and a scale parameter c > 0. We write Y ∼ Weibull(τ, c). E.H.W. Weibull
The distribution function of Y has positive support R+ with
w)
density for y ≥ 0 given by
g(y) = (cτ ) (cy)τ −1 exp {−(cy)τ } .
(m
We are especially interested in τ ∈ (0, 1) because this provides a slower decay of
the survival function compared to the gamma distribution. For y ≥ 0 we have
tes
G(y) = 1 − exp {−(cy)τ } ,
which still does not provide regularly varying tails at infinity but the decay of the
no
survival function Ḡ is slower than in the gamma case for τ < 1. The family of
Weibull distributions is closed towards multiplication with positive constants, that
is, for ρ > 0 we have
ρY ∼ Weibull(τ, c/ρ).
NL
The moment generating function and the moments are given by
MY (r) does not exist for τ < 1,

Γ(1 + 1/τ )
µY = ,
c
Γ(1 + 2/τ )
σY2 = 2
− µ2Y ,
"c #
1 Γ(1 + 3/τ ) 2 3
ςY = − 3µY σY − µY .
σY3 c3


Γ(1 + 1/τ )
E[Y 1{u1 <Y ≤u2 } ] = [G(1 + 1/τ, (cu2 )τ ) − G(1 + 1/τ, (cu1 )τ )] ,
c
I(G(y)) = G(1 + 1/τ, (cy)τ ),
Γ(1 + 1/τ ) 1 − G(1 + 1/τ, (cu)τ )
!
e(u) = − u.
c exp{−(cu)τ }
The Weibull distribution does not have regularly varying tails at infinity, see Table
3.4.4 in Embrechts et al. [36]. In fact, the survival function Ḡ(y) = 1 − G(y) decays
as exp{−(cy)τ } to 0 for y → ∞.
w)
(d)
For generating Weibull random numbers observe that we have the identity Y =
1 1/τ (d)
c
Z with Z ∼ expo(1) = Γ(1, 1). The R code for the Γ(1, 1) distribution is
(m
> rgamma(n, shape=1, rate=1).

Γ(1 + 1/τbMM ) σbn2 Γ(1 + 2/τbMM )
cbMM = and + 1 = .
µb n µb 2n Γ(1 + 1/τbMM )2
tes
The latter needs to be solved numerically in R:
f <- function(x,a){lgamma(1+2/x)-2*lgamma(1+1/x)-log(a+1)}
tau <- uniroot(f, c(0.0001,1), tol=0.0001, a=var(data)/mean(data)ˆ2)
no
For the MLE we need to solve the system of equations (numerically)

n
!−1/τ n
1X 1X
c= Yτ and τ log(cYi ) ((cYi )τ − 1) = 1.
n i=1 i n i=1
Example 3.4 (Weibull distribution for PP data). We fit the PP insurance data
NL
displayed in Figure 3.1. From Figures 3.10 and 3.11 we see that the Weibull model
gives a better fit to the PP data compared to the gamma model. The reason is that
it allows for more probability mass in the tail of the distribution, the estimate for
τ is in the interval (0.5, 0.75). The MM estimators manage to model the data up
to some layer. The MLE estimators, however, are still distorted by the big mass
of small claims which can be seen in the mean excess plot in Figure 3.11 (rhs).
Summarizing, we should choose even more heavy tailed distributions to model this
data, and we should carefully treat large claims.
3.2.3 Log-normal distribution

Making the tail of the distribution function heavier than the Weibull distribution
tail leads us to the log-normal distribution. The log-normal distribution has two

w)
(m
Figure 3.10: Weibull distribution with MM and MLE fits applied to the PP insur-
ance data. lhs: QQ plot; rhs: loss size index function.
tes
no
NL
Figure 3.11: Weibull distribution with MM and MLE fits applied to the PP insur-
ance data. lhs: log-log plot; rhs: mean excess plot.
parameters, a mean parameter µ ∈ R and a standard deviation parameter σ >

0. We write Y ∼ LN(µ, σ 2 ). The log-normal distribution has the property that
log Y ∼ N (µ, σ 2 ). Therefore, almost every crucial quantity can be obtained from
normal distributions. The distribution of Y has positive support R+ with density
for y ≥ 0 given by
( )
1 1 1 (log y − µ)2
g(y) = √ exp − .
2πσ y 2 σ2

For y ≥ 0 we have distribution function

!
log y − µ
G(y) = Φ ,
σ
with Φ(·) denoting the standard Gaussian distribution function. The family of
log-normal distributions is closed towards multiplication with a positive constant,
that is, for ρ > 0 we have
ρY ∼ LN(µ + log ρ, σ 2 ).
w)
We have
(m
MY (r) does not exist,
n o
µY = exp µ + σ 2 /2 ,
n o
σY2 = exp 2µ + σ 2 exp{σ 2 } − 1 ,
1/2
Vco(Y ) = exp{σ 2 } − 1 ,
1/2
exp{σ 2 } + 2 exp{σ 2 } − 1
tes
ςY = .

" ! !#
log u2 − (µ + σ 2 ) log u1 − (µ + σ 2 )
E[Y 1{u1 <Y ≤u2 } ] = µY Φ −Φ ,
no
σ σ
!
2
log y − (µ + σ )
I(G(y)) = Φ ,
σ
log u−(µ+σ 2 )
 
1−Φ σ
e(u) = µY   − u.
log u−µ
1− Φ
NL
The log-normal distribution does not have regularly varying survival function at
infinity, see Table 3.4.4 in Embrechts et al. [36]. For generating log-normal random
numbers we simply choose standard Gaussian random numbers Z ∼ Φ and then
set Y = exp{µ + σZ}.

" !#1/2
MM σbn2
σb = log +1 and µb MM = log µb n − (σb MM )2 /2.
µb 2n
The MLE is given by
n n
1X 1X 2
µb MLE = log Yi and (σb MLE )2 = log Yi − µb MLE .
n i=1 n i=1

Example 3.5 (log-normal distribution for PP data). We fit the PP insurance data
displayed in Figure 3.1. In Figures 3.12 and 3.13 we present the results. We observe
w)
(m
Figure 3.12: Log-normal distribution with MM and MLE fits applied to the PP
insurance data. lhs: QQ plot; rhs: loss size index function.
tes
no
NL
Figure 3.13: Log-normal distribution with MM and MLE fits applied to the PP
insurance data. lhs: log-log plot; rhs: mean excess plot.
that the log-normal distribution gives quite a good fit. We give some comments
on the plots: The MM estimator looks convincing because the observations match
the lines quite well. The only things that slightly disturb the picture are the three
largest observations, see QQ plot. It seems that they are less heavy tailed then the
log-normal distribution would suggest. This is also the reason why the empirical
mean excess plot deviates from the log-normal distribution, see Figure 3.13 (rhs).

A little bit puzzling is the bad performance of the MLE. The reason is again that
more than 50% of the claims are less than 1’500. The MLE therefore is very
much based on these small observations and provides a good fit in that range of
observations but it gives a bad fit for larger claims. We conclude that this PP data
set should be modeled with different distributions in different layers. The reason
for this heterogeneity is that PP insurance contracts have different modules such as
theft, water damage, fire, etc. and it is recommended (if data allows) to model each
of these modules separately. This may also explain the abnormalities in the log-log
plot because these different modules, in general, have different maximal covers.
w)
3.2.4 Log-gamma distribution
The log-gamma distribution is more heavy tailed than the log-normal distribution
and is obtained by assuming that log Y ∼ Γ(γ, c) for positive parameters γ and c.
(m
The density for y ≥ 1 is given by
cγ
g(y) = (log y)γ−1 y −(c+1) ,
Γ(γ)
tes
and the distribution function can be written as
G(y) = G(γ, c log y).

no
For the moment generating function and the moments we have

γ
c
NL

µY = for c > 1,
c − 1 γ
c

σY2 = − µ2Y for c > 2,
c−2
γ
1 c

2 3
ςY = − 3µY σY − µY for c > 3.
σY3 c−3

c γ
E[Y 1{u1 <Y ≤u2 } ] = [G(γ, (c − 1) log u2 ) − G(γ, (c − 1) log u1 )] ,
c−1
I(G(y)) = G(γ, (c − 1) log y),
γ !
c 1 − G(γ, (c − 1) log u)

e(u) = − u.
c−1 1 − G(γ, c log u)

The log-gamma distribution has a regularly varying survival function at infinity

with tail index c > 0, see Table 3.4.2 in Embrechts et al. [36].
w)
(m
Figure 3.14: Log-gamma distribution with MM and MLE fits applied to the PP
insurance data. lhs: QQ plot; rhs: loss size index function.
tes
no
NL
Figure 3.15: Log-gamma distribution with MM and MLE fits applied to the PP
insurance data. lhs: log-log plot; rhs: mean excess plot.

log µb n log(σbn2 + µb 2n ) log cbMM − log(cbMM − 2)
γb MM = MM
and = ,
c
log bcMM
b log µb n log cbMM − log(cbMM − 1)
−1
where the latter is solved numerically using, e.g., the R command uniroot().
The MLE is obtained analogously to the MLE for gamma observations by simply
replacing Yi by log Yi .

Example 3.6 (log-gamma distribution for PP data). We fit the PP insurance

data displayed in Figure 3.1. From Figures 3.14 and 3.15 we conclude that the
log-gamma model provides the best fit to the data from the models considered so
far. As already commented on in the log-normal example, we see that probably
the only thing that does not entirely fit to the log-gamma distribution are the 3 or
4 largest claims which are less heavy tailed than the log-gamma distribution would
suggest. The tail index of regular variation is about cb = 5.8 in this example.
Example 3.7 (resulting distribution functions). We close this section by providing

the cumulative distribution functions in Figure 3.16 and the histograms in Figure
w)
3.17 for the distributions considered.
(m
tes
Figure 3.16: Cumulative distribution functions with logged claim sizes.
no
histogram logged claim sizes PP insurance

12000
Weibull distribution
log−normal distribution
log−gamma distribution
10000
NL count
8000
6000
4000
2000
0
4 6 8 10 12
logged claim sizes
Figure 3.17: Histogram with logged claim sizes on the x axis.
Figure 3.16 (lhs) shows the gamma, inverse Gauss and Weibull fits, Figure 3.16
(rhs) the log-normal and log-gamma fits. In Figure 3.17 we give the histogram for
the Weibull, log-normal and log-gamma fits.

3.2.5 Pareto distribution
We have seen that large claims often need a special treatment.

Therefore, large claims are often modeled separately with ei-
ther a Pareto or a generalized Pareto distribution. Here, we
concentrate on the Pareto distribution. The Pareto distribu-
tion is named after Vilfredo Federico Damaso Pareto
(1848-1923) who initially used this distribution to describe the
allocation of wealth.
w)
V.F.D. Pareto
The Pareto distribution specifies a (large claims) threshold θ >
0 and then only models claims above this threshold, see also
Example 2.16. The claims above this threshold are assumed to have regularly
(m
varying tails with tail index α > 0. For Y ∼ Pareto(θ, α), the density for y ≥ θ is
given by
−(α+1)
α y
g(y) = ,
θ θ
tes
and distribution function can be written as
−α
y
G(y) = 1 − .
no
We have closedness towards multiplication with a positive constant, that is, for
ρ > 0 we have
NL
ρY ∼ Pareto(θρ, α). (3.6)
For the moment generating function and the moments we have

α
µY = θ for α > 1,
α−1
α
σY2 = θ2 for α > 2,
(α − 1)2 (α − 2)
2(1 + α) α − 2 1/2

ςY = for α > 3.
α−3 α

For α > 1, θ ≤ u1 < u2 and u, y > θ we obtain

" −α+1 −α+1 #
α u1 u2

E[Y 1{u1 <Y ≤u2 } ] = θ − ,
α−1 θ θ
−α+1
y
I(G(y)) = 1 − ,
θ
1
e(u) = u.
α−1
As soon as we only study tails of distributions we should use MLEs for parameter
w)
estimation (the method of moments is not sufficiently robust against outliers).
Since the threshold θ has a natural meaning we only need to estimate α. The MLE
is given by
n
!−1
1
b MLE =
X
α log Yi − log θ .
(m
n i=1
i.i.d.
Lemma 3.8. Assume Y1 , . . . , Yn ∼ Pareto(θ, α). We have
h i n n2
E b MLE
α = α and Var b MLE
α = α2 .
n−1 (n − 1) (n − 2)
2
tes
(d)
Proof. Choose Z ∼ expo(α) = Γ(1, α). Then, θeZ = Y ∼ Pareto(θ, α) (this can be seen
by a change of variables in the corresponding densities). This immediately implies that Zi =
i.i.d.
log Yi − log θ ∼ expo(α). The sum of these i.i.d. exponential random variables is gamma
distributed with parameters γ = n and c = α. Using the scaling property (3.4) we conclude that
no
αMLE )−1 ∼ Γ (n, nα) .

(b
This implies for k < n

∞
(nα)n n−1 −nαz Γ(n − k)
Z
z −k
MLE k
α
E (b ) = z e dz = (nα)k . (3.7)
0 Γ(n) Γ(n)
NL
From this the claim follows. 2
For the MLE of α it was assumed that the threshold θ

is given in a natural way. If this threshold needs to be
detected from data, the Hill plot can be of help. For the
Hill method we refer to McNeil et al. [68], Section 7.2.4.
We order the claims accordingly to Y(1) ≤ Y(2) ≤ . . . ≤
Y(n) . The Hill plot explores the stability of the MLEs
when successively dropping the smallest observations.
Therefore we define for k < n the Hill estimator by
n
!−1
H 1 X
α
b k,n = log Y(i) − log Y(k) .
n − k + 1 i=k

The Hill estimator is based on the rationale that the Pareto distribution is closed
towards increasing thresholds, i.e. for Y ∼ Pareto(θ0 , α) and θ1 > θ0 we have for
all y ≥ θ1
y −α

−α
y

θ0
P [ Y > y| Y ≥ θ1 ] = −α = .
θ1 θ1
θ0
Therefore, if the data comes from a Pareto distribution we should observe stability
H
in α
b k,n for changing k. The confidence bounds of the Hill estimators are determined
by Lemma 3.8.
w)
Example 3.9 (Pareto for extremes of PP insurance). We start the analysis with
the PP insurance data.
(m
tes
no
H
Figure 3.18: PP insurance data; lhs: Hill plot k 7→ αb k,n with confidence bounds of
1 standard deviation; rhs: log-log plot for α = 2.5.
NL
To perform this large claims analysis we choose only the largest

H
claims of Figure 3.1. The Hill plot k 7→ α b k,n is given in Figure
3.18 (together with confidence bounds of 1 standard deviation,
estimated by Lemma 3.8). We observe a fairly stable picture
in k around value α = 2.5 up to the largest 100 claims. For
larger claims the Hill estimator “disappears” to 4 or 5 which
(once more) explains that the tail of the largest observations
is not really heavy tailed. This is similar to the log-normal
S.I. Resnick
and the log-gamma fit. Sidney Ira Resnick [76] has called
this phenomenon ”Hill horror plot” and it stems from the difficulty that the Hill
estimator cannot correctly adjust non-Pareto like tails. The right-hand side of
Figure 3.18 gives the log-log plot for α = 2.5, in accordance to the Hill plot we
see that the slope of the data is slightly less than this value α for smaller claims,

w)
(m
Figure 3.19: PP insurance data largest claims only; lhs: QQ plot; rhs: mean excess
plot for α = 2.5.
but the data becomes less heavy tailed further out in the tails. This becomes also
obvious from the mean excess plot and the QQ plot in Figure 3.19.
tes
Example 3.10 (Pareto for extremes of CP insurance). In a second analysis we
analyze the extremes of the CP claims data of Figure 3.2. The results are presented
no
NL
H
Figure 3.20: CP insurance data; lhs: Hill plot k 7→ αb k,n with a confidence interval
of 1 standard deviation; rhs: log-log plot for α = 1.4.
in Figure 3.20. At the first sight they look similar to the PP insurance example,
i.e. they begin to destabilize between the 150 and 100 largest claims. However, the
main difference is that the tail index is much smaller in the CP example. That is,
there is a higher potential for large claims for this line of business. Of course, this

is clear from a practical point of view because, in general, commercial insurance

contracts involve much bigger sums insured than personal insurance contracts.
However, also in this example it seems that the largest claims are less heavy tailed
than explained by a Pareto distribution, which result in the Hill horror plot phe-
nomenon described in the previous example.
Example 3.11 (nuclear power accident example). We revisit the nuclear power
accident data set studied in Hofert-Wüthrich [53].
w)
In Figure 3.21 we plot all nuclear power acci-
dents that have occurred until the end of 2011
and which have a claim size larger than 20 mio.
USD (as of 2010). These events include Three
(m
Mile Island (United States, 1979), Chernobyl
(Ukraine, 1986) and Fukushima (Japan, 2011).
Fukushima 2011
In Figure 3.22 we provide the Hill plot. We
observe that this data is very heavy tailed.
tes
scatter plot logged claim sizes nuclear power accidents empirical distribution
1.0
24
● ● nuclear power accidents ●

●
●
●
●
●
●
●
●
23
no
●
●
0.8
● ●
●
●
●
●
●
22
●
●
●
● ●
claim sizes (log scale)
●
empirical distribution
● ●
● ●
0.6
● ●
21
●
●
● ●
●
●
● ● ●
●
●
●
20
● ●
● ● ●
0.4
●
● ●
● ● ●
● ● ●
● ● ●
●
19
●
● ● ●
●
● ●
● ● ●
● ● ●
NL
0.2
● ● ● ●
● ●
● ● ●
18
● ● ●
● ● ●● ●
● ●
● ● ● ●
● ● ●
● ● ●
● ● ● ● ●
●
● ● ●
● ●
17
● ● ● ● ● ●
0.0
● ●
0 10 20 30 40 50 60 17 18 19 20 21 22 23 24
logged claim sizes
Figure 3.21: 61 largest nuclear power accidents until 2011; lhs: logged claim sizes
(in chronological order, the last entry is Fukushima); rhs: empirical distribution
function of claim sizes.
The Hill plot suggests to set the tail index α around 0.64, which means that we
have an infinite mean model. The log-log plot in Figure 3.22 shows that this tail
index choice captures the slope quite well.

Hill plot of nuclear power accidents log−log plot (with alpha = 0.64 for the 61 largest observations)
0
●●
●●
● ●
●●
●
●●
●●●
●
●●●
1.2 ●
●
●●
●
●
●
●●
● ●
●
● ●
●●
−1
● ●●
●●
●
●
●
●
●
log (1−distribution function)

●
1.0
●
●
●
Pareto parameter
● ●
−2
● ●
●
●
●
●
0.8
●
● ●
●●● ● ●
−3
●
● ●
● ● ●
● ●
●● ● ● ●● ●
● ● ●
● ● ● ● ●
●
● ●● ●●
● ●●●● ●●
●●●●
0.6
● ●
● ● ●
●
● ●●
−4
w)
● Pareto distribution
0.4
observations ●
61 51 41 31 21 11 17 18 19 20 21 22 23 24
number of observations log (claim size)
(m
H
Figure 3.22: 61 largest nuclear power accidents until 2011; lhs: Hill plot k 7→ α
b k,n
with confidence bounds of 1 standard deviation; rhs: log-log plot for α = 0.64.
3.3 Model selection

In the previous section we have presented different claim size distributions and we
tes
have debated which one fits best to the observed data. The argumentation was
completely based on graphical tools like log-log plots. In statistics, however, there
are more methodological tools that consider these questions from a more analytical
point of view. The two most commonly used tests are the Kolmogorov-Smirnov
no
(KS) test and the Anderson-Darling (AD) test. These are discussed in Sections
3.3.1 and 3.3.2.
In Section 3.3.3 we give the χ2 -goodness-of-fit test and we discuss the Akaike
information criterion (AIC) as well as the Bayesian information criterion (BIC).
NL
3.3.1 Kolmogorov-Smirnov (KS) test

Andrey Nikolaevich Kolmogorov (1903-1987) was a world leading proba-
bilists, in 1933 he gave the modern axiomatic foundations of probability theory.
Unfortunately, on Nikolai Vasilyevich Smirnov (1900-1966) less is known.
The KS test is a non-parametric test that investigates
whether a given sample Y1 , . . . , Yn fits to a particular
continuous distribution function G0 . Therefore, one
compares the empirical distribution function G b of the
n
sample to the distribution function G0 . The argument is
based on the Glivenko-Cantelli theorem which says that
the empirical distribution function of an i.i.d. sample
converges uniformly to the true underlying distribution A.N. Kolmogorov

function, P-a.s., if the number n of i.i.d. observations goes to infinity, see Theorem
20.6 in Billingsley [13].
Assume we have an i.i.d. sequence Y1 , Y2 , . . . from an unknown continuous distribu-
tion function G and we denote the corresponding empirical distribution function of
finite sample size n by G b . We would like to test whether these samples Y , Y , . . .
n 1 2
may stem from G0 . Consider the null hypothesis H0 : G = G0 against the two-sided
alternative hypothesis that these distribution functions differ. We define the KS
test statistics by

b −G
Dn = Dn (Y1 , . . . , Yn ) = G
b (y) − G (y) .
= sup G

n 0 n 0
w)
∞ y
This KS test statistics has the property that, see (13.4) in Billings-
ley [12],
√
nDn ⇒ Kolmogorov distribution K as n → ∞.
(m
The Kolmogorov distribution K is for y ∈ R+ given by
∞ n o
(−1)j+1 exp −2j 2 y 2 .
X
K(y) = 1 − 2
j=1
N.V. Smirnov
The null hypothesis H0 is rejected on the significance level q ∈ (0, 1) if
tes
Dn > n−1/2 K ← (1 − q),
where K ← (1 − q) denotes the (1 − q)-quantile of the Kolmogorov

no
distribution K.
q 20% 10% 5% 2% 1%
←
K (1 − q) 1.07 1.22 1.36 1.52 1.63
NL
Example 3.12 (KS test, PP insurance data). We apply the KS

test to the log-normal and the log-gamma fits of the PP insurance
data, see Examples 3.5 and 3.6. In the log-normal case we obtain for the MLE fit
Dn = 0.05 and for the methods of moment fit Dn = 0.12. These values are far too
large compared to the large sample size of n = 610 053 and the KS test clearly rejects
the null hypothesis of having a log-normal distribution on the 1% significance level.
If we look at Figure 3.23 (lhs) we see that these big values of the KS test statistics
are driven by small claims, i.e. we obtain a bad fit for small claims, the tails however
do not look too badly. The log-gamma fit looks better than the log-normal fit, see
Figure 3.23. It provides KS test statistics Dn = 0.04 for the MLE fit and Dn = 0.06
for the method of moments fit. However, these values are still far too large to not
get rejected on the 1% significance level.
Conclusion. The claim size modeling should be split into different layers.

w)
(m
Figure 3.23: KS test statistics for method of moments and MLE fits applied to the
PP insurance data; lhs: log-normal distribution; rhs: log-gamma distribution.
Example 3.13 (KS test, tail distribution). In this example we investigate the tail
fits of the Pareto distributions in the CP and the PP examples for the n = 505
largest claims, see Examples 3.9 and 3.10. The results are presented in Figure 3.24.
tes
no
NL
Figure 3.24: Point-wise terms of KS test statistics for MLE fits applied to the 505
largest claims; lhs: PP insurance data; rhs: CP insurance data.
For the PP insurance data we obtain Dn = 0.027 (for α = 2.5) and for the CP
insurance data we receive Dn = 0.061 (for α = 1.4). The first value is sufficiently
small so that the null hypothesis cannot be rejected on the 5% significance level,
the CP insurance value reflects just about the critical value on the 5% significance
level, i.e. the resulting p-value is just about 5%. The plot of the point-wise terms
of Dn looks fine for the PP insurance data, however the graph for the CP insurance

data looks a bit one-sided, suggesting two different regimes (this can also seen from
Figure 3.20).
3.3.2 Anderson-Darling (AD) test

The advantage of the (non-parametric) KS test is that it can be applied to any
situation of continuous distribution functions. The drawback of this large general-
ity, of course, is that it is often not very powerful and, especially, not very good in
detecting particular properties such as tail behavior.
The two statisticians Theodore Wilbur Anderson and Don-
w)
ald Allan Darling have developed a modification of the KS
test, the so-called AD test, which gives more weight to the tail
of the distributions. It is therefore more sensitive in detecting
tail fits, but on the other hand it has the disadvantage of not
(m
being non-parametric, and critical values need to be calculated
for every chosen distribution function.
The KS test statistics is modified by the introduction of a weight
T.W. Anderson
function ψ : [0, 1] → R+ which then modifies the KS test statis-
tics Dn as follows
tes
q
b (y) − G (y)
sup G

ψ(G0 (y)).
n 0
y
Different choices of ψ allow to weight different regions of the sup-

port of the distribution function differently, the KS test statistics
no
is obtained by ψ ≡ 1. The choice proposed by Anderson and Dar-

ling is ψ(t) = (t(1 − t))−1 in order to investigate the tails of the
distributions. D.A. Darling
In contrast to the maximal difference between the empirical distribution function
Gb and the null hypothesis G we could also consider a weighted L2 -distance. This
n 0
leads to the Anderson-Darling modification of the Cramér-von Mises test. The AD
NL
test statistics for ψ(t) = (t(1 − t))−1 is obtained from

2
Z b (y) − G (y)
Gn 0
A2n = n dG0 (y).
R G0 (y)(1 − G0 (y))
Anderson-Darling have explicitly identified the asymptotic behavior of An as n →
∞ by determining the limiting characteristic function. We do not further elaborate
on this but refer to the literature in statistics.
3.3.3 Goodness-of-fit and information criteria

There are many other criteria that can be applied for testing fits and distributional
choices. Many of them are based on asymptotic normality. For instance, a χ2 -
goodness-of-fit test splits the support of the null hypothesis distribution function

G0 into K disjoint intervals Ik = [ck , ck+1 ), k = 1, . . . , K. Then, data is grouped

according to these intervals, i.e. Ok counts the number of observed realizations
Y1 , . . . , Yn in interval Ik and Ek denotes the expected number of observations in Ik
according to the distribution function G0 . The test statistics is then defined by
2
K
X (Ok − Ek )2
Xn,K = . (3.8)
k=1 Ek
2
If d parameters were estimated in G0 , then Xn,K is compared to a χ2 -distribution
with K − 1 − d degrees of freedom, see also Exercise 2 on page 21. Often it is
w)
suggested that we should have Ek > 4 for reasonable results. However, these
rules-of-thumbs are not very reliable.
Within the framework of MLE methods the Hirotugu
Akaike (1927-2009) information criterion (AIC) and the
(m
Bayesian information criterion (BIC) are often used, we
refer to Akaike [2] and Section 2.2 in Congdon [27]. These
criteria are used to compare different distribution func-
tions and densities. Assume we want to compare two
different densities g1 and g2 that where fitted to the data H. Akaike
tes
Y = (Y1 , . . . , Yn )0 . The AIC is defined by
(i)
AIC(i) = −2`Y + 2d(i) ,
no
(i)
where `Y is the log-likelihood function of density gi for data Y and d(i) denotes
the number of estimated parameters in gi , for i = 1, 2. For MLE we maximize
(i)
`Y and in order to evaluate the AIC we penalize the model for having too many
parameters. The AIC then says that the model with the smallest AIC value should
be preferred.
NL
The BIC uses a different penalty term for the number of parameters (all these
penalty terms are motivated by asymptotic results). It reads as
(i)
BIC(i) = −2`Y + log(n) d(i) ,
and the model with the smallest BIC value should be preferred.
Exercise 5 (AIC and BIC). Assume we have claim sizes Y = (Y1 , . . . , Yn )0 with
n = 1000 which were generated by a gamma distribution, see Figure 3.25.
The sample mean and sample standard deviation are given by
µb n = 0.1039 and σbn = 0.1050.

w)
(m
Figure 3.25: Claim sizes Y = (Y1 , . . . , Yn )0 with n = 1000; lhs: observed data; rhs:
empirical distribution.
tes
no
NL
Figure 3.26: Fitted gamma distributions; lhs: log-log plot; rhs: QQ plot.
If we fit the parameters of the gamma distribution we obtain the method of mo-
ments estimators and the MLEs
γb MM = 0.9794 and cbMM = 9.4249,
γb MLE = 1.0013 and cbMLE = 9.6360.
This provides the fitted distributions displayed in Figure 3.26. The fits look perfect
and the corresponding log-likelihoods are given by
`Y (γb MM , cbMM ) = 1264.013 and `Y (γb MLE , cbMLE ) = 1264.171.
(a) Why is `Y (γb MLE , cbMLE ) > `Y (γb MM , cbMM ) and which fit should be preferred
according to AIC?

(b) The estimates of γ are very close to 1 and we could also use an exponential
distribution function. For the exponential distribution function we obtain
MLE cbMLE = 9.6231 and `Y (cbMLE ) = 1264.169. Which model (gamma or
exponential) should be preferred according to the AIC and the BIC?

3.4 Calculating within layers for claim sizes

In the previous sections we have experienced that it is difficult to fit one parametric
w)
distribution function to the entire range of possible outcomes of the claim sizes.
Therefore, we consider claim sizes in different layers. Another reason why different
layers of claim sizes are of interest is that re-insurance can often be bought for
different claims layers. For these reasons we would like to understand how claim
(m
sizes behave in different layers. First we discuss the modeling issue and second we
describe modeling of re-insurance layers.
3.4.1 Claim size modeling using layers

We come back to the issue that the KS test rejects the most popular parametric
tes
i.i.d.
fits, see Example 3.12. We assume that Y1 , Y2 , . . . ∼ G and we would like to split
G into different layers. The simplest case is to choose two layers, see Example 2.16,
that is, choose a large claims threshold M > 0 such that G(M ) ∈ (0, 1), i.e. G(M )
is bounded away form zero and one. We then define the disjoint decomposition
no
{Y1 ≤ M } and {Y1 > M } .
Assume that S ∼ CompPoi(λv, G). We consider the total claim Ssc in the small
claims layer and the total claim Slc in the large claims layer given by
N
X N
X
Ssc = Yi 1{Yi ≤M } and Slc = Yi 1{Yi >M } .
NL
i=1 i=1
Theorem 2.14 implies that Ssc and Slc are independent and compound Poisson
distributed with
Ssc ∼ CompPoi (λsc v = λG(M )v , Gsc (y) = P [Y1 ≤ y|Y1 ≤ M ]) ,
and
Slc ∼ CompPoi (λlc v = λ(1 − G(M ))v , Glc (y) = P [Y1 ≤ y|Y1 > M ]) .
Thus, we can model large claims and small claims separately. Observe that we
have the following decomposition
G(y) = P [ Y1 ≤ y| Y1 ≤ M ] G(M ) + P [ Y1 ≤ y| Y1 > M ] (1 − G(M ))

= Gsc (y)G(M ) + Glc (y)(1 − G(M )).

Often a successful modeling approach involves 3 steps:
1. Choose threshold M > 0 sufficiently large so that many of the observations

fall into the lower layer (0, M ]. In this lower layer one either fits a parametric
distribution function to the data or one directly works with the empirical
distribution function due to the Glivenko-Cantelli theorem. If a distribu-
tion function is fitted one should ensure that this distribution function has
compact support (0, M ], for instance, by choosing a truncated gamma distri-
bution.
w)
2. Estimate probability G(M ) of the event {Y1 ≤ M }.
3. Fit a Pareto distribution to Glc for threshold θ = M , i.e. estimate the tail
index α > 0 from the observations exceeding this threshold.
(m
Example 3.14. We revisit the PP and the CP insurance data set. We choose
tes
no
NL
Figure 3.27: Empirical fit in small claims layer and Pareto distribution fit in large
claims layer, the gray lines show the large claims threshold; lhs: PP insurance data;
rhs: CP insurance data.
large claims threshold M = 500 000 in both cases. In the PP insurance data set we
have 237 observations above this threshold, which provides estimate 1 − G(M
b )=
0
237/61 053 = 0.39%. For the CP insurance example we have 272 claims above
this threshold, which provides estimate 1 − G(M
b ) = 1.87%. Next we calculate
the sample mean and the sample coefficient of variation in the small claims layer
{Yi ≤ M }:
0 PP
PP : µb PP
{Yi ≤M } = 2 805, Vco
d
{Yi ≤M } = 1.80,
0 CP
CP : µb CP
{Yi ≤M } = 4 377, Vco
d
{Yi ≤M } = 1.51.

These should be compared to (3.1)-(3.2). We observe a substantial reduction of

the sample coefficient of variation in the small claims layer compared to the entire
range of possible outcomes. This is not surprising because large claims drive the
coefficient of variation. For CP insurance we also see that the sample mean in the
lower claims layer is substantially reduced. This is due to the fact that 1.87% of
claims exceed the threshold M = 500 000 and these claims may get very large and
drive the mean, see also loss size index function in Figure 3.5.
Finally, we fit the distribution function G to the data. We choose the empirical
distribution functions below the threshold M and Pareto distributions for the tail
w)
fit in the large claims layer, having tail parameters α as estimated in Examples 3.9
and 3.10 (this is also supported by the KS tests, see Example 3.13). The results
are presented in Figure 3.27. For PP insurance they look convincing, whereas the
CP insurance fit is not entirely satisfactory in the large claims layer (which might
(m
ask for a bigger large claims threshold M ).
3.4.2 Re-insurance layers and deductibles
Above we have calculated expected values in claims layers which were given by
tes
E[Y 1{u1 <Y ≤u2 } ] for various parametric distribution functions. This is of interest for
several reasons which we are going to discuss next.
(i) The first reason is that insurance contracts often have a deductible. On the
no
one hand small claims often cause too much administrative costs, and on the other
hand deductibles are also an instrument to prevent from fraud. For instance, it
can become quite expensive if every insured claims that his umbrella got stolen.
Therefore, a deductible d > 0 of say 200 CHF is introduced and the insurance
company only covers claims (Y − d)+ that exceed this deductible. In this case the
NL
pure risk premium for claim Y ∼ G is given by
Z ∞
E [(Y − d)+ ] = (y − d) dG(y) = E[Y 1{Y >d} ] − d P[Y > d] (3.9)
d
= P[Y > d] (E[Y |Y > d] − d) = P[Y > d] e(d),
under the assumption that P[Y > d] > 0 and e(·) is the mean excess function of Y .
(ii) The second reason is that the insurance company may have a maximal insurance
cover per contract, i.e. it covers claims up to a maximal size of M > 0 and the
exceedances need to be paid by the insured; or, similarly, it may cover claims
exceeding M but has a re-insurance cover for these exceedances. In that case the
insurance company covers (Y ∧ M ) and the pure risk premium for this (bounded)

claim is given by
Z M
E [Y ∧ M ] = y dG(y) + M P[Y > M ] = E[Y 1{Y ≤M } ] + M P[Y > M ]
0

= E[Y ] − E[Y 1{Y >M } ] − M P[Y > M ]
= E[Y ] − P[Y > M ] e(M ) = E[Y ] − E [(Y − M )+ ] .
If we combine the deductibles with the maximal cover we obtain the excess-of-loss
(XL) (re-)insurance treaty. Assume we have a deductible u1 > 0 (in re-insurance
terminology this also called priority or retention). Then the insurance treat “u2
w)
XL u1 ” covers the claims layer (u1 , u1 + u2 ], that is, this contract covers a maximal
excess of u2 above the priority u1 . The pure risk premium for such contracts is
then given by
E[Y 1{u1 <Y ≤u1 +u2 } ].
(m
An issue, when dealing with layers, is claims inflation. Assume we sell insurance
contracts with a deductible d > 0 and we ask for a pure risk premium E [(Y − d)+ ].
Since cash flows have time values this premium has to be revised carefully for later
periods as the following theorem shows.
tes
Theorem 3.15 (leverage effect of claims inflation). Choose a fixed deductible d >
0 and assume that the claim at time 0 is given by Y0 . Assume that there is a
(deterministic) inflation index i > 0 such that the claim at time 1 can be represented
(d)
by Y1 = (1 + i)Y0 . We have
no
E[(Y1 − d)+ ] ≥ (1 + i) E[(Y0 − d)+ ].
Proof. We calculate the pure risk premium

Z ∞ Z ∞
E[(Y1 − d)+ ] P[(Y1 − d)+ > y] dy =
NL
= P[Y1 > y + d] dy
0 0
Z ∞ Z ∞
x
= P[Y1 > x] dx = P Y0 > dx
d d 1+i
Z ∞
= (1 + i) P [Y0 > y] dy,
d
1+i
where we have twice applied a change of variables. The latter is calculated as follows
!
Z d Z ∞
E[(Y1 − d)+ ] = (1 + i) P [Y0 > y] dy + P[Y0 > y] dy
d
1+i d
Z d
= (1 + i) P [Y0 > y] dy + (1 + i) E[(Y0 − d)+ ].
d
1+i

Example 3.16 (leverage effect of claims inflation). Assume that Y0 ∼ Pareto(θ, α)

with α > 1 and choose a deductible d > θ. In that case we have, see (3.9),
!−α
d 1
E [(Y0 − d)+ ] = d.
θ α−1
Choose inflation index i > 0 such that θ(1 + i) < d. From (3.6) we obtain
(d)
Y1 = (1 + i)Y0 ∼ Pareto(θ(1 + i), α).
w)
This provides
!−α
d 1
E [(Y1 − d)+ ] = d
θ(1 + i) α−1
!−α
d 1
(m
α
= (1 + i) d > (1 + i) E [(Y0 − d)+ ] .
θ α−1
We see that we obtain a strict inequality, i.e. the pure risk premium grows faster
than the claim sizes itself. The reason for this faster growth is that claims Y0 ≤ d
may entitle for claims payments after claims inflation adjustment, i.e. not only the
claim sizes are growing under inflation but also the number of claims is growing if
tes
one does not adapt the deductible to inflation.
no
NL

w)
(m
tes
no
NL

Chapter 4
Approximations for Compound
w)
Distributions
(m
In Chapter 2 we have introduced several claims count distributions for the modeling
of the number of claims N within a fixed time period. In Chapter 3 we have met
several claim size distribution functions G for the claim sizes Y1 , Y2 , . . . modeling.
Ultimately, we would always like to calculate the compound distribution function
of S, see Definition 2.1. As explained in Proposition 2.2, we can easily calculate the
tes
moments and the moment generating function of this compound distribution. On
the other hand the distribution function of S given in (2.1) is a notoriously difficult
object because it involves (too many) convolutions of the claim size distribution
function G. The aim of this chapter is to explain how we can circumvent this
difficulty.
no
The most commonly used practice in industry to solve this prob-

lem is to apply Monte Carlo simulations and then consider the
resulting empirical distribution function as a sufficiently good
approximation to the true distribution function, see Glivenko-
Cantelli theorem in Billingsley [13], Chapter 20. Though this
NL
is a feasible way we do not recommend it. The issue is that it

is often difficult to asses what sufficiently good means, i.e. the
rates of convergence of the Monte Carlo samples may be very
poor which results in a lot of simulations. This is especially true for heavy tailed
distribution functions of regularly varying type (3.3). Therefore, we would like to
present other methods. These include approximations, the Panjer algorithm and
fast Fourier transforms (FFT).
4.1 Approximations
In many cases approximations to S are used. This may be justified by the central
limit theorem (CLT) if the number of claims is large. Compound distributions may
89
90 Chapter 4. Approximations for Compound Distributions
have two different risk drivers in the tail of the distribution function, namely the
number of claims N may contribute to large values of S or single large claims in
Y1 , . . . , YN may drive extreme values in S. Let us concentrate on the compound
Poisson model, in particular, we would like to use the decomposition theorem in
the spirit of Example 2.16. In this case, mostly the claim sizes Yi contribute to
the tail of the distribution (if these are heavy-tailed). Therefore, we emphasize
that in the light of the compound Poisson model one should separate small from
large claims resulting in the independent decomposition S = Ssc + Slc . Then, if
the expected number of small claims vλsc is large, Ssc can be approximated by a
parametric distribution function and Slc should be modeled explicitly. This we are
)
going to describe in detail in the remainder of this chapter.
w
4.1.1 Normal approximation
(m
The normal approximation is motivated by the CLT which goes
back to de Moivre (1733) and Laplace (1812), see (1.2). It
was then Aleksandr Mikhailovich Lyapunov (1857-1918)
who stated it in the general version and who discovered the
importance of the CLT.
tes
The classical CLT holds for a fixed number of claims. In our
approach the number of claims is not fixed, therefore we need
a refinement of the CLT. We do this for Poissonian number of A.M. Lyapunov
claims N by keeping the expected claims frequency λ fixed and by sending the
no
volume v → ∞.
Theorem 4.1. Assume S ∼ CompPoi(λv, G) with G having a positive radius of

convergence ρ0 > 0 for its moment generating function. We have
S − λvE[Y1 ]
NL
q ⇒ N (0, 1) as v → ∞.
λvE[Y12 ]
Observe that we consider a special class of moment generating functions MY1 . As

long as we work in the set-up of Ssc this is not a restriction because claim sizes are
bounded by the large claims threshold M .
Proof. Set µ = E[Y1 ] and define Zv by

S − λvµ
Zv = p ,
λvE[Y12 ]
and calculate its moment generating function for r ∈ (−ρ0 , ρ0 )
! !
r rλvµ
log MZv (r) = log E [exp {rZv }] = λv MY1 p −1 −p .
λvE[Y12 ] λvE[Y12 ]

Chapter 4. Approximations for Compound Distributions 91
We study the asymptotic behavior. For v → ∞ both enumerator and denominator of the following
expression go the zero, therefore we can apply l’Hôpital’s rule

−3/2
rµ rv −3/2
M Y1 √ r
−1 − √ −MY1 0 √ r √ + rµv
√
λvE[Y12 ] λvE[Y12 ] λvE[Y12 ] 2 λE[Y12 ] 2 λE[Y12 ]
lim = lim
v→∞ (λv)−1 v→∞ −λ−1 v −2

0 r
r M Y1 √ −µ
λvE[Y12 ]
= lim p .
v→∞ 2λ−1 v −1/2 λE[Y12 ]
Since enumerator and denominator still converge to zero as v → ∞ we can apply l’Hôpital’s rule
once more and obtain
w)

−3/2
MY1 √ r 2 − 1 − √ rµ 2 rMY001 √ r 2 −rv
√
λvE[Y1 ] λvE[Y1 ] λvE[Y1 ] 2 λE[Y12 ]
lim −1
= lim p
v→∞ (λv) v→∞ −λ−1 v −3/2 λE[Y12 ]

1 2 00 √ r
2 r M Y1 λvE[Y12 ] 1
= r2 ,
(m
= lim 2
v→∞ E[Y1 ] 2
the last step follows from (1.3). This last expression exactly reflects the moment generating
function of the standard Gaussian distribution, see (1.4), therefore the claim follows from Lemma
1.4. 2
Theorem 4.1 is the motivation for the following approximation of the distribution
tes
function of S
   
S − λvE[Y1 ] x − λvE[Y1 ]  x − λvE[Y1 ] 
P [S ≤ x] = P  q ≤ q ≈ Φ q , (4.1)
2 2
λvE[Y1 ] λvE[Y1 ] λvE[Y12 ]
no
where Φ denotes the standard Gaussian distribution function. This approximation

works well when v is large and if the claim sizes Yi do not have heavy tailed
distribution functions G. Otherwise it under-estimates the true potential of large
outcomes of S (because Theorem 4.1 provides a particularly good approximation
NL
solely around the mean of S). For rates of convergences we refer to the literature,
for instance, see Embrechts et al. [36].
Note that the normal approximation (4.1) also allows for negative claims S, which
under our model assumptions is excluded, thus, it is really an approximation that
needs to be considered carefully.
Example 4.2 (Normal approximation for PP insurance). We revisit the PP insur-
ance data of Example 3.14. We consider 3 different examples:
(a) Only small claims: in this example we only consider claim size distribution
function G(y) = P [Y ≤ y|Y ≤ M ], i.e. the claims are compactly supported in
(0, M ]. As explicit claim size distribution we choose the empirical distribution
of Example 3.14, see Figure 3.27 (lhs), with M = 500 000. We choose portfolio
size v such that λv = 100.

(b) Claim size distribution function G is chosen as in (a), but this time we choose
portfolio size v such that λv = 1000.
(c) In addition to (b) we add the large claims layer modeled by a Pareto distri-
bution with M = 500 000 and α = 2.5 and for the expected number of large
claims we set λlc v = 3.9.
w)
(m
tes
Figure 4.1: Compound Poisson distribution of S and normal approximation (4.1)
in case (a), i.e. no large claims, expected number of claims 100; lhs: distribution
function; rhs: log-log plot.
no
For simplicity the true distribution function is evaluated by Monte Carlo simula-
tion, which contradicts our statement above, but is appropriate for sufficiently large
samples (and sufficient patience). We choose 100’000 simulations, this is further
illustrated in Example 4.11 below.
In Figure 4.1 we present the results of the normal approximation (4.1) in case (a).
NL
We observe an appropriately good fit around the mean but the normal approxima-
tion clearly under-estimates the tails of the true distribution function, see log-log
plot in Figure 4.1 (rhs). Moreover, the true distribution function has positive skew-
ness ςS = 0.43 whereas the normal approximation has zeroqskewness. In the normal
approximation we obtain probability mass Φ(−λvE[Y1 ]/ λvE[Y12 ]) = 6 · 10−7 for
a negative total claim amount (which is fairly small).
In Figure 4.2 we show situation (b) which is the same as situation (a) the only
change is that we enlarge the portfolio size by a factor 10. We see better approx-
imation properties due to the fact that we have convergence in distribution for
portfolio size v → ∞. We observe a lower skewness ςS = 0.15 which improves the
normal approximation, also in the tails.
Finally, in Figure 4.3 we also include large claims (in contrast to Figure 4.2) having
an expected number of large claims of 3.9 and a Pareto tail parameter of α = 2.5.

w)
(m
in case (b), i.e. no large claims, expected number of claims 1000; lhs: distribution
function; rhs: log-log plot.
tes
no
NL

in case (c), i.e. with large claims, total expected number of claims 1003.9; lhs:
distribution function; rhs: log-log plot.
We see that in this case the normal approximation is useless in the tail, which
strongly favors the large claims separation as suggested in Example 2.16.
4.1.2 Translated gamma and log-normal approximations

In Example 4.2 we have seen that the normal approximation can be useful for
large portfolio sizes v and under the exclusion of large claims. For small portfolio

sizes the approximation may be bad because the true distribution has substantial
skewness. This leads to the idea of approximating the small claims layer by other
distribution functions that also enjoy skewness.
We choose k ∈ R and define the random variable
X = k + Z, where Z ∼ Γ(γ, c) or Z ∼ LN(µ, σ 2 ).
We have in the translated gamma case
E[X] = k + γ/c, Var(X) = γ/c2 and ςX = 2γ −1/2 ,
w)
and in the translated log-normal case
E[X] = k + exp{µ + σ 2 /2},

Var(X) = exp{2µ + σ 2 }(exp{σ 2 } − 1),
(m
2
ςX = (eσ + 2)(exp{σ 2 } − 1)1/2 .
The idea now is to do a fit of moments between S and X. Assume that S has finite
third moment and then we choose
tes
X = k + Z, where Z ∼ Γ(γ, c) or Z ∼ LN(µ, σ 2 ),
such that the three parameters of X fulfill
E[X] = E[S], Var(X) = Var(S) and ςX = ςS , (4.2)

no
and then this fitted random variable X is chosen as an approximation to S.
Exercise 6. Assume that S has a compound Poisson distribution with expected

number of claims λv > 0 and claim size distribution G having finite third moment.
NL
1. Prove that the fit of moments approximation (4.2) for a translated gamma
distribution for X provides the following system of equations
E[Y13 ]
λv E[Y1 ] = k + γ/c, λv E[Y12 ] = γ/c2 and = 2γ −1/2 .
(λv)1/2 E[Y12 ]3/2
2. Solve this system of equations for k ∈ R, γ > 0 and c > 0 and prove that it
has a well-defined solution for G(0) = 0.
3. Why should this approximation not be applied to case (c) of Example 4.2?

w)
(m
Figure 4.4: Compound Poisson distribution of S and normal approximation (4.1),
translated gamma and log-normal approximation (4.2) in case (a), i.e. no large
claims, expected number of claims 100; lhs: distribution function; rhs: log-log plot.
tes
no
NL

translated gamma and log-normal approximation (4.2) in case (b), i.e. no large
claims, expected number of claims 1000; lhs: distribution function; rhs: log-log
plot.
Example 4.3 (Translated gamma and log-normal approximations). We revisit

cases (a) and (b) of Example 4.2, that is, we only consider the small claims layer
and we would like to approximate the compound Poisson distribution in this small
claims layer by translated gamma and log-normal distributions.
The approximations for expected number of claims λv = 100, i.e. case (a), are
presented in Figure 4.4 and the ones for expected number of claims λv = 1000,

i.e. case (b), in Figure 4.5. In both cases we see that the translated gamma and log-
normal approximations provide remarkably good fits. For this reason, the small
claims layer is often approximated by one of these two parametric distribution
functions.
Observe that for k > λv we have a Chernoff type bound of (Stirling’s formula
provides asymptotic behavior k! = O(exp{k log(k/e)}) as k → ∞)
P [N ≥ k] ≤ exp {−k log k − λv + k log(eλv)} .
This explains that the compound Poisson distribution with bounded claim sizes
w)
Yi ≤ M is less heavy tailed compared to the translated gamma and log-normal
distributions.
The KS test rejects the null hypothesis on the 5% significance level for the normal
approximation in both cases (a) and (b), whereas this is not the case for the
(m
translated gamma and log-normal approximations in both cases (a) and (b), the
p-values are clearly bigger than 5%; for the exact p-values we refer to Table 4.1,
below. In case (a) the translated gamma approximation is favored, in case (b)
the translated log-normal approximation (though the differences in the latter are
negligible).
tes
4.1.3 Edgeworth approximation
The Edgeworth approximation is named after Francis Ysidro

Edgeworth (1845-1926). The approximations presented in the
no
previous section were quite ad-hoc, i.e. we have just chosen a

distribution function that enjoys skewness and then we have
done a fit of moments (with no further argument on the shape
of the approximating distribution function). The Edgeworth
approximation starts from the CLT and then tries to adjust
NL
higher order terms in approximation (4.1) by the evaluation of F.Y. Edgeworth

the appropriate moment generating function in terms of Taylor
expansions.
Assume that S is compound Poisson with G having a positive radius of convergence
ρ0 > 0, see Theorem 4.1. Then, we define the normalized random variable
S − λvE[Y1 ]
Z= q .
λvE[Y12 ]
We have E[Z] = 0, Var(Z) = 1 and ςZ = ςS , and in fact the latter identity applies
to all further normalized moments of Z and S. The aim now is to approximate
the moment generating function of Z by appropriate terms coming from normal
distributions. Therefore, we first consider the following Taylor expansion around

the origin, choose n ≥ 3,

n dk
k log MZ (r)|r=0 k
r + o(rn )
X
dr
log MZ (r) = as r → 0.
k=0 k!
k

d
We set ak = dr k log MZ (r)|r=0 /k!. Note that we have a0 = log MZ (0) = 0,
a1 = E[Z] = 0 and a2 = Var(Z)/2! = 1/2. This provides the approximation
n n
( ) ( )
1 2 X 1 2

ak r k ak r k .
X
MZ (r) ≈ exp r + = exp r exp
2 k=3 2 k=3
w)
Using another Taylor expansion for ex = 1 + x + x2 /2! + . . . applied to the latter
exponential function in the last expression the moment generating function of Z is
approximated by
(m
 P 2 
n
2 /2
n
k=3 ak r k
MZ (r) ≈ er ak r k +
X
1 + + . . .
.

k=3 2!
Depending on the required precision as r → 0 we can choose more terms in the

bracket (highlighted by “+ . . .”) and we can take more terms in the summation
tes
reflected by the upper index n in the summation. Thus, for appropriate constants
bk ∈ R we get the approximation for small r
 
r2 /2 1 + a3 r 3 bk r k  .
X
MZ (r) ≈ e + (4.3)
k≥4
no
Lemma 4.4. Let Φ denote the standard Gaussian distribution function and Φ(k)
its k-th derivative. Then for k ∈ N0 and r ∈ R
Z ∞
k r2 /2 k
r e = (−1) erx Φ(k+1) (x) dx.
−∞
NL
Proof. The proof goes by induction. Choose k = 0, then

Z ∞ Z ∞
rx 0 1 2 2
e Φ (x) dx = erx √ e−x /2 dx = MX (r) = er /2 ,
−∞ −∞ 2π
which is the moment generating function of X ∼ N (0, 1).

Induction step k → k + 1. Using integration by parts we have
Z ∞ h i∞ Z ∞
(−1)k+1 erx Φ(k+2) (x) dx = (−1)k+1 erx Φ(k+1) (x) − (−1)k+1 rerx Φ(k+1) (x) dx.
−∞ −∞ −∞
Note that the first term on the right-hand side is equal to zero because Φ(k+1) (x) goes faster to
zero than erx may possibly converge to infinity. This and the induction assumption for k provides
identity
Z ∞ Z ∞
2
(−1)k+1 erx Φ(k+2) (x) dx = r (−1)k erx Φ(k+1) (x) dx = r rk er /2 ,
−∞ −∞

which proves the claim. 2
Lemma 4.4 allows to rewrite approximation (4.3) as follows, set X ∼ N (0, 1),
h i Z ∞ Z ∞
MZ (r) ≈ E erX − a3 erx Φ(4) (x) dx + bk (−1)k erx Φ(k+1) (x) dx
X
−∞ k≥4 −∞
 
Z ∞
erx Φ0 (x) − a3 Φ(4) (x) + bk (−1)k Φ(k+1) (x) dx.
X
=
−∞ k≥4
Assume that Z has distribution function denoted by FZ , then the latter suggests
w)
approximation, see Lemmas 1.2-1.3,
 
dFZ (z) ≈ Φ0 (z) − a3 Φ(4) (z) + bk (−1)k Φ(k+1) (z) dz.
X
k≥4
(m
q
Integration then provides the Edgeworth approximation, set x = λvE[Y12 ] z +
λvE[Y1 ],
def.
P [S ≤ x] = FZ (z) ≈ EW(z) = Φ(z) − a3 Φ(3) (z) + bk (−1)k Φ(k) (z). (4.4)
X
k≥4
tes
This formula now highlights the refinement of the normal approximation (4.1),
namely we correct the first order approximation Φ by higher order terms involving
skewness and other higher order terms reflected by a3 and bk in (4.4).
no
The Edgeworth approximation (4.4) is elegant but its use requires some care as we
are just going to highlight.
We first consider the derivatives Φ(k) for k ≥ 1. The first derivative is given by
1 2
Φ0 (z) = √ e−z /2 ,
NL
2π
and the higher order derivatives for k ≥ 2 are given by
dk−1 1 −z2 /2
k−1 −z 2 /2

Φ(k) (z) = √ e = O z e for |z| → ∞.
dz k−1 2π
From this we immediately see that
lim EW(z) = 0 and lim EW(z) = 1.

z→−∞ z→∞
Attention. The issue with the Edgeworth approximation EW(z) is that it is not
necessarily a distribution function because it does not need to be monotone in z,
see Example 4.5, below!

Example 4.5. To see the non-monotonicity of EW(z) we only take skewness,

i.e. a3 = ςZ σZ3 /6 = ςS /6, into account and the approximation ez ≈ 1 + z in (4.4).
We have
0 1 2
Φ (z) = √ e−z /2 ,
2π
1 2
Φ(2) (z) = −z √ e−z /2 ,
2π
1 2 1 2
Φ(3) (z) = − √ e−z /2 + z 2 √ e−z /2 ,
2π 2π
1 1 1
w)
2 2 2
Φ(4) (z) = z √ e−z /2 + 2z √ e−z /2 − z 3 √ e−z /2 .
2π 2π 2π
This implies
d
EW(z) = Φ0 (z) − a3 Φ(4) (z) = Φ0 (z) 1 − 3a3 z + a3 z 3 . (4.5)
(m
dz
Consider the function h(z) = 1 − 3a3 z + a3 z 3 for positive skewness ςS > 0. Then
we have
lim h(z) = −∞ and lim h(z) = ∞,
z→−∞ z→∞
which explains that the derivative of EW(z) has both signs and therefore EW(z) is
tes
not monotone. However, in the upper tail of the distribution of S, that is, for z suf-
ficiently large, the Edgeworth approximation (4.5) is monotone and can be used as
an appropriate approximation. We would like to emphasize that these monotonicity
properties should always be carefully checked in the Edgeworth approximation.
We revisit the numerical examples given in Examples 4.3.
no
In Figure 4.6 we give the approximation in case (a), i.e. expected number of claims
equal 100, and in Figure 4.7 we give the approximation in case (b), i.e. expected
number of claims equal 1000. In both cases we only choose the next additional
moment which is the skewness and refers to term a3 and we choose approximation
ez ≈ 1 + z in (4.4). We see in both cases that the Edgeworth approximation clearly
NL
outperforms the Gaussian approximation. However, the Edgeworth approximation

is still light-tailed which can be seen by comparing it to the translated gamma
approximation.
In Figure 4.8 we compare the Edgeworth ’density’ (4.5) to the Gaussian density.
We clearly see the influence of the skewness parameter a3 and ςS > 0, respectively.
Moreover, we also see that the influence of the skewness parameter is decreasing
with higher expected number of claims. Of course, this exactly reflects the CLT,
see Theorem 4.1.
If we calculate the minimal value of the Edgeworth ’density’ (4.5) we obtain in
case (a) the value −9.8 · 10−4 and in case (b) the value −4.1 · 10−5 . This exactly
explains that the Edgeworth ’density’ is not a proper probability density because
it violates the positivity property. However, this is only in the range of very small
claims and therefore it can be used as approximation in the range of large claims.

w)
(m
translated gamma approximation (4.2) and Edgeworth approximation (4.4) in case
(a), i.e. no large claims, expected number of claims 100; lhs: distribution function;
rhs: log-log plot.
tes
no
NL

translated gamma approximation (4.2) and Edgeworth approximation (4.4) in case
(b), i.e. no large claims, expected number of claims 1000; lhs: distribution function;
rhs: log-log plot.
Finally, in Table 4.1 we present the p-values of the different approximations re-
sulting from the KS test, see Section 3.3.1. In this particular case we see that
the translated gamma distribution is preferred in case (a), whereas in case (b) the
approximations are very similar. For this reason, one often chooses a translated
gamma distribution in practice. Note that the Edgeworth approximation can be

w)
(m
Figure 4.8: We consider the Edgeworth ’density’ (4.5) to the Gaussian density;
lhs: in case (a), i.e. expected number of claims 100; rhs: in case (b), i.e. expected
number of claims 1000.
case (a) case (b)

normal approximation 0% 0%
tes
translated gamma approximation 51% 57%
translated log-normal approximation 8% 59%
Edgeworth approximation 13% 58%
no
Table 4.1: p-values of the KS test of Section 3.3.1.
refined and improved by considering more terms in the Taylor expansion. This
closes the example.
NL
4.2 Algorithms for compound distributions

4.2.1 Panjer algorithm
The Panjer algorithm (also known as Panjer’s recursion) goes back

to Harry H. Panjer [73]. The Panjer algorithm assumes a spe-
cific property for the claims count distribution and then it uses
this property in a clever way to develop a recursive algorithm for
the calculation of the distribution function of S.
Throughout this section we assume that N is a claims count dis-

H.H. Panjer
tribution that is supported in a possibly infinite interval A ⊂ N0

containing 0. The corresponding probability weights are denoted by pk for k ∈ N0

and we set pk = 0 for k ∈
/ A.
Definition 4.6 (Panjer distribution). N has a Panjer distribution if there exist

constants a, b ∈ R such that for all k ∈ N we have the recursion
pk = pk−1 (a + b/k) .
Note that the Panjer distributions require p0 > 0 otherwise the

recursion will not provide a well-defined distribution function.
)
Bjørn Sundt and William S. Jewell (1932-2003) have char-
w
acterized the Panjer distributions. This is exactly stated in the
following lemma. B. Sundt
(m
Lemma 4.7 (Sundt-Jewell [83]). Assume N has a non-degenerate Panjer distri-
bution. N is either binomially, Poisson or negative-binomially distributed.
Proof. In order for N to have a non-degenerate distribution function we need

to have |A| > 1. Thus, we may and will choose as initialization k = 1 ∈ A (A
tes
is an interval containing 0). The Panjer distribution then provides for this
k the identity p1 = p0 (a + b). To have a well-defined distribution function
we need to have a + b ≥ 0, otherwise p1 < 0. The case a + b = 0 provides a
degenerate distribution function, thus we even need to have a + b > 0.
Case (i). Assume a = 0. This implies b > 0 and
no
b
pk = pk−1 >0 for all k ∈ N.
k W.S. Jewell
This is exactly the Poisson distribution with parameters a = 0 and b = λv > 0 for A = N0
because for the Poisson distribution we have, see Section 2.2.2, pk /pk−1 = λv/k.
Case (ii). Assume a < 0. To have positive probabilities we need to make sure that a + b/k
remains positive for all k ∈ A. This requires |A| < ∞. We denote the maximal value in A
NL
by v ∈ N (assuming it has pv > 0). The positivity constraint then provides b/v > −a and
a + b/(v + 1) = 0. The latter implies that pk = 0 for all k > v and is equivalent to the requirement
v = −(a + b)/a > 0. We set p = −a/(1 − a) ∈ (0, 1) which provides

b p b
pk = pk−1 a + = pk−1 − + .
k 1−p k
For the binomial distribution we have on A, see Section 2.2.1,

pk p v−k+1 p p v+1
= =− + .
pk−1 1−p k 1−p 1−p k
This is exactly the binomial distribution with parameters a = −p/(1 − p) and b = (v + 1)p/(1 − p)
and A = {0, . . . , v}.
Case (iii). Assume a > 0. In this case we define γ = (a + b)/a > 0. This provides b = a(γ − 1)
and
b γ−1
pk = pk−1 a + = pk−1 a 1 + .
k k

Since the latter should be summable in order to obtain a well-defined distribution function we
need to have a < 1. For the negative-binomial distribution we have, see Proposition 2.20,
pk (1 − p)(k + γ − 1) (1 − p)(γ − 1)
= =1−p+ .
pk−1 k k
This is exactly the negative-binomial distribution with parameters a = 1−p and b = (1−p)(γ −1)
and A = N0 . This proves the lemma. 2
The previous lemma shows that the (important) claims count distributions that
we have considered in Chapter 2 are Panjer distributions and the corresponding
w)
choices a, b ∈ R are provided in the proof of Lemma 4.7.
Theorem 4.8 (Panjer algorithm [73]). Assume S has a compound distribution

according to Model Assumptions 2.1 with N having a Panjer distribution with pa-
(m
rameters a, b ∈ R and the claim size distribution G is discrete with support N.
Denote gm = P[Y1 = m] for m ∈ N. Then we have for r ∈ N0

def.
 p0 for r = 0,
fr = P[S = r] = Pr
 k=1 a+ b kr gk fr−k for r > 0.
tes
Remarks.
• The Panjer algorithm requires a Panjer distribution for N and strictly pos-
itive and discrete claim sizes Yi ∈ N, P-a.s. Then it provides an algorithm
that easily allows to calculate the compound distribution without doing the
no
involved convolutions (2.1).
• Assume N ∼ Poi(λv), then we have a = 0 and b = λv and for r ∈ N

r
X k
fr = λv gk fr−k . (4.6)
NL
k=1 r
This allows to apply the simple recursion
f0 = p0 = e−λv ,
f1 = λvg1 f0 ,
1
f2 = λv g1 f1 + λvg2 f0 ,
2
1 2
f3 = λv g1 f2 + λv g2 f1 + λvg3 f0 ,
3 3
..
.
Observe that fr only depends on f0 , . . . , fr−1 which allows to perform this

algorithm.

• More remarks concerning the Panjer algorithm are provided below.
In order to prove Theorem 4.8 we need a technical lemma, see also Lemma 1.5 in
Schmidli [81].
Lemma 4.9. Set the assumptions of Theorem 4.8.

(i) For r ≥ n ≥ 1 and we have
n
" #
X
Y1 Yi = r = r/n,
)
E

i=1
w
Pn
where, of course, we assume positive probability of the event { i=1 Yi = r}.
(ii) For r ≥ n ≥ 2
(m
r−1
!
k ∗(n−1)
pn gr∗n
X
= a+b gk pn−1 gr−k .
k=1 r
Proof. Statement (i) uses the i.i.d. property of Y1 , Y2 , . . . to see that

tes
n
" # n
" n # " n n #
X X X X X
nE Y1 Yi = r = E Yi Yi = r = E Yi Yi = r = r.

i=1 i=1 i=1 i=1 i=1
For proving (ii) we start on the right-hand side

no
r−1 r−1 " n

#
X k ∗(n−1)
X k X
a+b gk pn−1 gr−k = pn−1 a+b P Y1 = k, Yi = r − k
r r i=2
k=1 k=1
r−1 " n
#
X k X
= pn−1 a+b P Y1 = k, Yi = r
r i=1
k=1
r−1 " n #
X k X
Yi = r gr∗n .
NL
= pn−1 a+b P Y1 = k

r
i=1
k=1
Observe that the last line is exactly the conditional expectation of a + bY1 /r, conditioned on the
Pn
event { i=1 Yi = r} supposed that gr∗n > 0. Therefore, we can apply (i) of this lemma which
provides
" #
r−1 n
X k ∗(n−1) Y1 X b
a+b gk pn−1 gr−k = pn−1 E a + b Yi = r gr∗n = pn−1 a + gr∗n = pn gr∗n ,

r r n

k=1 i=1
where in the last step we have used that (pn )n≥0 is a Panjer distribution. 2
We are now ready to prove the Panjer algorithm theorem.
Proof of Theorem 4.8. Initialization is clear because from Y1 ≥ 1, P-a.s., we obtain f0 =

P[S = 0] = P[N = 0] = p0 . Then, for r ≥ 1, using Lemma 4.9 in the third step,
r
X r
X
fr = pn gr∗n = p1 gr + pn gr∗n
n=1 n=2
r Xr−1
X k ∗(n−1)
= p1 gr + a+b gk pn−1 gr−k
n=2 k=1
r
r−1 X r
X k ∗(n−1)
= p1 gr + a+b gk pn−1 gr−k
r n=2
k=1
r−1 r−k r−1
X k X ∗(n)
X k
= p1 gr + a+b gk pn gr−k = p1 gr + a+b gk fr−k ,
r r
w)
k=1 n=1 k=1
∗(n)
where in the second last step we have used that gm = 0 for n > m. Observe p1 gr = p0 (a+b)gr =
f0 (a + b)gr . Therefore the right-hand side is transformed to
r−1 r
X k X k
(m
fr = p0 (a + b)gr + a+b gk fr−k = a+b gk fr−k ,
r r
k=1 k=1
which proves the claim of the Panjer algorithm. 2
Remarks.
• In practical applications there might occur the situation that the initial value
tes
f0 is nonsensical on the IT systems. This has to do with the fact that we
can represent numbers only up to some precision. Let us explain this using
the compound Poisson distribution providing Panjer algorithm (4.6). If the
expected number of claims λv is very large, then on IT systems the initial
value f0 = p0 = e−λv may be interpreted as zero and thus the algorithm
no
cannot start due to missing precision and meaningful starting value. This is
called numerical underflow.
In this case we can modify the Panjer algorithm as follows: choose any strictly
positive starting value fe0 > 0 and develop the iteration
NL
fe1 = λvg1 fe0 ,

1
fe2 = λv g1 fe1 + λvg2 fe0 ,
2
1 2
fe3 = λv g1 fe2 + λv g2 fe1 + λvg3 fe0 ,
3 3
..
.
Observe that this provides a multiplicative shift from fr to fer . The true
probability weights are then found by
n o
fr = exp log fer + log f0 − log fe0 ,
where we go over to the log-scale to avoid another multiplication with missing

precision.

This multiplicative shift may lead to a numerical overflow, which might it

make necessary to shift forward and backward the algorithm several times to
get sensible values. Important at the end is a final check to see whether
n
X
fr → 1 as n → ∞,
r=0
in order to have total probability mass 1.
• We need to have discrete claim sizes Yi ∈ N. Of course, this can be modified

to any other span d > 0, i.e. Yi ∈ dN, because for r ∈ N
w)
"N # "N # "N #
X X X
P[S = dr] = P Yi = dr = P Yi /d = r = P Ye
i =r ,
i=1 i=1 i=1
(m
with Yei = Yi /d ∈ N.
• If we have non-discrete claim sizes Yi we need to discretize them in order to

apply the Panjer algorithm. Therefore, we choose a span size d > 0 and we
consider for k ∈ N0 the probabilities
G((k + 1)d) − G(kd) = P [kd < Y1 ≤ (k + 1)d] .

tes
These probabilities can now either be shifted to the left or to the right end-
point of the interval [kd, (k+1)d]. We define the two new discrete distribution
functions for k ∈ N0
no
h i
+
gk+1 = P Y1+ = (k + 1)d = G((k + 1)d) − G(kd), (4.7)
and h i
gk− = P Y1− = kd = G((k + 1)d) − G(kd). (4.8)
This provides the following stochastic ordering
NL
N N N
Y1− ≤ Y1 ≤ Y1+ . S− = Yi− ≤ S = Yi ≤ S + = Yi+ ,
X X X
and
i=1 i=1 i=1
for Yi− being i.i.d. copies of Y1− and Yi+ being i.i.d. copies of Y1+ (also inde-
pendent of N ). Thus, we get lower and upper bounds S − ≤ S ≤ S + which
become more narrow the smaller we choose the span d. In most applications,
especially for small λv, these bounds/approximations are sufficient compared
to the other uncertainties involved in the prediction process (parameter esti-
mation uncertainty, etc.).
To S + we can directly apply the Panjer algorithm, S − is more subtle because
it may happen that g0− > 0 and, thus, the Panjer algorithm cannot be ap-
plied in its classical form. In the case of the compound Poisson distribution

this problem can circumvented quite easily due to the disjoint decomposition
theorem, Theorem 2.14, which says that
N N
− (d)
Yi− Yi− 1{Y − >0} = Se−
X X
S = =
i
i=1 i=1
has again a compound Poisson distribution with parameters λv e = λv(1 − g − )

0
and weights of the claim sizes gek− = gk− /(1 − g0− ) for k ∈ N. Finally, we apply
the Panjer algorithm to the compound Poisson distributed random variable
Se− to get the second bound.
w)
Of course, there are more sophisticated discretization methods but often our
(rough) proposal is sufficient.
Example 4.10 (Panjer algorithm compound Poisson distribution). We choose a
(m
compound Poisson model with expected number of claims λv = 1 and Pareto claim
i.i.d.
size distribution Yi ∼ Pareto(θ, α) with θ = 500 000 and α = 2.5. In a first step we
need to discretize the claim sizes. We calculate the distributions of Yi− ≤ Yi ≤ Yi+
according to (4.7) and (4.8) with
!−α !−α
kd k(d + 1)
gk− = gk+1
+
= G((k + 1)d) − G(kd) = − .
θ θ
tes
no
NL
Figure 4.9: Discretized claim size distributions (gk− )k and (gk+ )k ; lhs: case (i) with
span d = 100 000; rhs: case (ii) with span d = 10 000.
As span size we choose two different values: (i) d = 100 000 and (ii) d = 10 000. In
Figure 4.9 we plot the resulting probability weights (gk− )k and (gk+ )k . We see that
the discretization error disappears for decreasing span d.
We then implement the Panjer algorithm in R. The implementation is rather
straightforward. In a first step we invert the ordering in the claim size distributions

(gk− )k and (gk+ )k so that in the second step we can apply matrix multiplications.
This looks as follows:
> for (k in 0:Kmax) { g[2,Kmax-k] <- g[1,k]*k }

> f[0] <- exp(-lambda * v)
> for (r in 1:Kmax) {
f[r] <- g[2,(Kmax-r):(Kmax-1)] %*% f[0:(r-1)] * lambda * v / r
}
w)
The results are presented in Figures 4.10 and 4.11.
(m
tes
no
Figure 4.10: Discrete probability weights of compound Poisson distribution with

λv = 1 from Panjer algorithm; lhs: case (i) with span d = 100 000; rhs: case (ii)
with span d = 10 000.
In Figure 4.10 we plot the resulting probability weights of the (discretized) com-
NL
pound Poisson distribution, the left-hand side gives the picture for span d = 100 000
and the right-hand side for d = 10 000. The observation is that span d = 100 000
gives quite some differences between lower and upper bounds reflected by (gk− )k
and (gk+ )k , for span d = 10 000 they are sufficiently close so that we obtain appro-
priate approximations to the continuous Pareto distribution case. We also observe
that the resulting distribution has two obvious modes, see Figure 4.10 (rhs), these
reflect the cases of having one claim N = 1 and having N = 2 claims, the cases
N ≥ 3 only give smaller discontinuities.
Finally, in Figure 4.11 we show the log-log plots of the distribution functions. The
straight blue line reflects the Pareto distribution Y1 ∼ Pareto(θ, α), i.e. of having
exactly one claim with tail parameter α = 2.5 (which corresponds to the negative
slope of the blue line). We observe that asymptotically the compound Poisson
distribution with λv = 1 coincides with the Pareto claim size distribution.

w)
(m
Figure 4.11: Log-log plot of compound Poisson distribution with λv = 1 from
Panjer algorithm; lhs: case (i) with span d = 100 000; rhs: case (ii) with span
d = 10 000.
Example 4.11. We revisit case (c) of Example 4.2, that is, for large claims Slc we
assume a compound Poisson distribution with expected number of claims λlc v = 3.9
tes
and Pareto(θ, α) claim size distribution with θ = 500 000 and α = 2.5. We choose
the same discretizations as in Example 4.10, see Figure 4.9, and then we apply the
Panjer algorithm to the large claims layer as explained above. The results for the
distribution of Slc are presented in Figures 4.12 and 4.13.
no
NL
Figure 4.12: Discrete probability weights of compound Poisson distribution with

λlc v = 3.9 from Panjer algorithm; lhs: case (i) with span d = 100 000; rhs: case (ii)
with span d = 10 000.

w)
(m
Figure 4.13: Log-log plot of compound Poisson distribution with λlc v = 3.9 from
Panjer algorithm; lhs: case (i) with span d = 100 000; rhs: case (ii) with span
d = 10 000.
The results are very much in-line with the ones of Example 4.10 and we should go
for span d = 10 000 which gives a sufficiently good approximation to the continu-
tes
ous Pareto claims size distribution. Observe that due to λlc v = 3.9 the resulting
compound Poisson distribution has more modes now, see Figure 4.12 (rhs). In
Figure 4.13 we see that the asymptotic behavior is sandwiched between the Pareto
distribution Pareto(θ, α) with tail parameter α = 2.5 and this Pareto distribution
no
stretched with the expected number of claims λlc v = 3.9 (blue lines in Figure 4.13).
We also observe a very slow convergence to the asymptotic slope −α which tells us
that parameter estimation is a difficult task if only few observations are available.
Finally, we merge the large claims layer Slc of case (c) in Example 4.2 with the
corresponding small claims layer Ssc , see case (b) of Example 4.2. In the small
claims layer we choose a translated gamma distribution as approximation to the
NL
true distribution function of Ssc , i.e. we set
S = Ssc + Slc ≈ Xsc + Slc , (4.9)
where Xsc is the translated gamma approximation to Ssc (see Example 2.16 and
(4.2)) and Slc models the large claims layer having a compound Poisson distribution
with Pareto claim sizes as described above.
In order to calculate the compound Poisson random variable Slc we apply the Panjer
algorithm with span d = 10 000. The disjoint decomposition theorem, see Theorem
2.14 and Example 2.16, implies that in the compound Poisson case we may and
will assume that the large claims separation leads to an independent decoupling
of Ssc and Slc , and Xsc and Slc , respectively, see (4.9). Therefore, the aggregate
distribution of Xsc + Slc is obtained by a simple convolution of the marginal distri-

w)
(m
Figure 4.14: Case (c) of Example 4.2: exact discretized distribution Xsc + Slc for
span d = 10 000, Monte Carlo approximation and normal approximation (only rhs).
lhs: discrete probability weights (upper and lower bounds); rhs: log-log plot (see
also Figure 4.3 (rhs)).
butions of Xsc and Slc . Using a further discretization of the distribution function of
tes
Xsc to the same span d = 10 000 as in the Panjer algorithm for Slc , the convolution
of Xsc + Slc can easily be calculated analytically, i.e. no Monte Carlo simulation
(1)
is needed. Namely, denote the discrete probability weights of Xsc by (fk )k≥0 and
(2)
the discrete probability weights of Slc by (fk )k≥0 , i.e. set
no
(1) (2)
P [Xsc = kd] = fk and P [Slc = kd] = fk .
Then, due to independence, we have for all r ∈ N0

r
def. X (1) (2)
fr = P [Xsc + Slc = rd] = fk fr−k . (4.10)
k=0
NL
> for (k in 0:Kmax) { f2[2,Kmax-k] <- f2[1,k] }

> for (r in 0:Kmax) { f[r] <- f2[2,(Kmax-r):Kmax] %*% f1[0:r] }
The results are presented in Figure 4.14. On the left-hand side we present the
probability weights (fr )r≥0 and on the right-hand side the log-log plot of the re-
sulting distribution function. We observe that the Monte Carlo approximation
(100’000 simulations) has bad properties in the tail of the distribution, see Figure
4.14 (rhs), and one should avoid the simulation approach if possible. Especially,
for heavy-tailed distribution functions the Monte Carlo simulation approach has
a weak speed of convergence performance. Note that convolution (4.10) is exact
up to the discretization error, and in some sense this discretized version can be
interpreted as an optimal Monte Carlo sample with equidistant observations.

We conclude that approximation (4.9) with a translated gamma distribution for

the small claims layer and a compound Poisson distribution with Pareto tails for
the large claims layer is often a good model for total claim amount modeling in
non-life insurance. Moreover, using a discretization with appropriate span size d
the resulting distribution function can be calculated analytically.
expected claim amount E[S] 30 1310 397

standard deviation Var(S)1/2 3380 819
coefficient of variation Vco(S) 10.8%
99.5%-VaR upper bound 4 0460 500
0
w)
99.5%-VaR lower bound 40 0380 500
99.5%-VaR − E[S] ≈ 9120 500
Table 4.2: Resulting key figures, the 99.5%-VaR corresponds to the 99.5%-quantile
(m
of S, see Example 6.25, below. The 99.5%-VaR is calculated with the discretized
version with span d = 10 000, therefore we obtain upper and lower bounds resulting
from the discretization error.
Finally, in Table 4.2 we present the resulting key figures. We observe that the
resulting distribution is substantially more heavy-tailed than the Gaussian distri-
tes
bution which is not surprising in view of Figure 4.14 (rhs).
4.2.2 Fast Fourier transform

no
We only briefly sketch the fast Fourier transform (FFT) in order to explain the
main idea. Therefore, we follow Section 6.7 in Panjer [74].
In Chapter 1 we have introduced the Laplace-Stieltjes transform of X ∼ F given
by h i Z
cF (r) = MX (−r) = E e−rX =
m e−rx dF (x).
R
NL
The beauty of such transforms is that they allow for dealing elegantly with inde-
pendent random variables, in the sense that convolutions turn into products, i.e. for
X and Y independent we have
MX+Y (−r) = MX (−r)MY (−r).
Moreover, for compound distributed random variables S we have, see Proposition
2.2,
MS (−r) = MN (log MY1 (−r)). (4.11)
If we manage to identify the right-hand side of the latter equation, that is, find Z
such that MN (log MY1 (−r)) = MZ (−r), then Lemma 1.2 explains that S and Z
have the same distribution function and we do not need to perform the convolutions
(if Z is sufficiently explicit). This is also the idea behind the FFT.

However, the Laplace-Stieltjes transform is replaced by the Fourier

transform, which is named after Jean Baptiste Joseph Fourier
(1768-1830). We present the discretized case as it is usually used
in practice.
Assume we have finite support A = {0, . . . , n−1} and that (fl )l∈A
is a discrete distribution function on A. The discrete Fourier J.B.J. Fourier
transform of (fl )l is defined by
n−1
( )
zl
fˆz =
X
fl exp 2πi for z ∈ A.
w)
l=0 n
Assume S ∼ (fl )l , then we have, by a slight abuse of notation,
z zS

fˆz = MS 2πi = E exp 2πi .
(m
n n
The discrete Fourier transform has the following nice inversion formula
1 n−1
( )
zl
fˆz exp −2πi
X
fl = for l ∈ A.
n z=0 n
This now provides the idea to the first part of the algorithm: if we are able to ex-
tes
plicitly calculate the discrete Fourier transform (fˆz )z the inversion formula provides
the probability weights (fl )l . Note that this idea also applies if (fl )l are weights
that do not necessarily add up to 1. This gives the following recipe.
• Step 1. Choose threshold n ∈ N up to which we would like to determine the

no
distribution function of S, i.e. P[S ≤ n − 1] we would like to calculate.
• Step 2. Discretize the claim severity distribution G to obtain weights (gk )k∈A ,
for discretization we refer to the last section on the Panjer algorithm. Note
P
that typically we have k∈A gk < 1, because claims Yi may exceed threshold
n − 1 with positive probability.
NL
• Step 3. Calculate the discrete Fourier transform (fˆz )z∈A of S ∼ (fl )l∈A using
identity (4.11) with −r = 2πiz/n. Note that k∈A gk < 1 does not harm the
P
calculation since it will simply cancel out in the next step, because there is
only a scaling factor missing.
• Step 4. Apply the inversion formula to obtain (fl )l∈A from (fˆz )z∈A .
The remaining part now is the FFT which explains how we calculate the discrete
Fourier transform (ĝz )z∈A of Y1 ∼ (gl )l∈A which is needed to apply identity (4.11)
for −r = 2πiz/n, i.e. for
z z

fˆz = MS 2πi = MN log MY1 2πi = MN (log ĝz ) .
n n

There is a nice recursive algorithm that allows to calculate these discrete Fourier
transforms for the choices n = 2d , d ∈ N0 .
d −1
2X
( )
zl
ĝz = gl exp 2πi d
l=0 2
2d−1
X−1 2d−1
X−1
( ) ( )
2zl z(2l + 1)
= g2l exp 2πi d + g2l+1 exp 2πi
l=0 2 l=0 2d
2d−1
X−1 2d−1
X−1
( ) ( )
zl z zl

= g2l exp 2πi + exp 2πi d g2l+1 exp 2πi
2d−1 2 2d−1
w)
l=0 l=0
z

= gbz(0) + exp 2πi d
gbz(1) ,
2
(0)
where gbz(0) is the discrete Fourier transform of (gl )l=0,...,m−1 = (g2l )l=0,...,m−1 and
(1)
(m
gbz(1) is the discrete Fourier transform of (gl )l=0,...,m−1 = (g2r+1 )l=0,...,m−1 for m =
2d−1 . This can now be iterated until we have reduced the total length 2d to 20 = 1.
Observe that the total length of (fˆz )z is also n = 2d . Therefore, the exactly same
recursive algorithm can also be applied for the calculation of the inversion formula
to obtain (fl )l .
tes
no
NL

Chapter 5
Ruin Theory in Discrete Time
w)
Ruin theory has its origin in the early twentieth century when
(m
Ernst Filip Oskar Lundberg (1876-1965) [62] wrote his fa-
mous Uppsala PhD thesis in 1903. It was later the distinguished
Swedish mathematician and actuary Harald Cramér (1893-
1985) [28, 29] who developed the cornerstones in collective risk
and ruin theory and has made many of Lundberg’s ideas mathe-
matically rigorous. Therefore, the underlying process studied in
tes
ruin theory is called Cramér-Lundberg process. For the collected H. Cramér
work of Cramér we refer to [30]. Since then a vast literature has
developed in this field, important contributions are Feller [42], Bühlmann [19],
Rolski et al. [79], Asmussen-Albrecher [7], Dickson [34], Kaas et al. [57] and many
scientific papers by Hans-Ulrich Gerber and Elias S.W. Shiu. Therefore,
no
this theory is sometimes also called Gerber-Shiu risk theory, see Kyprianou [59].
Because it is not our intention to write another textbook
on ruin theory we keep this chapter rather short and
only give some key results. In particular, we investigate
NL
the importance of the tail of the claim size distribution.

Our short summary is mainly based on Schmidli [81] and
Rolski et al. [79], for a more comprehensive overview we
H.-U. Gerber refer to the literature.
5.1 Net profit condition

We consider time series of premium payments πt and total claim
amount payments St over several accounting years t ∈ N. In
this set-up we study the question under which circumstances the
premia πt suffice to pay all claims St (instantaneously when they
occur). In order to do this, we define the following discrete time E.S.W. Shiu
surplus process (Ct )t∈N0 .
115
116 Chapter 5. Ruin Theory in Discrete Time
Definition 5.1 (surplus process). Choose t ∈ N. The surplus at time t is given by

t
(c0 ) X
Ct = Ct = c0 + (πu − Su ) ,
u=1
for initial capital C0 = c0 ≥ 0 at time 0 and an i.i.d. sequence (πt , St )t∈N with:
• the premium πt received for accounting year t satisfies πt > 0, P-a.s.;
• the total claim amount St in accounting year t satisfies St ≥ 0, P-a.s.;
w)
• πt and St are independent for all t ∈ N.
The last assumption in the previous definition is not really
(m
necessary but it may simplify calculations.
The surplus process (Ct )t∈N0 models the equity or the net
asset value process of an insurance company which starts
with (deterministic) initial capital C0 = c0 ≥ 0, collects
every year a premium πt and pays for the corresponding
(non-negative) claim St . At the first sight it looks artificial
tes
to model the premium πt stochastically. The reason there-
fore is that some results in ruin theory are derived under
randomized premia. The ultimate goal is to achieve
Ct ≥ 0 for all t ≥ 0,
no
otherwise the company cannot fulfill its liabilities at any point in time t ∈ N0 . In
the present set-up we look at a homogeneous surplus process (having independent
and stationary increments Xt = πt − St ). Moreover, no financial return on assets is
considered. Of course, this is a rather synthetic situation. For the present purpose
NL
it is sufficient because it already highlights crucial issues and it will be refined for
solvency considerations in Chapter 10.
Definition 5.2 (ruin time and finite horizon ruin probability). We define the ruin
time τ of the surplus process (Ct )t∈N0 by
τ = inf {s ∈ N0 ; Cs < 0} ≤ ∞.
The finite horizon ruin probability up to time t ∈ N and for initial capital c0 ≥ 0
is defined by

ψt (c0 ) = P [ τ ≤ t| C0 = c0 ] = P inf C (c0 ) <0 .
s=0,...,t s
Remark on the notation. Below we use that for c0 = 0 the stochastic process
(0)
(Ct )t∈N0 = (Ct )t∈N0 is a random walk on the probability space (Ω, F, P) starting

Chapter 5. Ruin Theory in Discrete Time 117
(c ) (0)
at zero. The general surplus process can then be described by (Ct 0 )t∈N0 = (Ct +
c0 )t∈N0 under P and, as stated in Definition 5.2, we can indicate the initial capital
by using the notation P[·|C0 = c0 ]. In Markov process theory it has naturalized
that the latter is written as Pc0 [·] meaning that (Ct )t∈N0 under Pc0 is equal in law
(0)
to (Ct + c0 )t∈N0 under P.
The event {τ ≤ t} can be written as follows

n o [
{τ ≤ t} = inf {s ∈ N0 ; Cs < 0} ≤ t = {Cs < 0} ,
s=0,...,t
w)
and therefore τ is a stopping time w.r.t. the filtration generated by (Ct )t∈N0 . To
consider the limiting case t → ∞ we need to extend the positive real line by an
additional point {∞} because τ is not necessarily finite, P-a.s. We use the notation
R+ for the extended positive real line [0, ∞].
(m
The finite horizon ruin probability ψt (c0 ) is non-decreasing in t → ∞ and it is
bounded by 1 (because it is a probability). This immediately implies convergence
for t → ∞ and
we can define the ultimate ruin probability by the following limit

tes
ψ(c0 ) = lim ψt (c0 ) ∈ [0, 1]. (5.1)
t→∞
Lemma 5.3 (ultimate ruin probability). The ultimate ruin probability for initial
no
capital c0 ≥ 0 is given by

ψ(c0 ) = Pc0 [τ < ∞] = Pc0 inf Ct < 0 ∈ [0, 1].
t∈N0
Proof. The second equality is a direct consequence of the definition, note that
NL
[ [ [ [
{τ < ∞} = {τ ≤ t} = {Cs < 0} = {Ct < 0} = inf Ct < 0 .
t∈N0
t∈N0 t∈N0 s=0,...,t t∈N0
For the first equality we use the monotone convergence property of probability measures, note
{τ ≤ t} ⊂ {τ ≤ t + 1},
" #
[
Pc0 [τ < ∞] = Pc0 {τ ≤ t} = lim Pc0 [τ ≤ t] = lim ψt (c0 ) = ψ(c0 ).
t→∞ t→∞
t∈N0
We analyze this ultimate ruin probability in various situations. Therefore, we

(c )
modify the surplus process (Ct 0 )t∈N0 . We define Z0 = 0 and for t ∈ N
t t
(c0 ) (0) X X
Zt = Ct − c0 = C t = (πu − Su ) = Xu , (5.2)
u=1 u=1

where we define the i.i.d. sequence (Xt )t∈N by Xt = πt − St . In probability theory

the process (Zt )t∈N0 is called general random walk. A main object of interest of
random walk theory is the study of its long time behavior. The key theorem is the
following statement:
Theorem 5.4 (random walk theorem). Assume Xt are i.i.d. with P[X1 = 0] <
1 and E[|X1 |] < ∞. The random walk (Zt )t∈N0 defined in (5.2) has one of the
following three behaviors
• if E[X1 ] > 0 then limt→∞ Zt = ∞, P-a.s.;
w)
• if E[X1 ] < 0 then limt→∞ Zt = −∞, P-a.s.;
• if E[X1 ] = 0 then lim inf t→∞ Zt = −∞ and lim supt→∞ Zt = ∞, P-a.s.
(m
Proof. See, e.g., Proposition 7.2.3 in Resnick [77]. 2
From now on we exclude the trivial case P[π1 − S1 = 0] = 1

and we assume that π1 and S1 have finite first moments.
The random walk theorem immediately gives the following

tes
crucial corollary for our context:
Corollary 5.5 (ultimate ruin with probability one). As-

sume E[π1 ] ≤ E[S1 ]. Then ψ(c0 ) ≡ 1 for any initial capital
c0 ≥ 0.
no
Proof. The random walk theorem implies for E[X1 ] = E[π1 ]−E[S1 ] ≤
0 that lim inf t→∞ Zt = −∞, P-a.s., and thus lim inf t→∞ Ct = −∞, Pc0 -a.s (for any c0 ≥ 0). But
this means that we have ultimate ruin with probability 1. 2
Henceforth, for avoiding ultimate ruin with positive probability we need to charge
NL
an (expected) annual premium E[π1 ] which exceeds the expected annual claim
E[S1 ]. This gives rise to the following standard assumption.
Assumption 5.6 (net profit condition). The surplus process satisfies the net profit
condition (NPC) given by
E[π1 ] > E[S1 ].
Corollary 5.7. Assume that E[π1 ] > E[S1 ], then ψ(0) < 1.
Proof. The assumption E[π1 ] > E[S1 ] implies E[X1 ] > 0 and, thus, limt→∞ Zt = ∞, P-a.s. This
implies that P[lim inf t→∞ Zt = −∞] = 0. The latter is equivalent to P[inf t∈N0 Zt ≥ 0] > 0, see
for instance Proposition 7.2.1 in Resnick [77]. But then the proof follows. 2

Moreover, observe that ψ(c0 ) is non-increasing in c0 ↑ ∞ (this can be seen path by

(c )
path because Ct 0 = Zt + c0 is strictly increasing in the initial capital c0 ).
This implies that ψ(c0 ) ≤ ψ(0) < 1 under (NPC).
Our next goal is to find more explicit bounds on the ruin probability as a function
of the initial capital c0 ≥ 0.
w)
5.2 Lundberg bound
We start with a lemma which gives the renewal property of the surplus process.
(m
i.i.d.
We define the distribution function F by S1 − π1 ∼ F . Thus, we have −Xt ∼ F .
Note that from S1 ∼ FS , −π1 ∼ F−π and independence of S1 and π1 it follows
F = FS ∗ F−π .
Lemma 5.8. The finite horizon ruin probability and the ultimate ruin probability
tes
satisfy the following equations for t ∈ N0 and initial capital c0 ≥ 0
Z c0
ψt+1 (c0 ) = 1 − F (c0 ) + ψt (c0 − y) dF (y),
−∞
Z c0
ψ(c0 ) = 1 − F (c0 ) + ψ(c0 − y) dF (y).
−∞
no
Proof. We start with the finite horizon ruin probability. Observe that we have disjoint decom-
position for c0 ≥ 0
{τ ≤ t + 1} = {τ ≤ 1} ∪ {1 < τ ≤ t + 1} = {S1 − π1 > c0 } ∪ {1 < τ ≤ t + 1}.
The i.i.d. property of (πt , St )t implies

NL
ψt+1 (c0 ) = Pc0 [τ ≤ t + 1] = P[S1 − π1 > c0 ] + Pc0 [1 < τ ≤ t + 1]

Z c0
= P[S1 − π1 > c0 ] + Pc0 [ 1 < τ ≤ t + 1| S1 − π1 = y] dF (y)
−∞
Z c0
= P[S1 − π1 > c0 ] + Pc0 [ 1 < τ ≤ t + 1| C1 = c0 − y] dF (y)
−∞
Z c0
= P[S1 − π1 > c0 ] + Pc0 −y [τ ≤ t] dF (y)
−∞
Z c0
= 1 − F (c0 ) + ψt (c0 − y) dF (y).
−∞
The ultimate ruin probability statement is a direct consequence of the finite horizon statement.
Using that we have point-wise convergence (5.1) and that ψt is bounded by 1 which is integrable
w.r.t. dF we can apply the dominated convergence theorem to the finite horizon ruin probability
statement which provides the claim for the ultimate ruin probability as t → ∞. 2

Definition 5.9 (Lundberg coefficient, adjustment coefficient). Assume there exists

an R > 0 such that
M−X1 (R) = MS1 −π1 (R) = 1.
Then, this R > 0 is called Lundberg coefficient.
Lemma 5.10 (uniqueness of Lundberg coefficient). Assume that (NPC) holds and
that a Lundberg coefficient R > 0 exists. Then, R is unique.
w )
(m
tes
no
Figure 5.1: Lundberg coefficient R of the function r 7→ M−X1 (r).
Proof. Due to the existence of a Lundberg coefficient R > 0 and due to the independence
between S1 and π1 the following function is well-defined for all r ∈ [0, R] and satisfies
NL
r 7→ h(r) = log MS1 −π1 (r) = log(MS1 (r) M−π1 (r)) = log E erS1 + log E e−rπ1 .

Similar to Lemma 1.6 we see that h(r) is a convex function on [0, R] with h(0) = 0 and h0 (0) =
E[S1 − π1 ] < 0 under (NPC). But then there is at most one R > 0 with h(R) = 0. This proves
the uniqueness of the Lundberg coefficient. 2
Theorem 5.11 (Lundberg’s exponential bound). Assume (NPC) and R > 0 exists.
ψ(c0 ) ≤ e−Rc0 for all c0 ≥ 0.
Proof. It suffices to prove that ψt (c0 ) ≤ e−Rc0 for all t ∈ N because ψt (c0 ) ↑ ψ(c0 ) for t → ∞.
We apply Lemma 5.8 to the finite horizon ruin probability ψt (c0 ) to obtain the following proof
by induction.

t = 1: We apply Chebychev’s inequality to obtain for Lundberg coefficient R > 0

h i
ψ1 (c0 ) = Pc0 [τ ≤ 1] = P[S1 − π1 > c0 ] = P eR(S1 −π1 ) > eRc0
≤ e−Rc0 MS1 −π1 (R) = e−Rc0 .
t → t + 1: we assume that the claim holds true for ψt (c0 ). Then with Lemma 5.8
Z ∞ Z c0
ψt+1 (c0 ) = dF (y) + ψt (c0 − y) dF (y)
c0 −∞
Z ∞ Z c0
≤ e−R(c0 −y) dF (y) + e−R(c0 −y) dF (y)
c0 −∞
)
−Rc0 −Rc0
= e MS1 −π1 (R) = e ,
w
due to the choice of the Lundberg coefficient R > 0. This proves the Lundberg bound. 2
Remarks on Lundberg’s exponential bound.
(m
• Under (NPC) and the existence of the Lundberg coefficient
R > 0 we have an exponentially decaying bound on the ulti-
mate ruin probability as initial capital c0 → ∞, i.e.
ψ(c0 ) ≤ e−Rc0 .
tes
Set ε > 0 (small). There exists c0 = c0 (R, ε) ≥ 0 such that
ψ(c0 ) ≤ ε. This means that in the Lundberg case we can
E.F.O.
specify a maximal admissible ruin probability ε as tolerance
Lundberg
and then we can choose an appropriate initial capital c0 which
no
implies that the ultimate ruin probability ψ(c0 ) is bounded by this tolerance.
• The existence of the Lundberg coefficient R > 0 implies that MS1 (R) < ∞
and, using Chebychev’s inequality,
h i
P[S1 > x] = P eRS1 > eRx ≤ e−Rx MS1 (R) ∼ e−Rx as x → ∞.
NL
This means that the claims S1 have exponentially decaying tails which are
so-called light tailed claims.
A main question is whether this exponential bound can be improved in the case
where the Lundberg coefficient exists. The difficulty in most cases is that the ulti-
mate ruin probability cannot be calculated explicitly. An exception is the Bernoulli
case.
Proposition 5.12 (Bernoulli random walk). Assume that Xt are i.i.d. with P[Xt =
1] = p and P[Xt = −1] = 1 − p for given p > 1/2. For all c0 ∈ N we have
!c0 +1
1−p
ψ(c0 ) = .
p

Note that this model is obtained by assuming πt ≡ 1 and St ∈ {0, 2} with proba-
bility p having a zero claim.
Proof. We choose a finite interval (−1, a) for a ∈ N and define for fixed c0 ∈ [0, a) ∩ N0 the
stopping time
τa = inf {s ∈ N0 ; Cs = c0 + Zs ∈
/ (−1, a)} .
The random walk theorem implies τa < ∞, P-a.s., because the interval (−1, a) is finite. We define
the random variable c +Z C
1−p 0 t

1−p t
Yt = = .
p p
w)
It satisfies
" c0 +Zt−1 +Xt # " Xt #
1−p 1−p
E [ Yt | Yt−1 ] = Yt−1 = Yt−1 E Yt−1

E
p p
" −1 #
1−p 1−p
= Yt−1 (1 − p) +p = Yt−1 ,
(m
p p
thus (Yt )t≥0 is a martingale. Note that also the stopped process (Yτa ∧t )t≥0 is a martingale.
Moreover, the latter martingale is bounded and since the stopping time is finite, P-a.s., we can
apply the stopping theorem, see Section 10.10 in Williams [85], which provides
c
1−p 0
= E[Y0 ] = E[Yτa ]
p
tes
−1 a
1−p 1−p
= Pc0 [Cτa = −1] + Pc0 [Cτa = a]
p p
−1 a
1−p 1−p
= Pc0 [Cτa = −1] + (1 − Pc0 [Cτa = −1]) ,
p p
no
where the last step follows because (Ct )t∈N0 leaves the interval (−1, a), Pc0 -a.s., either at −1 or
at a. This provides the identity
c0 a
1−p 1−p
p − p
Pc0 [Cτa = −1] = −1 a .
1−p 1−p
p − p
NL
Finally, note that {Cτa = −1} is increasing in a and thus

c0 +1
1−p
ψ(c0 ) = Pc0 [τ < ∞] = lim Pc0 [Cτa = −1] = ,
a→∞ p
because p > 1 − p. This proves the theorem. 2
The Lundberg coefficient is found by the positive solution of

!
R −R p
M−X1 (R) = (1 − p)e + pe = 1, i.e. R = log > 0.
1−p
This together with Proposition 5.12 provides in the Bernoulli case

!
1−p
ψ(c0 ) = e−Rc0 .
p

That is, the Lundberg bound is optimal in the sense that we cannot improve the
exponential order of decay because R is already maximal.
In most cases we cannot explicitly calculate the ultimate ruin probability ψ(c0 ).
Exceptions are the Bernoulli random walk of Proposition 5.12 and the Cramér-
Lundberg process in continuous time with exponential claim size distribution, see
(5.3.8) in Rolski et al. [79]. In the other cases where the Lundberg coefficient
exists we apply Lundberg’s exponential bound of Theorem 5.11, or refined versions
thereof. But the following question remains: what can we do if the Lundberg
w)
coefficient does not exist, i.e. if the tail probability of St does not necessarily decay
exponentially?
5.3 Pollaczek-Khinchin formula
(m
We assume (NPC) throughout this section, thus we know
that Ct → ∞, Pc0 -a.s., and ψ(0) < 1. Under these as-
sumptions we can study the (local) minima of the surplus
process. This study is done by looking at the ladder heights
that define these minima. We follow Bühlmann [18], Sec-
tes
tion 6.2.6, Feller [42], Chapter XII, and Rolski et al. [79],
Chapter 6. We define the stopping times ν0 = 0 and for
k∈N
 n o
inf t > νk−1 ; Zt < Zνk−1 if νk−1 < ∞,
νk =
no
 ∞ otherwise.
νk is called the k-th strong descending ladder epoch, see (6.3.6) in Rolski et al. [79].
These stopping times form an increasing sequence that record the arrivals of new
ladder heights (descending records). For their distribution functions we have under
the i.i.d. properties of the Xt ’s (independent and stationary increments)
NL
h n o i
P [ νk < ∞| νk−1 < ∞] = P inf t > νk−1 ; Zt < Zνk−1 < ∞ νk−1 < ∞

= P [ inf {t > ν0 ; Zt < Zν0 } < ∞| ν0 < ∞]

= P [inf {t > 0; Zt < 0} < ∞] = ψ(0) < 1.
The probability of a finite ladder epoch is exactly equal to the ultimate ruin prob-
ability ψ(0) with initial capital c0 = 0.
Note that we could have πt − St ≥ 0, P-a.s., which would imply that the ultimate
ruin probability ψ(0) = 0 because the premium collected is bigger than the max-
imal claim. We exclude this situation as it is not interesting for ruin probability
considerations and because the insured will (hopefully) never pay a premium that
exceeds his maximal possible loss. Henceforth, under (NPC) we throughout assume
that ψ(0) ∈ (0, 1) (where the upper bound follows from (NPC)).

We define the random variable
K + = sup {k ∈ N0 ; νk < ∞} .
K + counts the total number of finite ladder epochs, i.e. the total number of strong
descending records. We have (applying the tower property several times)
h i
P K + = k = P [νk < ∞, νk+1 = ∞] = ψ(0)k (1 − ψ(0)),
that is, the total number of finite ladder epochs has a geometric distribution with
success probability 1 − ψ(0) ∈ (0, 1) under (NPC). On the set {K + = k}, k ≥ 1,
w)
we study the ladder heights which are for l ≤ k given by
Zl+ = Zνl−1 − Zνl > 0, P-a.s.
The random variable Zl+ measures by which amount the old local minima Zνl−1 is
(m
improved. Due to the i.i.d. property of the Xt ’s, we have
" k # k k
h i
{Zl+ xl } K + P Zl+ ≤ xl νl < ∞ =
\ Y Y
≤ =k = H(xl ), (5.3)

P
l=1 l=1 l=1
where the distribution function H neither depends on k nor on l. Thus, the ladder
heights (Zl+ )l=1,...,k are i.i.d. on the set {K + = k}. Finally, we consider the maximal
tes
height achieved by (−Zt )t∈N0 , this is the global minimum of the random walk
(Zt )t∈N0 ,
K +
Zl+ = Z0 − ZνK + = −ZνK + = sup −Zt = − inf Zt .

X
M=
t∈N0
no
l=1 t∈N0
This now allows to study the ultimate ruin probability as follows. Choose initial
capital c0 ≥ 0. The ultimate ruin probability is given by

ψ(c0 ) = Pc0 inf Ct < 0 = Pc0 inf Ct − c0 < −c0 = P inf Zt < −c0
t∈N0 t∈N0 t∈N
0
NL
 
h i K+
P K+ = k P  Zl+ >
X X +
= P [M > c0 ] = c0 K = k 
k∈N0 l=1
  
K+
ψ(0)k 1 − P  Zl+ ≤ c0 K +
X X
= (1 − ψ(0)) = k 
k∈N l=1

ψ(0)k 1 − H ∗k (c0 ) .
X
= (1 − ψ(0))
k∈N
This proves Spitzer’s formula, which is Corollary 6.3.1 in Rolski et al. [79]:
Theorem 5.13 (Spitzer’s formula). Assume ψ(0) ∈ (0, 1). Then

ψ(0)k 1 − H ∗k (c0 ) .
X
ψ(c0 ) = (1 − ψ(0))
k∈N

The previous theorem goes back to Frank Ludvig Spitzer

(1926-1992). It gives us another description of the ruin probability
under (NPC). The difficulty is the determination of the ladder
height distribution H defined in (5.3). In special cases this can
be calculated explicitly. We give the compound Poisson case, for
further details we refer to Rolski et al. [79], Section 6.4.3. The
random walk is given by, see (5.2),
t t
F.L. Spitzer
X X
Zt = (πu − Su ) = Xu .
u=1 u=1
w)
In classical (continuous time) ruin theory one starts with a homogeneous Poisson
point process (Nf)
t t∈R+ having constant intensity λv > 0 for the arrival of claims.
The premium income is modeled proportionally to time with constant premium
rate β > 0. The continuous time surplus process is then defined by Ce0 = c0 ≥ 0
(m
and for t > 0
N
et
X
Cet = c0 + βt − Su , (5.4)
u=1
with i.i.d. claim amounts St satisfying St > 0, P-a.s., and with these claim amounts
being independent of the claims arrival process (N f)
t t∈R+ . This continuous time
surplus process (Ct )t∈R+ is called Cramér-Lundberg process. Definition 5.2 of the
tes
e
ruin time is then extended to continuous time, namely
n o
τe = inf s ∈ R+ ; Ces < 0 ≤ ∞.
Note that ruin can only occur at time points where claims happen, otherwise the
no
continuous time surplus process (Cet )t∈R+ is strictly increasing with constant slope
β > 0 (in fact, the continuous time surplus process is a spectrally negative Lévy
process, see Chapter 1 in Kyprianou [59]). We define the inter-arrival times between
two claims by Wu , u ∈ N. For the homogeneous Poisson point process (N f)
t t∈R+
these inter-arrival times are i.i.d. exponentially distributed with parameter λv > 0.
Therefore, we can rewrite the continuous time surplus process in these claims arrival
NL
times by, define Vn = nu=1 Wu ,

P
N
eVn n
def. X X
Cn = CeVn = c0 + βVn − Su = c0 + (βWu − Su ).
u=1 u=1
This is exactly the set-up of Definition 5.1 with i.i.d. premia πt = βWt , t ∈ N.
The only thing that has changed is time, moving from t ∈ R+ to operational time
n ∈ N0 , and therefore
h i
P τe < ∞| Ce0 = c0 = Pc0 [τ < ∞] = ψ(c0 ), (5.5)
with πt = βWt . For (NPC) we require premium rate β > 0 such that
0 < E[X1 ] = βE[W1 ] − E[S1 ] = β/(λv) − E[S1 ] =⇒ β > λvE[S1 ].

The exponential distribution has the lack-of-memory property

which means that the waiting time for a next claim does not
depend on how long we have already been waiting for it. It is this
property which allows to calculate H explicitly in the Cramér-
Lundberg/compound Poisson case (5.4), namely, for x ≥ 0
Z ∞
−1
H(x) = 1 − E[S1 ] P[S1 > y] dy. (5.6)
x
F. Pollaczek We do not prove this statement, it uses the Wiener-Hopf factor-
w)
ization, for details we refer to Theorem 6.4.4 in Rolski et al. [79].
Note that H is a distribution function on R+ because 0∞ P[S1 > y] dy = E[S1 ].
R
This then allows to state the following theorem which gives the Félix Pollaczek
(1892-1981) and Aleksandr Yakovlevich Khinchin (1894-1959) formula.
(m
Theorem 5.14 (Pollaczek-Khinchin formula). Assume we have the compound
Poisson model (5.4) with (NPC) given by ρ = λvE[S1 ]/β ∈ (0, 1). The ultimate
ruin probability for initial capital c0 ≥ 0 is given by

ρk 1 − H ∗k (c0 ) ,
X
ψ(c0 ) = (1 − ρ)
tes
k∈N
with distribution function H given by (5.6).
Proof. See Rolski et al. [79], Theorem 6.4.4. 2 2

no
Remark. In the compound Poisson model (5.4) with (NPC)

one can also prove an integral equation for the ultimate ruin
probability given by
NL
Z ∞ Z c0
λv

ψ(c0 ) = (1 − FS (x))dx + ψ(c0 − x)(1 − FS (x))dx ,
β c0 0
A.Y. Khinchin
with distribution function S1 ∼ FS . We do not prove this state-
ment because the Pollaczek-Khinchin formula is sufficient for
our purposes. The exact assumptions and a proof of this integral equation is, for
instance, provided in Rolski et al. [79], Theorem 5.3.2.
We conclude that for the compound Poisson case (5.4) we have three different
descriptions for the ultimate ruin probability: (i) probabilistic description, (ii)
Pollaczek-Khinchin formula from renewal theory, and (iii) the integral equation.
Depending on the problem one then chooses the most appropriate one, i.e. we can
apply different techniques coming from different fields to solve the questions.

5.4 Subexponential claims

A distribution function F supported on R+ is called subexponential if
1 − F ∗2 (x)
lim = 2.
x→∞ 1 − F (x)
We start with a technical lemma that gives properties of subexponential distribu-

tion functions and a characterization. We follow the proofs in Rolski et al. [79],
Section 2.5.2.
)
Lemma 5.15 (subexponential distribution functions). Assume F is subexponential
then the following statements hold true:
w
1. For all n ∈ N
1 − F ∗n (x)
lim = n.
(m
x→∞1 − F (x)
In fact, this is an if and only if statement.
2. For all r > 0

lim erx (1 − F (x)) = ∞.
x→∞
3. For all ε > 0 there exists D < ∞ such that for all n ≥ 2 and all x ≥ 0
tes
1 − F ∗n (x)
≤ D(1 + ε)n .
1 − F (x)
Proof of Lemma 5.15. We start with the following statement for subexponential distribution
functions F : for all t ∈ R
no
1 − F (x − t)
lim = 1. (5.7)
x→∞ 1 − F (x)
We first prove (5.7). Choose t ≥ 0, then we have for x > t, using monotonicity of F ,
Z x
1 − F ∗2 (x) F (x) − F ∗2 (x) 1 − F (x − y)
−1 = = dF (y)
1 − F (x) 1 − F (x) 0 1 − F (x)
NL
Z t Z x
1 − F (x − y) 1 − F (x − y)
= dF (y) + dF (y)
0 1 − F (x) t 1 − F (x)
1 − F (x − t)
≥ F (t) + (F (x) − F (t)) .
1 − F (x)
This implies (the sandwich is for lim inf x→∞ ≤ lim supx→∞ )
1 − F ∗2 (x)

1 − F (x − t) −1
1 ≤ lim ≤ lim sup (F (x) − F (t)) − 1 − F (t) = 1.
x→∞ 1 − F (x) x→∞ 1 − F (x)
For t < 0 note that
1 − F (x − t) 1 1
lim = lim 1−F (x)
= lim = 1.
x→∞ 1 − F (x) x→∞ y→∞ 1−F (y−(−t))
1−F (x−t) 1−F (y)
This proves (5.7). The second auxiliary statement is

Z x
1 − F (x − y)
lim dF (y) = 1. (5.8)
x→∞ 0 1 − F (x)

This is an immediate consequence of

x
1 − F ∗2 (x) F (x) − F ∗2 (x) 1 − F (x − y)
Z
−1= = dF (y). (5.9)
1 − F (x) 1 − F (x) 0 1 − F (x)
We now turn to the proof of the first statement of Lemma 5.15. We prove the claim by induction.
For n = 2, 1 the statement holds true by definition. Thus, we assume that it holds true for n ≥ 2
and we would like to prove it for n + 1. Choose ε > 0 then there exists x0 such that for all x > x0
1 − F ∗n (x)

1 − F (x) − n < ε.

This implies for x > x0
)
Z x
1 − F ∗(n+1) (x) F (x) − F ∗(n+1) (x) 1 − F ∗n (x − y)
−1= = dF (y)
w
1 − F (x) 1 − F (x) 0 1 − F (x)
Z x−x0 Z x
1 − F ∗n (x − y) 1 − F (x − y) 1 − F ∗n (x − y)
= dF (y) + dF (y).
0 1 − F (x − y) 1 − F (x) x−x0 1 − F (x)
(m
The second integral is non-negative and using (5.7) we obtain
Z x Z x
1 − F ∗n (x − y) 1
lim sup dF (y) ≤ lim sup dF (y)
x→∞ x−x0 1 − F (x) x→∞ x−x0 1 − F (x)
F (x) − F (x − x0 ) 1 − F (x − x0 )
= lim sup = − 1 + lim sup = 0.
x→∞ 1 − F (x) x→∞ 1 − F (x)
For the first integral we have for x > x0 , using the triangle inequality,
tes
Z x−x0 Z x−x0
1 − F ∗n (x − y) 1 − F (x − y)

1 − F (x − y)
dF (y) − n ≤
dF (y) − 1 n

0 1 − F (x − y) 1 − F (x) 0 1 − F (x)
Z x−x0 ∗n

1 − F (x − y) 1 − F (x − y)
+ −n dF (y)
0 1 − F (x − y) 1 − F (x)
no
Z x−x0 Z x−x0
1 − F (x − y) 1 − F (x − y)
≤ n dF (y) − 1 + ε dF (y).
0 1 − F (x) 0 1 − F (x)
Finally observe
Z x−x0 Z x Z x
1 − F (x − y) 1 − F (x − y) 1 − F (x − y)
dF (y) = dF (y) − dF (y),
0 1 − F (x) 0 1 − F (x) x−x0 1 − F (x)
NL
the first integral converges to 1, see (5.8), and the second integral converges to 0 because it is
non-negative with
Z x Z x
1 − F (x − y) 1
lim sup dF (y) ≤ lim sup dF (y)
x→∞ x−x0 1 − F (x) x→∞ x−x0 1 − F (x)
F (x) − F (x − x0 ) 1 − F (x − x0 )
= lim sup = − 1 + lim sup = 0.
x→∞ 1 − F (x) x→∞ 1 − F (x)
This proves that for all ε > 0 there exists x1 ≥ x0 such that for all x > x1 we have
1 − F ∗(n+1) (x)

1 − F (x) − (n + 1) ≤ 4ε.
This proves the first statement of Lemma 5.15. We now turn to the second statement of the
lemma. Note that for 0 < y < x
1 − F (x)
erx (1 − F (x)) = (1 − F (x − y))er(x−y) ery .
1 − F (x − y)

1
Choose ε > 0 and y > r log(3/(1 − ε)) > 0. With (5.7) there exists x0 such that for all x > x0
erx (1 − F (x)) ≥ (1 − ε)(1 − F (x − y))er(x−y) ery > 3(1 − F (x − y))er(x−y) .
This implies that the function is strictly increasing with limit +∞. So there remains the proof
of the last statement of Lemma 5.15. Define αn = supx≥0 (1 − F ∗n (x))/(1 − F (x)). Note that
the first assertion of the lemma implies that αn < ∞. Moreover, we have 1 − F ∗(n+1) (x) =
1 − F ∗ F ∗n (x) = 1 − F (x) + F ∗ (1 − F ∗n (x)). This implies for any x0 ∈ (0, ∞)
1 − F (x) + F ∗ (1 − F ∗n (x))
αn+1 = sup
x≥0 1 − F (x)
Z x Z x
1 − F ∗n (x − y) 1 − F ∗n (x − y)
= 1 + sup dF (y) + sup dF (y)
w)
0≤x≤x0 0 1 − F (x) x>x0 0 1 − F (x)
Z x
1 1 − F ∗n (x − y) 1 − F (x − y)
≤ 1+ + sup dF (y)
1 − F (x0 ) x>x0 0 1 − F (x − y) 1 − F (x)
Z x
1 1 − F (x − y)
≤ 1+ + αn sup dF (y)
1 − F (x0 ) 1 − F (x)
(m
x>x0 0
1 − F ∗2 (x)

1
= 1+ + αn sup −1 ,
1 − F (x0 ) x>x0 1 − F (x)
where we have used (5.9) in the last step. The subexponentiality of F implies that for all ε > 0
there exists x0 such that
1
αn+1 ≤ 1 + + αn (1 + ε).
1 − F (x0 )
tes
Iteration provides

1 1
αn+1 ≤ 1+ + 1+ + αn−1 (1 + ε) (1 + ε)
1 − F (x0 ) 1 − F (x0 )
n−1
1 X
≤ 1+ (1 + ε)k + (1 + ε)n
1 − F (x0 )
no
k=0
X n
1 1 1
≤ 1+ (1 + ε)k ≤ 1+ (1 + ε)n+1 ,
1 − F (x0 ) 1 − F (x0 ) ε
k=0
which proves the claim for D = (1 + (1 − F (x0 ))−1 )/ε ∈ (0, ∞). This proves Lemma 5.15. 2
NL
Statements 1. and 3. of Lemma 5.15 will be important in the analysis of the

Pollaczek-Khinchin formula. Statement 2. of Lemma 5.15 says that for subex-
ponential distributions the moment generating function for r > 0 does not exist,
choose X ∼ F with F subexponential
Z ∞ h i Z ∞
E[erX ] = P erX > y dy = P [X > log(y)/r] dy
0 0
Z ∞
= r erx P [X > x] dx = ∞.
−∞
We conclude that for any r > 0 the moment generating function of subexponential
distributions does not exist, and therefore there is no Lundberg coefficient in this
case. We call such subexponential distributions heavy tailed distributions.

Theorem 2.5.5 in Rolski et al. [79] gives an important sufficient condition for having
a subexponential distribution.
Lemma 5.16 (regularly varying survival function). Assume that F is supported

on R+ and has regularly varying survival function at infinity with index α > 0,
i.e. for all y > 0
1 − F (xy)
lim = y −α ,
x→∞ 1 − F (x)
then F is subexponential.
w)
Proof. Assume that X1 and X2 are two i.i.d. random variables with regularly varying survival
functions with parameter α > 0. Note that we have for all ε ∈ (0, 1)
{X1 + X2 > x} ⊂ {X1 > (1 − ε)x} ∪ {X2 > (1 − ε)x} ∪ {X1 > εx, X2 > εx}.
(m
The i.i.d. property implies
P[X1 + X2 > x] ≤ 2 P[X1 > (1 − ε)x] + P[X1 > εx]2 .
Thus, we have
1 − F ∗2 (x) 2(1 − F ((1 − ε)x) + (1 − F (εx))2
lim sup ≤ inf lim sup
x→∞ 1 − F (x) ε∈(0,1) x→∞ 1 − F (x)
inf 2(1 − ε)−α = 2.
tes
=
ε∈(0,1)
On the other hand we have for any positively supported distribution function F , see also (5.9),
Z x Z x
1 − F ∗2 (x) 1 − F (x − y)
=1+ dF (y) ≥ 1 + dF (y) = 1 + F (x),
1 − F (x) 0 1 − F (x) 0
no
since by assumption F (0) = 0. This immediately implies that

1 − F ∗2 (x)
lim inf ≥ 2.
x→∞ 1 − F (x)
Note that the lower bound holds true for any distribution function supported on R+ . 2
NL
Remarks 5.17. Lemma 5.16 gives the connection to classical extreme value theory.
In extreme value theory one distinguishes three different domains of attraction for
tail behavior, see Section 3.3 in Embrechts et al. [36]: (i) Weibull case, which are
distribution functions with finite right endpoint of their support; (ii) Gumbel case,
which are light tailed to moderately heavy tailed distribution functions; (iii) Fréchet
case, which are heavy tailed distribution functions. The Fréchet case is exactly
characterized by regularly varying survival functions with (tail) index α > 0, see
Theorem 3.3.7 in Embrechts et al. [36]. This index has already been met in Section
3.2, see formula (3.3). Lemma 5.16 now says that every distribution function that
belongs to the Fréchet domain of attraction is also subexponential. However, the
class of subexponential distribution functions is larger than the class of distribution
functions with regularly varying survival functions, see Example 1.4.3 in Embrechts
et al. [36].

We apply the Pollaczek-Khinchin formula, see Theorem 5.14, to obtain the follow-
ing result in the subexponential case.
Theorem 5.18 (subexponential case, Embrechts-Veraverbeke theorem). Assume

we have the compound Poisson model (5.4) with (NPC) given by ρ = λvE[S1 ]/β ∈
(0, 1). Moreover, we assume that the ladder height distribution function H given
by (5.6) is subexponential. Then we have
ψ(c0 ) ρ
lim = .
w)
c0 →∞ 1 − H(c0 ) 1−ρ
Proof. Our aim is to apply Lemma 5.15 to the Pollaczek-Khinchin for-

mula. The latter provides
(m
ψ(c0 ) X 1 − H ∗k (c0 )
lim = (1 − ρ) lim ρk .
c0 →∞ 1 − H(c0 ) c0 →∞ 1 − H(c0 )
k∈N
Our aim is to exchange the limit c0 → ∞ and the infinite summation.

Note that Lemma 5.15 provides point-wise convergence of the last terms
to k as c0 → ∞, therefore our aim is to find a uniform integrable upper
tes
bound so that we can apply the dominated convergence theorem. To this P. Embrechts
end we choose ε ∈ (0, 1/ρ − 1). Then Lemma 5.15 implies that there exists D < ∞ such that for
all k ≥ 1 and c0 ≥ 0 we have a uniform integrable upper bound given by
X 1 − H ∗k (c0 ) X X
(1 − ρ) ρk ≤ (1 − ρ) ρk D(1 + ε)k = (1 − ρ)D (ρ(1 + ε))k < ∞,
1 − H(c0 )
no
k∈N k∈N k∈N
because ρ(1 + ε) < 1. Thus, we have found a uniform integrable upper bound and this allows to
exchange the two limits. This provides
ψ(c0 ) X 1 − H ∗k (c0 ) X
lim = (1 − ρ) ρk lim = (1 − ρ) ρk k.
c0 →∞ 1 − H(c0 ) c0 →∞ 1 − H(c0 )
k∈N k∈N
NL
The last term is the expected value of the geometric distribution which is given by ρ/(1 − ρ).
This proves the theorem. 2
Example 5.19 (Pareto claim sizes). We assume that we are in

the compound Poisson model of Theorem 5.14. The claim size
distribution of S1 is given by a Pareto distribution with threshold
θ > 0 and tail parameter α > 1. Under these assumptions we
calculate the ladder height distribution H. For x ≥ θ
Z ∞
−1
H(x) = 1 − E[S1 ] P[S1 > y] dy N. Veraverbeke
x
−α −α+1
α−1Z ∞ y 1 x
= 1 − θ−1 dy = 1 − .
α x θ α θ

This implies that H has a regularly varying survival function with tail index α−1 >
0. Therefore, Lemma 5.16 implies that H is subexponential and we can apply
Theorem 5.18 to obtain
ψ(c0 ) λvθα
lim −α+1 = .
c0 →∞ 1 (α − 1)β − λvθα

c0
α θ
That is, we have found in the Pareto (subexponential) case for α > 1
λvθα
ψ(c0 ) ∼ c0 −α+1 as c0 → ∞.
w)
(α − 1)β − λvθα

(m
Conclusions. We conclude that the heavy tailed case may
lead to a much more dangerous ruin behavior. In Example
5.19 we obtain for the asymptotic ruin behavior a power law
decay as the initial capital goes to infinity, whereas in the
light tailed case we obtain the exponentially decaying Lund-
berg bound, see Theorem 5.11. This is an impressive example
tes
that heavy tailed claims require careful risk management prac- C.M. Goldie
tices. For instance, an excess-of-loss reinsurance cover with
retention level M > θ would completely change the ruin behavior of a company
facing Pareto distributed claims St , t ∈ N. Also the triggers of ruin are very differ-
ent in the two cases. In the light tailed case it is the big mass of claims that causes
no
ruin, whereas in the heavy tailed case it is the single large claim event that causes
ruin.
The most general version of the asymptotic ruin behavior in the subexponential
case goes back to Paul Embrechts and Noël Veraverbeke [38]. However,
an important missing piece in the argumentation was provided by Charles M.
NL
Goldie. The Pareto case has previously been solved by Bengt von Bahr [8].

Chapter 6
Premium Calculation Principles
w)
In Assumption 5.6 and the random walk Theorem 5.4 we have seen that we need
to charge an expected premium that exceeds the expected claim amount E[St ]
(m
otherwise we have ultimate ruin, P-a.s. This is the so-called (NPC). For the present
chapter we assume that the premium πt is deterministic, then (NPC) reads as
πt > E[St ]. For simplicity (because we consider a fixed accounting year in this
chapter) we drop the time index t and then (NPC) is given by
π > E[S], (6.1)

tes
with S ∼ FS . In this chapter
• we justify why the insurance company can charge a premium that exceeds the
average claim amount E[S], i.e. why the insured is willing to pay a premium
no
π that exceeds his expected claim E[S]; and
• we give different pricing principles to calculate premium loadings π−E[S] > 0.

Simple solution (expected value principle). Choose a fixed constant α > 0 and
charge (to everyone) the premium
NL
π = (1 + α) E[S]. (6.2)
Are we happy with this solution?
Example 6.1 (expected value principle). We consider two different portfolios S1

and S2 with the same mean E[S1 ] = E[S2 ]. Under the previous simple solution
both insured pay the same insurance premium
π = (1 + α) E[S1 ] = (1 + α) E[S2 ] > E[S2 ] = E[S1 ].
We make an explicit distributional example.
133
134 Chapter 6. Premium Calculation Principles
• Assume S1 ∼ Γ(γ, c) with mean E[S1 ] = γ/c, and
• S2 ≡ γ/c is a constant.
Observe that there is absolutely no uncertainty in portfolio S2 , that is, we can

perfectly predict claim S2 (and, of course, also the insured can perfectly predict
his claim). But then it is natural that the insured is not willing to pay a premium
that exceeds his (maximal possible) loss S2 = γ/c, i.e. (hopefully) he refuses to pay
insurance premium π > E[S2 ] = S2 .
Conclusion. The premium loading should be risk-based! That is, the loading
w)
π − E[S] > 0 should reflect the risk of fluctuations of S around its mean E[S].
6.1 Simple risk-based principles
(m
The first notion of risk is always described by the variance of a random variable.
Therefore, we assume in this section that the second moment of S exists.
Variance loading principle. Choose a fixed constant α > 0 and define the
insurance premium π by
tes
π = E[S] + αVar(S).
Revisiting Example 6.1 we obtain insurance premia with variance loading given by
γ γ γ
π1 = E[S1 ] + αVar(S1 ) = +α 2 > = E[S2 ] + αVar(S2 ) = π2 .
c c c
no
That is, for the risky position S1 we now charge a premium that strictly exceeds the
expected value and the loading is zero for deterministic claims S2 . An unpleasant
feature of the variance loading principle is that the calibration is difficult because
the variance is not handy for this purpose and, related to this, the principle is not
invariant under changes of currencies. Assume that rfx > 0 is the (deterministic)
NL
exchange rate between two different currencies. Assume rfx 6= 1 then we obtain
2
πfx = E[rfx S] + αVar(rfx S) = rfx E[S] + rfx αVar(S) 6= rfx π.
This non-linearity of the variance implies that the premium cannot easily be scaled
with exchange rates and inflation indexes. Therefore, one often studies other ver-
sions of variance principles which brings us to the next principle.
Standard deviation loading principle. Choose a fixed constant α > 0 and

define the insurance premium π by
π = E[S] + αVar(S)1/2 = E[S] (1 + αVco(S)) ,
where the last equality requires that E[S] 6= 0.

Chapter 6. Premium Calculation Principles 135
This principle gives an explicit meaning to the loading constant in (6.2), namely it
says that the loading constant should be proportional to the coefficient of variation
of S. If we revisit Example 6.1 we obtain premia
γ γ 1/2 γ
π1 = E[S1 ] + αVar(S1 )1/2 = +α > = E[S2 ] + αVar(S2 )1/2 = π2 .
c c c
For the risky position S1 we charge a premium that strictly exceeds the expected
value and the loading is zero for deterministic claims S2 . The standard deviation
loading principle is much better understood than the variance loading principle
because often practitioners have a good feeling for appropriate ranges of the co-
w)
efficient of variation. For instance, they know that for certain lines of business
it should be around 10%. Moreover, this principle is invariant under changes of
currencies. Assume that rfx > 0 is again the (deterministic) exchange rate between
two different currencies. Then we obtain the identity
(m
πfx = E[rfx S] + αVar(rfx S)1/2 = rfx E[S] + rfx αVar(S)1/2 = rfx π.
The previous examples consider rather simple premium loading principles and there
are more principles of this type such as the modified variance principle. In the next
section we describe more sophisticated principles which are motivated by economic
behavior of financial agents and give risk management perspectives. These more
tes
advanced principles include:
• utility theory pricing principles
• Esscher premium principle

no
• probability distortion pricing principles
• cost-of-capital principles based on risk measures
• deflator pricing principles

NL
6.2 Advanced premium calculation principles

6.2.1 Utility theory pricing principles
Utility theory aims at modeling the “happiness index” of finan-
cial agents, i.e. for a financial agent holding a position X, we try
to find an index that quantifies his happiness generated by this
position X.
Utility theory can be introduced in a rather general framework
using preference orderings. If this system of preference orderings
is sufficiently regular then there exists a so-called numerical rep-
resentation for the preference ordering, see Föllmer-Schied [44]. J. von Neumann

We always assume that there exists a John von Neumann

(1903-1957) and Oskar Morgenstern (1902-1977) represen-
tation for the preference ordering on the set
X ⊂ L1 (Ω, F, P).
O. Morgenstern The set X describes the (risky) positions X ∈ X of interest. In

this set-up X reflects gains. Thus, we restrict ourselves to a set X of available risky
positions X and among these positions we would like to choose the ones which make
us as happy as possible. The von Neumann-Morgenstern representation equips us
w)
with a utility function u which is a function that has the following properties:
u:I→R
is strictly increasing on a non-empty interval I ⊂ R, where we assume that X ∈ I,
(m
P-a.s., for all X ∈ X , see Figure 6.1 for two examples.
exponential utility function power utility function
gamma>1
0
gamma=1
gamma<1
−500
tes 5
−1000
−1500
0
−2000
no
−2500
−5
−100 −50 0 50 100 0 5 10 15 20

NL
Figure 6.1: lhs: exponential utility function with α = 0.05 and I = R, see (6.6);
rhs: power utility function with γ ∈ {0.5, 1, 1.5} and I = R+ , see (6.7).
In general, we are interested in risk-averse utility functions u : I → R which makes

the additional assumption that u is strictly concave on I, see Figure 6.1. This
risk-averse utility function now allows to define a preference ordering on the set of
risky positions in X .
Definition 6.2. Assume u : I → R is strictly increasing and strictly concave on

the non-empty interval I ⊂ R (with X ∈ I, P-a.s., for all X ∈ X ). Then we prefer
the position X ∈ X over the position Y ∈ X , write X Y , if
E [u(X)] ≥ E [u(Y )] .

Colloquially speaking this means that holding position X makes us at least as

happy as holding position Y , therefore we prefer position X over position Y .
For u ∈ C 2 (twice differentiable) strictly increasing and strictly concave means
u0 > 0 and u00 < 0 on I, respectively.
Strict increasing property. Strictly increasing implies that if X ≥ Y , P-a.s.,

and X > Y with positive P-probability we have
w)
E [u(X)] > E [u(Y )] , (6.3)
i.e. we strictly prefer X over Y . In this context, X has always the interpretation of
(m
a gain and if the gain of position X dominates the gain of position Y (in the above
sense) we have strict preference X Y . We conclude: u introduces a preference
ordering on X where positive outcomes of X ∈ X describe gains and negative
outcomes losses.
Strict concavity property. Strict concavity implies that we can apply Jensen’s
tes
inequality which provides
E [u(X)] ≤ u (E [X]) , (6.4)
and if X ∈ X is non-deterministic we even have a strict inequality in (6.4). Thus,
for non-deterministic positions, strict concavity of u implies E[X] X. The in-
no
terpretation of this preference ordering is that under risk-aversion we try to avoid

uncertainties which results in the fact that we always prefer the mean value E[X]
over random outcomes X.
This latter property is exactly the argument why policyholders are willing to pay
NL
an insurance premium π that exceeds their average claim amount E[Y ], and hence
finance (NPC). Assume that a policyholder has (deterministic) initial wealth c0
and he faces a risk that may reduce his wealth by (the random amount) Y . Hence,
he holds a risky position c0 − Y and his happiness index of this position is given
by E[u(c0 − Y )] if u describes the (risk-averse) utility function of this policyholder.
The strict concavity and increasing properties now imply the following preference
ordering
E [u(c0 − Y )] < u (c0 − E [Y ]) .
The left-hand side describes the present happiness and the right-hand side describes
the happiness that he would achieve if he could exchange Y by E[Y ]. Therefore,
any deterministic premium π > E [Y ] such that
E [u(c0 − Y )] < u (c0 − π) < u (c0 − E [Y ]) ,

would make him more happy than his current position c0 −Y . Thus, strict concavity
and increasing property of u implies that he is willing to pay any premium π in
the (non-empty) interval

π ∈ E [Y ] , c0 − u−1 (E [u(c0 − Y )]) , (6.5)
to improve his happiness position. The lower bound of this interval is the (NPC)
and the upper bound is the maximal price π that the policyholder will just tolerate
according to his risk-averse utility function u. The less risk-averse he is the narrower
the interval will get. The extreme case of risk-neutrality, which corresponds to the
w)
linear function u(x) = x, will just provide that the upper bound is equal to the
lower bound in (6.5), and no insurance is necessary.
The most popular utility functions are, see also Figure 6.1:
(m
• exponential utility function, constant absolute risk-aversion (CARA) utility
function: for α > 0 (defined on I = R)
1
u(x) = 1 − exp {−αx} . (6.6)
α
• power utility function, constant relative risk-aversion (CRRA) utility func-

tes
tion, isoelastic utility function (defined on I = R+ )
( x1−γ
1−γ
for γ 6= 1,
u(x) = (6.7)
log x for γ = 1.
no
Example 6.3 (exponential utility function). Assume that the policyholder has
exponential utility function (6.6), he has initial wealth c0 and he faces a risky
position Y ∈ L1 (Ω, F, P) with Var(Y ) > 0 and Y ≥ 0, P-a.s. This implies that
the expected claim is given by E[Y ] > 0. The exponential utility function has the
NL
following properties
u0 (x) = exp{−αx} > 0 and u00 (x) = −α exp{−αx} < 0,
therefore it is strictly increasing and concave on R, see Figure 6.1 (lhs). The inverse
is given by
1
u−1 (y) = − log (α(1 − y)) .
α
This implies that the possible premium π is given in the non-empty interval, see
(6.5),
1

π ∈ E [Y ] , log E [exp{αY }] .
α
The important observation in this example is that the price tolerance in π does
not depend on the initial wealth c0 of the policyholder. We will see that this

property uniquely holds true for the exponential utility function, and we may ask
the question how realistic this property is in real world behavior?
Example 6.4 (power utility function). Assume that the policyholder has power
utility function (6.7), he has initial wealth c0 > 1 and he faces a risky position Y ∼
Bernoulli(p = 1/2). This implies that the expected claim is given by E[Y ] = 1/2.
The power utility function has the following properties
u0 (x) = x−γ > 0 and u00 (x) = −γx−γ−1 < 0,
w)
therefore it is strictly increasing and concave on I = R+ , see Figure 6.1 (rhs). For
our example we choose γ = 1. In this case the inverse of the utility function is
given by
u−1 (y) = exp{y}.
(m
We calculate the upper bound in (6.5),
c0 − u−1 (E [u(c0 − Y )]) = c0 − exp {E [log(c0 − Y )]}

1 1

= c0 − exp log(c0 ) + log(c0 − 1)
q
2 2
def.
= c0 − c0 (c0 − 1) = b(c0 ).
tes
This implies that the possible premium π is given in the non-empty interval, see
(6.5),
1
q
π∈ , c0 − c0 (c0 − 1) .
2
no
The important observation in this example is that the price tolerance in π depends
on the initial wealth c0 > 1 of the policyholder.
The function b is defined on (1, ∞) and we have
1.2
lim b(c0 ) = 1 and lim b(c0 ) = 1/2,

1.0
c0 →1 c0 →∞
0.8
NL
the second canq be seen by applying l’Hôpital’s rule to

0.6
0.4
b(c0 ) = c0 (1− 1 − 1/c0 ). That is, if the policyholder is

0.2
very poor, i.e. c0 close to 1, he is willing to pay almost

0.0
the maximal possible claim size 1, on the other hand if 0 2 4 6 8 10
he is very rich, i.e. c0 close to ∞, he is only willing to

pay for the average claim amount E[Y ] = 1/2 because function b(c0 )
basically he can do the risk bearing himself. The derivative of b is given by
q
1 2c0 − 1 c0 (c0 − 1) − (c0 − 1/2)
b0 (c0 ) = 1 − q = q
2 c0 (c0 − 1) c0 (c0 − 1)
q q
c20 − c0 − c20 − c0 + 1/4
= q < 0.
c0 (c0 − 1)

This shows that we have strict monotonicity in the initial capital c0 > 1, i.e. the
richer the policyholder the narrower the price tolerance interval (6.5), see also
Example 6.14, below.
Definition 6.5 (utility indifference price). The utility indifference price π =

π(u, S, c0 ) for utility function u, initial capital c0 ∈ I and risky position S is given
by the solution of
u(c0 ) = E[u(c0 + π − S)].
w)
Of course, π and S need to be such that c0 + π − S ∈ I, P-a.s. This may give some
restrictions on the range of S if I is a bounded interval, see also Example 6.4.
(m
The utility indifference price given in Definition 6.5 gives the insurance company
point of view. It is assumed that the insurance company has initial capital c0 ∈ I,
similar to the surplus process given in Definition 5.1. It will then only accept an
insurance contract S at price π if the utility does not decrease, i.e. if it is indifferent
about accepting S at price π and not selling such a contract.
tes
Jensen’s inequality and the strict increasing property of u immediately provide the
following corollary.
Corollary 6.6. The utility indifference price π = π(u, S, c0 ) for initial capital c0 ,
risk-averse utility function u and risky position S satisfies
no
π = π(u, S, c0 ) > E[S].

Proof. Exercise 2
Example 6.7 (exponential utility function). Assume we have initial capital c0 ∈ R,

NL
exponential utility function (6.6) with risk-aversion parameter α > 0, and we would
like to insure a risky position S ∼ N (µ, σ 2 ). Thus, we need to solve
1 1

1− exp {−αc0 } = E 1 − exp {−α(c0 + π − S)} .
α α
This is equivalent to solving
exp {απ} = E [exp {αS}] = exp{αµ + α2 σ 2 /2}.
Therefore we obtain utility indifference price for S
π = π(u, S, c0 ) = µ + ασ 2 /2 > µ.
Remarks.

• We obtain an insurance premium π > µ = E[S] (Jensen’s inequality) and

therefore (NPC) is fulfilled.
• The loading is of the form ασ 2 /2 = αVar(S)/2. That is, for the exponential
utility function we get a variance loading. This is exact for S ∼ N (µ, σ 2 )
and it is approximately true for other distribution functions (using a Taylor
approximation).
• The utility indifference price does not depend on the initial capital c0 .
)

w
Exercise 7. Choose the exponential utility function (6.6).
• Calculate the utility indifference price for S ∼ Γ(γ, c).
(m
• Calculate the utility indifference price for S ∼ Pareto(θ, α).
Proposition 6.8. Assume u ∈ C 3 is a risk-averse utility function on R. The

following two are equivalent:
tes
• the utility indifference prices π = π(u, S, c0 ) do not depend on c0 for all S;
• the utility function u is of the form

no
u(x) = a − b exp{−cx},
for a ∈ R and b, c > 0.
Remark. Note that the utility function u(x) = a − b exp{−cx} gives the same
preference ordering as the exponential utility function (6.6) with c = α: if we have
NL
two different utility functions u(·) and v(·) with v = a + bu for a ∈ R and b ∈ R+
then they generate the same preference ordering.
Proof of Proposition 6.8. The direction “⇐” is immediately clear just by evaluating Definition
6.5. So we prove the direction “⇒”. Definition 6.5 implies for the derivative w.r.t. c0

0 ∂ 0 ∂
u (c0 ) = E[u(c0 + π − S)] = E u (c0 + π − S) 1 + π(c0 ) = E [u0 (c0 + π − S)] ,
∂c0 ∂c0
where in the last step we have used the assumption that the premium does not depend on c0 .
This implies that, using a change of sign,
−u0 (c0 ) = E [−u0 (c0 + π − S)] .
Since u is a risk-averse utility function, we have v = −u0 is strictly increasing (because v 0 =

−u00 > 0) and, thus, v can be used as a utility function on R (may be not risk-averse, but

strictly increasing). The last equation explains that π = π(u, S, c0 ) = π(v, S, c0 ) is also the utility
indifference price for utility function v, since
v(c0 ) = E [v(c0 + π − S)] .
This implies that for the same π = π(u, S, c0 )
u−1 (E [u(c0 + π − S)]) = v −1 (E [v(c0 + π − S)]) , (6.8)
for any c0 and S where π = π(u, S, c0 ) exists. The latter implies that (the proof is provided
below)
u00 (x) v 00 (x)
= for all x ∈ R. (6.9)
w)
u0 (x) v 0 (x)
Before we prove (6.9) we show that it provides the claim. Calculate
d u00 (x) u000 (x)u0 (x) − (u00 (x))2 u00 (x) u000 (x) u00 (x)

= = 0 − 0
dx u0 (x) (u0 (x))2 u (x) u00 (x) u (x)
(m
00 00 00

u (x) v (x) u (x)
= − 0 = 0.
u0 (x) v 0 (x) u (x)
This implies that for some c ∈ R
u00 (x) = −cu0 (x) for all x ∈ R.
The solution to this differential equation is given by u(x) = a − b exp{−cx}. Risk-aversion and
tes
increasing property provide b, c > 0.
So there remains to prove that (6.8) implies (6.9). This is proved by contradiction. Assume that
(6.9) does not hold true. Due to the differentiability property of u we can find a non-empty open
interval O ⊂ R with, w.l.o.g.,
u00 (x) v 00 (x)

< for all x ∈ O.
no
u0 (x) v 0 (x)
We consider the function u(v −1 (·)) on the non-empty open interval v(O) (note that v is strictly
increasing). We calculate
d d u0 (v −1 (x))
u(v −1 (x)) = u0 (v −1 (x)) v −1 (x) = 0 −1 > 0,
dx dx v (v (x))
NL
because both u and v are strictly increasing, and
d2 u00 (v −1 (x)) u0 (v −1 (x))v 00 (v −1 (x))

u(v −1 (x)) = 0 −1
−
dx2 (v (v (x))) 2 (v 0 (v −1 (x)))3
u0 (v −1 (x)) u00 (v −1 (x)) v 00 (v −1 (x))

= − < 0 for all x ∈ v(O).
(v 0 (v −1 (x)))2 u0 (v −1 (x)) v 0 (v −1 (x))
This implies that u(v −1 (·)) is a risk-averse utility function on the non-empty interval v(O). Choose
a non-deterministic random variable Y such that Y ∈ O, P-a.s. Since O is a non-empty open
interval such a random variable can be chosen (i.e. no concentration in a single point). This
implies that v(Y ) is a non-deterministic random variable with range v(O) and the strict concavity
of u(v −1 (·)) on v(O) implies
u−1 (E [u(Y )]) = u−1 E u(v −1 (v(Y ))) < u−1 u v −1 (E [v(Y )]) = v −1 (E [v(Y )]) ,

(6.10)
but this contradicts (6.8), which proves the claim. 2

The proof of Proposition 6.8 provides insights into risk-aversion. Define the ab-
solute and the relative risk-aversions of a twice differentiable utility function u
by
u00 (x) u00 (x)
ρARA (x) = ρuARA (x) = − and ρRRA (x) = ρuRRA (x) = −x .
u0 (x) u0 (x)
Example 6.9 (exponential utility function). The exponential utility function (6.6)
with risk-aversion parameter α > 0 satisfies for all x ∈ R
w)
ρARA (x) = α.
This explains the terminology constant absolute risk-aversion (CARA) utility.
Example 6.10 (power utility function). The power utility function (6.7) with
(m
risk-aversion parameter γ > 0 satisfies for all x ∈ R+
ρRRA (x) = γ.
This explains the terminology constant relative risk-aversion (CRRA) utility.
Assume that u and v are two utility functions that are defined on the same interval
tes
I. Then, u is more risk-averse than v if for any X with range I we have
u−1 (E[u(X)]) ≤ v −1 (E[v(X)]) .
Proposition 6.11. Assume that u and v are twice differentiable utility functions
defined on the same interval I ⊂ R. The following are equivalent:
no
• u is more risk-averse than v on I;
• ρuARA (x) ≥ ρvARA (x) for all x ∈ I.

Proof. The direction “⇒” is proved by exactly the same contradiction argument that is used to
NL
prove claim (6.9) from (6.8). For the direction “⇐” we consider the function u(v −1 (·)) on v(I).
This is a strictly increasing function because u and v are utility functions, see proof of Proposition
6.8. The latter proof also implies
u0 (v −1 (x))
00 −1
d2 u (v (x)) v 00 (v −1 (x))

−1
u(v (x)) = − 0 −1
dx2 (v 0 (v −1 (x)))2 u0 (v −1 (x)) v (v (x))
0 −1
u (v (x)) v
ρARA (v −1 (x)) − ρuARA (v −1 (x)) ≤ 0

= 0 −1 2
for all x ∈ v(I).
(v (v (x)))
The proof then follows similar to (6.10). 2
The above proposition has a nice interpretation.

Corollary 6.12. Assume u is more risk-averse than v. Then we have for the utility
indifference prices
π(u, S, c0 ) ≥ π(v, S, c0 ).

Proof. We have the following: ρuARA (x) ≥ ρvARA (x) for all x ∈ I implies
c0 = u−1 (E[u(c0 + π(u, S, c0 ) − S)]) ≤ v −1 (E[v(c0 + π(u, S, c0 ) − S)]) .
Since both v −1 and v are strictly increasing we see that π(u, S, c0 ) ≥ π(v, S, c0 ). 2
The last corollary also explains that the price elasticity interval (6.5) becomes more
narrow for decreasing risk-aversion.
Theorem 6.13. Assume u ∈ C 3 is a risk-averse utility function on I. The follow-

ing are equivalent:
w)
• π(u, S, c0 ) is decreasing in c0 for all S;
• ρuARA (x) is decreasing for all x ∈ I.
(m
Proof of Theorem 6.13. In complete analogy to the proof of Proposition 6.8 we obtain for the
direction “⇒” (calculating derivatives w.r.t. c0 )
c0 ≥ v −1 (E [v(c0 + π(u, S, c0 ) − S)]) ,
where v = −u0 is a utility function. This implies
u−1 (E [u(c0 + π(u, S, c0 ) − S)]) ≥ v −1 (E [v(c0 + π(u, S, c0 ) − S)]) .

tes
Since this holds for any c0 and S we obtain that v is more risk-averse than u, and Proposition
6.11 implies that ρvARA (x) ≥ ρuARA (x) for all x ∈ I. From this we obtain
u000 v 00 u00
− 00
=− 0 ≥− 0,
u v u
no
and thus
d u00 u000 (u00 )2 u00 u000 u00

d u
ρARA (x) = − = − + = − − ≤ 0.
dx dx u0 u0 (u0 )2 u0 u00 u0
This prove the first direction of the equivalence. The prove of the direction “⇐” is received by
just reading the prove into the other direction (all the statements are equivalences). 2
NL
Example 6.14 (power utility function). The power utility function (6.7) with
risk-aversion parameter γ > 0 satisfies for all x ∈ R+
ρARA (x) = γx−1 .
This is a strictly decreasing function in x ∈ R+ . Therefore the utility indifference

price π(u, S, c0 ) becomes a decreasing function in c0 , see Theorem 6.13. This is the
property required by economists for having reasonable risk-averse utility functions.
This was already explored in Example 6.4.
Exercise 8. Choose exponential utility function (6.6).

i.i.d.
• Assume Y1 , . . . , Yn ∼ Γ(γ, c). Calculate the utility indifference price for
Pn
i=1 Yi .

• Assume S ∼ CompPoi(λv = n, G = Γ(γ, c)). Calculate the utility indifference

price for S.
• Compare the two results of the previous items.
• What can be said about diversification benefits?
6.2.2 Esscher premium
w)
Choose a random variable S ∼ F with finite first moment given by
Z
E[S] = s dF (s).
(m
R
Classical actuarial practice calculates premium loadings by giv-

ing more weight to bad events compared to good events. Basi-
cally, this means that one does a change of measure towards a
less favorable probability measure. Hans Bühlmann [19] in-
tes
troduces this idea in the actuarial literature by constructing the
Esscher measure.
Define for α > 0 the Esscher (probability) measure Fα of F as H. Bühlmann

no
follows:
1 Z s
Fα (s) = eαx dF (x),
MS (α) −∞
under the additional assumption that the moment generating function MS (α) of S
exists in α. Note that this defines a (normalized) probability measure Fα .
NL
Definition 6.15 (Esscher premium). Choose S ∼ F and assume that there exists
r0 > 0 such that MS (r) < ∞ for all r ∈ (−r0 , r0 ). The Esscher premium πα of S
in α ∈ (0, r0 ) is defined by
Z
πα = Eα [S] = s dFα (s).
R
Corollary 6.16. Under the assumptions in Definition 6.15 we have
d
πα = log MS (r)|r=α ≥ E[S],
dr
where the inequality is strict for non-deterministic S.

Proof. Note that Lemma 1.1 implies for α ∈ (0, r0 )
MS0 (α)
Z
1 d
πα = seαs dF (s) = = log MS (r)|r=α .
MS (α) R MS (α) dr
The claim then follows from Lemma 1.6. 2
Example 6.17 (Esscher premium for Gaussian distributions). Choose α > 0 and
assume that S ∼ N (µ, σ 2 ). Then we have
w)
πα = log MS (r)|r=α = µ + ασ 2 > µ = E[S].
dr
In the Gaussian case we obtain the variance loading. Thus, the variance loading,
the exponential utility function and the Esscher premium principles provide exactly
the same insurance premium in the Gaussian case.
(m

Exercise 9 (Esscher premium for gamma distributions). Assume that S ∼ Γ(γ, c)

with γ, c > 0. Calculate the Esscher premium of S for α ∈ (0, c).
Conclusions.
tes
• The Esscher premium can easily be calculated from the moment generating
function MS (r).
• The Esscher premium can only be calculated for light tailed claims, see also
Section 5.2 on the Lundberg coefficient. Towards all more heavy tailed claims
no
the Esscher premium reacts so sensitive that it becomes infinite. In the next
section we study probability distortion principles that allow for more heavy
tails in premium calculation still leading to a finite premium.
• In classical economic theory, prices are often derived by the assumption of

market clearing in a risk exchange economy. That is, if we assume that
NL
we have (i) an economy with risky positions S1 , . . . , SK ; (ii) market partici-

pants who have an exponential utility function with risk aversion parameters
αi > 0; and (iii) market clearing in the sense that all risky positions are allo-
cated to the market participants, then one can prove that the risky positions
are exactly priced with the Esscher measure of the aggregate market capital-
ization. This is in the spirit of Bühlmann [19] and is, for instance, found in
Tsanakas-Christofides [84].
6.2.3 Probability distortion pricing principles

In the previous section we have met a pricing principle that was based on probability
distortions. In that case it was only possible to calculate insurance prices for light
tailed claims because the distortion reacted very sensitively to heavy tails. In the

present section we look at probability distortions from a different angle which will
allow for more flexibility. Assume that S ∼ F with S ≥ 0, P-a.s., then using
integration by parts the expected claim is calculated as
Z ∞ Z ∞
E[S] = x dF (x) = P[S > x] dx.
0 0
In this section we distort the survival function P[S > x]. Therefore, we introduce a
distortion function h : [0, 1] → [0, 1] which is a continuous, increasing and concave
function with h(0) = 0 and h(1) = 1, in Figure 6.2 we give two examples.
w)
probability distortions
1.0
(m
0.8
0.6
0.4
tes
0.2
power distortion
0.0
expected shortfall
no
0.0 0.2 0.4 0.6 0.8 1.0
Figure 6.2: Distortion functions h of Examples 6.19 and 6.20, below, with γ = 1/2
and q = 0.1, respectively.
NL
• h(p) distorts the probability p with the property that h(p) ≥ p because h is
increasing and concave with h(0) = 0 and h(1) = 1.
• The concavity of h reflects risk aversion, similar to the utility functions used
in Section 6.2.1.
• Note that the existence of p ∈ (0, 1) with h(p) > p implies that h(p) > p for
all p ∈ (0, 1). Therefore, we assume under strict risk-aversion that h(p) > p
for all p ∈ (0, 1).

Definition 6.18. Assume that h : [0, 1] → [0, 1] is a continuous, increasing and

concave function with h(0) = 0, h(1) = 1 and h(p) > p for all p ∈ (0, 1). The
probability distorted price πh of S ≥ 0 is defined by
Z ∞
πh = Eh [S] = h(P[S > x]) dx.
0
We obtain a risk loading that provides

Z ∞ Z ∞
E[S] = P[S > x] dx ≤ h(P[S > x]) dx = Eh [S] = πh ,
w)
0 0
where the inequality is strict for non-deterministic S.
Remarks.
(m
• Similar to the Esscher premium we modify the probability distribution func-
tion of the claims S (in contrast to the utility theory approach where we
modify the claim sizes).
• The probability distortion approach is a technique to
construct coherent risk measures for bounded random
tes
variables. For a detailed outline we refer to Freddy
Delbaen [31], in particular to corresponding Example
4.7 and Corollary 7.6 which relates convex games to
coherent risk measures.
no
• This probability distortion approach is similar to life

insurance pricing where one constructs first order life F. Delbaen
tables out of second order life tables (expected mortality
rates) in order to have a security and profit margin, see also Denneberg [32].
Example 6.19 (probability distortion for Pareto distribution). Choose (non-negative)
NL
claim S ∼ Pareto(θ, α) with α > 1 and θ > 0, and probability distortion function,
see Example 4.5 in Delbaen [31] and Figure 6.2,
h(p) = pγ for γ ∈ (0, 1). (6.11)
The probability distorted price of S is given by

Z ∞ Z θ Z ∞ " −α #γ
x
πh = h(P[S > x]) dx = 1γ dx + dx
0 0 θ θ
Z θ Z ∞ −αγ Z ∞
x
= 1 dx + dx = P[Sγ > x] dx,
0 θ θ 0
where Sγ ∼ Pareto(θ, αγ). This immediately implies

αγ α
πh = θ > θ = E[S] for γ ∈ (1/α, 1).
αγ − 1 α−1

In contrast to the Esscher premium we can calculate the probability distorted

premium also for heavy tailed claims as long as the risk aversion (concavity) is not
too large, i.e. in our case γ ∈ (1/α, 1).
Exercise 10. Choose power distortion function (6.11). Calculate the probability
distorted price of S ∼ Γ(1, c) and of S ∼ Bernoulli(p).
Example 6.20 (expected shortfall). Choose distortion function h : [0, 1] → [0, 1]

as follows, see Remark 7.7 in Delbaen [31] and Figure 6.2: fix q ∈ (0, 1) and define
(
x/q for x ≤ q,
w)
h(x) = (6.12)
1 otherwise.
Choose S ∼ F with S ≥ 0, P-a.s. The left-continuous generalized inverse of F for

α ∈ (0, 1) is given by, see Chapter 1,
(m
F ← (α) = inf{x ∈ R; F (x) ≥ α}.
For simplicity we assume that F is continuous and strictly increasing. This simpli-
fies considerations because then also F ← is continuous and strictly increasing and
F (F ← (α)) = α and F ← (F (x)) = x, see Chapter 1 (strictly increasing property of
F would not be necessary for getting the full flavor of this example). Consider the
tes
survival function of S given by F (x) = 1 − F (x) = P[S > x]. Note that under our
assumptions
{x > F ← (1 − q)} = {F (x) > 1 − q} = {F (x) < q}.

no
This identity implies

Z ∞
1Z Z
πh = h(P[S > x]) dx = F (x) dx + 1 dx
0 q {F (x)≤q} {F (x)>q}
1Z Z
= F (x) dx + 1 dx
q {x≥F ← (1−q)}
NL
{x<F ← (1−q)}
1Z ∞
= P[S > x] dx + F ← (1 − q).
q F (1−q)
←
The continuity and strictly increasing property of F implies that
P[S ≥ F ← (1 − q)] = 1 − P[S < F ← (1 − q)] = 1 − F (F ← (1 − q)) = q.
This provides, using continuity and strictly increasing property of F ,

Z ∞
1
πh = P[S > x] dx + F ← (1 − q)
P[S ≥ F ← (1 − q)] F ← (1−q)
Z ∞
= P [S > x |S ≥ F ← (1 − q)] dx + F ← (1 − q)
F ← (1−q)
Z ∞
= P [S > x |S ≥ F ← (1 − q)] dx = E [S |S ≥ F ← (1 − q)] .
0

The latter is exactly the so-called Tail-Value-at-Risk (TVaR) or the conditional tail
expectation (CTE) of the random variable S at the 1 − q security level. Moreover,
F ← (1 − q) is the Value-at-Risk (VaR) of the random variable S at the 1 − q security
level. The continuity of F implies that this TVaR is equal to the expected shortfall
(ES) of S at the security level 1 − q, that is,
← 1 Z1 ←
πh = E [S |S ≥ F (1 − q)] = F (u) du = ES1−q (S),
q 1−q
see Artzner et al. [5, 6], Acerbi-Tasche [1] and Lemma 2.16 in McNeil et al. [68].
The proof again uses that fact that for continuous distribution functions F we have
w)
F (F ← (α)) = α and then the left-hand side of the above statement can be obtained
by a change of variables from the right-hand side.
We conclude that under continuity assumptions the risk measure ES1−q (S) can be
(m
obtained via probability distortion (6.12), and following Delbaen [31] is therefore
coherent.
Exercise 11. Choose probability distortion (6.12) for q = 1% and calculate the
probability distorted price for
• S ∼ LN(µ, σ 2 ),
tes
• S ∼ Pareto(θ, α) with α > 1,
i.i.d.
• Sn = ni=1 Yi with Yi ∼ Γ(1, 1) and study the diversification benefit of the
P
probability distorted price of Sn as a function of n ∈ N.

no
6.2.4 Cost-of-capital principles

Denote by X ⊂ L1 (Ω, F, P) the set of (risky) positions X of interest (X denotes
NL
losses here).
A risk measure % on X is a mapping
%:X →R with X 7→ %(X).
Remarks.
• A risk measure % attaches to each (risky) position X a value %(X) ∈ R.
• If the risk measure % is the regulatory risk measure then %(X) ∈ R reflects the
necessary risk bearing capital that needs to be available within the insurance
company to run business X. This is the minimal equity the insurance com-
pany needs to hold to balance possible shortfalls in the insurance portfolio.

• By a change of sign in X we can observe the similarities to the expected

utility framework of Section 6.2.1.
• For having a “good” risk measure one requires additional properties for %
such as monotonicity, coherence, etc. This is described below.
• The most commonly used risk measures are: variance, Value-at-Risk (VaR),
expected shortfall (ES) already met in Example 6.20. We further discuss
them below.
w)
Assume a (regulatory) risk measure % : X → R with X 7→ %(X) is given. We would
like to price an insurance portfolio S under the assumption X = S −E[S] ∈ X . The
regulatory capital requirement then prescribes that the insurance company needs to
hold at least risk bearing capital %(S − E[S]). This risk bearing capital %(S − E[S])
(m
quantifies the necessary financial strength of the insurance company so that it is
able to finance shortfalls beyond E[S] exactly up to the amount %(S − E[S]).
We assume %(S − E[S]) ≥ 0 (which is going to be justified below). Then the insur-
ance company looks for shareholders that are willing to provide this risk bearing
capital %(S − E[S]) > 0. The shareholders will provide this capital as soon as the
promised expected return on this (invested) capital is sufficiently high. We call the
tes
expected rate of return on this shareholder capital cost-of-capital rate rCoC > 0.
Thus, the shareholders’/investors’ expect return is
rCoC %(S − E[S]) > 0

no
on their investment %(S − E[S]) > 0.
Definition 6.21. The cost-of-capital pricing principle is given by
πCoC = E[S] + rCoC %(S − E[S]).

NL
Interpretation.
• For outcomes S ≤ E[S]: the claim can be financed by the pure risk premium
E[S] solely.
• For outcomes S > E[S]: the pure risk premium E[S] is not sufficient and the
shortfall S − E[S] > 0 needs to be paid from %(S − E[S]). Thus, the investors’
capital %(S − E[S]) is at risk, he may lose (part of) it. Therefore, he will ask
for a cost-of-capital rate
rCoC > r0 ,
if r0 denotes the risk-free rate (he receives on a risk-free bank account with
the same time to maturity as his investment).

We give desired properties of risk measures. For details we refer to Artzner et

al. [5, 6], McNeil et al. [68] and Föllmer-Schied [44]. The first assumption is that
X is a convex cone containing R, i.e. it satisfies
(1) c ∈ X for all c ∈ R,
(2) X + Y ∈ X for all X, Y ∈ X , and
(3) λX ∈ X for all X ∈ X and λ > 0.
w)
Then we state the following axioms for risk measures % on X .
Axioms 6.22 (axioms for risk measures %). Assume % is a risk measure on the
convex cone X containing R. Then we define for X, Y ∈ X , c ∈ R and λ > 0:
(m
(a) normalization: %(0) = 0;
(b) monotonicity: for X, Y with X ≤ Y , P-a.s., we have %(X) ≤ %(Y );
(c) translation invariance: for all X and every c we have %(X + c) = %(X) + c;
(d) positive homogeneity: for all X and for every λ > 0 we have %(λX) = λ%(X);
tes
(e) subadditivity: for all X, Y we have %(X + Y ) ≤ %(X) + %(Y ).
Observe that some of the axioms imply others, e.g. positive homogeneity implies
normalization since %(0) = %(λ0) = λ%(0) for all λ > 0 immediately says %(0) = 0.
no
For a detailed analysis of such implications we refer to Section 6.1 in McNeil et

al. [68] and Section 9.1 in Wüthrich-Merz [88].
For our analysis we require (at least) normalization (a) and translation invariance
(c). We briefly comment on this.
NL
Translation invariance. If we hold a risky position X and if we inject capital

c > 0 then the loss is reduced to X − c. This implies for risk measure % that the
reduced position satisfies
%(X − c) = %(X) − c.
This justifies the definition of the regulatory risk measure as stated above. Namely,
if we sell a risky portfolio S and we collect pure risk premium E[S] then the risk
of the residual loss S − E[S] is given by
%(S − E[S]) = %(S) − E[S].
Normalization and translation invariance. A balance sheet of an insurance

company is called acceptable if its (future) surplus C1 ∈ X satisfies %(−C1 ) ≤ 0.
Assume that the insurance company sells a policy S at price π ≥ E[S] and at the

same time it has initial capital c0 = %(S −E[S]) ≥ 0. Then the future surplus of the
company is given by C1 = c0 + π − S. The regulator then checks the acceptability
condition which reads as
%(−C1 ) = % (−(c0 + π − S)) = −c0 − π + %(S) = −π + E[S] ≤ 0. (6.13)
Thus, we have an acceptable position. Coming back to the cost-of-capital pricing

principle given in Definition 6.21 this needs to be interpreted as follows: assume
that an initial capital c0 > 0 is provided by an investor who expects cost-of-capital
rate rCoC > r0 on his investment. Thus, the insurance company also needs to
w)
finance the cost-of-capital cash flow rCoC c0 = rCoC %(S − E[S]) to the investor.
This can exactly be done with the cost-of-capital premium πCoC and the insurance
company keeps its acceptable position in (6.13) if rCoC c0 is also considered as a
liability of the insurance company.
(m
Monotonicity and normalization imply that more risky positions are charged
with higher capital requirements and, in particular, if we have only downside risks,
i.e. X ≥ 0, P-a.s., then we will have positive capital charges %(X) ≥ %(0) = 0.
Definition 6.23 (coherent risk measure). The risk measure % is called coherent if
tes
it satisfies Axioms 6.22.
Coherent risk measures were introduced by Artzner et al. [5, 6]

and the properties of coherent risk measures are often regarded
as useful in practice. In particular, the subadditivity property
no
means that if we merge two portfolios we expect diversification

benefits in the sense of a release of necessary risk bearing capital.
We close this section with a discussion of the three most popular

risk measures. P. Artzner
NL
Example 6.24. The standard deviation risk measure is for S

with finite second moment given by
%(S) = ασ(S) = αVar(S)1/2 ,
for a given parameter α > 0. This risk measure is normalized, positive homoge-
neous, and subadditive. But it is neither translation invariant nor monotone. Note
that for the standard deviation risk measure the cost-of-capital pricing principle
coincides with the standard deviation loading principle presented in Section 6.1.
Example 6.25 (Value-at-Risk, VaR). The VaR of S ∼ F at security level 1 − q ∈

(0, 1) is given by the left-continuous generalized inverse of F at 1 − q, i.e.
%(S) = VaR1−q (S) = F ← (1 − q).

The VaR is normalized, monotone, translation invariant and positive homogeneous,

but it is not subadditive, and hence not coherent. There are many examples in the
literature showing this non-coherence, see, for instance, Artzner et al. [5, 6], McNeil
et al. [68] and Embrechts et al. [37].
Example 6.26 (expected shortfall). The expected shortfall has already been in-
troduced in Example 6.20, where we have stated that the expected shortfall is equal
to the TVaR for continuous distribution functions F . Instead of introducing it via
probability distortion functions we can also directly define it. Assume that S ∼ F
with F continuous. Then we have
w)
1Z 1
%(S) = TVaR1−q (S) = E [S |S ≥ VaR1−q (S)] = VaRu (S) du = ES1−q (S).
q 1−q
ES1−q (S) is a coherent risk measure on L1 (Ω, F, P). The cost-of-capital pricing
(m
principle is then given by
π = E[S] + rCoC ES1−q (S − E[S]) = E[S] + rCoC (ES1−q (S) − E[S]) .
This cost-of-capital pricing principle can also be obtained with probability distor-
tion functions: choose h as in Example 6.20 and define the distortion function
e : [0, 1] → [0, 1] as follows
h
tes
h(x)
e = (1 − rCoC ) x + rCoC h(x),
for fixed rCoC ∈ (0, 1), see Figure 6.3.

For a non-negative random variable S ≥ 0 with continuous (and strictly increasing)
no
distribution function we obtain

Z ∞
πeh = e (P[S > x]) dx
h
0
Z ∞ Z ∞
= (1 − rCoC ) P[S > x] dx + rCoC h (P[S > x]) dx
0 0
NL
= (1 − rCoC ) E[S] + rCoC ES1−q (S) ,
which proves the claim.
Remarks.
• Solvency II considers VaR1−q (S − E[S]) for 1 − q = 99.5% as the regulatory

risk measure.
• The Swiss Solvency Test considers ES1−q (S − E[S]) for 1 − q = 99% as the
regulatory risk measure.
• For rCoC one often sets 6% above the risk-free rate. However, this is a heavily
debated number because in stress periods this rate should probably be higher.

probability distortions
1.0
0.8
0.6
0.4
w)
0.2
expected shortfall (ES)

0.0
ES cost−of−capital loading
(m
0.0 0.2 0.4 0.6 0.8 1.0
Figure 6.3: Distortion functions h of Example 6.20 (expected shortfall) and corre-
sponding he for expected shortfall cost-of-capital loading.
tes
Exercise 12. Assume that S ∼ N (µ, σ 2 ) has a Gaussian distribution. Choose
1 − q = 99% and rCoC = 6%. The cost-of-capital pricing principle for the expected
shortfall risk measure gives price
no
1 σ 1
2
π = µ + rCoC √ exp − Φ−1 (1 − q) .
q 2π 2
(a) Prove this statement.

(b) Calibrate the security level for the VaR risk measure such that the cost-of-
NL
capital insurance price is the same as for the expected shortfall risk measure.
(c) Calibrate the standard deviation risk measure loading parameter α > 0 such
that the price is the same as for for the expected shortfall risk measure.
Remark. This parameter calibration only holds true under the Gaussian model
assumption.
6.2.5 Deflator based pricing principles

Up to now we have completely neglected that cash flows also

have a time value, i.e., in general, future cash flows need to be
discounted for valuation purposes. In a financial mathematics
setting insurance cash flow valuation can be considered as a
pricing problem in an incomplete financial market setting.
The pricing in such a financial market can be done either by
J.D. Duffie
risk neutral measures or equivalently by using (state price)
deflators. Deflators were introduced in the actuarial literature by Bühlmann [20,
21, 22] and heavily used in Wüthrich et al. [86] and Wüthrich-Merz [88]. The
terminology deflator was introduced by James Darrell Duffie [35].
w)
Assume that ϕ is an integrable and strictly positive random variable with
1
E[ϕ] = d0 = ∈ (0, 1].
1 + r0
(m
Then, d0 can be seen as deterministic discount factor and r0 ≥ 0 can be seen
as deterministic risk-free rate. This is the general version of a deflator ϕ. To
make deflator pricing comparable to the previously introduced pricing principles
we assume that d0 = 1, i.e. no time values are added to cash flows.
Fix ϕ ∈ L1 (Ω, F, P) strictly positive with d0 = 1 and assume that ϕ and S are
tes
positively correlated. Then we can define the deflator based price by
πϕ(0) = E[ϕS] ≥ E[ϕ]E[S] = E[S].
We use the upper index in πϕ(0) to indicate that we set d0 = 1.

no
Thus, all random variables S which are positively correlated with ϕ receive a posi-
tive premium loading. The next example shows that this is a generalization of the
Esscher premium, or more generally, it can be understood as a probability distor-
tion principle because ϕ allows to define the equivalent probability measure P∗ by
the Radon-Nikodym derivative as follows
NL
dP∗
= ϕ,
dP
because ϕ is a strictly positive density w.r.t. P for d0 = 1. Then, we price S under
the equivalent probability measure P∗ by
πϕ(0) = E[ϕS] = E∗ [S].
Example 6.27 (Esscher premium). Choose a random variable S and define ϕ =

MS (α)−1 exp{αS} for given α > 0 with MS (α) < ∞. It follows that ϕ is strictly
positive, P-a.s., and normalized. That is, ϕ is a deflator with d0 = 1. Due to
the FKG inequality, see Fortuin et al. [45], it follows that ϕ and S are positively
correlated, and thus
πϕ(0) = E[ϕS] ≥ E[S].

Observe the identity

1 h i
πϕ(0) = E[ϕS] = E eαS S = πα ,
MS (α)
which is exactly the Esscher premium πα , and P∗ is the Esscher measure corre-
sponding to Fα , see Section 6.2.2.
The previous example shows that the deflator approach is a

generalization of the Esscher premium. The crucial point is
that ϕ and S are positively correlated so that we obtain a
w)
positive premium loading. Moreover, this deflator approach
also allows for stochastic discounting by choosing a deflator ϕ
with E[ϕ] ∈ (0, 1), and generalizations to multiperiod problems
are easily possible and straightforward. For more details we
(m
refer to Wüthrich-Merz [88] and Wüthrich et al. [86].
Example 6.28 (cost-of-capital loading with expected short-

fall). This example treats the expected shortfall risk measure. Assume S ∼ F
with continuous distribution function F . The VaR on security level 1 − q ∈ (0, 1)
is then given by VaR1−q (S) = F ← (1 − q), see Example 6.25. Note again that
tes
F (F ← (1 − q)) = 1 − q, see Chapter 1. Choose rCoC ∈ (0, 1) and define the proba-
bility distortion
rCoC
ϕ = (1 − rCoC ) + 1{S≥VaR1−q (S)} > 0, P-a.s.
q
no
This choice and the continuity of F imply

rCoC rCoC
E[ϕ] = (1 − rCoC ) + P [S ≥ VaR1−q (S)] = (1 − rCoC ) + q = 1,
q q
that is, we obtain the required normalization. The premium is then given by
NL
" ! #
rCoC
πϕ(0) = E (1 − rCoC ) + 1{S≥VaR1−q (S)} S
q
rCoC h i
= (1 − rCoC ) E [S] + E 1{S≥VaR1−q (S)} S
q
= (1 − rCoC ) E [S] + rCoC E [ S| S ≥ VaR1−q (S)]
= E [S] + rCoC (E [ S| S ≥ VaR1−q (S)] − E [S])
= E [S] + rCoC ES1−q (S − E [S]) .
We conclude that we exactly obtain the cost-of-capital loading principle with ex-
pected shortfall as risk measure, see Example 6.26.

w)
(m
tes
no
NL

Chapter 7
Tariffication and Generalized
w)
Linear Models
(m
Assume we have v ∈ N insurance policies denoted by l = 1, . . . , v. These insurance
policies should be sufficiently similar such that we have a homogeneous insurance
portfolio to which the law of large numbers (LLN) applies, see (1.1). The ideal
case of i.i.d. risks justifies the charge of the same premium to every policy. If
there is no perfect homogeneity (and there never is) then there are two different
tes
possibilities of charging a premium: (a) everyone pays the same premium which
reflects more the aspect of social insurance, where one tries to achieve a balance
between the rich and the poor; (b) the individual premium should reflect the quality
of the specific insurance policy, i.e. we try to calculate risk adjusted premia. In
the present chapter we try to achieve (b). We explain this with the compound
no
Poisson model at hand. The aggregation and the disjoint decomposition properties
of the compound Poisson model S ∼ CompPoi(λv, G), see Theorems 2.12 and 2.14,
suggest the consideration of the following identity
N v N (l) v
X X X (l) X
S = Yi = Yi = Sl ,
NL
i=1 l=1 i=1 l=1

(l) (l)
where Sl = N
P
i=1 Yi models the total claim amount of policy l = 1, . . . , v. This
decoupling provides independent compound Poisson distributions Sl . That is, we
have Sl ∼ CompPoi(λl , Gl ), where we set volume vl = 1, λl > 0 is the expected
(l)
number of claims of policy l and Yi ∼ Gl describes the claim size distribution of
policy l. This implies for the mean value of S the following decomposition
v v v (l) v
(l) λl E[Y1 ]
χ(l) ,
X X X X
E[S] = E[Sl ] = λl E[Y1 ] = λE[Y1 ] = µ
l=1 l=1 l=1 λE[Y1 ] l=1
where µ = E[S]/v = λE[Y1 ] is the average claim over all policies and χ(l) > 0
reflects the risk characteristics of policy l = 1, . . . , v. This means that in the case
of heterogeneity we should determine the risk characteristics χ(l) for every policy l
to obtain risk adjusted premia because these risk characteristics may differ.
159
160 Chapter 7. Tariffication and Generalized Linear Models
To avoid over-parametrization and to have sufficient volumes for a LLN we choose

a fixed number, say K, of tariff criteria (like age, type of car, kilometers yearly
driven, place of living, etc.) such that the total portfolio is divided into sufficiently
homogeneous sub-portfolios (risk classes). Then we try to modify the overall aver-
age claim µ = E[S]/v = λE[Y1 ] to these risk classes such that their prices become
a function of the risk characteristics in the K tariff criteria.
For this exposition we assume that we have only two tariff criteria, i.e. K = 2, and
we would like to set up a multiplicative tariff.
w)
The generalization to K > 2 is then straightforward.
Assume we have K = 2 tariff criteria. The first criterion has i ∈ {1, . . . , I} risk
(m
characteristics and the second criterion has j ∈ {1, . . . , J} risk characteristics.
Thus, we have I · J different risk classes, see Table 7.1.
1 ··· j ··· J
1
..
.
tes
i risk classes (i, j)
..
.
I
no
Table 7.1: K = 2 tariff criteria with I and J risk characteristics, respectively.
We assume that policy l belongs to risk class (i, j), write χ(l) = χ(i,j) . This provides
vi,j χ(i,j) ,
X
E[S] = µ
i,j
NL
where vi,j denotes the number of policies belonging to risk class (i, j). Our aim is
to set up
a multiplicative tariff structure for these K = 2 tariff criteria, i.e. we assume
χ(i,j) = χ1,i χ2,j , (7.1)
where χk,lk describes the specifics of criterion k if it has risk characteristics lk .
Example 7.1 (multiplicative tariff). A classical example in car insurance is the

following: choose as tariff criteria the “kilometers yearly driven” and the “years
driven without an accident”.

Chapter 7. Tariffication and Generalized Linear Models 161
1st tariff criterion χ1,i : “kilometers yearly driven”

2nd tariff criterion χ2,j : “years driven without an accident” (bonus-malus level)
Observe that the 1st tariff criterion is continuous, but typically it is discretized for
having finitely many risk characteristics, see Table 7.2 for an example.
no accident 0 years 1 year 2 years 3 years 4 years 5 years 6+ years
w)
χ2,· 1.2 1.1 1.0 0.9 0.8 0.7 0.5
yearly km χ1,·
0-10’000 0.8
10-15’000 0.9
15-20’000 1.0
(m
20-25’000 1.1 χ(4,5) = χ1,4 χ2,5 = 1.1 · 0.8 = 0.88
25’000+ 1.2
Table 7.2: Tariffication scheme for K = 2 tariff criteria.

tes
We have K = 2 tariff criteria. Criterion k = 1 has I = 5 risk characteristics and
criterion k = 2 has J = 7 risk characteristics. This gives I · J = 35 risk classes
(i, j) for i ∈ {1, . . . , I} and j ∈ {1, . . . , J}.
The general aim is to determine tariff criteria such that they give sufficiently large
no
homogeneous risk classes. These risk classes are then priced by choosing appropri-
ate multiplicative pricing factors χk,lk .
Remarks.
• A prior choice of tariff criteria should be done using expert opinion. Statisti-
NL
cal analysis should then select as few as possible significant criteria. However,
also market specifications of competitors are important to avoid adverse se-
lection.
• Related to the first item: the aim should be to build homogeneous risk classes
of sufficient volume such that LLN applies and we get statistical significance.
• Variable reduction techniques and multivariate statistical analysis need to be

applied to avoid an “over-correction” of dependent factors, e.g. in the rather
trivial example above, the relation between the factors is not immediately
clear: it could be that “kilometers yearly driven” is strongly related to “years
driven without an accident”. If this is the case we might correct twice for the
same factor.

• We consider a bivariate model using simple methods for categorical risk

classes and will then go over to more sophisticated models using generalized
linear model (GLM) techniques.
Assume we have two tariff criteria i and j which give I · J risk classes. Our aim
is to find appropriate multiplicative pricing factors χ1,i , i ∈ {1, . . . , I}, and χ2,j ,
j ∈ {1, . . . , J}, which describe the risk classes (i, j) according to the multiplicative
tariff structure (7.1).
We define by Si,j the total claim of risk class (i, j) and by vi,j the corresponding
volume with
)
X X
vi,j = v and Si,j = S.
w
i,j i,j
This implies that we need to study

vi,j
E[S] χ(i,j) = vi,j µ χ1,i χ2,j ,
(m
E[Si,j ] =
v
where µ = λE[Y1 ] is the average claim per policy over the whole portfolio v,
i.e. E[S] = vµ.
7.1 Simple tariffication methods

tes
We start with the method of Robert A. Bailey & LeRoy
J. Simon [10] which was introduced in 1960 for rate-making.
The method of Bailey & Simon is rather simple and it is not
no
directly motivated by a stochastic model which considers

the claim Si,j of risk class (i, j) in a consistent way. It spec-
L.J. Simon (right) ifies µ, χ1,i and χ2,j > 0 such that the following expression
is minimized
(Si,j − vi,j µ χ1,i χ2,j )2

NL
2
X
X = . (7.2)
i,j vi,j µ χ1,i χ2,j
The motivation behind this approach is that X 2 describes the

test statistics of the χ2 -goodness-of-fit test, see (3.8). This test
rejects a model if X 2 exceeds the quantile of a χ2 -distributed
random variable on a certain significance level. Therefore, the
aim is to choose the parameters such that X 2 becomes as small
as possible.
Note that this approach is not based on a stochastic model it
is just based on a statistical argument. Moreover, it has the R.A. Bailey
following unpleasant feature.
Lemma 7.2. The minimizers of (7.2) have a (systematic) positive bias.

Proof. We denote the minimizers of (7.2) by µ

b, χ
b1,i and χ
b2,j . We would like to prove that
X X
vi,j µ
bχ b2,j ≥
b1,i χ Si,j = S.
i,j i,j
This can either be done by first summing over rows i or columns j. Note that χ
b2,j is found by
! ∂X 2 X ∂ (Si,j − vi,j µ χ1,i χ2,j )2

0 = = .
∂χ2,j i
∂χ2,j vi,j µ χ1,i χ2,j
This provides estimates

!1/2
S 2 /(vi,j µ
P
bχ b1,i )
w)
χ
b2,j = i
Pi,j .
v
i i,j µ
b χ
b 1,i
If we sum over i and plug in the estimates χ

b2,j we obtain
!1/2 !1/2
2
X X X Si,j
(m
vi,j µ
bχb1,i χ
b2,j = vi,j µ
bχb1,i .
i i i
vi,j µ
bχ b1,i
Next we apply the Schwarz’ inequality to the terms on the right-hand side which provides the
following lower bound
 !1/2 
2
X X 1/2 Si,j
X
vi,j µ
bχ b2,j ≥ 
b1,i χ (vi,j µ
bχb1,i )  = Si,j .
vi,j µ
bχ b1,i
tes
i i i
Example 7.3 (method of Bailey & Simon). We choose an example with two tariff
criteria. The first one specifies whether the car is owned or leased, the second
no
one specifies the age of the driver. For simplicity we set vi,j ≡ 1 and we aim to
determine the tariff factors µ, χ1,i and χ2,j . The method of Bailey & Simon then
requires minimization of
(Si,j − µ χ1,i χ2,j )2

X2 =
X
.
µ χ1,i χ2,j
NL
i,j
Note that we need to initialize the estimators for obtaining a unique solution. We
set µb = 1 and χb1,1 = 1. The observations Si,j are given by, see also Figure 7.1,
21-30y 31-40y 41-50y 51-60y

owned 1300 1200 1000 1200
leased 1800 1300 1300 1500
We have I · J = 2 · 4 = 8 risk classes (i, j) and observations Si,j . The number of

parameters to be estimated are I +J −1 = 5 (taking into account the initializations
µb = χb1,1 = 1). Minimizing X 2 numerically provides the following multiplicative
tariff structure for χ1,i , i ∈ {1, 2}, and χ2,j , j ∈ {1, . . . , 4}.

scatter plot
2000
L leased
O owned
1800
L
1600
claim amount
1400 L
O L L
w)
1200
O O
1000
21−30y 31−40y 41−50y 51−60y
(m
age classes
Figure 7.1: Observations Si,j .
21-30y 31-40y 41-50y 51-60y χb1,i

owned 1376 1112 1020 1197 1.0000
tes
leased 1727 1395 1280 1503 1.2548
χb2,j 1376 1112 1020 1197
In this example we have (systematic) positive bias as stated in Lemma 7.2, i.e.
no
χb1,i χb2,j = 100 611 > 100 600 =

X X
Si,j = S.
i,j i,j
The method of Robert A. Bailey & Jan Jung (1922-2005)

[9, 56] intends to improve the weakness of the positive bias of the
NL
previous method, see Lemma 7.2 and Example 7.3. But it is still
a simple method that is not directly motivated by a stochastic
model, however we will see below that it has its groundings in a
stochastic model. It imposes unbiasedness of rows and columns
by definition: Choose µ, χ1,i and χ2,j > 0 such that the rows i
and columns j satisfy
J. Jung
J
X J
X
vi,j µ χ1,i χ2,j = Si,j , (7.3)
j=1 j=1
I
X I
X
vi,j µ χ1,i χ2,j = Si,j . (7.4)
i=1 i=1

Remarks.
• This method is also called method of total marginal sums.
• It is more robust than the method of Bailey & Simon.
• If Si,j are independent Poisson distributed with cross-classified means, then

the above system is exactly the MLE system that needs to be solved. We
will discuss this in Section 7.3.1 below.
w)
• Both the method of Bailey & Simon and the method of Bailey & Jung are
rather pragmatic methods because they are not directly based on a stochastic
model. Therefore, in the remainder of this chapter we are going to describe
more sophisticated methods from a probabilistic point of view.
(m
Example 7.4 (method of Bailey & Jung, method of total marginal sums). We
revisit the data of Example 7.3. This time we determine the parameters by solving
the system (7.3)-(7.4). This needs to be done numerically and provides the following
multiplicative tariff structure:
21-30y 31-40y 41-50y 51-60y χb1,i
tes
owned 1375 1108 1020 1197 1.0000
leased 1725 1392 1280 1503 1.2553
χb2,j 1375 1108 1020 1197
We conclude that both methods give similar results for this example.
no
7.2 Gaussian approximation

7.2.1 Maximum likelihood estimation
NL
In the previous section we have presented two pragmatic tariffication methods. In

this section we give a more advanced method, in the sense that we use an explicit
stochastic model. However, the approach is still pragmatic because the stochastic
model is assumed to be a good approximation to the true tariffication problem.
We consider the claims ratio defined by
Ri,j = Si,j /vi,j .
We have expected value for this claim ratio given by
E[Ri,j ] = µ χ1,i χ2,j .
We use two simple facts:
1. The simplest absolutely continuous distribution is the Gaussian distribution.

2. Taking logarithms turns products into sums.
Combining this two items implies that we want to consider the following model
def.

Xi,j = log Ri,j ∼ N β0 + β1,i + β2,j , σ 2 .
Thus, taking logarithms may turn the multiplicative tariff structure into an additive
structure. If this logarithm Xi,j of Ri,j has a Gaussian distribution we have nice
mathematical properties. Therefore, we assume a log-normal distribution for Ri,j
which hopefully gives a good approximation to the true tariffication problem. These
w)
choices imply for the first two moments
2 /2 2
E[Ri,j ] = eβ0 +σ eβ1,i eβ2,j and Var(Ri,j ) = E[Ri,j ]2 (eσ − 1).
2
Observe that the mean has the right multiplicative structure, set µ = eβ0 +σ /2 ,
(m
χ1,i = eβ1,i and χ2,j = eβ2,j . However, the distributional properties are rather
different from compound models. Nevertheless, this log-linear Gaussian structure
is often used for tariffication because of its nice mathematical structure and because
popular statistical methods can be applied.
Set M = I · J and define for Xi,j = log Ri,j = log(Si,j /vi,j ) the vector
tes
X = (X1 , . . . , XM )0 = (X1,1 , . . . , X1,J , . . . , XI,1 , . . . , XI,J )0 ∈ RM . (7.5)
Note that we change the enumeration of the observations because this is going to
be simpler in the sequel. Index m always refers to
no
m = m(i, j) = (i − 1)J + j ∈ {1, . . . , M = I · J}. (7.6)
We assume that X has a multivariate Gaussian distribution
X ∼ N (Zβ, Σ) , (7.7)
NL
with diagonal covariance matrix Σ = σ 2 diag(w1 , . . . , wM ), parameter vector
β = (β0 , β1,2 , . . . , β1,I , β2,2 , . . . , β2,J )0 ∈ RI+J−1 ,
and design matrix Z ∈ RM ×(I+J−1) such that for m = m(i, j)
E[Xi,j ] = (Zβ)m = β0 + β1,i + β2,j .
Throughout we assume that Z has full rank. We initialize β1,1 = β2,1 = 0 and
β0 plays the role of the intercept. At the moment the weights wm do not have a
−1
natural meaning, often one sets wm = vi,j (inverse proportional to the underlying
volume).
In view of Example 7.3 this gives the following table where the “1”s show to which
class the observations belong to:

owned leased 21-30y 31-40y 41-50y 51-60y X

1 1 0 1 0 0 0 7.17
2 1 0 0 1 0 0 7.09
1 0 0 0 1 0 6.91
1 0 0 0 0 1 7.09
m 0 1 1 0 0 0 7.50
0 1 0 1 0 0 7.17
0 1 0 0 1 0 7.17
M 0 1 0 0 0 1 7.31
w)
This table needs to be turned into the appropriate form so that it fits to (7.7).
Therefore we need to drop the columns “owned” and “21-30y” because of the
chosen normalization β1,1 = β2,1 = 0. This provides the following table:
(m
intercept leased 31-40y 41-50y 51-60y
1 0 0 0 0
 
1 0 1 0 0 β0
1 0 0 1 0 β1,2
 
 
 
Zβ = 1 0 0 0 1 
 β2,2 

1 1 0 0 0 β2,3
 
 
β2,4
tes
1 1 1 0 0
1 1 0 1 0
1 1 0 0 1
Under assumption (7.7) we know that X has density

no
1 1

f (x) = exp − (x − Zβ)0 Σ−1 (x − Zβ) .
(2π)M/2 |Σ|1/2 2
b MLE of the parameter vector β:

This allows for the calculation of the MLE β
NL
MLE −1
β
b = Z 0 Σ−1 Z Z 0 Σ−1 X. (7.8)
The tariff factors can then be estimated by (avoiding the variance correction term)
n o n o n o
µb = exp βb0MLE , MLE
χb1,i = exp βb1,i and MLE
χb2,j = exp βb2,j .
If we have homoscedasticity, i.e. if we assume identical weights wi,j ≡ w and Σ =

b MLE = (Z 0 Z)−1 Z 0 X.
σ 2 w1, then the estimator of β is given by β
Example 7.5 (log-linear model). We use the data Si,j from Example 7.3. Assume
vi,j ≡ 1 and initialize µb = 1 and χb1,1 = 1. The log-linear MLE formula (7.8)
provides the following multiplicative tariff structure:

21-30y 31-40y 41-50y 51-60y χb1,i

owned 1368 1117 1020 1200 1.0000
leased 1710 1396 1274 1500 1.2495
χb2,j 1368 1117 1020 1200
We compare the results from the method of Bailey & Simon, the method of total
marginal sums (Bailey & Jung) and the log-linear MLE method.
owned
w)
leased
1600 2000
1400
1200 1500
1000
800 1000
600
400 500
200
0 0
21-30y 31-40y 41-50y 51-60y 21-30y 31-40y 41-50y 51-60y
(m
observed Bailey & Simon marginal sums linear regression
observed Bailey & Simon marginal sums linear regression
• We see that in this example all three methods provide similar results.
• Observe: the risk class (owned, 21-30y) is punished by the bad performance
of (leased, 21-30y) and vice verse. A similar remark holds true for risk class
tes
(leased, 31-40y).
Remarks.
no
• The multiplicative tariff construction above has used the design matrix Z =
(zm,k )m,k ∈ RM ×(I+J−1) which was generated by categorical variables. Cate-
gorical variables allow to group observations into disjoint risk categories.
• Binary variables are a special case of categorical variables that can only have
NL
two specifications, 1 for true and 0 for false. Recall that all our zn,k ∈ {0, 1}.
E.g., the observation Si,j either belongs to the class “owned” or to the class
“leased”.
• Often the linear regression model X = Zβ + ε with ε ∼
N (0, Σ) is introduced for continuous variables (zn,k )n,k
which generate the design matrix Z. E.g. if there is a
(clear) functional relationship between the age and the
tariff criterion χ2 , for instance if χ2 is a linear function
of age, then variable zn,k ∈ R+ modeling age is directly
reflecting this relationship (linear regression). For more
on this subject we refer to Frees [46], for the present dis-
cussion we concentrate on binary variables.

7.2.2 Goodness-of-fit analysis

Compared to the methods in the previous section, the log-linear MLE formula (7.8)
has the advantage that we can apply classical statistical methods for a goodness-of-
fit test and for variable selection/reduction. We introduce this statistical language.
For this discussion we assume homoscedasticity, i.e. identical weights
wi,j = w and Σ = σ 2 w1,

MLE
which implies β
b = (Z 0 Z)−1 Z 0 X. The general case is treated in the next section.
We introduce the total sum of squares (the first and last equalities are definitions)
w)
X 2 X 2 X 2
SStot = Xm − X = c −X
X m + Xm − X
c
m = SSreg +SSerr , (7.9)
m m m
(m
with X = 1 PM
Xm and X b MLE .
c = Zβ
M m=1
• SStot is the total difference between observations Xm and the sample mean
X without knowing the explaining variables Z.
• SSreg is the difference explained by the explaining variables Z.

tes
• SSerr is the residual difference not explained by the regression.
Proof of (7.9). We rewrite the total sums of squares SStot in vector notation. Therefore we
define
ε = X − Zβb MLE = X − X and X̄ = X (1, . . . , 1)0 . (7.10)
no
b c
We calculate
0 0
X 0X = (X ε)0 (X
c+b ε) = X
c+b cXc + 2X
cb ε0 b
ε+b ε.
b MLE minimizes in the homoscedastic case (X − Zβ)0 (X − Zβ) and thus we have
The MLE β
MLE
NL
0 = Z 0 (X − Z β
b ) = Z 0b
ε, (7.11)
and as a consequence
0
X
cb b MLE )0 b
ε = (Z β ε = 0.
This implies
0
X 0X = X
cX ε0 b
c+b ε.
0
We subtract on both sides X̄ X̄ to obtain
0 0
ε0 b
SStot = X 0 X − X̄ X̄ = b ε+X
cXc − X̄ 0 X̄ = SSerr + SSreg ,
where for the last step we need to observe that the intercept β0 is contained in every row of
the design matrix Z, therefore the first column in Z is equal to (1, . . . , 1)0 . This and (7.11)
implies 0 = (1, . . . , 1)0 b
P P b
ε = Xm − Xm . This treats the cross-product terms leading to
0 0
X X − X̄ X̄ = SSreg . This proves (7.9).
c c 2

We define and consider the coefficient of determination R2 given by
SSreg SSerr
R2 = =1− ∈ [0, 1].
SStot SStot
This is the ratio of explaining variables SSreg and the total sum of squares SStot . If
the model explains well the structure then R2 should be close to 1, because X c is
able to explain the underlying structure.
w)
For Example 7.5 we obtain R2 = 0.9202 which is in favor for this model explaining
the data Si,j .
(m
Residual standard deviation σ: For further analysis we also need the residual
standard deviation σ. It is estimated (in the homoscedastic case) by
1 X b0 b
c 2 = ε ε = SSerr ,

σb 2 = Xm − X m
M m M M
tes
where εb was defined in (7.10). Define r = I + J − 2,
i.e. the dimension of parameter β is r + 1. σb 2 is the
MLE for σ 2 and M σb 2 is distributed as σ 2 χ2M −r−1 , see, for
instance, Section 7.4 in Johnson-Wichern [55]. Often, one
also considers the unbiased variance parameter estimator
no
M
s2 = M −r−1 σb 2 .
Revisiting Example 7.5, we have M = 8 observations,

r + 1 = 5 parameters and hence df = M − r − 1 = 3
degrees of freedom. In our case we obtain σb = 0.07447.
NL
Likelihood ratio test: Finally, we would like to see whether we need to include
a specific parameter βk,lk .
We have a r + 1 = I + J − 1 dimensional parameter vector given by
β = (β0 , β1,2 , . . . , β1,I , β2,2 , . . . , β2,J )0 ∈ Rr+1 .
Note that the model is, of course, invariant under permutation of parameters and
components. Therefore, we can choose any specific ordering and to simplify nota-
tion we define
β = (β0 , β1 , . . . , βr )0 ∈ Rr+1 , (7.12)
so that we have the ordering of components that is appropriate for the next layout.

Null hypothesis H0 : β0 = . . . = βp−1 = 0 for given p < r + 1.

1. Calculate the residual differences SSfull
err and σ
b in the full model with r + 1
r+1
dimensional parameter vector β ∈ R .
2. Calculate residual differences SSH 0

err in the reduced model (βp , . . . , βr ) ∈ R
0 r+1−p
.
We calculate the likelihood ratio Λ. Therefore, we denote the design matrix of the
reduced model by Z0 . Then it is given by

b MLE )0 (X MLE
w)
fbH (X)

σbH0
−M exp − 2bσ12 (X − Z0 βH0 − Z0 β
b
H0 )
H
Λ = b 0 = 0

ffull (X) σbfull exp − 2bσ12 (X b MLE )0 (X
− Zβ − Zβ
b MLE
)
full full
full
 H −M/2 !−M/2 !−M/2
SSerr0
SSH 0
SSH0 − SSfull
(m
err
=  M
SSfull
 = = 1 + err full err . (7.13)
err SSfull
err SSerr
M
The likelihood ratio test rejects the null hypothesis H0 for small values of Λ. This
is equivalent to rejection for large values of (SSH full full
err − SSerr )/SSerr .
0
This motivates to consider the test statistics

tes
SSH full
err − SSerr M − r − 1
0
SSH
err − SSerr
0 full
F = = . (7.14)
SSfull
err p p sb2full
F has an F -distribution with degrees of freedom given by df 1 = p and df 2 =

no
M − r − 1, see Result 7.6 in Johnson-Wichern [55] or (4.2) in Frees [46]. Therefore,

we reject the null hypothesis H0 on the significance level 1 − α if
F > Fp,M −r−1 (α), (7.15)
where the latter denotes the α quantile of the F -distribution with degrees of free-
NL
dom df 1 and df 2 . The heteroscedastic case is given in (7.20).

Example 7.6 (regression model, revisited). We revisit Example 7.5.
In Figure 7.2 we give the R output of the command lm.
• The lines Call give the MLE problem to be solved.
• The lines Residuals display εb .
• The lines Coefficients give the MLE for the parameters β0 (intercept), β1,2
(leased) and β2,2 , . . . , β2,4 . For these parameters a standard estimation error is
calculated and a t-test is applied to each parameter individually, whether they
are different from zero, see formula (7.14) in Johnson-Wichern [55]. From this
analysis we see that we might only question β2,4 because of the large p-value
of 0.1675 the other parameters are well justified by the observations.

w)
(m
Figure 7.2: R output of Example 7.5 using R command lm.
• The bottom lines then display the residual standard error sb = 0.7447 on
tes
df = 3 degrees of freedom, the coefficient of determination R2 = 0.9202, the
adjusted coefficient of determination Ra2 corrects for the degrees of freedom
SSerr M − 1
Ra2 = 1 − .
SStot M − r − 1
no
• The final line displays an F test statistics (7.14) of value 8.653 for df1 = 4
and df2 = 3 for dropping all variables except of the intercept β0 . This gives
a p-value of 5.36% which says that the null hypothesis is just about to be
rejected on the the 5% significance level and we stay with the full model.
NL
• For the reduction of the variable “owned” or “leased”. We obtain an F test

statistics of 18.36 for df1 = 1 and df2 = 3. This gives a p-value of 2.34%
which says that we reject the null hypothesis of setting β1,2 = 0 on the 5%
significance level.
• In the reduced model β1,2 = 0 we obtain an F test statistics of 1.071 for

df1 = 3 and df2 = 3 for dropping all remaining variables variables β2,2 =
. . . = β2,4 = 0. This gives a p-value of 45.52% which says that we cannot
reject this null hypothesis on the 5% significance level.
We conclude that we need the variable to distinguish between “owned” and “leased”.
The classification in age classes “21-30y”, . . ., “51-60y” can be discussed. This
discussion will also depend on whether we want such a tariffication criterion and
whether our competitors consider similar variables.

7.3 Generalized linear models

In the previous section we have simply taken a log-normal approximation for the
total claim amounts Si,j in risk cells (i, j) which has led to a multiplicative structure
in a natural way. In the present section we express the expected claim of risk class
(i, j) as expected number of claims times the average claim, i.e.
(l)
E[Si,j ] = E[Ni,j ] E[Yi,j ],
(l)
where Ni,j describes the number of claims in risk class (i, j) and Yi,j the corre-
)
sponding i.i.d. claim sizes for l = 1, . . . , Ni,j in risk class (i, j).
w
(l)
We now analyze Ni,j and Yi,j separately.
(m
Definition 7.7 (exponential dispersion family). X ∼ fX belongs to the exponential
dispersion family if fX is of the form
( )
xθ − b(θ)
fX (x; θ, φ) = exp + c(x, φ, w) ,
φ/w
write X ∼ EDF(θ, φ, w, b(·)), where

tes
w>0 is a given weight,
φ>0 is the dispersion parameter,
θ∈Θ is the (unknown) parameter of the distribution,
no
Θ⊂R is an open set of possible parameters θ,

b:Θ→R is the cumulant function,
c(·, ·, ·) is the normalization, not depending on θ.
fX can either be a density in the absolutely continuous sense or it can be probability

NL
weights in the discrete case. Moreover, depending on the choice of the cumulant
function b(·) and of the possible parameters Θ the support of X may need to be
restricted to subsets of R.
Lemma 7.8. Choose a fixed cumulant function b(·) and assume that the exponen-
tial dispersion family EDF(θ, φ, w, b(·)) gives well-defined densities with identical
supports for all parameters θ ∈ Θ in an open set Θ. Assume that for any θ ∈ Θ
there exists a neighborhood of zero such that the moment generating function MX (r)
of X ∼ EDF(θ, φ, w, b(·)) is finite in this neighborhood of zero (for r). Then we
have for all θ ∈ Θ and r sufficiently close to zero
( )
b(θ + rφ/w) − b(θ)
MX (r) = exp .
φ/w

Proof. Choose θ ∈ Θ and r in the neighborhood of zero such that MX (r) exists. Then we have

xθ − b(θ)
Z
rx
MX (r) = e exp + c(x, φ, w) dx
φ/w

x(θ + rφ/w) − b(θ)
Z
= exp + c(x, φ, w) dx
φ/w
Z
b(θ + rφ/w) − b(θ) x(θ + rφ/w) − b(θ + rφ/w)
= exp exp + c(x, φ, w) dx.
φ/w φ/w
We have assumed that Θ is an open set. Therefore, for any θ ∈ Θ we have that θr = θ+rφ/w ∈ Θ
for r sufficiently close to zero. Therefore, the last integral is the density that corresponds to
EDF(θr , φ, w, b(·)) and since this is a well-defined density with identical support for all θr ∈ Θ
w)
this last integral is equal to 1. This proves the claim. 2
Corollary 7.9. We make the same assumptions as in Lemma 7.8 and in addition
we assume that b ∈ C 2 in the interior of Θ. Then we have
(m
φ 00
E[X] = b0 (θ) and Var(X) = b (θ).
w
Proof. In view of (1.3) we only need to calculate the first and second derivatives at zero of the
moment generating function. We have from Lemma 7.8

d b(θ + rφ/w) − b(θ) 0
= b0 (θ),

MX (r) = exp b (θ + rφ/w)
tes
dr r=0 φ/w r=0
and
d2

b(θ + rφ/w) − b(θ) 0 2 φ 00
2
MX (r) = exp (b (θ + rφ/w)) + b (θ + rφ/w)
dr r=0 φ/w w r=0
φ
(b0 (θ))2 + b00 (θ).
no
=
w
Example 7.10 (exponential dispersion family). In Chapters 2 and 3 we have

met several examples that belong to the exponential dispersion family. We revisit
NL
these examples and explain how they fit into the exponential dispersion family
framework. These considerations also lead to an explicit explanation of the weight
w > 0. We start with the discrete case assuming X ∼ fX .
• Binomial distribution: Choose Θ = R, b(θ) = log(1 + eθ ), φ = 1 and w = v.
In this case we obtain for x ∈ {0, 1/v, 2/v, . . . , 1}
fX (x; θ, 1)
exp v xθ − log(1 + eθ ) = exp v x log eθ − log(1 + eθ )

=
exp{c(x, 1, v)}
eθ

1
= exp vx log exp v(1 − x) log = pvx (1 − p)v−vx ,
1 + eθ 1 + eθ
for p = eθ /(1 + eθ ) ∈ (0, 1). The first two moments are obtained by
eθ
E[X] = b0 (θ) = = p,
1 + eθ

and
1 00 1 eθ 1 1
Var(X) = b (θ) = θ θ
= p(1 − p).
v v 1+e 1+e v
From this we see that N = vX ∼ Binom(v, p).
• Poisson distribution: Choose Θ = R, b(θ) = exp{θ}, φ = 1 and w = v > 0.

In this case we obtain for x ∈ N0 /v
fX (x; θ, 1) n o
= exp v xθ − eθ = λvx e−λv ,
exp{c(x, 1, v)}
w)
for λ = eθ > 0. The first two moments are obtained by
1 00 1 1
E[X] = b0 (θ) = eθ = λ and Var(X) = b (θ) = eθ = λ.
v v v
From this we see that N = vX ∼ Poi(λv).
(m
In the absolutely continuous case we have the following examples.
• Gaussian distribution: Choose Θ = R and b(θ) = θ2 /2. In this case we have

for x ∈ R
( ) ( )
fX (x; θ, φ) xθ − θ2 /2 1 θ2 − 2xθ
= exp −
tes
= exp ,
exp {c(x, φ, w)} φ/w 2 φ/w
which is the Gaussian density with mean θ and variance φ/w.
• Gamma distribution: Choose Θ = −R+ and b(θ) = − log(−θ). In this case

no
we have for x ∈ R+
( ) ( )
fX (x; θ, φ) xθ + log(−θ) w/φ −θw
= exp = (−θ) exp − x ,
exp {c(x, φ, w)} φ/w φ
this is a gamma density with scale parameter γ = w/φ > 0 and shape
parameter c = −θw/φ > 0. The first two moments are obtained by
NL
γ φ γ
E[X] = b0 (θ) = −1/θ = and Var(X) = 2
= 2.
c wθ c
For more examples we refer to Table 13.8 in Frees [46] on page 379.
The previous example shows that several popular distribution functions belong to
the exponential dispersion family. In the present layout we concentrate on the
Poisson and the gamma distributions for pricing the two components ’number
of claims’ and ’claims severities’. However, the theory holds true in much more
generality, especially within the exponential dispersion family. Our aim is to express
the expected claim of risk class (i, j) as expected number of claims times the average
claim, i.e.
(l)
E[Si,j ] = E[Ni,j ] E[Yi,j ],

(l)
where Ni,j describes the number of claims in risk class (i, j) and Yi,j the corre-
sponding i.i.d. claim sizes for l = 1, . . . , Ni,j in risk class (i, j). We then aim for
calculating a multiplicative tariff which considers risk characteristics χ’s for both
the number of claims and the claims severities.
We assume that Ni,j are independent with Ni,j ∼ Poi(λi,j vi,j ) and vi,j counting
the number of policies in risk class (i, j). Under these assumptions we derive a
multiplicative tariff structure for the number of claims determining the risk char-
acteristics. For the claim sizes we will do a similar construction by making a
w)
gamma distributional assumption. Since the latter is slightly more involved than
the former we start with the Poisson case.
7.3.1 GLM for Poisson claims counts
(m
We assume that Ni,j are independent with Ni,j ∼ Poi(λi,j vi,j ) and vi,j counting the
number of policies in risk class (i, j). In view of the exponential dispersion family
we set for the mean frequency, see Example 7.10,
" #
Ni,j
λi,j =E = b0 (θi,j ) = exp{θi,j } = exp{(Zβ)m }, (7.16)
vi,j
tes
where we make the assumption of having a multiplicative tariff structure which
provides an additive structure on the log-scale reflected by Zβ and the index m =
m(i, j) was defined in (7.6). Thus, we assume that Xi,j = Ni,j /vi,j ∈ N0 /vi,j are
independent with
no
Xi,j ∼ EDF(θi,j , φ = 1, vi,j , b(·) = exp{·}).
Our aim is to estimate the parameter β under the assumption that for every risk
class (i, j) we have
θi,j = (Zβ)m .
NL
The last piece in this consideration is the link function g which connects the mean
λi,j with the parameter vector. For obtaining a multiplicative structure the natural
link function is the so-called log-link function g(·) = log(·). Applying the log-link
function to (7.16) releases the parameter β in the following linear form
g(λi,j ) = log λi,j = log E[Xi,j ] = log(b0 (θi,j )) = (Zβ)m .
This implies that

g(b0 (θi,j )) = (Zβ)m ,
which gives the direct connection between the parameter vector β and θi,j for
m = m(i, j). For the joint log-likelihood function of X ∈ RM
+ we obtain
X Xm θm − exp{θm } X Xm (Zβ)m − exp{(Zβ)m }

`X (β) ∝ = ,
m 1/vm m 1/vm

where we have applied the relabeling of the components of X and vi,j such that
they fit to the design matrix Z, see also (7.5).
The MLE β b MLE for β is found by the solution of
∂
`X (β) = 0. (7.17)
∂β
We calculate the partial derivatives of the log-likelihood function
∂ ∂ X Xm θm − exp{θm } X Xm − exp{θm } ∂θm
`X (β) = =
∂βl ∂βl m 1/vm m 1/vm ∂βl
w)
X Xm − exp{(Zβ)m } ∂(Zβ)m X Xm − exp{(Zβ)m }
= = zm,l .
m 1/vm ∂βl m 1/vm
where Z = (zm,l )m,l ∈ RM ×(r+1) for β ∈ Rr+1 , see also (7.12). If we define the weight
matrix V = diag(v1 , . . . , vM ) then we have just proved the following proposition:
(m
Proposition 7.11. The solution to the MLE problem (7.17) in the Poisson case
is given by the solution of
Z 0 V exp{Zβ} = Z 0 V X.
Remarks. One should observe the similarities between the Gaussian case (7.8)
tes
and the Poisson case
MLE MLE
Z 0 Σ−1 Z β
b = Z 0 Σ−1 X and Z 0 V exp{Z β
b } = Z 0 V X.
The Gaussian case is solved analytically (assuming full rank of Z), the Poisson case
no
can only be solved numerically, due to the presence of the exponential function.
The Poisson case is rewritten as
Z 0 V exp{Zβ MLE } − Z 0 N = 0.
Observe that the latter exactly leads to method of total marginal sums by Bailey
NL
& Jung [9, 56] given by (7.3)-(7.4).
7.3.2 GLM for gamma claim sizes

The analysis of gamma claim sizes is slightly more involved because it needs more
(l)
transformations. We denote by ni,j the number of observations Yi,j in risk cell
(i, j). We assume that
(l) i.i.d.
Yi,j ∼ Γ(γi,j , ci,j ) for l = 1, . . . , ni,j .
From the moment generating function given in Section 3.2.1 we immediately see
that for given ni,j the convolution is given by
ni,j
X (l)
Yi,j = Yi,j ∼ Γ(γi,j ni,j , ci,j ).
l=1

We define the normalized random variable Xm = Yi,j /ni,j where we again use the
relabeling defined in (7.6). Observe that the family of gamma distributions is closed
towards multiplication, see (3.4). Therefore, the density of Xm is then given by
(cm nm )γm nm γm nm −1
fXm (x) = x exp{−cm nm x}. (7.18)
Γ(γm nm )
Next we do a reparametrization similar to Example 7.10. Set γm = 1/φm and
cm = −θm /φm . This provides
(−θm nm /φm )nm /φm nm /φm −1

( )
−θm nm
w)
fXm (x) = x exp − x .
Γ(nm /φm ) φm
Finally, define cumulant function b(θ) = − log(−θ) for θ < 0, see Example 7.10.
The density of Xm = Yi,j /ni,j is then given by
(m
( ) !nm /φm
θm x − b(θm ) 1 nm
fXm (x) = exp xnm /φm −1 .
φm /nm Γ(nm /φm ) φm
Thus, we have Xm ∼ EDF(θm , φm , nm , b(·)) with b(θ) = − log(−θ) for θ ∈ Θ =

−R+ . The first two moments are given by
tes
−1 φm −2
E[Xm ] = − θm and Var(Xm ) = θ .
nm m
As in the Poisson case we would like to have a multiplicative structure. This is
again achieved by using the log-link function for g and setting
no
− log(−θm ) = (Zβ)m .
This choice and the log-link implies that

−1
g(E[Xm ]) = log E[Xm ] = log(−θm ) = (Zβ)m .
NL
For the joint log-likelihood function of X ∈ RM

+ we obtain
X nm X nm
`X (β) ∝ [Xm θm + log(−θm )] = [−Xm exp{−(Zβ)m } − (Zβ)m ] .
m φm m φm
b MLE for β is found by the solution of

The MLE β
∂
`X (β) = 0. (7.19)
∂β
We calculate the partial derivatives of the log-likelihood
∂ ∂ X nm X nm h
−1 ∂θm
i
`X (β) = [Xm θm + log(−θm )] = Xm + θm
∂βl ∂βl m φm m φm ∂βl
X nm
= [Xm exp{−(Zβ)m } − 1] zm,l .
m φm

For rewriting the previous equation in matrix form we define the weight matrix
Vθ = diag(−θ1 n1 /φ1 , . . . , −θM nM /φM ). The last equation is then written as
∂
`X (β) = Z 0 Vθ X − Z 0 Vθ exp{Zβ}.
∂β
We have just proved the following proposition:
Proposition 7.12. The solution to the MLE problem (7.19) in the gamma case is
given by the solution of
w)
Z 0 Vθ exp{Zβ} = Z 0 Vθ X.
Remarks.
(m
• Proposition 7.12 for the gamma case looks very promising because it has
the same structure as Proposition 7.11 for the Poisson case. However, this
similarity is only at the first sight: the parameter β determines the θ which
is also integrated into the weight matrix Vθ . Therefore, the MLE β b MLE is
only found numerically, using either Fisher’s scoring method or the Newton-
Raphson algorithm.
tes
• For the general case within the exponential dispersion family with link func-
tion g we refer to Section 2.3.2 in Ohlsson-Johansson [72].
no
• We have seen that the weights wi,j are given by the number of policies vi,j in
the Poisson case and by the number of claims ni,j in the gamma case.
• We summarize the cases considered:

Gaussian case:
NL
MLE
Z 0 Σ−1 Z β
b − Z 0 Σ−1 X = 0.
Poisson case:
MLE
Z 0 V exp{Z β
b } − Z 0 V X = 0.
Gamma case:
MLE
Z 0 Vbθ exp{Z β
b } − Z 0 Vbθ X = 0,
b MLE }.
with θb = − exp{−Z β

7.3.3 Variable reduction analysis

We consider variable reduction in the general case of the exponential dispersion
family under the assumption of choosing the log-link choice for g. Having ob-
b MLE for β within the
servations X = (X1 , . . . , XM )0 we can estimate the MLE β
exponential dispersion family with log-link function g and design matrix Z. This
then provides the estimate for the mean given by

MLE
µb m = b0 (θbm ) = exp (Z β
b )m .
w)
We define the function h = (b0 )−1 which implies that θbm = h(µb m ). The log-
likelihood function for this estimate is then given by
X Xm h(µb m ) − b(h(µb m ))
`X (µ)
b = + c(Xm , φ, wm ),
(m
m φ/wm
where we assume that φm = φ for all m = 1, . . . , M . Observe that this maximizes

the likelihood function over all possible choices β (under given design matrix Z,
cumulant function b and log-link function g). Similar to the likelihood ratio test
(7.13) in the Gaussian model we do a likelihood ratio test for this model within the
exponential dispersion family. Therefore, we consider the model Zβ and compare
tes
it to the saturated model which has as many parameters as observations:
X Xm h(Xm ) − b(h(Xm ))
`X (X) = + c(Xm , φ, wm ).
m φ/wm
no
The scaled deviance is then defined by
D∗ (X, µ)
b = 2 (`X (X) − `X (µ))
b
2 X h i
= wm Xm h(Xm ) − b(h(Xm )) − Xm h(µb m ) + b(h(µb m )) .
φ m
NL
The deviance statistics is defined by
b = φD ∗ (X, µ)
D(X, µ) b = 2φ (`X (X) − `X (µ))
b .
Observe that these deviance statistics play the role of the likelihood ratio Λ given
in (7.13) if we compare the model Zβ to the saturated model which is used as
benchmark in this analysis.
Similar to Section 7.2.2 we would now like to see whether we can reduce the number
of parameters in β ∈ Rr+1 .
Null hypothesis H0 : β0 = . . . = βp−1 = 0 for given p < r + 1.

b full ) in the full model β ∈ Rr+1 .

1. Calculate the deviance statistics D(X, µ
2. Calculate the deviance statistics D(X, µ

b H0 ) under the null hypothesis H0 .
Define the test statistics, see also (7.14),
b H0 ) − D(X, µ
D(X, µ b full ) M − r − 1
F = . (7.20)
D(X, µ b full ) p
The test statistics F has asymptotically an F -distribution with degrees of freedom
w)
given by df 1 = p and df 2 = M − r − 1. Therefore, we apply the same criterion as
in (7.15).
A second test statistics considered is, see Lemma 3.1 in Ohlsson-Johansson [72],
(m
X 2 = D∗ (X, µ
b H0 ) − D ∗ (X, µ
b full ).
The test statistics X 2 is asymptotically χ2 -distributed with df = p degrees of

freedom. In order to calculate this latter test statistics we need to estimate the
dispersion parameter φ. For the Poisson case it is assumed to be 1, in the other cases
tes
we have different options for the estimation of φ. Assume that θm was estimated by
θbm (under the assumption that φm = φ for all m and thus φ cancels in the MLE).
Then, we can estimate φ from Pearson’s residuals by
1 X (Xm − b0 (θbm ))2
φbP = wm .
no
M −r−1 m b00 (θbm )

An alternative is to use the deviances and estimates
D(X, µ b full )
φbD = .
M −r−1
We can also calculate φbP and φbD in the Poisson case and if they are substantially
NL
different from 1, then we either have under- or over-dispersion.
Finally, to check the accuracy of the model and the fit one should also study the
residuals. We can study Pearson’s residuals given by
Xm − b0 (θbm )
rP,m = q ,
b00 (θbm )/wm
and the deviance residuals
r h i
0
rD,m = sgn(Xm − b (θb m )) 2wm Xm h(Xm ) − b(h(Xm )) − Xm θbm + b(θbm )
for m = 1, . . . , M . These residuals should not show any structure because the
Xm where assumed to be independent and the observed rP,m should roughly be
centered having the similar variances φ.

Example 7.13. Assume that X1 , . . . , XM are independent with
Xm ∼ EDF(θm , φ, vm , b(·) = (·)2 /2). (7.21)
From Example 7.10 we know that these Xm ’s have a Gaussian distribution, i.e. their
densities are given by
( )
1 1 (xm − θm )2
f (xm ; θm , φ) = q exp − .
2πφ/wm 2 φ/wm
w)
n o
b = b0 (θ)
The scaled deviance is given by, set µ b =θ
b = exp Z β
b ,
1X 2
D∗ (X, µ)
b = wm Xm − θbm ,
φ m
(m
and the deviance statistics is given by
X 2
b =
D(X, µ) wm Xm − θbm .
m
Compare (7.20) and (7.14) for the Gaussian model (7.21).

tes
Exercise 13. Calculate the deviance statistics for the Poisson and the gamma
model, see also (3.4) in Ohlsson-Johansson [72].
no
NL

Chapter 8
Bayesian Models and Credibility
w)
Theory
(m
In the previous chapter we have done tariffication using GLM. This was done
by splitting the total portfolio into different homogeneous risk classes (i, j). The
volume measures in these risk classes (i, j) were given by vi,j in Section 7.3.1 and
by ni,j in Section 7.3.2, respectively. There might occur the situation where a risk
class (i, j) has only small volume vi,j and ni,j , respectively, i.e. only a few policies
tes
or claims fall into that risk class. In that case an observation Ni,j and Si,j may not
be very informative and single outliers may disturb the whole picture. Credibility
theory aims to deal with such situations in that it specifies a tariff of the following
structure
µi,j = αi,j Si,j + (1 − αi,j )µ,
no
i.e. the tariff µi,j for next accounting year is calculated as a credibility weighted
average between the individual past observation Si,j and the overall average µ
with credibility weight αi,j ∈ [0, 1]. For αi,j = 1 we completely believe into past
observations, for αi,j = 0 we only believe into the overall average µ. Credibility
NL
theory makes this approach rigorous and specifies the credibility weights.
Credibility theory belongs to the field of Bayesian statistics:
• There are exact Bayesian methods which allow for analytical solutions.
• There are simulation methods such as the Markov chain Monte Carlo (MCMC)
method which allow for numerical solutions of Bayesian models.
• There are approximations such as linear credibility methods which give opti-
mal solutions in sub-spaces of possible solutions.
Central to these methods is the Bayes’ rule.
183
184 Chapter 8. Bayesian Models and Credibility Theory
8.1 Exact Bayesian models

We start by explaining Bayes’ rule. The basic idea of Bayes’
rule goes back to the Reverend Thomas Bayes (1701-1761)
who discovered the rule during the 1740s. It was then Richard
Price (1723-1791) who devoted much of his time to clean and
prepare Bayes’ essay on the probability of causes and he sub-
mitted “An essay toward solving a problem in the doctrine of
chances” to the Royal Society’s Philosophical Transactions. In T. Bayes
1774 Pierre-Simon Laplace discovered the rule on its own
w)
and he has brought it into today’s form. Therefore, the Bayes’ rule should be called
Bayes-Price-Laplace’s rule. For a historical review we refer to McGrayne [67]. As
we will just see, Bayes’ rule is the mathematical tool to combine prior knowledge
and observations into posterior knowledge. Technically it exchanges probabilities,
(m
therefore it is also known under the name method of inverse probabilities.
Assume we have observations X that have density fθ (x). Often

the difficulty is that the parameter θ is not known. In previ-
ous chapters we have estimated this parameter with the MLE
tes
method and with the method of moments. These methods are
purely observation based. What can we do if we have no past
R. Price observations or only scarce past observations? This is the ques-
tion we would like to answer in this chapter. It will lead to a new attitude and to
a new estimation method.
no
We specify a prior distribution/density π for the (unknown) parameter θ. We

will explain below how this prior distribution is specified. The joint density of
observation X and parameter θ is then given by
f (x, θ) = fθ (x)π(θ).
NL
Bayes’ rule allows to calculate the posterior distribution of θ, given observation x,
fθ (x)π(θ)
π(θ|x) = R ∝ fθ (x)π(θ).
fθ (x)π(θ) dθ
This means we start with a prior distribution π(θ). This prior distribution either
expresses expert knowledge or is determined from a portfolio of similar business.
Having observed x, we modify the prior believe π to obtain the posterior distribu-
tion π(θ|x) that reflects both prior knowledge π(θ) about θ and experience x, that
is, the prior believe π(θ) is improved by the arriving observation x. Thus, when-
ever an observation arrives we can update our knowledge about θ, which constantly
improves our estimation of the unknown model parameter θ.

Chapter 8. Bayesian Models and Credibility Theory 185
This is exactly what Bayesian and credibility theory is about.
We start with an explicit example to show how this mechanism works.
8.1.1 Poisson-gamma model

In this section we present one of the most popular Bayesian models
which has a closed form solution for the posterior distribution. As
w)
mentioned in Bühlmann-Gisler [24], this mathematical model goes
back to Fritz Bichsel (1921-1999) [11]. He has introduced it
in the 1960s to calculate a bonus-malus tariff system for Swiss
motor third party liability insurance. The aim was to punish bad
drivers and to reward good drivers which has led to bonus-malus
(m
considerations. F. Bichsel
Definition 8.1 (Poisson-gamma model). Assume fixed volumes vt > 0 are given
for t ∈ N.
• Conditionally, given Λ, the components of N = (N1 , . . . , NT ) are independent

tes
with Nt ∼ Poi(Λvt ).
• Λ ∼ Γ(γ, c) with prior parameters γ > 0 and c > 0.

no
Remark. Observe that there is a fundamental difference to the negative-binomial

distribution considered in Section 2.2.4. Here, we assume that N1 , . . . , NT belong
all to the same Λ, whereas for having independent negative-binomial distributions
N1 , . . . , NT every component belongs to another latent factor Λ1 , . . . , ΛT . In the
latter case the components of N are independent, whereas in the former case they
are dependent (we only have conditional independence).
NL
Theorem 8.2. Assume N = (N1 , . . . , NT ) follows the Poisson-gamma model of

Definition 8.1. The posterior distribution of Λ, conditional on N , is given by
T T
!
X X
Λ|{N } ∼ Γ γ + Nt , c + vt .
t=1 t=1
Proof. The posterior is given by

T
(λvt )Nt cγ γ−1 −cλ PT PT
− c+ vt λ
∝ λγ+ t=1 Nt −1 e
Y
−λvt
π(λ|N ) ∝ e λ e t=1 .
t=1
Nt ! Γ(γ)
This is a gamma density with the required properties. 2

Remarks 8.3.
• The posterior is again a gamma distribution but with modified parameters.
For the parameters we obtain the updates
T T
γ 7→ γTpost = γ + c 7→ cpost
X X
Nt and T =c+ vt .
t=1 t=1
Often γ and c are called prior parameters and γTpost and cpost
T posterior pa-
rameters (at time T ).
)
• The remarkable property in the Poisson-gamma model is that the posterior
distribution stays in the same family of distributions as the prior distribution.
w
There are more examples of this kind as we will see below. Many of these
examples belong to the exponential dispersion family with conjugate priors.
(m
• For the estimation of the unknown parameter Λ we obtain the following prior
and posterior estimators
γ
λ0 = E[Λ] = ,
c
γTpost γ + Tt=1 Nt
P
post
λT
b = E[Λ|N ] = post = .
c + Tt=1 vt
P
cT
tes
We analyze the posterior estimator λ b post in more detail below, which will
T
provide the basic credibility formula.
no
Corollary 8.4. Assume N = (N1 , . . . , NT ) follows the Poisson-gamma model of

b post has the following credibility form
Definition 8.1. The posterior estimator λ T
b post = α λ
λ b + (1 − α ) λ .
T T T T 0
with credibility weight αT and observation based estimator λ

b given by
T
NL
PT T
t=1 vt b = P 1
X
αT = PT ∈ (0, 1) and λT T Nt .
c+ t=1 vt t=1 vt t=1
The (mean square error) uncertainty of this estimator is given by

γTpost 1 b post

post 2
E Λ − λT
b
N = post 2 = (1 − αT ) λ .
(cT ) c T
Proof. In view of Theorem 8.2 we have for the posterior mean

PT PT T
bpost γ+ t=1 Nt t=1 vt 1 X c γ
λ T = PT = PT PT Nt + PT
c+ t=1 vt c+ t=1 vt t=1 vt t=1 c+ t=1 vt
c
= bT + (1 − αT ) λ0 .
αT λ

This proves the first claim. For the estimation uncertainty we have
2 γTpost

post 1 bpost
E Λ − λT b N = Var ( Λ| N ) = post 2 = (1 − αT ) λ .
(cT ) c T
Remarks 8.5.
b post is a credibility weighted
• Corollary 8.4 shows that the posterior estimator λ T
average between the prior guess λ0 and the purely observation based estimator
b with credibility weight α ∈ (0, 1).
λ
)
T T
w
• The credibility weight αT has the following properties:
1. for the number of observed years T → ∞: αT → 1 (since vt ≥ 1 for all
(m
t if vt counts the number of policies);
2. for the volume vt → ∞: αT → 1;
3. for the prior uncertainty going to infinity, i.e. c → 0: αT → 1;
4. for the prior uncertainty going to zero, i.e. c → ∞: αT → 0.
Note that
tes
γ 1
Var (Λ) =
2
= λ0 .
c c
For c large we have informative prior distribution, for c small we have vague
prior distribution and for c = 0 we have non-informative or improper prior
distribution. The latter means that we have no prior parameter knowledge
no
(this has to be understood in an asymptotic sense).
• The observation based estimator satisfies, see Estimators 2.27 and 2.32,
b MV = λ
λ b MLE = λ
b .
T T T
NL
• The posterior estimator λb post has the nice property of a recursive update
T
structure which is important in many situations, see next corollary.
Corollary 8.6. Assume N = (N1 , . . . , NT ) follows the Poisson-gamma model of

Definition 8.1. Let λ b post denote the posterior estimator and λ
b post the posterior esti-
T T −1
mator in the sub-model where we only have observed (N1 , . . . , NT −1 ). The posterior
estimator λb post has the following recursive update structure
T
b post = β NT b post .
λ T T + (1 − βT ) λ T −1
vT
with credibility weight
vT
βT = PT ∈ (0, 1).
c+ t=1 vt

Proof. In view of Corollary 8.4 we have for the posterior mean

PT T
bpost t=1 vt 1 X c γ
λT = PT PT Nt + PT
c+ t=1 vt t=1 vt c+
t=1 t=1 vt
c
T PT −1
1 X c + t=1 vt c γ
= PT Nt + PT PT −1
c+ t=1 vt t=1 c + t=1 vt c + t=1 vt c
−1
T
!
1 X
= PT Nt + NT + (1 − βT )(1 − αT −1 )λ0 .
c+ t=1 vt t=1
For the first term we have
w)
−1
T
!
1 X
PT Nt + NT
c+ t=1 vt t=1
PT −1 PT −1 T −1
vT NT c + t=1 vt t=1 vt 1 X
= PT + PT PT −1 PT −1 Nt
c+ t=1 vt
vT c + t=1 vt c + t=1 vt t=1 vt t=1
(m
NT
= βT + (1 − βT ) αT −1 λ
bT −1 .
vT
Collecting all terms provides the claim. 2
Conclusions. For pricing such a portfolio, we need to have prior information λ0

about the premium. This prior information can come from experts, from similar
tes
portfolios, from market information or from a combination thereof. If we have
no observations we charge premium λ0 . When we start to collect observations
N1 , N2 , . . . , we constantly update the premium by the rule
b post = β Nt b post ,
λ + (1 − βt ) λ
no
t t t−1
vt
for t ≥ 1, where we set λ b post = λ . The prior information has an uncertainty

0 0
parameter c for the credibility weighting of λ0 . The bigger the prior uncertainty
the faster the prior knowledge will disappear as t → ∞. In the limit (as t → ∞)
we have a premium that is completely based on the observations and which, in this
NL
Poisson-gamma case, coincides with the MLE.

However, the credibility formula of Corollary 8.4 is of special interest when we only
have a few observations, i.e. t small, and these few observations are only based on
a small portfolio, i.e. vs small for all s ≤ t. In such cases the credibility weight αt
may be around, say, 60% and therefore the prior mean λ0 substantially smooths
the purely observation based estimator λ b . This way we get much more stability
t
and reliability in the premium calculation because we add an additional source of
information to the premium calculation problem (prior choice).
8.1.2 Exponential dispersion family with conjugate priors

The crucial property of the Poisson-gamma model is that the prior and the poste-
rior distributions belong to the same family of parametric distributions, only the

parameters change from prior parameters to posterior parameters. There are many
examples of this type. The best known examples belong to the exponential disper-
sion family with conjugate priors. We have already met the exponential dispersion
family in Definition 7.7, X ∼ EDF(θ, φ, w, b(·)) has (generalized) density
( )
xθ − b(θ)
fX (x; θ, φ) = exp + c(x, φ, w) ,
φ/w
for an (unknown) parameter θ in the open set Θ. In the Bayesian case we will model
this parameter Θ = θ with a prior distribution π on Θ and then try to determine the
posterior distribution after we have collected observations X1 , . . . , XT that belong
)
to this EDF(θ, φ, w, b(·)).
w
Model Assumptions 8.7 (exponential dispersion family with conjugate priors).
Assume fixed volumes wt > 0, t = 1, . . . , T , a dispersion parameter φ > 0 and a
(m
cumulant function b : Θ → R on an open set Θ ⊂ R are given.
• Assume the random variable Θ has the following density on Θ

( )
x0 θ − b(θ)
πx0 ,τ (θ) = exp + d(x0 , τ ) ,
τ2
tes
with fixed prior parameters x0 ∈ I and τ ∈ (0, cτ ), and d(·, ·) describes the
normalization. I ⊂ R denotes the possible choices of x0 so that πx0 ,τ is a
well-defined density on Θ for all τ ∈ (0, cτ ) for a fixed given constant cτ > 0.
• Conditionally, given Θ ∈ Θ, the components of X = (X1 , . . . , XT ) are in-

no
dependent with Xt ∼ EDF(Θ, φ, wt , b(·)), having well-defined densities with

identical support (for all Θ ∈ Θ and all t = 1, . . . , T ).
Theorem 8.8. We make Model Assumptions 8.7 and assume that the domain I of
NL
possible prior choices x0 is an open interval which contains the range of Xt for all
Θ ∈ Θ and t = 1, . . . , T . The posterior distribution of Θ, given X, is given by the
density πbxpost ,τ post (θ) with
T
" T #−1/2
post wt 1
τ post ∈ (0, cτ ),
X
τ = + 2 <τ with
t=1 φ τ
xbpost
T = αT xbMV
T + (1 − αT ) x0 ∈ I,
with credibility weight αT and (minimum variance) estimator xbMV
T
PT T
wt
t=1 1
xbMV
X
αT = PT φ and T = PT wt X t ,
t=1 wt + τ 2 t=1 wt t=1
where for the minimum variance statement we additionally assume that the second
moments of Xt |{Θ} exist and the cumulant function b ∈ C 2 in the interior of Θ.

Proof. The Bayes’ rule gives for the posterior distribution of Θ, conditionally given X,
T T
Y Y Xt θ − b(θ) x0 θ − b(θ)
π (θ| X) ∝ fXt (Xt ; θ, φ) πx0 ,τ (θ) ∝ exp exp
t=1 t=1
φ/wt τ2
(" T # " T # )
X Xt wt x0 X wt 1
= exp + 2 θ− + 2 b(θ)
t=1
φ τ t=1
φ τ
 "
T
#−1 " T # 
 X wt 1 X Xt wt x 0

= exp (τ post )−2  + 2 + 2 θ − b(θ) .

t=1
φ τ t=1
φ τ 
Observe that 0 < τ post < τ < cτ and
w)
" T #−1 " T # T
X wt 1 X Xt wt x0 1 X wt
+ 2 + 2 = αT PT wt
Xt + (1 − αT )x0 ∈ I.
t=1
φ τ t=1
φ τ t=1 φ t=1
φ
Therefore, we obtain posterior density πb xpost ,τ post

which is a well-defined density on Θ by assump-
(m
T
tion. There remains the proof of the minimum variance statement. For fixed parameter Θ ∈ Θ
we know that X = (X1 , . . . , Xn ) are independent with Xt ∼ EDF(Θ, φ, wt , b(·)). Corollary 7.9
(or its generalization) implies
φ 00
E[Xt |Θ] = b0 (Θ) and Var(Xt |Θ) = b (Θ). (8.1)
wt
Note that Θ does not depend on t, therefore the statement of the minimum variance estimator
tes
follows from Lemma 2.26. This closes the proof. 2
Theorem 8.9 (credibility estimator). We make the assumptions of Theorem 8.8.

In addition we assume that exp{(x0 θ − b(θ))/τ 2 } disappears on the boundary of Θ
for all x0 ∈ I and τ ∈ (0, cτ ) and that b ∈ C 1 in the interior of Θ. We have
no
E [b0 (Θ)] = x0 and E [b0 (Θ)| X] = xbpost

T = αT xbMV
T + (1 − αT ) x0 ,
see Theorem 8.8 for notation.

NL
Proof. In view of Theorem 8.8 it suffices to prove the first statement for all x0 ∈ I and τ ∈ (0, cτ ).

x0 θ − b(θ)
Z
E [b0 (Θ)] = b0 (θ) exp + d(x 0 , τ ) dθ
Θ τ2
x0 − b0 (θ)

x0 θ − b(θ)
Z
= x0 − τ 2 exp + d(x 0 , τ ) dθ
Θ τ2 τ2

x0 θ − b(θ)
= x0 − τ 2 exp {d(x0 , τ )} exp = x0 .
τ2
∂Θ
Example 8.10 (exact credibility model). We make the assumptions of Theorem

8.9 but we extend the random vector (X1 , . . . , XT , XT +1 ), i.e. we add one additional
component XT +1 to the random vector, and we assume that conditionally, given Θ,
these components are all independent satisfying Model Assumptions 8.7. Our aim

is to price XT +1 based on the observations X1 , . . . , XT and on the prior knowledge

πx0 ,τ . Therefore we calculate the conditional expectations by applying the tower
property. This provides
E [XT +1 | X1 , . . . , XT ] = E [E [XT +1 | Θ, X1 , . . . , XT ]| X1 , . . . , XT ]
= E [E [XT +1 | Θ]| X1 , . . . , XT ]
= E [b0 (Θ)| X1 , . . . , XT ] (8.2)
= αT xbMV
T + (1 − αT ) x0 .
w)
Thus, we get a probability weighted average for the premium of XT +1 which is based
on the prior knowledge πx0 ,τ and on the past experience X1 , . . . , XT . Similar to
Corollary 8.6 we obtain a recursive update structure for this experience premium,
which allows to express the premium more and more accurately as time passes
(m
(under the above stationarity assumptions, of course).
Remarks 8.11.
• Examples that belong to the exponential dispersion fam-

ily with conjugate priors are: Poisson-gamma model,
gamma-gamma model, (log-)normal-normal model. For
tes
detailed information we refer to Chapter 2 in Bühlmann-
Gisler [24].
• All the models that have been studied in GLM Chapter

7 can also be studied in the Bayesian sense as illustrated
no
above.
• Theorem 8.8 gives an additional way of parameter esti-

mation within the exponential dispersion family. In contrast to the MLEs and
the minimum variance estimators, this Bayesian way also allows to include
NL
prior information, which may come from experts or from similar business.
Moreover, parameter uncertainty is quantified by the posterior distribution.
• This Bayesian idea can be extended to other families of distribution, for

example the Pareto-gamma case is treated in Section 2.6 of Bühlmann-Gisler
[24].
Example 8.12 (gamma-gamma model). We close this section with the example of
the gamma-gamma model. We recall Example 7.10. Choose fixed volumes wt > 0,
t = 1, . . . , T , and dispersion parameter φ = 1/γ > 0. Assume that conditionally,
given Θ > 0, X1 , . . . , XT are independent gamma distributed with densities
(Θwt /φ)wt /φ wt /φ−1

fXt (x; Θ, φ) = x exp {−Θwt /φ x} for x ∈ R+ .
Γ(wt /φ)

This is the form used in (7.18) with c = Θ/φ. Observe that the range of the random
variables Xt is R+ and that we obtain well-defined gamma densities on R+ for all
Θ ∈ R+ and all t = 1, . . . , T . This motivates the choice of the open set Θ
f = R
+
for the possible parameter choices Θ.
Thus, we need to show two things: (i) the density fXt (x; Θ, φ) belongs to the
exponential dispersion family for a particular cumulant function b : Θ → R; (ii)
this will allow to define the conjugate prior density πx0 ,τ for which we would like
to show that we can apply Theorem 8.9.
w)
Item (i) was already done in Example 7.10, however we will do it once more because
the signs need a careful treatment.
fXt (x; Θ, φ) = Θwt /φ exp {−Θwt /φ x} exp {c(x, φ, wt )}
(m
n o
= exp log Θwt /φ − Θwt /φ x exp {c(x, φ, wt )}
( )
x(−Θ) − (− log(−(−Θ)))
= exp exp {c(x, φ, wt )} .
φ/wt
The last formula seems to be a waste of minus signs, but with the definitions
ϑ = −Θ and b(ϑ) = − log(−ϑ) for ϑ < 0 we see that the gamma density belongs
tes
to the exponential dispersion family, that is, by a slight abuse of notation in fXt ,
( )
xϑ − b(ϑ)
fXt (x; ϑ, φ) = exp exp {c(x, φ, wt )} .
φ/wt
no
Moreover, we set Θ = −Θ f = −R for the domain of b. Corollary 7.9 then implies

+
for all t = 1, . . . , T
1
E [Xt | Θ] = b0 (ϑ) = = Θ−1 ∈ R+ .
−ϑ
NL
This completes task (i).
(ii) The prior density on Θ is then chosen by

( )
x0 ϑ − b(ϑ) x0

1
+1−1
πx0 ,τ (ϑ) = exp + d(x0 , τ ) ∝ (−ϑ) τ2 exp − 2 (−ϑ) .
τ2 τ
This is a gamma density, set θ = −ϑ, with shape parameter 1 + 1/τ 2 > 0 and scale
parameter x0 /τ 2 . This implies that we should choose I = R+ and τ > 0. In view of
Theorem 8.8 the assumptions are fulfilled because I is an open interval containing
all possible observations Xt , and thus Theorem 8.8 can be applied.
f = −Θ given
Next we observe that this density disappears on the boundary of Θ
by the set {0} ∪ {∞}. Therefore, we have from Theorem 8.9 (we also perform the

whole calculation to back test the result)
2 1+1/τ 2
−1 (x0 /τ
) x0
Z
h i 1
0 −1 +1−1
x0 = E[b (ϑ)] = E Θ = θ 2
θ τ2 exp − 2 θ dθ
R+ Γ(1 + 1/τ ) τ
2 2
(x0 /τ 2 )1+1/τ Γ(1/τ 2 ) Z (x0 /τ 2 )1/τ 12 −1 x0

= θ τ exp − θ dθ = x0 .
Γ(1 + 1/τ 2 ) (x0 /τ 2 )1/τ 2 R+ Γ(1/τ 2 ) τ2
Moreover, the posterior mean is given by

h i
E Θ−1 X = xbpost
T = αT xbMV
T + (1 − αT ) x0 ,
w)
with credibility weight
PT
t=1wt
αT = PT φ .
(m
t=1 wt + τ 2
Thus, with τ > 0 we can control the sensitivities of the estimates.
8.2 Linear credibility estimation

tes
In Model Assumptions 8.7 we have studied Bayesian models
which were based on the exponential dispersion family with con-
jugate priors. As a result we are able to explicitly calculate the
posterior distribution in these models and, moreover, this poste-
no
rior distribution belongs to the same class of distributions as the

prior itself, see Theorem 8.8. In many applied modeling prob-
lems we do not face such an ideal situation. Nowadays there
are powerful simulation techniques that can handle more com- H. Bühlmann
plicated problems. In the case of Bayesian analysis we can use
NL
Markov chain Monte Carlo (MCMC) methods which will provide the posterior dis-
tribution in almost any situation where we can write down the posterior density
up to the normalizing constant. That is, whenever we have a posterior density of
the following form
π(θ|x) ∝ fθ (x)π(θ),

and the right-hand side of this proportionality is explicit as a

function of θ, we can use an acceptance-rejection simulation al-
gorithm (within MCMC methods) which allows to approximate
π(θ|x) empirically. For MCMC methods we refer to the related
literature, see for instance Congdon [27], Gilks et al. [47], Green
[50, 51], Johansen et al. [54] or Robert [78].
Linear credibility theory is not based on simulation methods
but it tries to approximate the posterior mean by the best linear
E. Straub estimator. This we are going to describe more explicitly in this
w)
section. The key model to this analysis is the Hans Bühlmann
and Erwin Straub (1938-2004) model [25].
8.2.1 Bühlmann-Straub (BS) model
(m
Model 8.13 (Bühlmann-Straub (BS) model [25]). Assume we have I risk classes
and T random variables per risk class. Assume fixed volumes wi,t > 0, i = 1, . . . , I
and t = 1, . . . , T , are given.
• Conditionally, given Θi , the components of X i = (Xi,1 , . . . , Xi,T ) are inde-

tes
pendent with the first two conditional moments given by
E [ Xi,t | Θi ] = µ(Θi ),
σ 2 (Θi )
Var (Xi,t | Θi ) = .
wi,t
no
• The pairs (Θ1 , X 1 ), . . . , (ΘI , X I ) are independent and Θ1 , . . . , ΘI are i.i.d.

2
Throughout, we assume that the second moments are finite, i.e. E[Xi,t ] < ∞ for
all i, t.
NL
Remarks 8.14.
• We assume that each risk class i is characterized by a risk characteristics Θi

with range Θ. A priori (before having any observations Xi,t ) all risk classes
are considered to be similar which is expressed by the i.i.d. property of Θi .
This describes our prior knowledge about the risk classes.
• The conditional mean and variance are characterized by the two functions
µ : Θ → R and σ 2 : Θ → R+ ; Θ 7→ µ(Θ) and Θ 7→ σ 2 (Θ).
• If we set I = 1, i.e. we only have one risk class, then an explicit example to the
BS Model 8.13 is given by the exponential dispersion family with conjugate

priors, Model Assumptions 8.7. The conditional mean and variance are then
modeled by, see (8.1),
µ(Θ) = b0 (Θ) and σ 2 (Θ) = φ b00 (Θ),
for the corresponding (sufficiently smooth) cumulant function b : Θ → R.
For the BS credibility estimator we define the following structural parameters:
µ0 = E[µ(Θ1 )] collective mean, (8.3)
w)
τ 2 = Var(µ(Θ1 )) variance between risk classes, (8.4)
σ 2 = E[σ 2 (Θ1 )] variance within risk class. (8.5)
(m
8.2.2 Bühlmann-Straub credibility formula
The full Bayesian estimator for the (unknown) mean µ(Θi ) of risk class i is given
by
\
µ(Θ i ) = E [µ(Θi )| X 1 , . . . , X I ] . (8.6)
tes
In the exponential dispersion family with conjugate priors this posterior mean can
be calculated explicitly, see Theorem 8.9. In most other situations, however, this is
not the case. Therefore, we approximate this posterior mean. We briefly describe
how this approximation is done. Assume that all considered random variables are
square integrable, thus we work in the Hilbert space L2 (Ω, F, P) of square integrable
no
random variables, where the inner product is given by
hX, Y i = E [XY ] for X, Y ∈ L2 (Ω, F, P).
In this Hilbert space the random variables X 1 , . . . , X I generate the subspace G(X)
of all σ(X 1 , . . . , X I )-measurable random variables. The posterior mean µ(Θ \ i ),
NL
2
given by (8.6), is the element of the subspace G(X) that minimizes the L -distance
between this subspace G(X) and µ(Θi ). Since we have a Hilbert space this estimate
\
µ(Θ i ) corresponds to the orthogonal projection of µ(Θi ) onto G(X). In the general
case this minimization and orthogonal projection to G(X), respectively, has a too
complicated form. To reduce this complexity we restrict the orthogonal projection
to simpler subsets L of G(X). This will provide approximations to µ(Θ \ i ) ∈ G(X)
in the more restricted subsets L ⊂ G(X). We define the following two subsets
 
 X 
L(X, 1) = µb = a0 + ai,t Xi,t ; a0 ∈ R, ai,t ∈ R for all i, t ⊂ G(X),
 
i,t
 
 X 
Lµ0 (X) = µb = ai,t Xi,t ; for all i, t and E[µ]
b = µ0 ⊂ G(X).
 
i,t

The first subsets L(X, 1) includes the constants which will imply unbiasedness of
the estimators, whereas in the second case for Lµ0 (X) we need to enforce unbi-
asedness by a side constraint.
Definition 8.15 (inhomogeneous and homogeneous credibility estimator). We as-

sume that the BS Model 8.13 is fulfilled with collective mean µ0 ∈ R.
The inhomogeneous (linear) credibility estimator of µ(Θi ) based on X 1 , . . . , X I is
defined by
\ h i
\
µ(Θi ) = arg min E (µ b − µ(Θi ))2 .
w)
b∈L(X,1)
µ
The homogeneous (linear) credibility estimator of µ(Θi ) based on X 1 , . . . , X I is

defined by
hom
\
\
h
2
i
µ(Θ ) = arg min ( µ − µ(Θ )) .
(m
i E b i
b∈Lµ0 (X)
µ
\
tes
Remark 8.16. The inhomogeneous credibility estimator µ(Θ \ i ) is the best approx-
2
imation to µ(Θi ) (in the L -sense) among all linear estimators given by L(X, 1).
Because L(X, 1) is a subset of G(X), we immediately obtain for the mean square
error with the Pythagorean theorem for successive orthogonal projections
no
 !2  "  !2 
2 #
\
\ \ \
\ \
E  µ(Θi ) − µ(Θi ) =E µ(Θi ) − µ(Θi ) + E  µ(Θi ) − µ(Θi ) (8.7)
 .
NL
The left-hand side describes the error of the inhomogeneous

credibility estimator which can be split (right-hand side) into
the error of the (best) Bayesian estimator and the approxima-
tion error by the inhomogeneous credibility estimator to the
Bayesian estimator, see Theorem 3.14 in Bühlmann-Gisler [24].
hom
\
\
In a similar spirit the homogeneous credibility estimator µ(Θ i)
\
\ \
is the best approximation to µ(Θ i ) and µ(Θi ) within Lµ0 (X). A. Gisler

Theorem 8.17 (inhomogeneous and homogeneous BS estimator). We assume that

the BS Model 8.13 is fulfilled with parameters µ0 , τ 2 and σ 2 given by (8.3)-(8.5).
The inhomogeneous credibility estimator is given by
\
\
µ(Θi ) = αi,T Xi,1:T + (1 − αi,T ) µ0 ,
c
with credibility weight αi,T and observation based estimator X

c
i,1:T
PT T
t=1 wi,t 1 X
αi,T = PT σ2
and X
c
i,1:T = PT wi,t Xi,t .
t=1 wi,t + τ 2 t=1 wi,t t=1
)
The homogeneous credibility estimator is given by
w
hom
\
\
µ(Θi) = αi,T X i,1:T + (1 − αi,T ) µ
bT ,
c
(m
with estimate
I
1 X
µb T = PI αi,T X
c
i,1:T .
i=1 αi,T i=1
tes
Proof of Theorem 8.17. The theorem can be proved by brute force doing convex optimizations
(using the method of Lagrange in the latter case) or we can apply Hilbert space techniques using
projection properties, see Chapters 3 and 4 in Bühlmann-Gisler [24]. We do the brute force
calculation because it is quite straightforward. We minimize
 2 
no
X
h(a) = E a0 + al,t Xl,t − µ(Θi ) 
 
l,t
over all possible choices a0 , ai,t ∈ R. This requires that we calculate all derivatives w.r.t. these
parameters and set them equal to zero.
 
NL
∂ X !
h(a) = 2E a0 + al,t Xl,t − µ(Θi ) = 0, (8.8)
∂a0
l,t
  
∂ X !
h(a) = 2E Xj,s a0 + al,t Xl,t − µ(Θi ) = 0. (8.9)
∂aj,s
l,t
Equation (8.8) immediately implies that

 
X
a0 = µ0 1 − al,t  .
l,t
Plugging this into (8.9) and using (8.8) once more immediately gives for all j, s the requirement
 
!
X
Cov Xj,s , al,t Xl,t − µ(Θi ) = 0.
l,t

Using the uncorrelatedness between different risk classes (which is implied by the independence)
we obtain the following (normal) equations, see Corollary 3.17 and Section 4.3 in Bühlmann-Gisler
[24],
 
X
a0 = µ0 1 − al,t  , (8.10)
l,t
T
X
Cov (Xj,s , µ(Θi )) = aj,t Cov (Xj,s , Xj,t ) for all j, s. (8.11)
t=1
We calculate these last covariance terms
)
Cov (Xj,s , Xj,t ) = E [Cov ( Xj,s , Xj,t | Θj )] + Cov (E [ Xj,s | Θj ] , E [ Xj,t | Θj ])
1
E σ 2 (Θj ) 1{t=s} + Var (µ(Θj ))

w
=
wj,s
σ2
= 1{t=s} + τ 2 > 0.
wj,s
(m
The first covariance is given by
Cov (Xj,s , µ(Θi )) = Var (µ(Θi )) 1{j=i} = τ 2 1{j=i} .
This implies that the left-hand side of (8.11) is equal to 0 for j 6= i and because Cov (Xj,s , Xj,t ) ≥
τ 2 > 0 it follows that aj,s = 0 for all j 6= i. This is not surprising because we have assumed that
the different risk classes are independent. Therefore (8.10)-(8.11) reduces to
tes
T
!
def.
X
a0 = µ0 1 − ai,t = µ0 (1 − αi,T ) , (8.12)
t=1
2 T
σ X σ2
τ2 = ai,s + τ2 ai,t = ai,s + τ 2 αi,T for all s. (8.13)
wi,s t=1
wi,s
no
PT
This defines αi,T = t=1 ai,t and we still need to see that this credibility weight has the claimed
form. Requirement (8.13) then implies for all s
τ2
ai,s = (1 − αi,T ) wi,s .
σ2
If we sum this over s we obtain
NL
T T
X τ2 X
αi,T = ai,s = (1 − αi,T ) wi,s .
s=1
σ2 s=1
Solving this for αi,T gives the following credibility weights

τ2
PT PT
σ2 s=1 wi,s wi,t
αi,T = τ 2 PT = PT t=1 σ2
,
σ2 s=1 wi,s + 1 t=1 wi,t + τ 2
and the ai,s are given by

2
PT !
σ
τ2 t=1 wi,t τ2 2 wi,s
ai,s = 2 1 − PT σ2
wi,s = 2 PT τ σ2
wi,s = αi,T PT .
σ wi,t + σ t=1 wi,t + t=1 wi,t
t=1 τ2 τ2
If we collect all the terms we have found the following inhomogeneous credibility estimator
T
\
\ 1 X
µ(Θi ) = αi,T PT wi,s Xi,s + (1 − αi,T ) µ0 = αi,T X
bi,1:T + (1 − αi,T ) µ0 .
t=1 wi,t s=1

This proves the first claim and an important observation is that this credibility estimator is
unbiased for µ0 . Therefore, it coincides with the estimator if we would have projected to
Lµ0 (X, 1) = L(X, 1) ∩ {bµ ∈ L2 (Ω, F, P) : E[b
µ] = µ0 }.
The proof of the homogeneous credibility estimator goes along the same lines as the inhomoge-
neous one, using the method of Lagrange for replacing (8.8) by the side constraint
 
X X X
µ0 = E [b
µ] = E  ai,t Xi,t  = ai,t E [Xi,t ] = ai,t µ0 ,
i,t i,t i,t
P
which implies i,t ai,t = 1. An alternative proof would go by using the iterative property and the
linearity of orthogonal projections on subspaces. For details we refer to Section 4.6 in Bühlmann-
w)
Gisler [24]. This closes the proof of Theorem 8.17. 2
Remarks 8.18 (interpretation of the BS formula of Theorem 8.17). The BS for-

mula provides the best linear approximations to the true premium µ(Θi ) and the
(m
\ 2
Bayesian estimator µ(Θ i ) in the L -sense, see also (8.7).
The inhomogeneous and the homogeneous credibility estimators are somewhat dif-
ferent which may also lead to different interpretations.
• For the inhomogeneous credibility estimator we assume that there is prior

knowledge on µ(Θi ) in the form of the prior mean parameter µ0 . This prior
tes
knowledge has uncertainty described by the variance parameter τ 2 and the
resulting estimator is the classical credibility weighted average between port-
folio experience X i and prior knowledge µ0 which leads to the credibility
weights αi,T . To calculate this estimator it is sufficient to have one risk class
no
only.
• The homogeneous credibility estimator can be interpreted as the modified

version of the inhomogeneous one if we do not have prior knowledge. In
this case we extract additional information from similar portfolios. That is,
we consider all risk classes simultaneously to obtain µb T which replaces the
NL
prior knowledge µ0 . The precision that is given to this overall knowledge µb T

depends on the volatility between the risk classes, i.e. on the significance of
particular observations.
• The so-called credibility coefficient κ is defined by, see Bühlmann-Gisler [24]

page 84,
σ2
κ = 2. (8.14)
τ
It describes the ratio of uncertainties within risk classes and between risk
classes. This is the crucial ratio that determines the credibility weights
PT
wi,t
t=1
αi,T = PT .
t=1 wi,t + κ

This latter case can now be used for tariffication of risk factors on different risk
classes, similar to the GLM Chapter 7. The overall premium is given by µb T , the
experience of risk class i is given by X i,1:T and the credibility weight αi,T ∈ (0, 1)
c
explains how this information needs to be combined to obtain the risk adjusted
premium of risk class i.
8.2.3 Estimation of structural parameters

In order to apply the homogeneous credibility estimator there remains the speci-
fication of the structural parameters σ 2 and τ 2 . We make the same choice as in
w)
Bühlmann-Gisler [24]. We define
T
1 2
sb2i
X
= wi,t Xi,t − X
c
i,1:T .
T − 1 t=1
(m
A straightforward calculation shows that this is an unbiased estimator for σ 2 (Θi ),
conditionally given Θi . But this immediately implies that sb2i is an unbiased esti-
mator for σ 2 for all i. Therefore, we set
I
1 X
σbT2 sb2i ,
tes
= (8.15)
I i=1
with E[σbT2 ] = σ 2 . Observe that one risk class is sufficient to get an estimate for σ 2 .
If we have prior knowledge µ0 then τ 2 should be calibrated such that it quantifies the
reliability of this prior knowledge. If we use the homogeneous credibility estimator
no
then τ 2 is estimated from the volatility between the risk classes (here we need
more than one risk class i). Therefore, we define the weighted sample mean over
all observations
!
1 X 1 X X
X̄ = P wi,t Xi,t = P wi,t X
c
i,1:T .
NL
i,t wi,t i,t i,t wi,t i t
In analogy to (2.7) we define

P
I t wi,t
2
vbT2
X
= X i,1:T − X̄ .
c
I −1 i
P
j,s wj,s
Similar to Lemma 2.29 we can calculate the expected value of vbT2 which then shows
that we need to define !
2 2 I σbT2
tT = cw vbT − P
b ,
j,s wj,s
with constant " !#−1

I − 1 X t wi,t
P P
wi,t
cw = P 1− P t .
I i j,s wj,s j,s wj,s

This estimator has the unbiasedness property E[tb2T ] = τ 2 , we refer to Section 4.8 in
Bühlmann-Gisler [24]. The only difficulty is that it might become negative which,
of course, is non-sense for estimating τ 2 . Therefore, we set for the final estimator
n o
τbT2 = max tb2T , 0 . (8.16)
Example 8.19. We do Exercise 4.1 of Bühlmann-Gisler [24]. We have I = 5 risk

classes and for every risk class we have T = 5 observations.
w)
1 2 3 4 5
risk class 1 v1,t 729 786 872 951 1019
S1,t 583 1100 262 837 1630
X1,t 80.0% 139.9% 30.0% 88.0% 160.0%
(m
risk class 2 v2,t 1631 1802 2090 2300 2368
S2,t 99 1298 326 463 895
X2,t 6.1% 72.0% 15.6% 20.1% 37.8%
risk class 3 v3,t 796 827 874 917 944
S3,t 1433 496 699 1742 1038
X3,t 180.0% 60.0% 80.0% 190.0% 110.0%
tes
risk class 4 v4,t 3152 3454 3715 3859 4198
S4,t 1765 4145 3121 4129 3358
X4,t 56.0% 120.0% 84.0% 107.0% 80.0%
risk class 5 v5,t 400 420 422 424 440
S5,t 40 0 169 1018 44
no
X5,t 10.0% 0.0% 40.0% 240.1% 10.0%
Table 8.1: Observed claims Si,t and corresponding numbers of policies vi,t .
NL
The data is provided in Table 8.1. We have claims Si,t and corresponding numbers
of policies vi,t . In order to apply the BS model we choose volumes wi,t = vi,t ,
i.e. the volumes wi,t are determined by the number of policies in the corresponding
cell (i, t) and we define the claims ratios Xi,t = Si,t /vi,t . Our aim is to apply the BS
model to (Xi,t )i,t . Observe that the application of the BS model is motivated by
the fact that some cells have small volumes and volatile claims ratios. Therefore,
Bayesian methods are applied to smooth the premia.
hom
\
\
We would like to calculate the homogeneous credibility estimator µ(Θi ) for the
claims ratios of the risk classes i = 1, . . . , 5, see Theorem 8.17. Therefore, we
first need to estimate the structural parameters. With formulas (8.15) and (8.16)
we obtain σbT2 = 261.2 and τbT2 = 0.1021. This gives estimated credibility coefficient
bT = σ
κ b T2 /τbT2 = 2558 and from this we can estimate the credibility weights αi,T . The
estimates are provided in Table 8.2. We see that in risk class 4 we have big volumes

risk class 1 risk class 2 risk class 3 risk class 4 risk class 5
αb i,T 63.0% 79.9% 63.0% 87.8% 45.2%
Xi,1:T
b 101.3% 30.2% 124.1% 89.9% 60.4%
hom
\
\
µ(Θ i) 93.5% 40.3% 107.9% 88.7% 71.3%
Table 8.2: Estimated credibility weights α

b i,T , observation based estimate X
c
i,1:T and
hom
\
\
homogeneous credibility estimate µ(Θ i) of the claims ratio at time T = 5.
w)
v4,t which results in a high credibility weight estimate of αb 4,T = 87.8%. In risk class
5 we have small volumes v5,t which results in a low credibility weight estimate of
αb 5,T = 45.2%. From this we calculate the credibility weighted overall claims ratio
µb T = 80.4% (which should be compared to the sample mean X̄ = 77.9%) and
(m
from this we finally calculate the homogeneous credibility estimators for the claims
ratios, see Table 8.2. We observe smoothing of X c
i,1:T towards µb T according to the
credibility weights αb i,T .
Exercise 14.
(a) Choose the data of Table 8.1 and calculate the inhomogeneous credibility esti-
tes
\
\
mators µ(Θ i ) for the claims ratios under the assumption that the collective mean
is given by µ0 = 90% and the variance between risk classes is given by τ 2 = 0.20.
(b) What changes if the variance between risk classes is given by τ 2 = 0.05?
no
8.2.4 Prediction error in the Bühlmann-Straub model

\
\
Observe that the credibility estimator µ(Θ i ) is used to estimate µ(Θi ) and to predict
next years claim Xi,T +1 . Similar to (1.8) we can analyze the total prediction error
NL
!
\
\ \
\
Xi,T +1 − µ(Θi ) = (Xi,T +1 − µ(Θi )) + µ(Θi ) − µ(Θi ) .
If we assume that Xi,1 , . . . , Xi,T +1 are independent, conditionally given Θi , then we

obtain from unbiasedness
 !2   !2 
\ h i \
\
E  Xi,T +1 − µ(Θi)
 = E (Xi,T +1 − µ(Θi ))2 + E  µ(Θi ) − µ(Θ
\ i)

 !2 
\
\
= E [Var (Xi,T +1 | Θi )] + E  µ(Θi ) − µ(Θi)

σ2
= + (1 − αi,T ) τ 2 , (8.17)
wi,T +1

see Theorem 4.3 in Bühlmann-Gisler [24]. Similarly we obtain for the homogeneous
credibility estimator, see Theorem 4.6 in Bühlmann-Gisler [24],
 2 
hom !
\ σ2 1 − αi,T
E Xi,T +1 \
− µ(Θi) = + (1 − αi,T ) τ 2 1+ P . (8.18)
  
wi,T +1 i αi,T

The expressions in (8.17) and (8.18) are called mean square error of prediction
(MSEP). We will come back to this notion in Section 9.3 and for a comprehensive
treatment we refer to Section 3.1 in Wüthrich-Merz [87].
w)
hom
\
\
Exercise 15. Estimate the prediction uncertainty E[(Xi,T +1 − µ(Θi) )2 ] for the
data of Example 8.19 under the assumption that the volume grows 5% in each risk
class.
(m
Exercise 16. We consider Example 4.1 of Bühlmann-Gisler [24]. The observed
numbers of policies vi and claims counts Ni in 21 different regions are given in
Table 8.3.
region i vi Ni
1 50’061 3’880
2 10’135 794
3 121’310 8’941
tes
4 35’045 3’448
5 19’720 1’672
6 39’092 5’186
7 4’192 314
8 19’635 1’934
9 21’618 2’285
10 34’332 2’689
no
11 11’105 661
12 56’590 4’878
13 13’551 1’205
14 19’139 1’646
15 10’242 850
16 28’137 2’229
17 33’846 3’389
18 61’573 5’937
NL
19 17’067 1’530
20 8’263 671
21 148’872 15’014
total 763’525 69’153
Table 8.3: Observed volumes vi and claims counts Ni in regions i = 1, . . . , 21.
Calculate the homogeneous credibility estimators for each region i under the as-
sumption that Ni |Θi has a Poisson distribution with mean µ(Θi )vi = Θi λ0 vi .
Hint: For the estimation of the credibility coefficient κ = σ 2 /τ 2 one should use that
Ni |Θi is Poisson distributed which has direct consequences for the corresponding
variance σ 2 (Θi ), see also Proposition 2.8.

w)
(m
tes
no
NL

Chapter 9
Claims Reserving
w)
This chapter will give a completely new perspective to non-life insurance business
(m
which has not been tackled so far in these notes. Until now we have assumed that
the total claim amount for a fixed accounting year can be described by a compound
distribution of the form
Nt
X (t)
St = Yi ,
tes
i=1
where t = 1, . . . , T denotes the different accounting years and Nt counts the number
of claims in accounting year t. This was the base model for the study of the surplus
process (Ct )t∈N0 in Chapter 5 and it was also the base assumption for parameter
no
estimation (based on past claims experience) for the prediction of future claims.
This model suggests that we have Nt claims in accounting year t and their claim
(t) (t)
sizes Y1 , . . . , YNt describe the total payouts to the insured. The issue in practice is
that a typical non-life insurance claim cannot be settled immediately at occurrence,
(t)
i.e. if Yi describes the claim amount of claim i in accounting year t then, in
NL
general, this claim amount is not observable at time t due to a settlement delay
that allows for a final assessment only later. Likewise Nt is not observable at the
end of accounting year t because there might be claims that have occurred in year
t but which are reported only later. We describe reasons for such reporting and
settlement delays in the next section. As a consequence we need to predict future
cash flows of claims that have occurred in the past in order to have a sound basis
for pricing future insurance contracts. This task is exactly the claims reserving
problem and it assesses outstanding loss liabilities for past claims. The prediction
of these outstanding loss liabilities constitute the claims reserves. Importantly,
these claims reserves typically are the largest position on the liability side of the
balance sheet of a non-life insurance company and are essential for the financial
strength of the company. Therefore, we aim to describe the claims reserving process
in this chapter and we also would like to describe the uncertainties involved.
205
206 Chapter 9. Claims Reserving
9.1 Outstanding loss liabilities

Claims in non-life insurance are triggered by an accident which is an event that
causes (financial) damages covered by an insurance contract. The date of claims
occurrence is called the accident date. Typically, time elapses until such a claim is
in the administrative system of the insurance company and is available for statis-
tical analysis. The time lag between the accident date and the registration at the
insurance company is called reporting delay and the date of registration is termed
reporting date.
The reporting delay can be small, say days, but it can also be very large, for
w)
example several years. Reasons for such reporting delays are that claims are not
immediately reported to the insurance company, for instance, a stolen bike is only
reported once it is clear that it will not “reappear”, but of course the accident
date is the day the bike was stolen. Large reporting delays are typically caused by
(m
claims which are not immediately noticed. A common example is an asbestos claim
which is typically caused a long time before cancer is diagnosed and reported. The
accident date refers to the event when there was contact with asbestos, the trigger
of the cancer, and not to the date of the breakout of the asbestos disease.
Once a claim is reported to the insurance company it typically cannot be settled
immediately. The insurance company starts an investigation, observes the recovery
tes
process, waits for external information, external bills, court decisions, etc. This
process may last for several years for more involved claims. Of course, the insurance
company cannot wait with claims payments until there is a final assessment of the
claim but it will continuously pay for justified claims benefits. Therefore, insurance
no
claims trigger a whole sequence of cash flows after the reporting date. This period
is called settlement period and the final assessment of a claim is called settlement
date or closing date.
Thus, we have three important (ordered) dates for non-life insurance claims:
accident date T1 ≤ reporting date T2 ≤ settlement date T3 .

NL
In addition, there are the following two important dates: beginning of insurance
period U1 and ending of insurance period U2 > U1 , we always assume U2 < ∞.
Typically, the insurance company is only liable for a claim if T1 ∈ [U1 , U2 ], thus, we
only consider claims that have accident dates T1 which fall into the insured period
[U1 , U2 ] specified in the insurance contract.
If we denote today’s time point by t ≥ U1 we can have four different situations:
1. t < T1 . Such (possible) claims have not yet occurred. If the company is
”lucky” then T1 > U2 . This means that it is not liable for this particular claim
with the actual insurance policy because the contract is already terminated

Chapter 9. Claims Reserving 207
accident date T_1 claims payments

reporting date T_2 claims closing T_3
insurance period [U_1,U_2] time
Figure 9.1: Non-life insurance run-off showing insurance period [U1 , U2 ] and a claim
with accident date T1 ∈ [U1 , U2 ], reporting date T2 > U2 and settlement date
T3 > T2 . Moreover, we have claims payments during the settlement period.
w)
at claims occurrence. Be careful, the company may still be liable for this
particular claim, namely, if the contract is renewed and T1 falls into the
(m
renewed insurance period, but renewals are not of interest for the present
discussion.
In this first case t < T1 the only information available at the insurance com-
pany is the insurance contracts signed, i.e. the exposure for which it is liable
in case of a claims occurrence T1 ∈ [U1 , U2 ].
tes
2. T1 ≤ t < T2 and T1 ∈ [U1 , U2 ]. In this case the insurance claim has occurred
but it has not yet been reported to the insurance company. These claims
are called incurred but not yet reported (IBNyR) claims. For such claims
we do not have any individual claims information (because it is IBNyR) but
no
we already have external information, like economic environment (e.g. unem-

ployment rate, inflation rate, financial distress), weather conditions (storm,
flood, earthquake, etc.), nuclear accident, flu epidemic, and so on. This ex-
ternal information already gives a hint whether we should expect more or less
claims reporting.
NL
3. T2 ≤ t < T3 and T1 ∈ [U1 , U2 ]. These claims are reported at the company

but the final assessment is still missing. Typically, we are in the situation
where more and more information about the claim arrives, i.e. the prediction
uncertainty in the final assessment decreases. However, these claims are not
completely resolved and therefore they are called incurred but not enough
reported (IBNeR) claims or reported but not settled (RBNS) claims. The
settlement period [T2 , T3 ] is also the period within which claims payments
are done, see Figure 9.1.
During the settlement period we receive more and more information of the
individual claim like accident date, cause of accident, type of accident, line-of-
business and contracts involved, claims assessments and predictions by claims
adjusters, payments already done, etc.

4. T3 < t and T1 ∈ [U1 , U2 ]. Claim is settled, file is closed and stored and we
expect no further payments for that claim. In some circumstances, it may
be necessary that a claim file is re-open due to unexpected further claims
development. If this happens too often then the files are probably closed
too early and the claims settlement philosophy should be reviewed in that
particular company. If there is a systematic re-opening it may also ask for
a special assessment for unexpected re-openings, for example, for contracts
with a timely unlimited cover for relapses.
To give statistical statements about insurance contracts and claims behavior, in-
w)
surance companies build homogeneous groups and sub-portfolios to which a LLN
applies. In non-life insurance, contracts are often grouped into business lines such
as private property, commercial property, private liability, commercial liability, ac-
cident insurance, health insurance, motor third party liability insurance, motor
(m
hull insurance, etc. If this classification is too rough it can further be divided into
sub-portfolios, for example, private property can be divided by hazard categories
like fire, water, theft, etc. Often such sub-classes are build by geographical markets
and for different legislations.
Once these (hopefully) homogeneous risk classes are built we can study all claims
that belong to such a sub-portfolio. These claims are further categorized by the
tes
accident date. Claims that fall into the same period are triggered by similar exter-
nal factors like weather conditions, economic environment, therefore such a clas-
sification is reasonable. Since the usual time scale for insurance contracts and
business consolidation is years, claims are typically gathered on the yearly time
scale. Therefore, we consider accounting years denoted by k ∈ N. All claims that
no
have occurrence date T1 ∈ (k − 1, k] are called claims with accident year k. These
claims generate cash flows which are also considered on the consolidated yearly
level, i.e. all payments that are done in the same accounting year are aggregated.
This motivates the following claims reserving notation for fixed i ∈ N and j ∈ N0
NL
Xi,j = all payments done for claims with accident year i

in accounting year k = i + j ∈ N.
Thus, we consider all claims (for a given sub-portfolio) which have accident dates
T1 ∈ (i − 1, i] in the same year, i.e. the same accident year i. For these claims
we consider aggregate cash flows which are further sub-divided by their payment
delays denoted by j ∈ N0 and called development years. For instance,
Xi,0 = payments in year (i − 1, i] for claims with accident year i;

Xi,1 = payments in year (i, i + 1] for claims with accident year i;
Xi,j = payments in year (i − 1 + j, i + j] for claims with accident year i.
Moreover, a common assumption is that there is fixed maximal settlement delay

J ∈ N, i.e. Xi,j ≡ 0 for all development years j ≥ J. Of course, this maximal

settlement delay J depends on the business line considered, typically it is smaller

for property insurance and larger for liability insurance. At time t ∈ N, with
t ≥ J, this motivates the graphical representation given in Table 9.1. This table
accident development years j
year i 0 1 ... j ... J −1
1 X1,0 X1,1 ··· X1,J−1
.. .. .. ..
. . . .
t−J +1 Xt−J+1,0 Xt−J+1,1 ··· Xt−J+1,J−1
..
w)
.
.. ..
i . observations Dt .
..
(m
. to be predicted Dtc
t−1
t Xt,0 Xt,1 ··· Xt,J−1
Table 9.1: Claims development triangle/trapezoid Dt at time t ≥ J.

tes
displays three time axes: (1) accident year axis i ∈ {1, . . . , t} (vertical axis); (2) the
development year axis j ∈ {0, . . . , J − 1} (horizontal axis); and (3) the accounting
year axis k = i + j ∈ {1, . . . , t + J − 1} (diagonal axis). In claims reserving all three
time axes are important: (1) i collects the claims with the same accident year; (2)
no
j describes payments with the same payment delay (relative to the accident year);
and (3) k = i + j describes the payments that are done in the same accounting year
(and hence are influenced by the same external factors like inflation). Therefore,
we denote the accounting year payments by
t∧k (J−1)∧(k−1)
NL
X X X
Xk = Xi,j = Xi,k−i = Xk−j,j .
i+j=k i=1∨(k−J+1) j=0∨(k−t)
At time t ∈ N we are liable for all claims that have occurred with accident years
i ≤ t. We call these claims past exposure claims. Some of these past exposure
claims are already settled (if the settlement date T3 ≤ t), others belong either to
the class IBNeR claims (if the reporting date T2 ≤ t but the settlement date T3 > t)
or to the class IBNyR claims (if the reporting date T2 > t).
On the aggregate level we have the following payment information at time t ∈ N
for past exposure claims
Dt = {Xi,j ; i + j ≤ t, 1 ≤ i ≤ t, 0 ≤ j ≤ J − 1} . (9.1)
This information exactly corresponds to the upper triangle (if t = J) or the upper
trapezoid (if t > J) of Table 9.1. This past exposure claims will generate cash

flows in future accounting years given by
Dtc = {Xi,j ; i + j > t, 1 ≤ i ≤ t, 0 ≤ j ≤ J − 1} .
This corresponds to the lower triangle in Table 9.1. This lower triangle Dtc is called
outstanding loss liabilities and it is the major object of interest. Namely, these
outstanding loss liabilities constitute the liabilities of the insurance company origi-
nating from past exposures. In particular, the company needs to build appropriate
provisions so that it is able to fulfill these future cash flows. These provisions are
w)
called claims reserves and they should satisfy the following requirements:
• the claims reserves should be evaluated such that it considers all relevant
(m
(past) information;
• the claims reserves should be a best-estimate for the outstanding loss liabili-
ties adjusted for time value of money.
tes
Basically, this means that we need to predict the lower triangle Dtc based on all
available information Ft ⊃ Dt at time t. In particular, we need to define a stochas-
tic model on the probability space (Ω, F, P) (i) that allows to incorporate past
information Ft ⊂ F; (ii) that reflects the characteristics of past observations Dt ;
no
(iii) that is able to predict future payments of the outstanding loss liabilities Dtc ;
and (vi) that is able to attach time values to these future cash flows Xi,j , i + j > t.
Of course, this is ambitious and we will build such a stochastic model step-by-step.
For the time-being we skip the task of attaching time values to cash flows and we
only consider nominal payments. The total nominal claims payments for accident
NL
year i are given by

J−1 Ni
X X (i)
Xi,j = Yl = Si ,
j=0 l=1
thus, for assessing the total claim amount Si of accounting year i we need to describe
the claims settlement process Xi,0 , . . . , Xi,J−1 . In particular, we need to predict the
(unobserved) future cash flows for the outstanding loss liabilities to quantify the
total claim Si of accounting year i.
We assume that the latest observed accident/accounting year is t = I and we will

do all considerations based on this accounting year.

The (nominal) best-estimate reserves at time I ≥ J for past exposure claims are
then (under these model assumptions) defined by
X X
R= E [Xi,j | FI ] = E [Xi,j | FI ] ,
i+j>I (i,j)∈IIc
where we define the the sets I, II and IIc of indexes as follows
I = {1, . . . , I} × {0, . . . , J − 1},

II = {(i, j) ∈ I; i + j ≤ I} and IIc = I \ II ,
w)
i.e. IIc exactly corresponds to the lower triangle DIc . (Ft )t≥0 is a filtration on
(Ω, F, P) with Xi,j being Fi+j -measurable for all (i, j) ∈ I.
(m
The best-estimate reserves R are a predictor for the (nominal) outstanding loss
liabilities of past exposure claims given by
X
Xi,j ,
(i,j)∈IIc
based on the available information FI at time I. Often Ft and Dt are identified,

tes
i.e. there is no other information than the claims itself.
The next question of interest is the uncertainty in this prediction called prediction
uncertainty. That is, we investigate the possible fluctuation of the true cash flows
around their best-estimate reserves. If the confidence interval is narrow, we can
no
predict the outstanding loss liabilities accurately. If we obtain wide confidence

bounds, an additional risk margin is necessary which protects against possible
shortfalls in the cash flows. We will discuss this below.
9.2 Claims reserving algorithms

NL
The title of this section contains the word “algorithms”. Initially, in the insurance
industry actuaries have designed algorithms that enable to determine claims re-
serves R. These algorithms need to be understood as guidelines to obtain claims
reserves. Only much later actuaries started to think about stochastic models un-
derlying these algorithms. In this section we present claims reserving from this
algorithmic point of view and in the next section we present stochastic models that
support these algorithms.
The two most popular algorithms are the so-called chain-ladder (CL) algorithm
and the Bornhuetter-Ferguson (BF) algorithm [16]. These two algorithms take
different viewpoints. The CL algorithm takes the position that the observations
DI are extrapolated into the lower triangle, the BF algorithm takes the position
that the lower triangle DIc is extrapolated independently of the observations using

expert knowledge. Depending on the line of business and the progress of claims de-
velopment process one or the other may provide better predictions. Only actuarial
experience may tell which one should be preferred in which particular situation.
Therefore, we are going to present both algorithms from a rather mechanical point
of view.
9.2.1 Chain-ladder algorithm

For the study of the CL algorithm we need to define cumulative payments
w)
j
X
Ci,j = Xi,l ,
l=0
that is, we sum all the payments Xi,l for fixed accident years i so that ultimately
(m
we obtain Ci,J−1 = Si , if Si denotes the total claim that corresponds to accident
year i.
CL idea. All accident years i ∈ {1, . . . , I} behave similarly and for cumulative
payments we have approximately
tes
Ci,j+1 ≈ fj Ci,j , (9.2)
for given factors fj > 0. These factors fj are called CL factors, age-to-age factors
or link ratios.
no
The structure (9.2) immediately provides the intuition for estimating the ultimate
claim Ci,J−1 based on the observations DI , namely, choose for every accident year
i the observation on the last observed diagonal, that is Ci,I−i , and multiply this
observation with the successive CL factors fI−i , . . . , fJ−2 .
NL
The remaining difficulty is that, in general, the CL factors fj are not known and,
henceforth, need to be estimated. Assuming that a volume weighted estimate
provides the most reliable results we set in view of (9.2)
PI−j−1 I−j−1
i=1 Ci,j+1 Ci,j Ci,j+1
fbCL
X
j = PI−j−1 = PI−j−1 . (9.3)
i=1 Ci,j i=1 n=1 Cn,j Ci,j
This formula (9.3) expresses that we should divide the sums of observed successive
columns by each other which exactly reflects (9.2).

Thus, we calculate a volume weighted average of the individual

loss development ratios Ci,j+1 /Ci,j which have been observed
in DI . In Table 9.3 we provide the claims reserving example
of Wüthrich-Merz [87].
Equipped with these CL factor estimators we predict the ulti-
mate claim Ci,J−1 for i > I − J + 1 by
J−2
CL
fbjCL ,
Y
Cbi,J−1 = Ci,I−i (9.4)
j=I−i
w)
CL
= Ci,I−i n−1 bCL for i + n > I.
Q
and, in general, Cbi,n j=I−i fj
The CL reserves at time I for accident years i > I − J + 1 are given by

 
J−2
(m
cCL = C
b CL − C fbjCL − 1 ,
Y
Ri i,J−1 i,I−i = Ci,I−i

j=I−i
and aggregated over all accident years we predict the outstanding loss liabilities of
past exposure by
cCL = cCL .
X
R R
tes
i
i>I−(J−1)
An example is presented in Tables 9.2, 9.3 and 9.4.

no
NL

214
0 1 2 3 4 5 6 7 8 9
1 5’946’975 3’721’237 895’717 207’760 206’704 62’124 65’813 14’850 11’130 15’813
2 6’346’756 3’246’406 723’222 151’797 67’824 36’603 52’752 11’186 11’646
3 6’269’090 2’976’223 847’053 262’768 152’703 65’444 53’545 8’924
4 5’863’015 2’683’224 722’532 190’653 132’976 88’340 43’329
5 5’778’885 2’745’229 653’894 273’395 230’288 105’224
6 6’184’793 2’828’338 572’765 244’899 104’957
7 5’600’184 2’893’207 563’114 225’517
8 5’288’066 2’440’103 528’043
9 5’290’793 2’357’936
NL
10 5’675’568
Table 9.2: Observed payments Xi,j with (i, j) ∈ II with I = J = 10.

no
0 1 2 3 4 5 6 7 8 9
1 5’946’975 9’668’212 10’563’929 10’771’690 10’978’394 11’040’518 11’106’331 11’121’181 11’132’310 11’148’124
2 6’346’756 9’593’162 10’316’383 10’468’180 10’536’004 10’572’608 10’625’360 10’636’546 10’648’192
3 6’269’090 9’245’313 10’092’366 10’355’134 10’507’837 10’573’282 10’626’827 10’635’751
tes
4 5’863’015 8’546’239 9’268’771 9’459’424 9’592’399 9’680’740 9’724’068
5 5’778’885 8’524’114 9’178’009 9’451’404 9’681’692 9’786’916
6 6’184’793 9’013’132 9’585’897 9’830’796 9’935’753
7 5’600’184 8’493’391 9’056’505 9’282’022
8 5’288’066 7’728’169 8’256’211
(m

9 5’290’793 7’648’729
10 5’675’568
fbjCL 1.4925 1.0778 1.0229 1.0148 1.0070 1.0051 1.0011 1.0010 1.0014
Table 9.3: Observed cumulative payments Ci,j with (i, j) ∈ II and estimated CL factors fbjCL .
w)
Chapter 9. Claims Reserving
0 1 2 3 4 5 6 7 8 9 R i
bCL
1 0
2 10’663’318 15’126
3 10’646’884 10’662’008 26’257
4 9’734’574 9’744’764 9’758’606 34’538
5 9’837’277 9’847’906 9’858’214 9’872’218 85’302
6 10’005’044 10’056’528 10’067’393 10’077’931 10’092’247 156’494
7 9’419’776 9’485’469 9’534’279 9’544’580 9’554’571 9’568’143 286’121
8 8’445’057 8’570’389 8’630’159 8’674’568 8’683’940 8’693’030 8’705’378 449’167
9 8’243’496 8’432’051 8’557’190 8’616’868 8’661’208 8’670’566 8’679’642 8’691’971 1’043’242
10 8’470’989 9’129’696 9’338’521 9’477’113 9’543’206 9’592’313 9’602’676 9’612’728 9’626’383 3’950’815
NL
Chapter 9. Claims Reserving
total 6’047’061
CL
Table 9.4: CL predicted cumulative payments Cbi,j cCL .
and estimated CL reserves Ri
no
prior BF reserves CL reserves
i estimate µ β CL C BF C CL R R
i i
bi bI−i bi,J−1 bi,J−1 bBF bCL
1 11’653’101 100.0% 11’148’124 11’148’124
2 11’367’306 99.9% 10’664’316 10’663’318 16’124 15’126
tes
3 10’962’965 99.8% 10’662’749 10’662’008 26’998 26’257
4 10’616’762 99.6% 9’761’643 9’758’606 37’575 34’538
5 11’044’881 99.1% 9’882’350 9’872’218 95’434 85’302
6 11’480’700 98.4% 10’113’777 10’092’247 178’024 156’494
7 11’413’572 97.0% 9’623’328 9’568’143 341’305 286’121
(m

8 11’126’527 94.8% 8’830’301 8’705’378 574’089 449’167
9 10’986’548 88.0% 8’967’375 8’691’971 1’318’646 1’043’242
10 11’618’437 59.0% 10’443’953 9’626’383 4’768’384 3’950’815
total 7’356’580 6’047’061
w)
Table 9.5: Claims reserves from the BF method and the CL method.
215
9.2.2 Bornhuetter-Ferguson algorithm

The Ronald Bornhuetter and Ronald E. Ferguson (BF)
method [16] is based on the assumption of having prior informa-
tion µb i for the expected ultimate claim of accident year i. This
prior information then allows to predict DIc as soon as we have
a so-called claims development pattern (γj )j=0,...,J−1 which de-
scribes the proportions paid in each development year. Thus, the
BF method extrapolates prior knowledge into the lower triangle
DIc using a development pattern. R. Bornhuetter
w)
BF idea. All accident years i ∈ {1, . . . , I} behave similarly and payments approx-
imately behave as
Xi,j ≈ γj µb i , (9.5)
(m
for given prior information µb i and given development pattern (γj )j=0,...,J−1 with
PJ−1
normalization j=0 γj = 1.
The prior value µb i should reflect the total expected ultimate claim
E[Ci,J−1 ] = E[Si ] of accounting year i. It is assumed that this
tes
estimate is given externally by expert opinion which, in theory,
should not be based on DI . There only remains the estimate of the
development pattern γj . In view of the CL method, one defines
the following estimates for the development pattern:
no
J−2 Qj−1 bCL

1 l=0 fl
βbCL
Y
j = = QJ−2 bCL . R.E. Ferguson
l=j fblCL l=0 fl
This ratio exactly reflects the proportion paid after the first j development periods
(according to the estimated CL pattern). Therefore, we obtain an estimate
NL
γb0CL = βb0CL ,
γbjCL = βbjCL − βbj−1
CL
for j = 1, . . . , J − 2,
CL
γbJ−1 = 1 − βbCL J−2 .
Equipped with these estimators we predict the ultimate claim Ci,J−1 for i > I −J +1
in the BF method by
J−1
BF
γbjCL = Ci,I−i + µb i 1 − βbI−i
CL
X
Cbi,J−1 = Ci,I−i + µb i . (9.6)
j=I−i+1
The BF reserves at time I for accident years i > I − J + 1 are given by

J−1
cBF = µ γbjCL = µb i 1 − βbI−i
CL
X
Ri bi ,
j=I−i+1

and aggregated over all accident years we predict the outstanding

loss liabilities of past exposure by
cBF = cBF .
X
R Ri
i>I−(J−1)
An example is provided in Table 9.5.
We conclude this section with a comparison of the BF and CL

predictors. Therefore, we modify formula (9.4) as follows
w)
 
J−2 J−2
CL 1 
fbjCL
Y Y
Cbi,J−1 = Ci,I−i + Ci,I−i 1 − .
CL
j=I−i fj
b
j=I−i
This gives the following comparison
(m

CL CL CL
Cbi,J−1 = Ci,I−i + 1 − βbI−i Cbi,J−1 ,

BF CL
Cbi,J−1 = Ci,I−i + 1 − βbI−i µb i .
Thus, we see that we have the same structure. The only difference is that for
the BF method we use the external estimate µb i for the ultimate claim and in
tes
CL
the CL method the observation based estimate Cbi,J−1 . Therefore, we have two
complementing prediction positions, which exactly gives the explanation mentioned
in the introduction to Section 9.2. For further remarks (also detailed remarks on
the example in Tables 9.2-9.5) we refer to Wüthrich-Merz [87].
no
9.3 Stochastic claims reserving methods

In the previous section we have presented algorithms for the calculation of the
claims reserves R. Of course, we should also estimate the precision of these pre-
NL
P
dictions, i.e. by how much the true payouts i+j>I Xi,j may deviate from these
predictions, see also (1.8). This brings us back to the notion of risk measures of
Section 6.2.4. In claims reserving, the most popular risk measure is the condi-
tional mean square error of prediction (MSEP) because it can be calculated or
estimated explicitly in many examples. Assume X c is a D -measurable predictor
I
for the random variable X. The conditional MSEP is defined by
2
msepX|DI X
c =E X − X DI .
c (9.7)
The conditional MSEP is an L2 -distance measure. This conditional MSEP can

be decoupled into two parts, the so-called process uncertainty and the parameter
estimation error as follows, see also (1.8),

2
c = Var (X| D ) + E [ X| D ] − X
msepX|DI X c . (9.8)
I I
If all parameters are known and if we can calculate E [ X| DI ] then we should set
c = E [ X| D ] because this minimizes the conditional MSEP, see (9.8). In any
X I
other case we try to estimate E [ X| DI ] as accurately as possible and then we try
to determine the possible sources of parameter uncertainty in this estimation. In
order to analyze this prediction uncertainty we need to put the claims reserving
w)
algorithms into a stochastic framework.
For the CL method there are different stochastic models that provide the CL re-
serves as predictors:
(m
• distribution-free CL model of Thomas Mack [64],
• over-dispersed Poisson (ODP) model of Renshaw-Verrall [75]

and England-Verrall [39] with MLE parameter estimates,
• Bayesian CL model of Gisler and Wüthrich [49] and Bühlmann, T. Mack

tes
De Felice, Gisler, Moriconi and Wüthrich [23].
We are going to describe them. Mack’s distribution-free CL model [64] is probably

the most popular stochastic claims reserving model. It is straightforward from a
stochastic point of view and easy to implement. A crucial contribution by Mack
no
was the derivation of an estimate for the estimation error term. In the present
text we consider the gamma-gamma Bayesian CL model in detail. This model
belongs to the family of Bayesian CL models for which the conditional MSEP can
be calculated explicitly. Mack’s formula will drop out as an approximation to the
full Bayesian formula in the non-informative prior case of this model.
NL
For the BF method there are different approaches such as
• BF ODP model of Alai, Merz and Wüthrich [3, 4],
• BF model of Mack [65],
• BF model of Saluz, Gisler and Wüthrich [80],
• Bayesian BF model of England, Verrall and Wüthrich [40].
Some of these models also use estimates of γj different from the ones previously
suggested. In the present text we are not going to consider stochastic models for
the BF method.

9.3.1 Gamma-gamma Bayesian CL model

In this section we consider an explicit distributional model that belongs to the
exponential dispersion family with conjugate priors. The advantage of such an
explicit distributional model is that we can calculate the posterior distribution
analytically. This allows to determine all quantities of interest in closed form.
From these explicit formulas we will also derive the distribution-free ones.
Model Assumptions 9.1 (gamma-gamma Bayesian CL model). Assume that

σj > 0, j = 0, . . . , J − 2, are given fixed constants.
w)
(a) Conditionally, given Θ = (Θ0 , . . . , ΘJ−2 ), (Ci,j )j=0,...,J−1 are independent (in
i) Markov processes (in j) with conditional distributions

Ci,j+1
Fi,j+1 = ∼ Γ Ci,j σj−2 , Θj Ci,j σj−2 .
(m

Ci,j
Ci,j ,Θ
(b) Θj are independent and Γ(γj , fj (γj −1))-distributed with given prior constants
fj > 0, γj > 1.
(c) Θ and Ci,0 are independent and Ci,0 > 0, P-a.s.

tes
For given parameters Θ we have for the conditional means
E [ Ci,j+1 | Ci,j , Θ] = Ci,j E [ Fi,j+1 | Ci,j , Θ] = Θ−1

j Ci,j .
no
From this we see that Θ−1

j plays the role of the CL factor introduced in (9.2). We
have
h i 1
E Θ−1j = fj (γj − 1) = fj .
γj − 1
This explains the choices of the prior parameters of the distribution of Θj . The
NL
joint likelihood function of the observations DI and the parameters Θ is given by

Ci,j−1
θj−1 Ci,j−1 σ2 Ci,j−1
j−1
−1
( )
Y 2
σj−1 σ2
j−1 θj−1 Ci,j−1
h(DI , θ) = Fi,j exp − 2
Fi,j
(i,j)∈II ,j≥1 Γ Ci,j−1 σj−1
2
σj−1
J−2
Y (fj (γj − 1))γj γj −1
× g(C1,0 , . . . , CI,0 ) θj exp {−θj fj (γj − 1)} .
j=0 Γ (γj )
g(C1,0 , . . . , CI,0 ) denotes the density of the first column j = 0. This allows to apply
Bayes’ rule which provides for the posterior of Θ, conditionally given DI ,

PI−j−1 Ci,j PI−j−1 Ci,j+1
Y γj +
J−2 −1 −θj fj (γj −1)+
i=1 σ2 i=1 σ2
j
h(θ|DI ) ∝ θj e j
.
j=0

We have just proved the following lemma:
Lemma 9.2. Under Model Assumptions 9.1, the posteriors of Θ0 , . . . , ΘJ−2 are,
conditionally given DI , independent with
 
I−j−1 I−j−1
X Ci,j X Ci,j+1
Θj |DI ∼ Γ γj + , fj (γj − 1) + .
i=1 σj2 i=1 σj2
Corollary 9.3. Under Model Assumptions 9.1, the posterior Bayesian CL factors
w)
are given by h i
fbjBCL = E Θ−1 D bCL + (1 − α )f ,
I = αj fj

j j j
with CL factor estimate fbjCL given by (9.3) and credibility weight
(m
PI−j−1
i=1 Ci,j
αj = PI−j−1 ∈ (0, 1). (9.9)
i=1 Ci,j + σj2 (γj − 1)
Proof. The proof is a straightforward application of the gamma distributional properties, namely
I−j−1
" #
−1 1 X Ci,j+1
E Θj DI = fj (γj − 1) +
tes
σj2
PI−j−1 Ci,j
γj + i=1 σj2
−1 i=1
PI−j−1 Ci,j PI−j−1 Ci,j+1
γj − 1 i=1 σj2 i=1 σj2
= PI−j−1 Ci,j fj + PI−j−1 Ci,j PI−j−1 Ci,j .
γj − 1 + i=1 σ2
γj − 1 + i=1 σ2 i=1 σ2
j j j

no
Remarks 9.4.
• Corollary 9.3 is the key for the derivation of the CL reserves. The result says
that the CL factors should be estimated by a credibility weighted average be-
NL
tween the classical CL estimate fbjCL and the prior estimate fj with credibility
weight αj ∈ (0, 1).
• The parameter γj describes the degree of information contained in the prior

distribution of Θj . If we let γj → 1 then we obtain αj → 1. In this case we
give full credibility to the observation based estimate, i.e. fbjBCL → fbjCL .
• Observe that the individual development factors Fi,j+1 satisfy the Bühlmann-
Straub model, see Model 8.13: conditionally given Θj and C1,j , . . . , CI,j , the
Fi,j+1 are independent with
E [ Fi,j+1 | Ci,j , Θj ] = µ(Θj ) = Θ−1

j , (9.10)
σj2 (Θj ) σj2 Θ−2
j
Var (Fi,j+1 | Ci,j , Θj ) = = . (9.11)
Ci,j Ci,j

Thus, Ci,j plays the role of the volume measure and σj2 (Θ) = σj2 Θ−2 j plays
the role of the variance function. We calculate, see (8.4) and (8.5),
1
τj2 = Var(µ(Θj )) = fj2 ,
γj − 2
2 2 γj − 1
σej2 = E[σj2 Θ−2
j ] = σj fj .
γj − 2
This implies for the credibility coefficient, see (8.14),
σej2
w)
κj = 2 = σj2 (γj − 1).
τj
Therefore, we obtain the classical structure for the credibility weights, see
Theorem 8.17 and (9.9),
(m
PI−j−1
i=1 Ci,j
αj = PI−j−1 .
i=1 Ci,j + κj
Note the Bühlmann-Straub formula requires γj > 2 otherwise the credibility

coefficient κj cannot be calculated. However, (9.9) is more general because
the second prior moment of Θ−1 j does not need to exist.
tes
Theorem 9.5. Under Model Assumptions 9.1, the Bayesian CL predictor for
Ci,J−1 with i + J − 1 > I is given by
no
J−2
BCL
fbjBCL .
Y
Cbi,J−1 = E [ Ci,J−1 | DI ] = Ci,I−i
j=I−i
Proof. We use the conditional independence between different accident years, the conditional
Markov property and the tower property to obtain
NL
 
J−2
Y
BCL −1

Ci,J−1 = E [ E [ Ci,J−1 | Ci,0 , . . . , Ci,I−i , Θ]| DI ] = Ci,I−i E
b  Θj DI  .
j=I−i
Using the posterior independence of Lemma 9.2 and Corollary 9.3 proves the claim. 2
Remark 9.6. Theorem 9.5 explains that our Model Assumptions 9.1 give the CL
reserves if we let the prior distributions of Θ−1
j become non-informative, i.e. for
γj → 1, j = I − i, . . . , J − 2, we have
BCL CL
Cbi,J−1 → Cbi,J−1 .
For this reason we can use the gamma-gamma Bayesian CL model as one example
that replicates the CL algorithm (9.4) in the non-informative limit. This analogy
allows to study prediction uncertainty within Model Assumptions 9.1.

For the conditional MSEP we obtain, see (9.8),

2
BCL BCL
msepCi,J−1 |DI Cbi,J−1 = Var (Ci,J−1 | DI ) + E [ Ci,J−1 | DI ] − Cbi,J−1
= Var (Ci,J−1 | DI ) .
This shows the optimality of the Bayesian CL predictor within our model and there
remains the calculation of the conditional variance of the ultimate claim.
Theorem 9.7. Under Model Assumptions 9.1 the Bayesian CL predictor satisfies

BCL BCL BCL 2
msepCi,J−1 |DI Cbi,J−1 = Cbi,J−1 ΓI−i + (Cbi,J−1 ) ∆I−i ,
)
!

Cb BCL BCL
X X
msepP = msepCi,J−1 |DI Cbi,J−1
w
i
Ci,J−1 |DI i,J−1
i i
BCL b BCL
X
+2 Cbi,J−1 Cl,J−1 ∆I−i ,
i<l
(m
where we define
J−2 J−2 I−n−1
!
σn2 (γn − 1) + i=1
P
Ci,n
σj2 fbnBCL
X Y
Γk = I−n−1 ,
σn (γn − 2) + i=1 Ci,n
2
P
j=k n=j
 PI−j−1 
Y σj2 (γj
J−2
− 1) + i=1 Ci,j
∆k =  PI−j−1  − 1.
tes
j=k σj2 (γj − 2) + i=1 Ci,j
Proof. We first decouple accident years
! !
X X X
BCL
msepP C = Var Ci,J−1 DI = Cov (Ci,J−1 , Cl,J−1 | DI ) .
bi,J−1
Ci,J−1 |DI
i
i i i,l
no
We calculate these covariance terms. Applying the tower property for conditional expectations
implies
Cov (Ci,J−1 , Cl,J−1 | DI ) = E [ Cov (Ci,J−1 , Cl,J−1 | DI , Θ)| DI ] (9.12)

+ Cov (E [ Ci,J−1 | DI , Θ] , E [ Cl,J−1 | DI , Θ]| DI ) .
We start with the first term on the right-hand side of (9.12). Observe that this term is zero for
NL
i 6= l because of the conditional independence between different accident years. Therefore, we

only consider the case i = l. For this case we have, applying the tower property,
Var (Ci,J−1 | DI , Θ)
= E [ Var ( Ci,J−1 | Ci,J−2 , Θ)| DI , Θ] + Var ( E [ Ci,J−1 | Ci,J−2 , Θ]| DI , Θ)
2
Θ−2 DI , Θ + Var Ci,J−2 Θ−1 DI , Θ

= E Ci,J−2 σJ−2 J−2 J−2
J−3
Y
= Ci,I−i Θ−1
j
2
σJ−2 Θ−2 −2
J−2 + ΘJ−2 Var ( Ci,J−2 | DI , Θ) .
j=I−i
Hence, we obtain the well-known recursive formula for the process variance in the CL method
(see Section 3.2.2 in Wüthrich-Merz [87]). By iterating the recursion we find for given Θ (see
also Lemma 3.6 in Wüthrich-Merz [87])
J−2
X j−1
Y J−2
Y
2 −2
Var (Ci,J−1 | DI , Θ) = Ci,I−i Θ−1
m σ j Θj Θ−2
n . (9.13)
j=I−i m=I−i n=j+1

Applying the operator E[·|DI ] to (9.13) and using the posterior independence of the random
variables Θj we obtain
J−2
X j−1
Y J−2
Y
BCL 2
E Θ−2

E [ Var (Ci,J−1 | DI , Θ)| DI ] = Ci,I−i fbm σj n
DI
j=I−i m=I−i n=j
J−2 j−1 J−2
PI−n−1 Ci,n
X Y Y γn − 1 + i=1 2
σn
BCL 2
= Ci,I−i fbm σj (fbnBCL )2 PI−n−1 Ci,n
j=I−i m=I−i n=j γn − 2 + i=1 2
σn
J−2 J−2
PI−n−1 Ci,n
X Y γn − 1 + i=1 2
σn
BCL
= C
bi,J−1 σj2 fbnBCL PI−n−1 Ci,n
=C BCL
bi,J−1 ΓI−i .
j=I−i n=j γn − 2 + i=1 2
σn
w)
For the second term in (9.12) we have, w.l.o.g. we assume i ≤ l,
 
J−2
Y J−2
Y
−1 −1

Cov (E [ Ci,J−1 | DI , Θ] , E [ Cl,J−1 | DI , Θ]| DI ) = Ci,I−i Cl,I−l Cov  Θj , Θj DI 
j=I−i j=I−l
      
(m
I−i−1
Y J−2
Y J−2
Y J−2
Y
−1 −2 −1 −1

= Ci,I−i Cl,I−l E   Θj Θj DI − E  Θj DI E  Θj DI 
j=I−l j=I−i j=I−i j=I−l
 PI−j−1 Ci,j 
Y γj − 1 + i=1
J−2
σj2
BCL b BCL  BCL b BCL
= C bi,J−1 Cl,J−1 PI−j−1 Ci,j − 1 = C bi,J−1 Cl,J−1 ∆I−i .
j=I−i γ j − 2 + i=1 σ 2
j
This proves the statements. 2

tes
We analyze the terms Γk and ∆k . They involve the following factors
PI−j−1
σj2 (γj − 1) + i=1 Ci,j σj2
PI−j−1 = 1+ PI−j−1 .
σj2 (γj − 2) + i=1 Ci,j σj2 (γj − 2) + i=1 Ci,j
no
Under the assumptions that σj2 I−j−1 Ci,j we obtain for γj → 1, we also use a
P
i=1
first order Taylor expansion for ∆k , see also (9.25) below,
J−2 J−2
def. e
σj2 fbnCL = Γ
X Y
Γk ≈ k, (9.14)
j=k n=j
NL
J−2
X σj2 J−2
X σj2 def.
∆k ≈ PI−j−1 ≈ PI−j−1 = ∆
e .
k (9.15)
j=k i=1 Ci,j − σj2 j=k i=1 Ci,j
This motivates for the non-informative prior case γj → 1 the approximation
 

CL

CL
J−2
1 s2j 1
)2
X
msepCi,J−1 |DI Cbi,J−1 = (Cbi,J−1 
bCL 2 Cb CL
+ PI−j−1
, (9.16)
j=I−i (fj ) i,j l=1 C l,j
for s2j = σj2 (fbjCL )2 . This is the famous Mack formula [64] that gave the first rigorous
derivation of an estimate for the conditional MSEP in the CL model (note that
Mack [64] uses s2j as variance parameter, whereas in the Bayesian model we use
σj2 Θ−2
j , see (9.11)).

Example 9.8 (gamma-gamma Bayesian CL model and Mack’s formula). We come

back to the claims reserving example presented in Table 9.3. We consider the
gamma-gamma Bayesian CL model with non-informative priors γj → 1 in Theorem
9.7. In this non-informative prior case we have αj = 1, see (9.9), and therefore we
BCL
obtain fbjBCL = fbjCL and Cbi,J−1 CL
= Cbi,J−1 . This immediately implies that the claims
reserves for the outstanding loss liabilities in this non-informative prior Bayesian
CL model are given by Tables 9.4 and 9.5.
There remains the calculation of the prediction uncertainty in this non-informative
prior Bayesian CL model. In order to do this we need an estimate for σj2 . From
(9.11) we see that σj2 (Θ) = σj2 Θ−2 which is compared to s2j = σj2 (fbjCL )2 . If we
w)
j
estimate Θ−2j by (fbjCL )2 then we can find estimates σbj2 = sb2j /(fbjCL )2 once we have
estimated s2j . The estimation of the latter is done rather ad-hoc by the classical
estimates, see Lemma 3.5 in Wüthrich-Merz [87],
(m
I−j−1 !2
1 Ci,j+1
sb2j − fbjCL
X
= Ci,j . (9.17)
I − j − 2 i=1 Ci,j
For triangles I = J the variance parameter s2J−2 cannot be estimated because we

do not have sufficiently many observations in this last column. In practice, one
therefore often sets, see (3.13) in Wüthrich-Merz [87],
tes
n o
sb2J−2 = min sb4J−3 /sb2J−4 ; sb2J−3 ; sb2J−4 . (9.18)
This provides the estimates in Table 9.6.

no
0 1 2 3 4 5 6 7 8 9
sbj 135.25 33.80 15.76 19.85 9.34 2.00 0.82 0.22 0.06
σ
bj 90.62 31.36 15.41 19.56 9.27 1.99 0.82 0.22 0.06
Table 9.6: Estimated standard deviation parameters in the non-informative priors

Bayesian model where we set σbj = sbj /fbjCL .
NL
These parameters provide the results for the square-rooted conditional MSEPs
given in Table 9.7. We observe that for the total claims reserves the 1 standard
deviation confidence bounds are about 7.7% of the total claims reserves. We also
observe that the full formula given by Theorem 9.7 with non-informative priors and
Mack’s formula (9.16) are very close, i.e. 462’967 versus 462’960. This observation
holds true for many typical non-life insurance data sets and it justifies the use of
the simpler formula.
9.3.2 Over-dispersed Poisson model

CL reserves msep1/2 msep1/2 in %

i RcCL Bayes’ Mack (9.16) reserves
i
0
1 15’126 267 267 1.8%
2 26’257 914 914 3.5%
3 34’538 3’058 3’058 8.9%
4 85’302 7’628 7’628 8.9%
5 156’494 33’341 33’341 21.3%
6 286’121 73’467 73’467 25.7%
w)
7 449’167 85’399 85’398 19.0%
8 1’043’242 134’337 134’337 12.9%
9 3’950’815 410’824 410’817 10.4%
covariance1/2 116’811 116’810
total 6’047’061 462’967 462’960 7.7%
(m
Table 9.7: Claims reserves and prediction uncertainty in the non-informative priors
gamma-gamma Bayesian CL model and Mack’s formula (9.16).
Another stochastic model that has attracted a lot of attention is

tes
the so-called over-dispersed Poisson (ODP) model. It goes back
to Renshaw-Verrall [75]. Peter D. England and Richard
J. Verrall [39] have popularized the model a lot. It belongs
to the family of GLM models and it is quite attractive because
no
bootstrap simulation can easily be applied.

Model Assumptions 9.9 (over-dispersed Poisson model). As- R.J. Verrall
sume there exist positive parameters µ1 , . . . , µI , γ0 , . . . , γJ−1 and
φ such that all Xi,j are independent (in i and j) with
Xi,j
∼ Poi(µi γj /φ).
NL
Observe that
E[Xi,j ] = µi γj ,
Var(Xi,j ) = φµi γj .
We have a cross-classified mean with µi modeling the exposure
of accident year i and γj the development pattern of the payout
delay j. In order to make the parameters µi and γj uniquely P.D. England
identifiable we need a side constraint. The two commonly used
side constraints are either
J−1
X
µ1 = 1 or γj = 1.
j=0

The first option is more convenient in the application of GLM methods, the second
option gives an explicit meaning to the pattern (γj )j , namely, that it corresponds
to the cash flow pattern.
The best-estimate reserves at time I are given by
X X
R= E [ Xi,j | DI ] = µi γj .
i+j>I i+j>I
Hence, we need to estimate the parameters µi and γj . This is done with MLE
methods. We assume J = I which simplifies notation. Having observations DI
w)
allows to estimate the parameters. The log-likelihood function for µ = (µ1 , . . . , µI ),
γ = (γ0 , . . . , γJ−1 ) and φ is given by
X
`DI (µ, γ, φ) = −µi γj /φ + (Xi,j /φ) log(µi γj /φ) − log((Xi,j /φ)!).
(i,j)∈II
(m
Calculating the derivatives w.r.t. µ and γ and setting them equal to zero implies
that we need to solve the following system of equations to find the MLEs
I−j
X I−j
X
γj µi = Xi,j for all j = 0, . . . , J − 1, (9.19)
i=1 i=1
I−i I−i
tes
X X
µi γj = Xi,j for all i = 1, . . . , I, (9.20)
j=0 j=0
w.l.o.g. under the side constraint J−1

P
j=0 γj = 1. The remarkable fact about the
MLE system (9.19)-(9.20) is that it can be solved explicitly and that it provides
no
the CL reserves. Moreover, the constant dispersion parameter φ cancels and is not
relevant for estimating the reserves.
Theorem 9.10. Under Model Assumptions 9.9, the MLEs for µ and γ, given DI ,
are given by
NL
 
J−2
CL 1  1 
µb MLE γbjMLE =
Y
i = Cbi,J−1 and 1 − ,
k=j fbkCL CL
fbj−1
QJ−2
for i = 1, . . . , I and j = 1, . . . , J − 1. Moreover, γb0MLE = k=0 1/fbkCL . For the
estimated reserves we have
J−1
cODP = µ
b MLE cCL .
γbjMLE = R
X
Ri i i
j=I−i+1
Proof. For the proof we refer to Lemma 2.16, Corollary 2.18 and Remarks 2.19 in Wüthrich-Merz
[87]. Basically, the proof goes by induction along the last observed diagonal in DI . 2
Remarks 9.11.

• Theorem 9.10 goes back to Hachemeister-Stanard [52], Kremer [60] and Mack
[63].
• Theorem 9.10 explains the popularity of the ODP model for claims reserving,
because it provides the CL reserves. Thus, we have found a second stochastic
model that can be used to explain the CL algorithm from a stochastic point
of view.
• In this model we can also give an estimate for the conditional MSEP. This
uses that MLEs are approximated by standard Gaussian asymptotic results
w)
for GLM. For details we refer England-Verrall [39] and Wüthrich-Merz [87],
Section 6.4.3.
• The ODP framework also allows to give an estimate for the conditional MSEP
in the BF method, and it justifies the choice βbjCL = jk=0 γbkMLE . For details
P
(m
we refer to Alai et al. [3, 4].
9.4 Claims development result

In the previous sections we have given a static point of view of claims reserving.
tes
However, claims reserving should be understood as a dynamic process, where more
and more information becomes available over time and prediction is continuously
improved according to this new knowledge. This is also the view that needs to be
taken for solvency considerations.
We consider the run-off situation, and thus the last accident year I is kept fixed.
no
In the run-off situation the flow of information (9.1) is changed to (we do a slight
abuse of notation)
Dt = {Xi,j ; i + j ≤ t, 1 ≤ i ≤ I, 0 ≤ j ≤ J − 1} ,
i.e. this generates a filtration denoted by (Dt )t≥0 on (Ω, F, P) that describes the
NL
flow of information (we set Dt = σ(Dt )). At time t ≥ I the ultimate claim of
accident year i > t − J + 1 is predicted by
(t)
Cbi,J−1 = E [ Ci,J−1 | Dt ] . (9.21)
This is the predictor that minimizes the conditional MSEP at time t. The best-
estimate reserves at time t for accident year i > t − J + 1 are provided by
(t) (t)
Ri = Cbi,J−1 − Ci,I−i+t . (9.22)
In accounting year t + 1 we then collect new information resulting in Dt+1 and we

do payments Xi,I−i+t+1 = Ci,I−i+t+1 − Ci,I−i+t . This allows to define the so-called
claims development result (CDR) of accident year i in accounting year t + 1 by, see
Michael Merz and Wüthrich [69],


(t) (t+1) (t) (t+1)
CDRi,t+1 = Ri − Xi,I−i+t+1 + Ri = Cbi,J−1 − Cbi,J−1 . (9.23)
The claims development result CDRi,t+1 explains how we change

the prediction of the ultimate claim when new information is
available. If the claims development result is negative we have a
loss in the P&L statement because we have under-estimated the
outstanding loss liabilities at time t, otherwise we have a gain.
w)
This is exactly the classical earning statement view in order to
understand the risk that derives from the development of the
outstanding loss liabilities.
The tower property immediately gives the following crucial state- M. Merz
(m
ment:
Corollary 9.12. Assume Ci,J−1 has finite first moment. Then we have
E [ CDRi,t+1 | Dt ] = 0.
tes
This corollary explains that in the average we neither expect losses nor gains but
the prediction is just unbiased. Note that (9.21) defines a martingale in t and
remark that square-integrable martingales have uncorrelated innovations (claims
development results). Our aim is to study the uncertainty in this position measured
by the conditional MSEP. For simplicity we set t = I,
no
h i
msepCDRi,I+1 |DI (0) = E (CDRi,I+1 − 0)2 DI = Var (CDRi,I+1 | DI )

(I+1)
= Var Cbi,J−1 DI . (9.24)
Thus, we need to study the volatility of the one-period update. We do this in the
NL
gamma-gamma Bayes CL Model 9.1. Of course, Lemma 9.2 easily extends to the
following lemma.
Lemma 9.13. Choose t ≥ I. Under Model Assumptions 9.1, the posteriors of

Θ0 , . . . , ΘJ−2 are independent, conditionally given Dt , with
 
(t−j−1)∧I (t−j−1)∧I
X Ci,j X Ci,j+1 
Θj |Dt ∼ Γ γj + , fj (γj − 1) + .
i=1 σj2 i=1 σj2
The Bayesian CL predictor for Ci,J−1 , i > t − J + 1,

J−2
(t) Y (t)
Cbi,J−1 = E [ Ci,J−1 | Dt ] = Ci,t−i fbj ,
j=t−i
(t)
with posterior expected Bayesian CL factors given by fbj = E[Θ−1
j |Dt ].

Next we exploit the recursive structure of credibility estimators, see for instance
Corollary 8.6. These hold true in quite some generality, for the current exposition
we restrict to t = I, I + 1 because these are the only indexes of interest for the
analysis of (9.24). For t = I + 1 and j ≥ 0 we have
PI−j Ci,j+1
h i fj (γj − 1) + i=1 σj2
(I+1)
fbj = E Θ−1
j DI+1 =

PI−j Ci,j
γj − 1 + i=1 σ 2
j
CI−j,j+1 PI−j−1 Ci,j+1
σj2
fj (γj − 1) + i=1 σj2
= PI−j Ci,j + PI−j Ci,j
γj − 1 + γj − 1 +
)
i=1 σ 2 i=1 σ 2
j j
(I+1) CI−j,j+1
(I+1) b(I)
w
= βj + 1 − βj fj ,
CI−j,j
with DI -measurable credibility weight
(m
(I+1) CI−j,j
βj = PI−j ∈ [0, 1].
σj2 (γj − 1) + i=1 Ci,j
(I+1)
The important observation is that there is only one random term in fbj , condi-
tionally given DI . This is crucial in the calculation of the conditional MSEP of the
claims development result prediction. We start with a lemma.
tes
Lemma 9.14. Under Model Assumptions 9.1 we have for I − i + 1 ≤ J − 1
2 (I) (I)
Var (Ci,I−i+1 | DI ) = Ci,I−i σI−i (fbI−i )2 ΥI−i + Ci,I−i
2
(fbI−i )2 (ΥI−i − 1).
where
no
σ 2 (γk − 1) + I−k−1 σk2

P
Cl,k
Υk = k2 l=1
= 1 + .
σk (γk − 2) + I−k−1 σk2 (γk − 2) + I−k−1
P P
l=1 Cl,k l=1 Cl,k
Proof. This is a straightforward consequence of Theorem 9.7 if we set J − 2 = I − i in the latter
theorem. 2
NL
Theorem 9.15. Under Model Assumptions 9.1 the Bayesian CL predictor satisfies
 
J−2
(I) (I+1) 2
msepCDRi,I+1 |DI (0) = (Cbi,J−1 )2 (ΨI−i + 1)
Y
(βj ) Ψj + 1 − 1 .
j=I−i+1
X
msepP CDRi,I+1 |DI (0) = msepCDRi,I+1 |DI (0)
i
i
 
J−2
X (I) (I)  β (I+1) ΨI−i
Y (I+1) 2
+2 Cbi,J−1 Cbl,J−1 I−i +1 (βj ) Ψj + 1 − 1 ,
i<l j=I−i+1
where
2
!
Var (Ci,I−i+1 | DI ) σI−i
ΨI−i = (I)
= + 1 ΥI−i − 1.
(fbI−i Ci,I−i )2 Ci,I−i

Proof. We first decouple the accident years

!

b (I+1) DI b (I+1) , C
b (I+1) DI .
X X
msepP CDRi,I+1 |DI (0) = Var Ci,J−1 = Cov C i,J−1 l,J−1
i
i i,l
We calculate these covariance terms. Observe

J−2 J−2
b (I+1) = Ci,I−i+1
Y (I+1)
Y (I+1) CI−j,j+1
(I+1) b(I)
C i,J−1 fbj = Ci,I−i+1 βj + 1 − βj fj .
CI−j,j
j=I−i+1 j=I−i+1
The only random terms under the measure P(·|DI ) are Ci,I−i+1 , Ci−1,I−i+2 , . . . , CI−J+2,J−1 . All
these random variables belong to different accident years i and to different development periods j.
)
Therefore, they are independent given DI , this follows from Model Assumptions 9.1 and Lemma
w
9.2. Moreover, we have unbiasedness in the sense (use the tower property)

(I+1) CI−j,j+1
h i
(I+1) b(I) (I+1) (I)
E βj + 1 − βj fj DI = E fbj DI = fbj .
CI−j,j
(m
In the first step we decouple the covariance as follows
h i
Cov C b (I+1) , C
b (I+1) DI = E C b (I+1) DI − C
b (I+1) C b (I) Cb (I)
i,J−1 l,J−1 i,J−1 l,J−1 i,J−1 l,J−1 ,
with
 
J−2 J−2
tes
h i
b (I+1) DI = E Ci,I−i+1
b (I+1) C (I+1)
Y Y
(I+1)

E C fbj Cl,I−l+1 fbm DI .

i,J−1 l,J−1
j=I−i+1 m=I−l+1
We first treat the case i = l. In that case we have using the conditional independence
 
J−2 2
(I+1) CI−j,j+1
h i
no
b (I+1) )2 DI (I+1) b(I)

Y
= E (Ci,I−i+1 )2

E (C i,J−1 βj + 1 − βj fj DI 
CI−j,j
j=I−i+1
" #
J−2 2

(I+1) CI−j,j+1

(I+1) b(I)
Y
= E (Ci,I−i+1 )2 DI

βj + 1 − βj fj DI ,

E
CI−j,j
j=I−i+1
which allows to calculate each term individually. The unbiasedness implies

NL
" 2 #
(I+1) CI−j,j+1
h i
(I+1) 2 (I+1) b(I)
E (fj ) DI = E βj + 1 − βj fj DI
b
CI−j,j

(I+1) CI−j,j+1

(I+1) b(I) (I)
= Var βj + 1 − βj fj DI + (fbj )2
CI−j,j
(I+1) 2
!
βj (I)
h
(I+1) 2
i
(I)
= Var (CI−j,j+1 | DI ) + (fbj )2 = (βj ) Ψj + 1 (fbj )2 .
CI−j,j
Now we collect all the terms to obtain for i = l

 
J−2
b (I+1) DI = (C
b (I) )2 (ΨI−i + 1) (I+1) 2
Y
Var C i,J−1 i,J−1 (βj ) Ψj + 1 − 1  .
j=I−i+1
There remains the case of different accident years. W.l.o.g. we assume i < l which implies
I − i + 1 > I − l + 1. This and conditional independence given DI implies for the covariance

between these accident years

 
h i J−2 J−2
b (I+1) C
b (I+1) (I+1)
Y Y
(I+1)

E Ci,J−1 l,J−1 DI = E Ci,I−i+1 fbj Cl,I−l+1 fbm DI

j=I−i+1 m=I−l+1
I−i−1 h i J−2 h i
(I+1) (I+1) 2
Y Y
(I)
= Cl,I−l fbm E Ci,I−i+1 fbI−i DI E (fbj ) DI
m=I−l j=I−i+1
h i J−2 h i
b (I) (I+1) (I) (I+1) 2 (I)
Y
= C l,I−i Cov Ci,I−i+1 , fbI−i DI + Ci,I−i (fbI−i )2 (βj ) Ψj + 1 (fbj )2
j=I−i+1
J−2
w)
h i h i
b (I) C
b (I) (I+1) (I+1) 2
Y
= Ci,J−1 l,J−1 βI−i ΨI−i + 1 (βj ) Ψj + 1 .
j=I−i+1
(m
Similar to (9.16) we consider in a first step the linear approximation
 
J−2
(I) (I+1) 2
msepCDRi,I+1 |DI (0) = (Cbi,J−1 )2 (ΨI−i + 1)
Y
(βj ) Ψj + 1 − 1
j=I−i+1
 
J−2
(I) (I+1) 2
≈ (Cbi,J−1 )2 ΨI−i +
X
(βj ) Ψj  . (9.25)
tes
j=I−i+1
In the non-informative prior case γj → 1 we have asymptotic credibility weights
(I+1) (I+1) CI−j,j

βj → βej = PI−j . (9.26)
no
l=1 Cl,j
For the variance constants we consider the same limit as γk → 1 which provides,
see also page 223 and in particular approximation (9.15),
σk2
NL
Υk = 1+ PI−k−1
σk2 (γk − 2) + l=1 Cl,k
2
σ σk2
→ 1 + PI−k−1 k ≈ 1 + PI−k−1 .
l=1 Cl,k − σk2 l=1 Cl,k
Applying this non-informative prior case and using again the linear approximation
we obtain
2 2
σ2
! !" #
σI−i σI−i
ΨI−i = + 1 ΥI−i − 1 ≈ + 1 1 + Pi−1I−i −1
Ci,I−i Ci,I−i l=1 Cl,I−i
2
σI−i σ2
≈ + Pi−1I−i .
Ci,I−i l=1 Cl,I−i
The remaining terms are approximated in the non-informative prior case completely

similarly, i.e.
!2
σj2 σj2
!
(I+1) 2 CI−j,j
(βj ) Ψj ≈ PI−j + PI−j−1
l=1 Cl,j
CI−j,j l=1 Cl,j
!2 PI−j−1 !
CI−j,j l=1 Cl,j + CI−j,j
= PI−j PI−j−1 σj2
l=1 C l,j C I−j,j l=1 C l,j
2
CI−j,j σj (I+1) σj2
= PI−j PI−j−1 = βje
PI−j−1 .
l=1 Cl,j l=1 Cl,j l=1 Cl,j
In analogy to (9.16) we obtain for the uncertainties in the claims development
w)
result prediction in the non-informative prior case
s2I−i
"
CL 1
msepCDRi,I+1 |DI (0) = (Cbi,J−1 )2 CL 2 C
(9.27)
(fI−i )
(m
i,I−i
b
s2I−i J−2
s2j
#
1 X (I+1) 1
+ CL 2
Pi−1 + βj
e
CL PI−j−1 ,
l=1 Cl,I−i
(fI−i )
b (fj )
b 2 Cl,j
j=I−i+1 l=1
(I+1)
for s2j = σj2 (fbjCL )2 and βej given by (9.26). This is the Merz-Wüthrich (MW)
formula, see (3.17) in [69]. We also refer to Bühlmann et al. [23]. Formula (9.16)
tes
is often called total run-off uncertainty and formula (9.27) corresponds to the one-
year run-off uncertainty. Comparing these two formulas we observe that from the
total uncertainty the first term with index j = I − i also appears in the one-
year uncertainty. From the remaining terms j ≥ I − i + 1 of the summation
no
in (9.16) only the second terms survive. These second terms correspond to the
(I+1)
parameter estimation error and need to be scaled with βej to obtain the one-
year uncertainty, which reflects the release of parameter uncertainty when new
information (a new diagonal in the claims development triangle) arrives.
Example 9.16. We revisit claims reserving Example 9.8 and calculate the claims
NL
development result uncertainty. We consider the non-informative prior case and

we choose the same parameter estimates as in Example 9.8. Moreover, we consider
the approximate MW formula (9.27).
The results are presented in Table 9.8. We see that in this example the one-year
claims development result uncertainty measured by the square-rooted conditional
MSEP results in 91% of the total uncertainty. The reason for this high value is that
knowing the next diagonal in the claims development triangle already releases a
major part of the claims run-off risks. For next accounting year we predict payments
of 3’873’205 which is almost 2/3 of the total claims reserves, i.e. we have a rather
fast claims settlement in this example and a fast decrease of run-off uncertainties.
Typically, the square-rooted conditional MSEP of the claims development result is
in the range of 50% to 95% relative to the total uncertainty, the former relates to
liability insurance and the latter to property insurance.

CL reserves total msep1/2 CDR msep1/2 CDR/total

i cCL
R Mack (9.16) MW (9.27) msep1/2
i
1 15’126 267 267 100%
2 26’257 914 884 97%
3 34’538 3’058 2’948 96%
4 85’302 7’628 7’018 92%
5 156’494 33’341 32’470 97%
6 286’121 73’467 66’178 90%
7 449’167 85’398 50’296 59%
w)
8 1’043’242 134’337 104’311 78%
9 3’950’815 410’817 385’773 94%
total 6’047’061 462’960 420’220 91%
Table 9.8: Claims reserves and prediction uncertainty: Mack’s formula (9.16) for
(m
the total uncertainty and Merz-Wüthrich formula (9.27) for one-year claims devel-
opment uncertainty.
Exercise 17 (Italian motor third party liability insurance example). We revisit the
Italian motor third party liability insurance example of Bühlmann et al. [23]. The
tes
field study considers 12 × 12 run-off triangles of 37 Italian insurance companies
at the end of 2006. For these data the claims reserves and the corresponding
conditional MSEPs for the total run-off uncertainty and for the one-year claims
development result uncertainty using Mack’s formula (9.16) and the MW formula
no
(9.27), respectively, were calculated. The results are presented in Table 9.9. Note
that for confidentiality reasons the volumes of the 4 biggest companies were all set
equal to 100.0 and the order of these 4 companies is arbitrary.
Give interpretations to these results.
NL

CDR msep1/2
company business total msep1/2 CDR msep1/2 total msep1/2
volume (in % reserves) (in % reserves) (in %)
w)
1 100.0 4.03 3.24 80.4
2 100.0 2.90 2.36 81.4
3 100.0 2.41 1.98 82.3
4 100.0 3.45 2.85 82.6
(m
5 61.8 3.66 3.04 82.9
6 56.9 5.54 4.50 81.2
7 53.0 4.52 3.70 81.8
8 49.4 4.60 3.82 83.1
9 46.2 5.61 4.59 81.8
10 41.6 5.32 4.36 82.0
tes
.. .. .. .. ..
. . . . .
30 3.5 18.02 14.78 82.0
31 3.4 17.23 13.92 80.8
32 2.6 18.73 14.89 79.5
no
33 2.5 23.11 19.10 82.6

34 2.2 20.83 17.53 84.2
35 2.0 17.01 13.87 81.5
36 1.8 26.16 21.54 82.4
37 1.8 27.79 22.25 80.1
NL
Total 0.96 0.78 81.8
Table 9.9: Italian motor third party liability insurance example of Bühlmann et
al. [23]. Prediction uncertainties: Mack’s formula (9.16) for the total uncertainty
and Merz-Wüthrich formula (9.27) for one-year claims development uncertainty.

Chapter 10
Solvency Considerations
w)
In the previous chapters we have mainly discussed the model-
(m
ing of insurance contracts, the related liability cash flows and
the implications for tariffication. If we remind of the discussion
in Chapter 1, we recall that the insurance company organizes
the equal balance within the community. That is, it issues in-
surance contracts at a fixed premium and in return it promises
to cover all (financial) claims that fall under these contracts.
tes
Of course, we need to make sure that the insurance company
can keep its promises. This is exactly the crucial task of su-
pervision (regulation) and sound risk management practice. Regulation aims to
protect the policyholder in that it enforces (by law) the insurance company to fol-
no
low risk management requirements and to be sufficiently well capitalized so that it

can fulfill its promises also under certain stress scenarios. This is exactly what we
would like (and need) to study in the present chapter.
We have already touched this issue in Chapter 5 on ruin theory. The main purpose
of Chapter 5 was to explain that there is a huge difference in ruin behavior between
light tailed and heavy tailed claims. Beyond that insight the random walk model of
NL
Chapter 5 is much too simple to reflect real world insurance problems. Therefore,
we modify the ultimate ruin probability considerations so that they reflect the
current risk management task. In a first step we will discuss more general risk
management views, for a comprehensive discussion we refer to Wüthrich-Merz [88],
and in a second step we discuss more explicitly the solvency and risk management
implementations used in the insurance industry.
10.1 Balance sheet and solvency

In Chapter 1 of Wüthrich-Merz [88] we have provided the balance sheet of an
insurance company. It may look as follows (we only provide the positions that are
relevant for non-life insurance companies):
235
236 Chapter 10. Solvency Considerations
assets liabilities
cash and cash equivalents deposits
debt securities policyholder deposits
bonds reinsurance deposits
loans borrowings
mortgages money market
real estate hybrid debt
equity convertible debt
equity securities insurance liabilities
private equity claims reserves
w)
investments in associates premium reserves
hedge funds
derivatives derivatives
futures, swaptions, equity options
(m
insurance and other receivables insurance and other payables
reinsurance assets reinsurance liabilities
property and equipment employee benefit plan
intangible assets provisions
goodwill
deferred acquisition costs
income tax assets income tax liabilities
tes
other assets other liabilities
Table 10.1: Balance sheet of a non-life insurance company at a fixed point in time.
no
Table 10.1 presents a snap shot of a non-life insurance company’s balance sheet,
that is, it reflects all positions at a certain moment in time t ∈ R+ . The left
hand side shows the assets at time point t and the right hand side should show the
NL
liabilities at the same time point t. We denote the value of the assets at time t by
At , and Lt denotes the value of the liabilities at time t.
In the language of Chapter 5, we can think of At denoting all asset values in the
company at time t. These comprise the initial capital, all premia received and all
other amounts received minus the payments done up to time t. These amounts
are invested at the financial market and, thus, are allocated to the different asset
classes displayed in Table 10.1. On the other hand, the liabilities Lt reflect the
value of all obligations accepted by the insurance company that are still open at
time t.
In the context of the ruin theory Chapter 5 we should have At ≥ Lt in order to cover
the liabilities by asset values at time t. In fact, we have studied the continuous
time surplus process (Cet )t∈R+ , given by Cet = At − Lt , which should fulfill for a

Chapter 10. Solvency Considerations 237
given large probability 1 − p ∈ (0, 1)

P inf Cet ≥ 0 Ce0 = c0 = P inf At − Lt ≥ 0 ≥ 1 − p, (10.1)
t∈R+ t∈R+
see (5.4) and (5.5). Since an insurance company cannot continuously verify the
solvency situation, condition (10.1) is only checked on the discrete time grid t ∈ N0 ,
this is similar to (5.5). But in fact, one even goes beyond that which we are just
going to describe. This will be done in several steps.
Step 1 (one-period problem). Let us assume that we are at time t = 0 and
w)
we would like to check a solvency condition (no ruin condition) similar to (10.1).
Moreover, we assume that at time 0 we have only sold one-year contracts (one-year
risk exposures) for which we receive a premium at time 0 and for which the claim
is paid at the end of the year, i.e. at time t = 1.
(m
The total asset value at time 0 is given by A0 . This value is invested at the financial
market and generates value A1 at time 1. Thus, for this one-period problem the
no ruin condition reads as follows:
for a given large probability 1 − p ∈ (0, 1) the initial capital and the asset strategy
should be chosen such that
tes
P [A1 ≥ L1 ] = P [L1 − A1 ≤ 0] ≥ 1 − p. (10.2)
This means that we need to choose the initial capital c0 and the asset strategy, which
no
maps value A0 at time 0 to value A1 at time 1, such that the (given stochastic)
liabilities L1 can be covered with large probability at time 1. Note that A1 and L1
are, in general, not independent.
Step 2 (risk measure). The no ruin condition in (10.2) is described under the
NL
Value-at-Risk risk measure VaR1−p (L1 − A1 ) on security level 1 − p ∈ (0, 1), see
Example 6.25. Assume we have a normalized and translation invariant risk measure
%, see (6.13), then more generally
the initial capital and the asset strategy should be chosen such that
% (L1 − A1 ) ≤ 0. (10.3)
Solvency II uses the VaR risk measure on the 1 − p = 99.5% security level and the
Swiss Solvency Test (SST) uses the TVaR risk measure on the 1−p = 99% security
level, see also Examples 6.25, 6.20 and 6.26. The main aspect is now concerned
with the stochastic modeling of position L1 − A1 .

Step 3 (market(-consistent) values). The main difficulty is the stochastic mod-

eling of L1 − A1 . Some positions in this difference are traded at active financial
markets. For these positions we need to stochastically model their market prices
at time 1 (viewed from time 0). However, most positions (on the liability) side of
the balance sheet are not traded at active markets. For these positions we need to
determine market-consistent values and their stochastic developments in a marked-
to-model approach. Let us explain the rationale behind this with the liabilities L1
at hand and using the claims reserving context of Chapter 9.
w)
Assume we can split the liabilities L1 into two elements:
(i) payments X1 done at time 1 (similar to Section 9.1 we map all payments in
accounting year (0, 1] to its endpoint);
(m
(ii) outstanding loss liabilities L+
1 at time 1.
The liabilities at time 1 are then given by
L1 = X 1 + L+
1.
The easier part is the modeling of X1 . We need to find a stochastic model that is
tes
able to predict the payments X1 and capture the dependencies with A1 and L+ 1.
+
The more complicated part is L1 . This amount should reflect a market-consistent
value for the outstanding loss liabilities at time 1. Observe that it differs from the
best-estimate reserves R(1) given in (9.22) in two crucial ways:
no
(1) The best-estimate reserves R(1) were calculated on a nominal basis, i.e. the
time value of money was not considered because no discounting was applied
to R(1) .
(2) The best-estimate reserves R(1) are conditional expectations, conditioned on

NL
the information F1 . That is, these are expected payouts and we should add
a (risk, market-value) margin/loading to obtain market-consistent values for
risk averse financial agents being willing to do the run-off of these liabilities,
see Chapter 6.
The aim of these two tasks (1) and (2) is motivated by the fact that L+ 1 should
reflect a price at which another insurance company is willing to take over the
liabilities at time 1 and to complete the run-off of the outstanding loss liabilities.
Step 4 (acceptability and solvency). As described above, we have three building

blocks A1 , X1 and L+
1 that we need to model stochastically (note that these build-
ing blocks are not independent). In the last step, we then need to evaluate risk
measure condition (10.3). If this condition is fulfilled we have an acceptable balance

sheet and the company is solvent at time 0. If (10.3) is not fulfilled we have an un-
acceptable balance sheet and it needs to be modified to achieve solvency. Options
for modification are the following: change the asset strategy so that it matches
better to the liabilities; reduce liabilities and mitigate uncertainties in liabilities (if
possible); inject more initial capital.
In the remainder of this chapter we discuss the modeling of the asset deficit at time
t = 1, where the asset deficit is for t ∈ N0 defined by
def.
ADt = Lt − At = Xt + L+
t − At . (10.4)
w)
Thus, the insurance company is solvent at time 0 (w.r.t. the risk measure %) if
% (AD1 ) = % (L1 − A1 ) ≤ 0.
(m
10.2 Risk modules
Typically the modeling of the asset deficit AD1 defined in (10.4) is split into different
modules that reflect different risk classes. In a first step each risk class is studied
individually and in a second step the results are aggregated to obtain the overall
tes
picture.
no
NL
Figure 10.1: lhs: Swiss Solvency Test risk modules; rhs: Solvency II risk modules
(sources [26] and [41]).
One can question whether this modeling approach is smart. Modeling individual
risk classes can still be fine, but aggregation is rather non-straightforward because
it is very difficult to capture the interaction between the different risk classes.
Nevertheless we would like to describe the approach used in practice.
In Figure 10.1 we show the individual risk modules used in the Swiss Solvency Test
and in Solvency II. Overall they are rather similar though some differences exist.
Often one considers the following 4 risk classes that are driven by the risk factors
that we will just describe:

1. Market risk. We cite SCR.5.1. of QIS5 [41]: ”Market risk arises from the level
or volatility of market prices of financial instruments. Exposure to market
risk is measured by the impact of movements in the level of financial variables
such as stock prices, interest rates, real estate prices and exchange rates.”
2. Insurance risk. Insurance risk is typically split into the different insur-
ance branches: non-life insurance, life insurance, health insurance and re-
insurance. Here we concentrate on non-life insurance risk. This is further
subdivided into (i) reserve risk which describes outstanding loss liabilities of
past exposure claims; and (ii) premium risk which describes the risk deriving
w)
from newly sold contracts that give an exposure over the next accounting
period.
3. Credit risk. We cite SCR.6.1. of QIS5 [41]: ”The counterparty default risk
(m
module should reflect possible losses due to unexpected default, or deteriora-
tion in the credit standing, of the counterparties and debtors of undertakings
over the forthcoming twelve months. The scope of the counterparty default
risk module includes risk-mitigating contracts, such as reinsurance arrange-
ments, securitisations and derivatives, and receivables from intermediaries,
as well as any other credit exposures which are not covered in the spread risk
tes
sub-module.”
4. Operational risk. We cite SCR.3.1. of QIS5 [41]: ”Operational risk is the risk
of loss arising from inadequate or failed internal processes, or from personnel
and systems, or from external events. Operational risk should include legal
no
risks, and exclude risks arising from strategic decisions, as well as reputation
risks. The operational risk module is designed to address operational risks to
the extent that these have not been explicitly covered in other risk modules.”
Let us formalize these risk factors and classes. Therefore, we first consider the
beginning of accounting year 1. At time t = 0 the asset deficit is given by
NL
AD0 = L0 − A0 .
We assume that X0 = 0 which implies that L0 = L+ +

0 , thus, L0 is the value of all
liabilities that need to be settled after t = 0. For simplification we assume that
the liabilities consist of insurance liabilities only. In this case, L+ 0 describes the
liabilities stemming from claims with accident date prior to t = 0 (these are the
liabilities of past exposure claims; we denote them by previous year claims PY, see
also Chapter 9), and of claims with accident date in accounting year (0, 1] (these
are the liabilities of the new premium exposure if we assume one-year contracts
only; we denote them by current year claims CY). Summarizing, this implies on
the liability side of the balance sheet at time t = 0 (with the obvious notation)
L0 = L+ PY CY
0 = L0 + L0 .

On the asset side of the balance sheet we have (this is also a simplified version)
A0 = c0 + APY
0 +π
CY
,
where APY
0 are the provisions to cover the PY liabilities LPY
0 , π
CY
is the premium
CY
received for the CY claims L0 and c0 is the initial capital. As described above,
this amount A0 is invested at the financial market and provides value A1 at time
t = 1. This value needs to be compared to

L1 = X1 + L+
1 = X1PY + X1CY + X1Op + L+,PY + L+,CY ,
w)
1 1
where X1PY are the payments for PY claims, X1CY are the payments for CY claims,
L+,PY
1 is the value of the outstanding loss liabilities at time t = 1 for claims with
accident year prior to t = 0, and L+,CY is the value of the outstanding loss liabilities
(m
1
at time t = 1 for CY claims (i.e. accident date in (0, 1]). Thus, if we merge these
+,PY
two values L+1 = L1 + L+,CY
1 we obtain the new outstanding loss liabilities for
past exposure claims with accident date prior to t = 1. Finally, X1Op denotes
the operational risk loss payment where, for simplicity, we assume that this can
immediately be settled. We conclude that the asset deficit at time 1 is given by
tes

AD1 = X1PY + X1CY + X1Op + L+,PY
1 + L+,CY
1 − A1 (10.5)

= X1PY + L+,PY
1 + X1CY + L+,CY
1 + X1Op − A1 . (10.6)
Let us comment on (10.5)-(10.6).

no
• Formula (10.5) gives the split into payments and outstanding liabilities. This
view is crucial for doing asset-and-liability management, i.e. to compare the
structure of the asset portfolio to the maturities of the liabilities.
NL
• Formula (10.6) provides the split into PY risk and CY risk. The PY risk is
mainly described by the claims development result described in Section 9.4.
The CY risk is described by a compound distribution as, for instance, seen in
Example 4.11. However, both descriptions only consider nominal claims and
in order to get values we still need to add time values for cash flow payments
and a risk margin for bearing the run-off risks, and thus these values also
depend on financial market movements.
• Coming back to the risk modules: market risk affects all variables in (10.5);
insurance risk is mainly reflected in X1PY , L+,PY
1 , X1CY and L+,CY
1 ; credit risk
Op
is a main risk driver in A0 ; and operational risk is reflected in X1 .
In the remainder we concentrate on the modeling of insurance liabilities.

10.3 Insurance liability variables

10.3.1 Market-consistent value
We still face the difficulty that we should attach market consistent values to the
insurance liabilities which provides value at time t = 1 given by
def.
LIns
1 = X1PY + L+,PY
1 + X1CY + L1+,CY .
Op
Note that in our terminology L1 = LIns 1 + X1 . Assume that X = (X1 , . . . , Xn )
denotes the random cash flow that is generated by the insurance liabilities. We
w)
assume that this cash flow is adapted to the filtration F = (Fs )s≥1 . In analogy
to Wüthrich-Merz [87] we need to choose an appropriate (state price) deflator
ϕ = (ϕ1 , . . . , ϕn ) (which is F-adapted and strictly positive, P-a.s.) and then
1 X
(m
LIns
1 = E [ ϕs Xs | F1 ]
ϕ1 s≥1
provides a market-consistent value in an arbitrage-free pricing system described by

the triple (P, F, ϕ). For a one-period problem this was already described in Section
6.2.5. Under the assumption of uncorrelatedness of ϕs and Xs , conditionally given
F1 , we can rewrite the market-consistent value of the insurance liabilities as
tes
1
LIns
X X
1 = E [ ϕs | F1 ] E [ Xs | F1 ] = P (1, s) E [ Xs | F1 ] (10.7)
s≥1 ϕ1 s≥1
= X1PY + X1CY +
X
P (1, s) E [ Xs | F1 ] ,
s≥2
no
where P (1, s) denotes the price at time 1 of the zero-coupon bond that matures at
time s. Note that viewed from time 0 both P (1, s) and E [ Xs | F1 ] are F1 -measurable
random variables.
Under all the previous assumptions (in particular the uncorrelatedness assumption
(10.7)) the acceptability requirement (10.3) reads as:
NL
The initial capital c0 and the asset strategy should be chosen such that
 
% X1PY + X1CY + X1Op +

X
P (1, s) E [ Xs | F1 ] − A1  ≤ 0. (10.8)
s≥2
Since the asset deficit still has a rather involved form the model is further simplified.
Denote the expected values
p(1, s) = E [ P (1, s)| F0 ] and xs = E [ Xs | F0 ] .
Then we use the following linear approximation
P (1, s) E [ Xs | F1 ] ≈ p(1, s)xs + (P (1, s) − p(1, s)) xs + p(1, s) (E [ Xs | F1 ] − xs ) .

The first term p(1, s)xs is the expected value (viewed from time 0) of the time-1-
price P (1, s)E [ Xs | F1 ]. The term (P (1, s) − p(1, s)) xs coins uncertainties in finan-
cial discounting and p(1, s) (E [ Xs | F1 ] − xs ) describes volatilities in the insurance
cash flows. The cross term of the uncertainties was dropped in this approxima-
tion. Typically, the above terms are assumed to be independent so that they can
be studied individually and aggregation is obtained by simply convoluting their
marginal distributions.
This approximation implies that for (10.8) we study the following three terms
X
Z1 = p(1, s)xs + (P (1, s) − p(1, s)) xs − A1 ,
w)
s≥1
X
Z2 = p(1, s) (E [ Xs | F1 ] − xs ) ,
s≥1
Z3 = X1Op .
(m
Z1 describes market and credit risks, Z2 describes insurance risk and Z3 describes
operational risk. In non-life insurance one often assumes that these three random
variables are independent.
In the remainder of this chapter we describe the insurance liability variable Z2 . For
the other terms we refer to the solvency literature QIS5 [41], Swiss Solvency Test
[43] and Wüthrich-Merz [88].
tes
10.3.2 Insurance risk
We study insurance risk given by
no
X
Z2 = p(1, s) (E [ Xs | F1 ] − xs ) .
s≥1
As already mentioned the insurance variables are separated into PY variables and
CY variables w.r.t. the valuation date t = 0. This provides the split
Z2 = Z2PY + Z2CY
NL

def.
h i h i
p(1, s) E XsPY F1 − xPY p(1, s) E XsCY F1 − xCY
X X
= + .

s s
s≥1 s≥1
The final simplification is that we assume that there are deterministic payout pat-
terns (γsPY )s≥1 and (γsCY )s≥1 , for instance, obtained by the CL method, see Theo-
rem 9.10 (and the estimation errors in these patterns are neglected). Then the last
expressions can be modified to
 
h i
Z2PY =  p(1, s)γsPY  X1PY + R(1) − R(0) ,
X
s≥1
 
Z2CY =  p(1, s)γsCY  [E [ S1 | F1 ] − E [ S1 | F0 ]] .

X
s≥1

The first line Z2PY reflects the study of the claims development result, see (9.23).
The second line Z2CY describes the total nominal claim S1 of accident year (0, 1]
that is caused by the premium exposure π CY . The terms in the round brackets
are the deterministic discount factors that respect the underlying maturities of the
cash flows; the terms in the square brackets are the random terms that need further
modeling and analysis.
Claims development result
w)
The claims development result for PY claims, given by
h i
CDR1 = − X1PY + R(1) − R(0) ,
(m
has expected value 0, see Corollary 9.12, if the claims reserves are defined by
conditional expectations in a Bayesian model. Therefore, there remains the study
of higher moments. In practice, one restricts to the second moment:
• Calculate for every line of business the conditional MSEP of the claims de-
velopment result prediction, for instance using MW formula (9.27). This
tes
provides a variance estimate for every line of business.
• Specify a correlation matrix between the different lines of business, see for
instance SCR.9.34. in QIS5 [41].
no
• The previous two items allow to aggregate the uncertainties of the individual
lines of business to obtain the overall variance over the sum of all lines of
business.
• Fit a translated gamma or log-normal distribution to these first two mo-

ments assuming that the mean is exactly given by R(0) . This provides the
NL
distribution of CDR1 .
Premium liability risk
The claim E [ S1 | F1 ] resulting from the premium exposure π CY is split into two
independent random variables Ssc and Slc , where Ssc reflects all small claims below
a given threshold M and Slc the claims above that threshold, see Examples 2.16
and 4.11.
The large claim Ssc is modeled per line of business (or by peril) by independent
compound Poisson distributions with Pareto claims severities and aggregation is
done using the aggregation ↑ Theorem 2.12 resulting in a compound Poisson dis-
tribution. The latter can be determined, for instance, with the Panjer algorithm,
see Theorem 4.8.

The small claim Ssc is treated similarly to the claims development result, i.e. es-
timate per line of business the first two moments. Aggregate this moments using
an appropriate correlation matrix, see for instance Section 8.4.2. in the technical
Swiss Solvency Test document [43] and fit a gamma or a log-normal distribution
to this first two moments.
Remarks.
• In the Swiss Solvency Test one distinguishes between pure process risk and
parameter uncertainty for the small claims layer. Process risk is diversifiable
w)
with increasing volume, whereas parameter uncertainty is not. As a result
the coefficient of variation per line of business has a similar form as has been
found for the compound binomial distribution, see Proposition 2.24. That
is, for volume v → ∞ the coefficient of variation does not vanish but stays
(m
strictly positive.
• In the Swiss Solvency Test one aggregates in addition so-called scenarios. The
motivation for this is that the present model cannot reflect all uncertainties
and therefore it is slightly disturbed by scenarios. These scenarios are basi-
cally claims of Bernoulli type, i.e. the occur with a certain probability and if
tes
they occur they have a given amount.
• For the aggregation between PY and CY claims it is either assumed that they
are independent, or that the claims development result uncertainty CDR1 and
the small claim CY Ssc are again aggregated via a correlation matrix and then
no
an overall distribution is fitted to the resulting first two moments.
Market-value margin
The careful reader will have noticed that we have lost the risk margin somewhere
on the way to the final result. We will not further discuss the risk and market-value
NL
margin here, we only want to mention that the current calculation of the market-
value margin is quite ad-hoc, see Chapter 6 in Swiss Solvency Test [43] and Section
10.3 in Wüthrich-Merz [87], and further refinements are necessary.

w)
(m
tes
no
NL

Bibliography
[1] Acerbi, C., Tasche, D. (2002). On the coherence of expected shortfall. Journal
Banking and Finance 26/7, 1487-1503.
w)
[2] Akaike, H. (1974). A new look at the statistical model identification. IEEE Trans-
actions on Automatic Control 19/6, 716-723.
[3] Alai, D.H., Merz, M., Wüthrich, M.V. (2009). Mean square error of prediction in
(m
the Bornhuetter-Ferguson claims reserving method. Annals of Actuarial Science
4/1, 7-31.
[4] Alai, D.H., Merz, M., Wüthrich, M.V. (2010). Prediction uncertainty in the
Bornhuetter-Ferguson claims reserving method: revisited. Annals of Actuarial Sci-
ence 5/1, 7-17.
tes
[5] Artzner, P., Delbaen, F., Eber, J.M., Heath, D. (1997). Thinking coherently. Risk
10/11, 68-71.
[6] Artzner, P., Delbaen, F., Eber, J.M., Heath, D. (1999). Coherent measures of risk.
Mathematical Finance 9/3, 203-228.
no
[7] Asmussen, S., Albrecher, H. (2010). Ruin Probabilities. 2nd edition. World Scien-
tific.
[8] Bahr, von B. (1975). Asymptotic ruin probabilities when exponential moments do
not exist. Scandinavian Actuarial Journal 1975, 6-10.
NL
[9] Bailey, R.A. (1963). Insurance rates with minimum bias. Proc. CAS, 4-11.
[10] Bailey, R.A., Simon, L.J. (1960). Two studies on automobile insurance ratemaking.
ASTIN Bulletin 1, 192-217.
[11] Bichsel, F. (1964). Erfahrungstarifierung in der Motorfahrzeug-Haftpflicht-

Versicherung. Bulletin of the Swiss Association of Actuaries 1964, 119-130.
[12] Billingsley, P. (1968). Probability and Measure. Wiley.
[13] Billingsley, P. (1995). Probability and Measure. 3rd edition. Wiley.
[14] Boland, P.J. (2007). Statistical and Probabilistic Methods in Actuarial Science.
Chapman & Hall/CRC.
[15] Bolthausen, E., Wüthrich, M.V. (2013). Bernoulli’s law of large numbers. ASTIN
Bulletin 43/2, 73-79.
247
248 Bibliography
[16] Bornhuetter, R.L., Ferguson, R.E. (1972). The actuary and IBNR. Proc. CAS,
Vol. LIX, 181-195.
[17] Boyd, S., Vandenberghe, L. (2004). Convex Optimization. Cambridge University

Press.
[18] Bühlmann, H. (1970). Mathematical Methods in Risk Theory. Springer.
[19] Bühlmann, H. (1980). An economic premium principle. ASTIN Bulletin 11/1, 52-
60.
[20] Bühlmann, H. (1992). Stochastic discounting. Insurance: Mathematics and Eco-
w)
nomics 11/2, 113-127.
[21] Bühlmann, H. (1995). Life insurance with stochastic interest rates. In: Financial
Risk in Insurance, G. Ottaviani (ed.), Springer, 1-24.
(m
[22] Bühlmann, H. (2004). Multidimensional valuation. Finance 25, 15-29.
[23] Bühlmann, H., De Felice, M., Gisler, A., Moriconi, F., Wüthrich, M.V. (2009).
Recursive credibility formula for chain ladder factors and the claims development
result. ASTIN Bulletin 39/1, 275-306.
[24] Bühlmann, H., Gisler, A. (2005). A Course in Credibility Theory and its Applica-
tes
tions. Springer.
[25] Bühlmann, H., Straub, E. (1970). Glaubwürdigkeit für Schadensätze. Bulletin of

the Swiss Association of Actuaries 1970, 111-131.
no
[26] Bundesamt für Privatversicherungen (2004). Weissbuch des Schweizer Solvenztests.

November 2004.
[27] Congdon, P. (2006). Bayesian Statistical Modelling. 2nd edition. Wiley.
[28] Cramér, H. (1930). On the Mathematical Theory of Risk. Skandia Jubilee Volume,
Stockholm.
NL
[29] Cramér, H. (1955). Collective Risk Theory. Skandia Jubilee Volume, Stockholm.
[30] Cramér, H. (1994). Collected Works. Volumes I & II. Edited by A. Martin-Löf.
Springer.
[31] Delbaen, F. (2000). Coherent Risk Measures. Cattedra Galileiana. Pisa.
[32] Denneberg, D. (1989). Verzerrte Wahrscheinlichkeiten in der Versicherungsmathe-

matik, quantilabhängige Prämienprinzipien. Mathematik Arbeitspapiere 34, Uni-
versity of Bremen.
[33] Denuit, M., Maréchal, X., Pitrebois, S., Walhin, J.-F. (2007). Actuarial Modelling
of Claims Count. Wiley.
[34] Dickson, D.C.M. (2005). Insurance Risk and Ruin. Cambridge University Press

Bibliography 249
[35] Duffie, D. (2001). Dynamic Asset Pricing Theory. 3rd edition. Princeton University
Press.
[36] Embrechts, P., Klüppelberg, C., Mikosch, T. (2003). Modelling Extremal Events
for Insurance and Finance. 4th printing. Springer.
[37] Embrechts, P., Nešlehová, J., Wüthrich, M.V. (2009). Additivity properties for
Value-at-Risk under Archimedean dependence and heavy-tailedness. Insurance:
Mathematics and Economics 44/2, 164-169.
[38] Embrechts, P., Veraverbeke, N. (1982). Estimates for the probability of ruin with
w)
special emphasis on the possibility of large claims. Insurance: Mathematics and
Economics 1/1, 55-72.
[39] England, P.D., Verrall, R.J. (2002). Stochastic claims reserving in general insur-
ance. British Actuarial Journal 8/3, 443-518.
(m
[40] England, P.D., Verrall, R.J., Wüthrich, M.V. (2012). Bayesian overdispersed Pois-
son model and the Bornhuetter-Ferguson claims reserving method. Annals of Ac-
tuarial Science 6/2, 258-283.
[41] European Commission (2010). QIS 5 Technical Specifications, Annex to Call for
Advice from CEIOPS on QIS5.
tes
[42] Feller, W. (1966). An Introduction to Probability Theory and its Applications.
Volume II. Wiley.
[43] FINMA (2006). Swiss Solvency Test. FINMA SST Technisches Dokument, Version
no
2. October 2006.
[44] Föllmer, H., Schied, A. (2004). Stochastic Finance, An Introduction in Discrete

Time. 2nd edition. de Gruyter.
[45] Fortuin, C.M., Kasteleyn, P.W., Ginibre, J. (1971). Correlation inequalities on

NL
some partially ordered sets. Communication Mathematical Physics 22/2, 89-103.
[46] Frees, E.W. (2010). Regression Modeling with Actuarial and Financial Applica-
tions. Cambridge University Press.
[47] Gilks, W.R., Richardson, S., Spiegelhalter, D.J. (1996). Markov Chain Monte Carlo
in Practice. Chapman & Hall.
[48] Gisler, A. (2011). Nicht-Leben Versicherungsmathematik. Lecture Notes, ETH

Zurich.
[49] Gisler, A., Wüthrich, M.V. (2008). Credibility for the chain ladder reserving
method. ASTIN Bulletin 38/2, 565-600.
[50] Green, P.J. (1995). Reversible jump Markov chain Monte Carlo computation and
Bayesian model determination. Biometrika 82/4, 711-732.

250 Bibliography
[51] Green, P.J. (2003). Trans-dimensional Markov chain Monte Carlo. In: Highly Struc-
tured Stochastic Systems, P.J. Green, N.L. Hjort, S. Richardson (eds.), Oxford
Statistical Science Series, 179-206. Oxford University Press.
[52] Hachemeister, C.A., Stanard, J.N. (1975). IBNR claims count estimation with
static lag functions. ASTIN Colloquium 1975, Portugal.
[53] Hofert, M., Wüthrich, M.V. (2013). Statistical review of nuclear power accidents.
Asia-Pacific Journal of Risk and Insurance 7/1, Article 1.
[54] Johansen, A.M., Evers, L., Whiteley, N. (2010). Monte Carlo Methods. Lecture
w)
Notes, Department of Mathematics, University of Bristol.
[55] Johnson, R.A., Wichern, D.W. (1998). Applied Multivariate Statistical Analysis.
4th edition. Prentice-Hall.
(m
[56] Jung, J. (1968). On automobile insurance ratemaking. ASTIN Bulletin 5, 41-48.
[57] Kaas, R., Goovaerts, M., Dhaene, J., Denuit, M. (2008). Modern Actuarial Risk
Theory, Using R. 2nd edition. Springer.
[58] Kehlmann, D. (2005). Die Vermessung der Welt. Rowohlt Verlag.

tes
[59] Kyprianou, A. (2014). Gerber-Shiu Risk Theory. Springer
[60] Kremer, E. (1985). Einführung in die Versicherungsmathematik. Vandenhoek &

Ruprecht, Göttingen.
[61] Lehmann, E.L. (1983). Theory of Point Estimation. Wiley.

no
[62] Lundberg, F. (1903). Approximerad framställning av sannolikhetsfunktionen. Åter-

försäkering av kolletivrisker. Almqvist & Wiksell, Uppsala.
[63] Mack, T. (1991). A simple parametric model for rating automobile insurance or
estimating IBNR claims reserves. ASTIN Bulletin 21/1, 93-109.
NL
[64] Mack, T. (1993). Distribution-free calculation of the standard error of chain ladder
reserve estimates. ASTIN Bulletin 23/2, 213-225.
[65] Mack, T. (2008). The prediction error of Bornhuetter/Ferguson. ASTIN Bulletin

38/1, 87-103.
[66] McCullagh, P., Nelder, J.A. (1983). Generalized Linear Models. Chapman & Hall.
[67] McGrayne, S.B. (2011). The Theory That Would Not Die. Yale University Press.
[68] McNeil, A.J., Frey, R., Embrechts, P. (2005). Quantitative Risk Management: Con-
cepts, Techniques and Tools. Princeton University Press.
[69] Merz, M., Wüthrich, M.V. (2008). Modelling the claims development result for
solvency purposes. CAS E-Forum, Fall 2008, 542-568.

Bibliography 251
[70] Merz, M., Wüthrich, M.V. (2013). Mathematik für Wirtschaftswissenschaftler.

Vahlen.
[71] Mikosch, T. (2006). Non-Life Insurance Mathematics. Springer.
[72] Ohlsson, E., Johansson, B. (2010). Non-Life Insurance Pricing with Generalized
Linear Models. Springer.
[73] Panjer, H.H. (1981). Recursive evaluation of a family of compound distributions.

ASTIN Bulletin 12/1, 22-26.
[74] Panjer, H.H. (2006). Operational Risk: Modeling Analytics. Wiley.
w)
[75] Renshaw, A.E., Verrall, R.J. (1998). A stochastic model underlying the chain-
ladder technique. British Actuarial Journal 4/4, 903-923.
[76] Resnick, S.I. (1997). Heavy tail modeling of teletraffic data. Annals of Statistics
(m
25/5, 1805-1869.
[77] Resnick, S.I. (2002). Adventures in Stochastic Processes. 3rd printing. Birkhäuser.
[78] Robert, C.P. (2001). The Bayesian Choice. 2nd edition. Springer.
[79] Rolski, T., Schmidli, H., Schmidt, V., Teugels, J. (1999). Stochastic Processes for
tes
Insurance and Finance. Wiley.
[80] Saluz, A., Gisler, A., Wüthrich, M.V. (2011). Development pattern and predic-
tion error for the stochastic Bornhuetter-Ferguson claims reserving model. ASTIN
Bulletin 41/2, 279-317.
no
[81] Schmidli, H. (2007). Risk Theory. Lecture Notes, University of Cologne.
[82] Schweizer, M. (2009). Stochastic Processes and Stochastic Analysis. Lecture Notes,
ETH Zurich.
[83] Sundt, B., Jewell, W.S. (1981). Further results of recursive evaluation of compound
NL
distributions. ASTIN Bulletin 12/1, 27-39.
[84] Tsanakas, A., Christofides, N. (2006). Risk exchange with distorted probabilities.
ASTIN Bulletin 36/1, 219-243.
[85] Williams, D. (1991). Probability with Martingales. Cambridge University Press.
[86] Wüthrich, M.V., Bühlmann, H., Furrer, H. (2010). Market-Consistent Actuarial

Valuation. 2nd edition. Springer.
[87] Wüthrich, M.V., Merz, M. (2008). Stochastic Claims Reserving Methods in Insur-
ance. Wiley.
[88] Wüthrich, M.V., Merz, M. (2013). Financial Modeling, Actuarial Valuation and
Solvency in Insurance. Springer.

252 Bibliography
w)
(m
tes
no
NL

List of exercises
Exercise 1, page 18
w)
Exercise 2, page 21
Corollary 2.7, page 28
Exercise 3, page 50
Exercise 4, page 60
(m
Exercise 5, page 81
Exercise 6, page 94
Corollary 6.6, page 140
Exercise 7, page 141
tes
no

NL
253
Index
F -distribution, 171 Bayes, Thomas, 184

χ2 -distribution, 21, 61 Bayesian
χ2 -goodness-of-fit test, 49, 80 inference, 40, 183
w)
p-value, 21 Bayesian CL
factor, 220
absolutely continuous distribution, 15 predictor, 221
acceptable, 152, 238 Bayesian information criterion, 81
(m
accident Bernoulli
date, 206 distribution, 26
year, 208 experiment, 26
AD (asset deficit), 239 random walk, 121
AD test, 80 Bernoulli, Jakob, 12
adjustment coefficient, 120 best-estimate reserves, 211
tes
admissible, 32 BF
age-to-age factor, 212 method, 211, 216
aggregation property, 30 reserves, 216
AIC, 81 BIC, 81
Akaike information criterion, 81
no
Bichsel, Fritz, 185

Akaike, Hirotugu, 81 binary variable, 168
alternative hypothesis, 21 binomial distribution, 26, 174
Anderson, Theodore Wilbur, 80 definition, 26
Anderson-Darling test, 80 moments, 27
approximation
NL
Bornhuetter, Ronald, 216

Edgeworth, 96 Bornhuetter-Ferguson method, 211, 216
normal, 90 BS model, 194
translated gamma, 93
translated log-normal, 93 CARA utility function, 138
arbitrage-free pricing, 242 categorial variable, 168
asset deficit, 239 CDR, 227, 244
uncertainty, 232
Bühlmann, Hans, 145, 194 central limit theorem, 13, 90
Bühlmann-Straub model, 194 chain-ladder method, 211, 212
Bahr, von Bengt, 132 chain-ladder model
Bailey, Robert A., 162 distribution-free, 218
balance sheet, 236 Chebychev’s inequality, 121
Bayes’ rule, 184 Chebychev, Pafnuty Lvovich, 17
254
Index 255
chi-square distribution, 21, 61 decomposition property, 33

chi-square-goodness-of-fit test, 49, 80 definition, 30
CL factor, 212 moments, 30
Bayes, 220 concave, 136
estimate, 220 conditional tail expectation, 150
CL method, 211, 212 conjugate prior, 189
CL model constant absolute risk-aversion, 138
gamma-gamma Bayes, 219 constant relative risk-aversion, 138
MSEP, 222 continuous variable, 168
w)
CL reserves, 213 convergence in distribution, 17
claims convex cone, 152
counts, 23 convolution, 25
frequency, 26 cost-of-capital, 151, 154
claims development
(m
rate, 151, 154
result, 227, 244 Cramér, Harald, 115
triangle, 209 Cramér-Lundberg process, 115
claims inflation, 86 credibility coefficient, 199, 221
claims reserves, 210, 211 credibility estimator, 190
claims reserving, 205 homogeneous, 196
tes
algorithm, 211 inhomogeneous, 196
method stochastic, 217 credibility weight, 183, 186, 189
closing date, 206 credit risk, 240, 243
CLT, 13, 90 CRRA utility function, 138
CoC, 151 CTE, 150
no
rate, 151 cumulant function, 173

coefficient of determination, 170 cumulant generating function, 18
coefficient of variation, 16, 58 current year claim, 240
coherent risk measure, 148, 153 CY claim, 240
collective mean, 195 CY risk, 243
NL
collective risk model, 23

compound binomial distribution, 27 Darling, Donald Allan, 80
definition, 27 De Moivre, Abraham, 13, 90
moments, 28 decomposition property, 33
compound distribution, 23 deductible, 85
definition, 23 deflator, 156, 242
moments, 24 Delbaen, Freddy, 148
compound negative-binomial distribution, density, 15, 58
39 descending ladder epoch, 123
definition, 39 design matrix, 166
moments, 40 development year, 208
compound Poisson distribution, 30 deviance statistics, 180
aggregation property, 30 discrete distribution, 15

256 Index
discretization, 106 gamma distribution, 37, 59, 175

disjoint decomposition, 32 gamma-gamma Bayes CL model, 219
property, 33 gamma-gamma model, 191
dispersion, 173 Gauss, Carl Friedrich, 18
distortion function, 147 Gaussian distribution, 17, 175
distribution function, 14 generalized inverse, 20
distribution-free CL model, 218 generalized linear model, 159
Duffie, James Darrell, 156 generalized linear models, 173
Gerber, Hans-Ulrich, 115
EDF, 173
w)
Gerber-Shiu risk theory, 115
Edgeworth approximation, 96
Gisler, Alois, 196
Edgeworth, Francis Ysidro, 96
Glivenko-Cantelli theorem, 77, 89
Embrechts, Paul, 132
GLM, 159, 173
Embrechts-Veraverbeke theorem, 131
(m
Goldie, Charles M., 132
empirical
goodness-of-fit, 80, 169
distribution function, 56
loss size index function, 57 happiness index, 135
mean excess function, 57 heavy tailed, 129
England, Peter D., 225 Hill
ES, 150 estimator, 73
tes
Esscher plot, 73
measure, 145 histogram, 55
premium, 145, 156 homogeneous credibility estimator, 196
estimation error, 21
no
estimator, 21 i.i.d., 19
expectation, 15 IBNeR, 207
expected claims frequency, 26 IBNyR, 207
expected shortfall, 149, 150, 154 incomplete gamma function, 59
expected value, 58 independent and identically distributed,
expected value principle, 133 19
NL
exponential dispersion family, 173, 189 individual claim size, 23, 53

exponential distribution, 61 informative prior, 187
exponential utility function, 138 inhomogeneous credibility estimator, 196
insurance risk, 240, 243
F-distribution, 171
inverse Gaussian distribution, 62
fast Fourier transform, 112
inversion formula, 113
Ferguson, Ronald E., 216
isoelastic utility function, 138
FFT, 112
finite horizon ruin probability, 116 Jewell, William S., 102
first moment, 15 Jung, Jan, 164
Fisher, Sir Ronald Aymler, 45
Fourier, Jean Baptiste Joseph, 113 Khinchin, Aleksandr Yakovlevich, 126
Fréchet, Maurice, 64 Kolmogorov distribution, 78

Index 257
Kolmogorov, Andrey Nikolaevich, 77 MCMC, 183

Kolmogorov-Smirnov test, 77 mean, 15, 58
KS test, 77 mean excess function, 57, 58
mean excess plot, 57
ladder
mean square error of prediction
epoch, 123
conditional, 217
height, 123
Merz, Michael, 227
Laplace, Pierre-Simon, 13, 90, 184
Merz-Wüthrich formula, 232
Laplace-Stieltjes transform, 16, 112
method of
large claims separation, 35
w)
Bailey & Jung, 164
law of large numbers, 12
Bailey & Simon, 162
layer, 58, 83
total marginal sums, 164
leverage effect, 86
method of moments, 40
likelihood function, 45
(m
minimum variance estimator, 41
likelihood ratio test, 170
mixed Poisson distribution, 36
linear credibility, 183, 193
definition, 36
link function, 176
MLE, 40, 45
link ratio, 212
MM, 40
LLN, 12
model risk, 13
log-gamma distribution, 69
tes
moment estimator, 41
log-likelihood function, 45
moment generating function, 16, 18, 58
log-linear model, 167
moments, 15
log-link function, 176
log-log plot, 57 monotonicity, 152
Monte Carlo simulation, 89
no
log-normal distribution, 65
loss size index function, 57, 58 Morgenstern, Oskar, 136
Lundberg MSEP, 217, 222
bound, 119, 120 multiplicative tariff, 160
coefficient, 120 multivariate Gaussian distribution, 166
Lundberg, Ernst Filip Oskar, 115 density, 167
NL
Lyapunov, Aleksandr Mikhailovich, 90 MV, 41

MW formula, 232
Mack CL model, 218
Mack formula, 223 negative-binomial distribution, 37
Mack, Thomas, 218 definition, 37
margin, 238 moments, 38
market risk, 240, 243 net profit condition, 118
market-consistent, 238 Neumann, von John, 136
value, 242 non-informative prior, 187
market-value margin, 238, 245 normal approximation, 90
Markov chain Monte Carlo, 183 normalization, 152
maximum likelihood estimator, 45 NPC, 118
maximum likelihood method, 40 null hypothesis, 21

258 Index
number of claims, 23 Price, Richard, 184

prior
ODP model, 225
distribution, 184
one-period problem, 237
parameter, 186
one-year uncertainty, 232
probability distortion, 146
operational risk, 240, 243
probability space, 14
loss, 241
process uncertainty, 217
outstanding loss liabilities, 206, 210
provisions, 210, 241
over-dispersed Poisson model, 225
pure randomness, 13
w)
p-value, 21 PY claim, 240
Panjer PY risk, 243
algorithm, 101, 103
distribution, 102 radius of convergence, 16
Radon-Nikodym derivative, 156
(m
recursion, 101
Panjer, Harry H., 101 random variables, 14
parameter estimation random walk theorem, 118
claims count distribution, 40 RBNS, 207
error, 217 re-insurance, 85
Pareto distribution, 72 regularly varying, 59, 130
tes
Pareto, Vilfredo Federico Damaso, 72 renewal property, 119
past exposure claim, 209 reporting
Pearson’s residuals, 181 date, 206
Poisson distribution, 28, 175 delay, 206
definition, 28 reserve risk, 240
no
moments, 29 reserves, 210, 211

Poisson, Siméon Denis, 29 residual standard deviation, 170
Poisson-gamma model, 185 Resnick, Sidney Ira, 74
Pollaczek, Félix, 126 Riemann-Stieltjes integral, 15
Pollaczek-Khinchin formula, 123, 126 risk
NL
positive homogeneity, 152 averse, 136

posterior bearing capital, 150
distribution, 184 characteristics, 160
parameter, 186 class, 160
power utility function, 138 components, 13
prediction error, 21, 202 margin, 238, 245
predictor, 20, 217 measure, 150, 237
premium modules, 239
calculation principle, 133 ruin probability
CY, 241 finite horizon, 116
elements, 13 ultimate, 117
liability risk, 240, 244 ruin theory, 115
previous year claim, 240 ruin time, 116

Index 259
sample total uncertainty, 232

estimators, 41 tower property, 19
mean, 41, 54 translated gamma approximation, 93
variance, 41, 54 translated log-normal approximation, 93
scale parameter, 59 translationn invariance, 152
scaled deviance, 180 TVaR, 150, 237
scatter plot, 54
ultimate ruin probability, 117
settlement
utility function
date, 206
exponential, 138
w)
delay, 208
power, 138
period, 206
utility indifference price, 140
shape parameter, 59
utility theory, 135
Shiu, Elias S.W., 115
(m
significance level, 21 vague prior, 187
Simon, LeRoy J., 162 value
skewness, 16, 58 assets, 236
Smirnov, Nikolai Vasilyevich, 77 liabilities, 236
solvency, 238 Value-at-Risk, 150, 153
Solvency II, 237 VaR, 150, 153, 237
tes
Spitzer’s formula, 124 Var, 15
Spitzer, Frank Ludvig, 125 variable reduction analysis, 180
SST, 237 variance, 15, 58
standard assumptions for compound dis- variance loading principle, 134
tributions, 23 Vco, 16
no
standard deviation, 16 Veraverbeke, Noël, 132

standard deviation loading principle, 134 Verrall, Richard J., 225
stochastic claims reserving method, 217 volume, 26
stopping time, 123
Weibull distribution, 63
Straub, Erwin, 194
NL
Weibull, Ernst Hjalmar Waloddi, 64

structural parameter, 200
subadditivity, 152 zero claim, 53
subexponential, 127, 129, 131 zero-coupon bond, 242
Sundt, Bjørn, 102
surplus process, 115, 236
survival function, 19, 58
Swiss Solvency Test, 237
tail index, 59, 130

Tail-Value-at-Risk, 150
tariff criterion, 160
tariffication, 159
total claim amount, 23

Non-Life Insurance: Mathematics & Statistics: - Lecture Notes

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Non-Life Insurance: Mathematics & Statistics: - Lecture Notes

Uploaded by

Copyright:

Available Formats

w)

Version September 2, 2013

Version September 2, 2013, M.V. Wüthrich, ETH Zurich

use. Any commercial use or reproduction is forbidden.

• Citation: please use the SSRN URL.

Version September 2, 2013, M.V. Wüthrich, ETH Zurich

Zurich, September 2, 2013 Mario V. Wüthrich

Version September 2, 2013, M.V. Wüthrich, ETH Zurich

2 Collective Risk Model 23

2.3 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3 Individual Claim Size Modeling 53

3.4.1 Claim size modeling using layers . . . . . . . . . . . . . . . . 83

4 Approximations for Compound Distributions 89

5 Ruin Theory in Discrete Time 115

6 Premium Calculation Principles 133

6.2.5 Deflator based pricing principles . . . . . . . . . . . . . . . . 155

7 Tariffication and Generalized Linear Models 159

7.2.1 Maximum likelihood estimation . . . . . . . . . . . . . . . . 165

8 Bayesian Models and Credibility Theory 183

Version September 2, 2013, M.V. Wüthrich, ETH Zurich

8.2.2 Bühlmann-Straub credibility formula . . . . . . . . . . . . . 195

9 Claims Reserving 205

Version September 2, 2013, M.V. Wüthrich, ETH Zurich

Version September 2, 2013, M.V. Wüthrich, ETH Zurich

insurance policy covering risks (random events)

fixed premium (deterministic)

similar risks whose individual insurance claims

probability theory. Assume we have a proba-

Version September 2, 2013, M.V. Wüthrich, ETH Zurich

i.e. in the limit we obtain a standard Gaussian distribution. The

1.1.2 Risk components and premium elements

2. Model risk: The description of the randomness of Yi , described in the previous

Version September 2, 2013, M.V. Wüthrich, ETH Zurich

(b) the parameters in the chosen model are misspecified;

• pure risk premium µ = E[Yi ]

• sales commissions to agents

• other administrative expenses

1.2 Probability theory and statistics

F (x) = FX (x) = P [X ≤ x] ∈ [0, 1]

Version September 2, 2013, M.V. Wüthrich, ETH Zurich

denotes the probability that X has an outcome less or equal to x. In general, we

We distinguish two important types of random variables:

(ii) a random variable X ∼ F is called absolutely continuous if there exists a

R  R h(x)f (x) dx if X is absolutely continuous.

ing moments (based upon existence):

• mean, expectation or first moment of X ∼ F

Version September 2, 2013, M.V. Wüthrich, ETH Zurich

• standard deviation and coefficient of variation of X ∼ F

This proves the lemma. 2

Version September 2, 2013, M.V. Wüthrich, ETH Zurich

Lemma 1.3. Assume X ≥ 0, P-a.s. The distribution function F of X is completely

In particular, Lemma 1.3 gives for two random variables X ∼

This property is often used to identify distribution functions.

Example 1.5 (Gaussian distribution). Assume X ∼ N (µ, σ 2 ) has a Gaussian

distribution with parameters µ ∈ R and σ 2 > 0. X is an absolutely continuous

The moment generating function of X is given by

This moment generating function is obtained by direct calculation completing the

Version September 2, 2013, M.V. Wüthrich, ETH Zurich

and for the second moment we obtain

This implies for the variance of Gaussian distributions

The Gaussian distribution is named after

[58] that fictitiously describes the lives of

Often we do not directly consider the moment generating function MX of a random